[jira] [Assigned] (TIKA-4317) Abusive content on https://corpora.tika.apache.org/
[ https://issues.apache.org/jira/browse/TIKA-4317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr reassigned TIKA-4317: - Assignee: Tim Allison > Abusive content on https://corpora.tika.apache.org/ > --- > > Key: TIKA-4317 > URL: https://issues.apache.org/jira/browse/TIKA-4317 > Project: Tika > Issue Type: Bug > Components: site >Reporter: Zoran Regvart >Assignee: Tim Allison >Priority: Major > > The Apache Camel team has been notified by Google of abusive content hosted > on https://corpora.tika.apache.org/, with the assumption that this is somehow > related to https://camel.apache.org. The scanning done by Google is against > the whole apache.org domain, so implication is that any abusive content found > on any domain within apache.org will be accredited and affect other domains > within apache.org. > Learn about abusive experiences here: > https://support.google.com/webtools/answer/7347327. > Singled out page from Google report (content & possibly security warning): > {code}https://corpora.tika.apache.org/base/docs/commoncrawl3/QK/QKKJTNDRIVLIPP7433IFC3EF3UVOSPIB{code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4278) TextAndCSVParser doesn't detect semicolon separated file
[ https://issues.apache.org/jira/browse/TIKA-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17888662#comment-17888662 ] Tilman Hausherr commented on TIKA-4278: --- new test result with the latest changes and the colon added in the default configuration, the results are the same: [^reports_csv_3.0.0_vs_3.0.0_new_withcolon.tar.xz] > TextAndCSVParser doesn't detect semicolon separated file > > > Key: TIKA-4278 > URL: https://issues.apache.org/jira/browse/TIKA-4278 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.9.2 >Reporter: Tilman Hausherr >Priority: Major > Labels: csv, csvparser > Fix For: 3.0.0, 2.9.3 > > Attachments: reports_csv_2.9.2_vs_2.9.3.tar.xz, > reports_csv_2.9.2_vs_2.9.3_3.tar.xz, reports_csv_2.9.2_vs_2.9.3_4.tar.xz, > reports_csv_3.0.0_vs_3.0.0_new_withcolon.tar.xz, > reports_csv_3.0.0_vs_3.0.0_nocolon.tar.xz > > > I ran the code from the attached SO issue and yes it doesn't detect semicolon > separated files. The reason is this line in {{TextAndCSVParser.java}}: > {code:java} > private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'}; > {code} > This is later used by {{CSVSniffer}}. For some reason the other delimiters > (pipe, colon and semicolon) aren't in that array, although they are in > {{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now > it works for semicolon. > Can I change this by adding the missing delimiters or was there a reason that > I missed? Proposed change would change CSVSniffer so that delimiters is a set > and then pass {{CHAR_TO_STRING_DELIMITER_MAP.keySet()}}. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4278) TextAndCSVParser doesn't detect semicolon separated file
[ https://issues.apache.org/jira/browse/TIKA-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4278: -- Attachment: reports_csv_3.0.0_vs_3.0.0_new_withcolon.tar.xz > TextAndCSVParser doesn't detect semicolon separated file > > > Key: TIKA-4278 > URL: https://issues.apache.org/jira/browse/TIKA-4278 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.9.2 > Reporter: Tilman Hausherr >Priority: Major > Labels: csv, csvparser > Fix For: 3.0.0, 2.9.3 > > Attachments: reports_csv_2.9.2_vs_2.9.3.tar.xz, > reports_csv_2.9.2_vs_2.9.3_3.tar.xz, reports_csv_2.9.2_vs_2.9.3_4.tar.xz, > reports_csv_3.0.0_vs_3.0.0_new_withcolon.tar.xz, > reports_csv_3.0.0_vs_3.0.0_nocolon.tar.xz > > > I ran the code from the attached SO issue and yes it doesn't detect semicolon > separated files. The reason is this line in {{TextAndCSVParser.java}}: > {code:java} > private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'}; > {code} > This is later used by {{CSVSniffer}}. For some reason the other delimiters > (pipe, colon and semicolon) aren't in that array, although they are in > {{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now > it works for semicolon. > Can I change this by adding the missing delimiters or was there a reason that > I missed? Proposed change would change CSVSniffer so that delimiters is a set > and then pass {{CHAR_TO_STRING_DELIMITER_MAP.keySet()}}. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4278) TextAndCSVParser doesn't detect semicolon separated file
[ https://issues.apache.org/jira/browse/TIKA-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17888651#comment-17888651 ] Tilman Hausherr edited comment on TIKA-4278 at 10/11/24 1:17 PM: - Here's the test result: [^reports_csv_3.0.0_vs_3.0.0_nocolon.tar.xz] No surprises here. However the test runs only on .csv files so it misses some of the files mentioned in the previous report. (This does not yet contain the latest change, and didn't include the colon) was (Author: tilman): Here's the test result: [^reports_csv_3.0.0_vs_3.0.0_nocolon.tar.xz] No surprises here. However the test runs only on .csv files so it misses some of the files mentioned in the previous report. (This does not yet contain the latest change) > TextAndCSVParser doesn't detect semicolon separated file > > > Key: TIKA-4278 > URL: https://issues.apache.org/jira/browse/TIKA-4278 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.9.2 >Reporter: Tilman Hausherr >Priority: Major > Labels: csv, csvparser > Fix For: 3.0.0, 2.9.3 > > Attachments: reports_csv_2.9.2_vs_2.9.3.tar.xz, > reports_csv_2.9.2_vs_2.9.3_3.tar.xz, reports_csv_2.9.2_vs_2.9.3_4.tar.xz, > reports_csv_3.0.0_vs_3.0.0_nocolon.tar.xz > > > I ran the code from the attached SO issue and yes it doesn't detect semicolon > separated files. The reason is this line in {{TextAndCSVParser.java}}: > {code:java} > private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'}; > {code} > This is later used by {{CSVSniffer}}. For some reason the other delimiters > (pipe, colon and semicolon) aren't in that array, although they are in > {{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now > it works for semicolon. > Can I change this by adding the missing delimiters or was there a reason that > I missed? Proposed change would change CSVSniffer so that delimiters is a set > and then pass {{CHAR_TO_STRING_DELIMITER_MAP.keySet()}}. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4278) TextAndCSVParser doesn't detect semicolon separated file
[ https://issues.apache.org/jira/browse/TIKA-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17888651#comment-17888651 ] Tilman Hausherr edited comment on TIKA-4278 at 10/11/24 1:16 PM: - Here's the test result: [^reports_csv_3.0.0_vs_3.0.0_nocolon.tar.xz] No surprises here. However the test runs only on .csv files so it misses some of the files mentioned in the previous report. (This does not yet contain the latest change) was (Author: tilman): Here's the test result: [^reports_csv_3.0.0_vs_3.0.0_nocolon.tar.xz] No surprises here. However the test runs only on .csv files so it misses some of the files mentioned in the previous report. > TextAndCSVParser doesn't detect semicolon separated file > > > Key: TIKA-4278 > URL: https://issues.apache.org/jira/browse/TIKA-4278 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.9.2 >Reporter: Tilman Hausherr >Priority: Major > Labels: csv, csvparser > Fix For: 3.0.0, 2.9.3 > > Attachments: reports_csv_2.9.2_vs_2.9.3.tar.xz, > reports_csv_2.9.2_vs_2.9.3_3.tar.xz, reports_csv_2.9.2_vs_2.9.3_4.tar.xz, > reports_csv_3.0.0_vs_3.0.0_nocolon.tar.xz > > > I ran the code from the attached SO issue and yes it doesn't detect semicolon > separated files. The reason is this line in {{TextAndCSVParser.java}}: > {code:java} > private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'}; > {code} > This is later used by {{CSVSniffer}}. For some reason the other delimiters > (pipe, colon and semicolon) aren't in that array, although they are in > {{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now > it works for semicolon. > Can I change this by adding the missing delimiters or was there a reason that > I missed? Proposed change would change CSVSniffer so that delimiters is a set > and then pass {{CHAR_TO_STRING_DELIMITER_MAP.keySet()}}. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4278) TextAndCSVParser doesn't detect semicolon separated file
[ https://issues.apache.org/jira/browse/TIKA-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17888651#comment-17888651 ] Tilman Hausherr commented on TIKA-4278: --- Here's the test result: [^reports_csv_3.0.0_vs_3.0.0_nocolon.tar.xz] No surprises here. However the test runs only on .csv files so it misses some of the files mentioned in the previous report. > TextAndCSVParser doesn't detect semicolon separated file > > > Key: TIKA-4278 > URL: https://issues.apache.org/jira/browse/TIKA-4278 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.9.2 >Reporter: Tilman Hausherr >Priority: Major > Labels: csv, csvparser > Fix For: 3.0.0, 2.9.3 > > Attachments: reports_csv_2.9.2_vs_2.9.3.tar.xz, > reports_csv_2.9.2_vs_2.9.3_3.tar.xz, reports_csv_2.9.2_vs_2.9.3_4.tar.xz, > reports_csv_3.0.0_vs_3.0.0_nocolon.tar.xz > > > I ran the code from the attached SO issue and yes it doesn't detect semicolon > separated files. The reason is this line in {{TextAndCSVParser.java}}: > {code:java} > private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'}; > {code} > This is later used by {{CSVSniffer}}. For some reason the other delimiters > (pipe, colon and semicolon) aren't in that array, although they are in > {{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now > it works for semicolon. > Can I change this by adding the missing delimiters or was there a reason that > I missed? Proposed change would change CSVSniffer so that delimiters is a set > and then pass {{CHAR_TO_STRING_DELIMITER_MAP.keySet()}}. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4278) TextAndCSVParser doesn't detect semicolon separated file
[ https://issues.apache.org/jira/browse/TIKA-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4278: -- Attachment: reports_csv_3.0.0_vs_3.0.0_nocolon.tar.xz > TextAndCSVParser doesn't detect semicolon separated file > > > Key: TIKA-4278 > URL: https://issues.apache.org/jira/browse/TIKA-4278 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.9.2 > Reporter: Tilman Hausherr >Priority: Major > Labels: csv, csvparser > Fix For: 3.0.0, 2.9.3 > > Attachments: reports_csv_2.9.2_vs_2.9.3.tar.xz, > reports_csv_2.9.2_vs_2.9.3_3.tar.xz, reports_csv_2.9.2_vs_2.9.3_4.tar.xz, > reports_csv_3.0.0_vs_3.0.0_nocolon.tar.xz > > > I ran the code from the attached SO issue and yes it doesn't detect semicolon > separated files. The reason is this line in {{TextAndCSVParser.java}}: > {code:java} > private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'}; > {code} > This is later used by {{CSVSniffer}}. For some reason the other delimiters > (pipe, colon and semicolon) aren't in that array, although they are in > {{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now > it works for semicolon. > Can I change this by adding the missing delimiters or was there a reason that > I missed? Proposed change would change CSVSniffer so that delimiters is a set > and then pass {{CHAR_TO_STRING_DELIMITER_MAP.keySet()}}. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4278) TextAndCSVParser doesn't detect semicolon separated file
[ https://issues.apache.org/jira/browse/TIKA-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17888326#comment-17888326 ] Tilman Hausherr edited comment on TIKA-4278 at 10/10/24 3:40 PM: - 1 and 2, i.e. set a user modifiable default configuration that doesn't contain the colons. I'd love to hear other opinions. was (Author: tilman): 1 and 2, i.e. set a default configuration that doesn't contain the colons. I'd love to hear other opinions. > TextAndCSVParser doesn't detect semicolon separated file > > > Key: TIKA-4278 > URL: https://issues.apache.org/jira/browse/TIKA-4278 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 2.9.2 >Reporter: Tilman Hausherr >Priority: Major > Labels: csv, csvparser > Fix For: 3.0.0, 2.9.3 > > Attachments: reports_csv_2.9.2_vs_2.9.3.tar.xz, > reports_csv_2.9.2_vs_2.9.3_3.tar.xz, reports_csv_2.9.2_vs_2.9.3_4.tar.xz > > > I ran the code from the attached SO issue and yes it doesn't detect semicolon > separated files. The reason is this line in {{TextAndCSVParser.java}}: > {code:java} > private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'}; > {code} > This is later used by {{CSVSniffer}}. For some reason the other delimiters > (pipe, colon and semicolon) aren't in that array, although they are in > {{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now > it works for semicolon. > Can I change this by adding the missing delimiters or was there a reason that > I missed? Proposed change would change CSVSniffer so that delimiters is a set > and then pass {{CHAR_TO_STRING_DELIMITER_MAP.keySet()}}. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4278) TextAndCSVParser doesn't detect semicolon separated file
[ https://issues.apache.org/jira/browse/TIKA-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17888326#comment-17888326 ] Tilman Hausherr commented on TIKA-4278: --- 1 and 2, i.e. set a default configuration that doesn't contain the colons. I'd love to hear other opinions. > TextAndCSVParser doesn't detect semicolon separated file > > > Key: TIKA-4278 > URL: https://issues.apache.org/jira/browse/TIKA-4278 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.9.2 >Reporter: Tilman Hausherr >Priority: Major > Labels: csv, csvparser > Fix For: 3.0.0, 2.9.3 > > Attachments: reports_csv_2.9.2_vs_2.9.3.tar.xz, > reports_csv_2.9.2_vs_2.9.3_3.tar.xz, reports_csv_2.9.2_vs_2.9.3_4.tar.xz > > > I ran the code from the attached SO issue and yes it doesn't detect semicolon > separated files. The reason is this line in {{TextAndCSVParser.java}}: > {code:java} > private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'}; > {code} > This is later used by {{CSVSniffer}}. For some reason the other delimiters > (pipe, colon and semicolon) aren't in that array, although they are in > {{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now > it works for semicolon. > Can I change this by adding the missing delimiters or was there a reason that > I missed? Proposed change would change CSVSniffer so that delimiters is a set > and then pass {{CHAR_TO_STRING_DELIMITER_MAP.keySet()}}. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4280) Tasks for the 3.0.0 release
[ https://issues.apache.org/jira/browse/TIKA-4280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17888321#comment-17888321 ] Tilman Hausherr commented on TIKA-4280: --- One weird thing: commoncrawl3/2P/2PSMEFJEYU7EPAZXQQDD6OL2WOQLBJRY, this is a compressed file. In "A" it appears as "application/json; charset=ISO-8859-1", in "B" as "text/csv; charset=ISO-8859-1; delimiter=colon". The file itself starts with "PK" so shouldn't this be easy? > Tasks for the 3.0.0 release > --- > > Key: TIKA-4280 > URL: https://issues.apache.org/jira/browse/TIKA-4280 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > I'm too lazy to open separate tickets. Please do so if desired. > Some items: > * Before releasing the real 3.0.0 we need to remove any "-M" dependencies > * Decide about the ffmpeg issue and the hdf5 issue > * Run the regression tests vs 2.9.x > * Convert tika-grpc to use the dependency plugin instead of the shade plugin > * Turn javadocs back on. I got errors during the deploy process because > javadoc needed the auto-generated code ("cannot find symbol > DeleteFetcherRequest"). We need to enable javadocs for the rest of the > project. > * TIKA-4290 Tilman question > Other things? Thank you [~tilman] for the first two! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4280) Tasks for the 3.0.0 release
[ https://issues.apache.org/jira/browse/TIKA-4280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17888256#comment-17888256 ] Tilman Hausherr commented on TIKA-4280: --- csv changes => TIKA-4278 > Tasks for the 3.0.0 release > --- > > Key: TIKA-4280 > URL: https://issues.apache.org/jira/browse/TIKA-4280 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > I'm too lazy to open separate tickets. Please do so if desired. > Some items: > * Before releasing the real 3.0.0 we need to remove any "-M" dependencies > * Decide about the ffmpeg issue and the hdf5 issue > * Run the regression tests vs 2.9.x > * Convert tika-grpc to use the dependency plugin instead of the shade plugin > * Turn javadocs back on. I got errors during the deploy process because > javadoc needed the auto-generated code ("cannot find symbol > DeleteFetcherRequest"). We need to enable javadocs for the rest of the > project. > * TIKA-4290 Tilman question > Other things? Thank you [~tilman] for the first two! -- This message was sent by Atlassian Jira (v8.20.10#820010)
build timeout fails
https://issues.apache.org/jira/browse/INFRA-26175
Re: 3.0.0 release?
This is weird... Tika itself depends on Apache CXF, it currently uses 4.0.5. I couldn't believe it but then I looked https://github.com/apache/cxf/blob/main/parent/pom.xml and it's true... a search finds more: https://github.com/search?q=repo%3Aapache%2Fcxf%20tika&type=code Tilman On 24.09.2024 15:29, Gary D. Gregory wrote: Hi All, Is there a time frame for 3.0.0? It looks like Apache CXF 4.1.0 depends on 3.0.0 [1] and I'm waiting on CXF 4.1.0... Any guidance would be appreciated. TY! Gary [1] https://issues.apache.org/jira/browse/CXF-8671 On 2024/08/21 17:08:22 Nicholas DiPiazza wrote: I have a pull request for some class path loading fixes for Tika grpc. Hoping to get that done today but it's a struggle so far On Wed, Aug 21, 2024, 11:30 AM Tim Allison wrote: All, There are a couple of items documented on https://issues.apache.org/jira/browse/TIKA-4280 that we wanted to take care of before the 3.0.0 release. I can run a comparison btwn 2.x and 3.x on our regression corpus, and I can try to deal with javadocs. Any recs on how to wrap up the other issues? Are there any other blockers not listed on that issue? Thank you! Best, Tim
[jira] [Closed] (TIKA-2619) Memory leak: PDF meta data detection fails with OutOfMemoryError
[ https://issues.apache.org/jira/browse/TIKA-2619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr closed TIKA-2619. - Resolution: Duplicate > Memory leak: PDF meta data detection fails with OutOfMemoryError > > > Key: TIKA-2619 > URL: https://issues.apache.org/jira/browse/TIKA-2619 > Project: Tika > Issue Type: Bug >Affects Versions: 1.17 > Environment: Linux 4.13.0-37 / JDK 1.8.0_152 >Reporter: Felix Dürrwanger >Priority: Critical > Attachments: Bundesministerium.pdf > > > When analysing the attached PDF with TIKA (embedded and server) the JVM > consumes all available memory and fails with an OutOfMemoryError. The PDF is > an offical and public document from a german federal ministry. > > *Client*: > {noformat} > fd@804F9H2:~/TIKA$ time curl -T Bundesministerium.pdf > http://127.0.0.1:9998/meta --header "Accept: application/json" > Error: 500 > real 0m53.417s > user 0m0.020s > sys 0m0.011s > {noformat} > > *Server*: > {noformat} > fd@804F9H2:~/TIKA$ java -Xmx12G -jar tika-server-1.17.jar > Mar 29, 2018 12:25:34 PM org.apache.tika.config.InitializableProblemHandler$3 > handleInitializableProblem > WARNING: JBIG2ImageReader not loaded. jbig2 files will be ignored > See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io > for optional dependencies. > TIFFImageWriter not loaded. tiff files will not be processed > See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io > for optional dependencies. > J2KImageReader not loaded. JPEG2000 files will not be processed. > See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io > for optional dependencies. > Mar 29, 2018 12:25:34 PM org.apache.tika.config.InitializableProblemHandler$3 > handleInitializableProblem > WARNING: org.xerial's sqlite-jdbc is not loaded. > Please provide the jar on your classpath to parse sqlite files. > See tika-parsers/pom.xml for the correct version. > INFO Starting Apache Tika 1.17 server > INFO Setting the server's publish address to be http://localhost:9998/ > INFO jetty-8.y.z-SNAPSHOT > INFO Started SelectChannelConnector@localhost:9998 > INFO Started Apache Tika server at http://localhost:9998/ > INFO meta (autodetecting type) > WARN Application \{http://resource.server.tika.apache.org/}MetadataResource > has thrown exception, unwinding now > org.apache.cxf.interceptor.Fault: Java heap space > at > org.apache.cxf.service.invoker.AbstractInvoker.createFault(AbstractInvoker.java:163) > at > org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:129) > at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:202) > at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:101) > at > org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59) > at > org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96) > at > org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) > at > org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) > at > org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:274) > at > org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261) > at > org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:76) > at > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088) > at > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) > at > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) > at org.eclipse.jetty.server.Server.handle(Server.java:370) > at > org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494) > at > org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:973) > at > org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1035) > at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:647) > at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:231) > at > org.eclipse.jetty.server.AsyncHttpConnecti
[jira] [Updated] (TIKA-4311) Avoid potential ClassCastException in angle detection PDF extraction
[ https://issues.apache.org/jira/browse/TIKA-4311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4311: -- Description: There is a programming error in PDFBox ExtractText now fixed in PDFBOX-5879, I'll fix the same code that is in PDF2XHTML here, although I haven't been able to reproduce the ClassCastException. (was: There is a programming error in ExtractText now fixed in PDFBOX-5879, I'll fix it here too, although I haven't been able to reproduce the ClassCastException.) > Avoid potential ClassCastException in angle detection PDF extraction > > > Key: TIKA-4311 > URL: https://issues.apache.org/jira/browse/TIKA-4311 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 3.0.0-BETA, 2.9.2 > Reporter: Tilman Hausherr >Assignee: Tilman Hausherr >Priority: Major > Fix For: 3.0.0, 2.9.3 > > > There is a programming error in PDFBox ExtractText now fixed in PDFBOX-5879, > I'll fix the same code that is in PDF2XHTML here, although I haven't been > able to reproduce the ClassCastException. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-4311) Avoid potential ClassCastException in angle detection PDF extraction
[ https://issues.apache.org/jira/browse/TIKA-4311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr resolved TIKA-4311. --- Resolution: Fixed > Avoid potential ClassCastException in angle detection PDF extraction > > > Key: TIKA-4311 > URL: https://issues.apache.org/jira/browse/TIKA-4311 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 3.0.0-BETA, 2.9.2 > Reporter: Tilman Hausherr >Assignee: Tilman Hausherr >Priority: Major > Fix For: 3.0.0, 2.9.3 > > > There is a programming error in PDFBox ExtractText now fixed in PDFBOX-5879, > I'll fix the same code that is in PDF2XHTML here, although I haven't been > able to reproduce the ClassCastException. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4311) Avoid potential ClassCastException in angle detection PDF extraction
Tilman Hausherr created TIKA-4311: - Summary: Avoid potential ClassCastException in angle detection PDF extraction Key: TIKA-4311 URL: https://issues.apache.org/jira/browse/TIKA-4311 Project: Tika Issue Type: Bug Components: parser Affects Versions: 2.9.2, 3.0.0-BETA Reporter: Tilman Hausherr Assignee: Tilman Hausherr Fix For: 3.0.0, 2.9.3 There is a programming error in ExtractText now fixed in PDFBOX-5879, I'll fix it here too, although I haven't been able to reproduce the ClassCastException. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-4308) ExecutableParser: PE 0x14c and 0x8664 both yield MACHINE_x86_32
[ https://issues.apache.org/jira/browse/TIKA-4308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr resolved TIKA-4308. --- Fix Version/s: 3.0.0 2.9.3 Resolution: Fixed Thanks for the report! > ExecutableParser: PE 0x14c and 0x8664 both yield MACHINE_x86_32 > --- > > Key: TIKA-4308 > URL: https://issues.apache.org/jira/browse/TIKA-4308 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.9.2 >Reporter: Alexey Pelykh >Assignee: Tilman Hausherr >Priority: Trivial > Labels: easyfix > Fix For: 3.0.0, 2.9.3 > > > It seems that a PE executable for 64-bit platform should return > MACHINE_x86_64, not MACHINE_x86_32: > https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-code-module/src/main/java/org/apache/tika/parser/executable/ExecutableParser.java#L142-L144 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4308) ExecutableParser: PE 0x14c and 0x8664 both yield MACHINE_x86_32
[ https://issues.apache.org/jira/browse/TIKA-4308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4308: -- Priority: Trivial (was: Major) > ExecutableParser: PE 0x14c and 0x8664 both yield MACHINE_x86_32 > --- > > Key: TIKA-4308 > URL: https://issues.apache.org/jira/browse/TIKA-4308 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.9.2 >Reporter: Alexey Pelykh >Assignee: Tilman Hausherr >Priority: Trivial > Labels: easyfix > > It seems that a PE executable for 64-bit platform should return > MACHINE_x86_64, not MACHINE_x86_32: > https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-code-module/src/main/java/org/apache/tika/parser/executable/ExecutableParser.java#L142-L144 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4308) ExecutableParser: PE 0x14c and 0x8664 both yield MACHINE_x86_32
[ https://issues.apache.org/jira/browse/TIKA-4308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4308: -- Component/s: parser > ExecutableParser: PE 0x14c and 0x8664 both yield MACHINE_x86_32 > --- > > Key: TIKA-4308 > URL: https://issues.apache.org/jira/browse/TIKA-4308 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.9.2 >Reporter: Alexey Pelykh >Assignee: Tilman Hausherr >Priority: Major > > It seems that a PE executable for 64-bit platform should return > MACHINE_x86_64, not MACHINE_x86_32: > https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-code-module/src/main/java/org/apache/tika/parser/executable/ExecutableParser.java#L142-L144 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4308) ExecutableParser: PE 0x14c and 0x8664 both yield MACHINE_x86_32
[ https://issues.apache.org/jira/browse/TIKA-4308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4308: -- Labels: easyfix (was: ) > ExecutableParser: PE 0x14c and 0x8664 both yield MACHINE_x86_32 > --- > > Key: TIKA-4308 > URL: https://issues.apache.org/jira/browse/TIKA-4308 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.9.2 >Reporter: Alexey Pelykh >Assignee: Tilman Hausherr >Priority: Major > Labels: easyfix > > It seems that a PE executable for 64-bit platform should return > MACHINE_x86_64, not MACHINE_x86_32: > https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-code-module/src/main/java/org/apache/tika/parser/executable/ExecutableParser.java#L142-L144 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4308) ExecutableParser: PE 0x14c and 0x8664 both yield MACHINE_x86_32
[ https://issues.apache.org/jira/browse/TIKA-4308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4308: -- Affects Version/s: 2.9.2 > ExecutableParser: PE 0x14c and 0x8664 both yield MACHINE_x86_32 > --- > > Key: TIKA-4308 > URL: https://issues.apache.org/jira/browse/TIKA-4308 > Project: Tika > Issue Type: Bug >Affects Versions: 2.9.2 >Reporter: Alexey Pelykh > Assignee: Tilman Hausherr >Priority: Major > > It seems that a PE executable for 64-bit platform should return > MACHINE_x86_64, not MACHINE_x86_32: > https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-code-module/src/main/java/org/apache/tika/parser/executable/ExecutableParser.java#L142-L144 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (TIKA-4308) ExecutableParser: PE 0x14c and 0x8664 both yield MACHINE_x86_32
[ https://issues.apache.org/jira/browse/TIKA-4308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr reassigned TIKA-4308: - Assignee: Tilman Hausherr > ExecutableParser: PE 0x14c and 0x8664 both yield MACHINE_x86_32 > --- > > Key: TIKA-4308 > URL: https://issues.apache.org/jira/browse/TIKA-4308 > Project: Tika > Issue Type: Bug >Reporter: Alexey Pelykh > Assignee: Tilman Hausherr >Priority: Major > > It seems that a PE executable for 64-bit platform should return > MACHINE_x86_64, not MACHINE_x86_32: > https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-code-module/src/main/java/org/apache/tika/parser/executable/ExecutableParser.java#L142-L144 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-3970) Certain OneNote documents produce duplicate text
[ https://issues.apache.org/jira/browse/TIKA-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877655#comment-17877655 ] Tilman Hausherr commented on TIKA-3970: --- Rather the file in TIKA-4303 is missing the chinese text that was there before the commit of 2023. This happens only in 2.9.2 but not in 3.0 despite that both commits are identical. > Certain OneNote documents produce duplicate text > > > Key: TIKA-3970 > URL: https://issues.apache.org/jira/browse/TIKA-3970 > Project: Tika > Issue Type: Bug > Components: app >Affects Versions: 2.7.0 >Reporter: David Avant >Priority: Minor > Attachments: Screenshot 2023-02-21 at 3.43.08 PM.png, > lyrics-crawlAllFileNodesFromRoot-false.txt, lyrics.docx, lyrics.one, > lyrics.txt > > > Extracting text from certain OneNote documents produces more text than is > actually in the document. In this case, the OneNote document was created > by opening a Word document and "printing" it to the OneNote. > To reproduce the issue, open the attached "lyrics.one" using the Tika App > version 2.7.0 and view the plain text. Look for the phrase "Sunday > Morning" and observe that there are 14 occurrences. However in the actual > displayed text, it occurs only once. > The original text in this document is only about 12K characters, but the > extracted text from tika is over 300K. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4303) Unable to extract Chinese content in onenote
[ https://issues.apache.org/jira/browse/TIKA-4303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877653#comment-17877653 ] Tilman Hausherr commented on TIKA-4303: --- I tried reverting TIKA-3970 and now I get what you have. However the commit diff in both branches are absolutely identical. Very weird. > Unable to extract Chinese content in onenote > > > Key: TIKA-4303 > URL: https://issues.apache.org/jira/browse/TIKA-4303 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.8.0, 2.9.2 >Reporter: lqangi >Priority: Major > Attachments: Chinese-notes.one, tika-parsing-chinese-notes-result.png > > > When I tried to extract the contents of onenote file containing Chinese using > tika, the Chinese part of the file could not be extracted, only the > non-Chinese content was extracted. > In addition, some of the extracted content is duplicate, as described in > [TIKA-3970|https://issues.apache.org/jira/browse/TIKA-3970], it seems to > extract the historical version of the data along with the extraction, I don't > know if this issue (TIKA-3970) has been fixed (I see that the code has been > committed on github, But it doesn't seem to have completely solved the > problem yet) > The software versions I use are as follows: > Tika: 2.8.0 > Onenote: Microsoft® OneNote® LTSC MSO (16.0.14332.20761) > > In order to reproduce this problem, just use the 2.8.0 version of Tika App to > open the attachment "Chinese-Notes.one" and check whether the Chinese content > in the file is extracted. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4303) Unable to extract Chinese content in onenote
[ https://issues.apache.org/jira/browse/TIKA-4303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4303: -- Affects Version/s: 2.9.2 > Unable to extract Chinese content in onenote > > > Key: TIKA-4303 > URL: https://issues.apache.org/jira/browse/TIKA-4303 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.8.0, 2.9.2 >Reporter: lqangi >Priority: Major > Attachments: Chinese-notes.one, tika-parsing-chinese-notes-result.png > > > When I tried to extract the contents of onenote file containing Chinese using > tika, the Chinese part of the file could not be extracted, only the > non-Chinese content was extracted. > In addition, some of the extracted content is duplicate, as described in > [TIKA-3970|https://issues.apache.org/jira/browse/TIKA-3970], it seems to > extract the historical version of the data along with the extraction, I don't > know if this issue (TIKA-3970) has been fixed (I see that the code has been > committed on github, But it doesn't seem to have completely solved the > problem yet) > The software versions I use are as follows: > Tika: 2.8.0 > Onenote: Microsoft® OneNote® LTSC MSO (16.0.14332.20761) > > In order to reproduce this problem, just use the 2.8.0 version of Tika App to > open the attachment "Chinese-Notes.one" and check whether the Chinese content > in the file is extracted. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4303) Unable to extract Chinese content in onenote
[ https://issues.apache.org/jira/browse/TIKA-4303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877631#comment-17877631 ] Tilman Hausherr commented on TIKA-4303: --- I tried with the 3 beta and there I get more: 中文标题� � 中文标题� 中文标题� zhongwen� 中文标题� 中文标题� 中文标题� 中文标题� � 14:08 zhongwen zhongwen� 14:08 中文标题� 14:08 中文标题� 14:08 中文标题� 14:08 中文标题� 14:08 中文标题� Type information into a notebook or insert information from other apps and web pages. OneNote is a digital notebook that automatically saves and syncs notes as you work. Follow up easily with highlights and tags. Take handwritten notes or draw ideas. Access the notebook from any device. Share notebooks to collaborate with others. 14:08 中文标题� Type information into a notebook or insert information from other apps and web pages. OneNote is a digital notebook that automatically saves and syncs notes as you work. Follow up easily with highlights and tags. Take handwritten notes or draw ideas. Access the notebook from any device. Share notebooks to collaborate with others. 14:08 中文标题� Type information into a notebook or insert information from other apps and web pages. OneNote is a digital notebook that automatically saves and syncs notes as you work. Follow up easily with highlights and tags. Take handwritten notes or draw ideas. Access the notebook from any device. Share notebooks to collaborate with others. 14:08 So maybe changes were done in 3.0 but not committed to 2.9. > Unable to extract Chinese content in onenote > > > Key: TIKA-4303 > URL: https://issues.apache.org/jira/browse/TIKA-4303 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.8.0 >Reporter: lqangi >Priority: Major > Attachments: Chinese-notes.one, tika-parsing-chinese-notes-result.png > > > When I tried to extract the contents of onenote file containing Chinese using > tika, the Chinese part of the file could not be extracted, only the > non-Chinese content was extracted. > In addition, some of the extracted content is duplicate, as described in > [TIKA-3970|https://issues.apache.org/jira/browse/TIKA-3970], it seems to > extract the historical version of the data along with the extraction, I don't > know if this issue (TIKA-3970) has been fixed (I see that the code has been > committed on github, But it doesn't seem to have completely solved the > problem yet) > The software versions I use are as follows: > Tika: 2.8.0 > Onenote: Microsoft® OneNote® LTSC MSO (16.0.14332.20761) > > In order to reproduce this problem, just use the 2.8.0 version of Tika App to > open the attachment "Chinese-Notes.one" and check whether the Chinese content > in the file is extracted. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4303) Unable to extract Chinese content in onenote
[ https://issues.apache.org/jira/browse/TIKA-4303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877631#comment-17877631 ] Tilman Hausherr edited comment on TIKA-4303 at 8/29/24 8:46 AM: I tried with the 3 beta and there I get more: 中文标题� � 中文标题� 中文标题� zhongwen� 中文标题� 中文标题� 中文标题� 中文标题� � 14:08 zhongwen zhongwen� 14:08 中文标题� 14:08 中文标题� 14:08 中文标题� 14:08 中文标题� 14:08 中文标题� Type information into a notebook or insert information from other apps and web pages. OneNote is a digital notebook that automatically saves and syncs notes as you work. Follow up easily with highlights and tags. Take handwritten notes or draw ideas. Access the notebook from any device. Share notebooks to collaborate with others. 14:08 中文标题� Type information into a notebook or insert information from other apps and web pages. OneNote is a digital notebook that automatically saves and syncs notes as you work. Follow up easily with highlights and tags. Take handwritten notes or draw ideas. Access the notebook from any device. Share notebooks to collaborate with others. 14:08 中文标题� Type information into a notebook or insert information from other apps and web pages. OneNote is a digital notebook that automatically saves and syncs notes as you work. Follow up easily with highlights and tags. Take handwritten notes or draw ideas. Access the notebook from any device. Share notebooks to collaborate with others. 14:08 So maybe changes were done in 3.0 but not committed to 2.9 (where I did not get chinese text) was (Author: tilman): I tried with the 3 beta and there I get more: 中文标题� � 中文标题� 中文标题� zhongwen� 中文标题� 中文标题� 中文标题� 中文标题� � 14:08 zhongwen zhongwen� 14:08 中文标题� 14:08 中文标题� 14:08 中文标题� 14:08 中文标题� 14:08 中文标题� Type information into a notebook or insert information from other apps and web pages. OneNote is a digital notebook that automatically saves and syncs notes as you work. Follow up easily with highlights and tags. Take handwritten notes or draw ideas. Access the notebook from any device. Share notebooks to collaborate with others. 14:08 中文标题� Type information into a notebook or insert information from other apps and web pages. OneNote is a digital notebook that automatically saves and syncs notes as you work. Follow up easily with highlights and tags. Take handwritten notes or draw ideas. Access the notebook from any device. Share notebooks to collaborate with others. 14:08 中文标题� Type information into a notebook or insert information from other apps and web pages. OneNote is a digital notebook that automatically saves and syncs notes as you work. Follow up easily with highlights and tags. Take handwritten notes or draw ideas. Access the notebook from any device. Share notebooks to collaborate with others. 14:08 So maybe changes were done in 3.0 but not committed to 2.9. > Unable to extract Chinese content in onenote > > > Key: TIKA-4303 > URL: https://issues.apache.org/jira/browse/TIKA-4303 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.8.0 >Reporter: lqangi >Priority: Major > Attachments: Chinese-notes.one, tika-parsing-chinese-notes-result.png > > > When I tried to extract the contents of onenote file containing Chinese using > tika, the Chinese part of the file could not be extracted, only the > non-Chinese content was extracted. > In addition, some of the extracted content is duplicate, as described in > [TIKA-3970|https://issues.apache.org/jira/browse/TIKA-3970], it seems to > extract the historical version of the data along with the extraction, I don't > know if this issue (TIKA-3970) has been fixed (I see that the code has been > committed on github, But it doesn't seem to have completely solved the > problem yet) > The software versions I use are as follows: > Tika: 2.8.0 > Onenote: Microsoft® OneNote® LTSC MSO (16.0.14332.20761) > > In order to reproduce this problem, just use the 2.8.0 version of Tika App to > open the attachment "Chinese-Notes.one" and check whether the Chinese content > in the file is extracted. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (TIKA-4302) Please generate a new 2.9.x deployment
[ https://issues.apache.org/jira/browse/TIKA-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr closed TIKA-4302. - Resolution: Duplicate > Please generate a new 2.9.x deployment > -- > > Key: TIKA-4302 > URL: https://issues.apache.org/jira/browse/TIKA-4302 > Project: Tika > Issue Type: Task >Affects Versions: 2.9.2 >Reporter: Alan Klein >Priority: Major > > It appears that a number of dependencies were updated in TIKA-4166 > Would you be able to generate a new 2.9.x deployment that includes the > changes in TIKA-4166 ? I am specifically looking to address CVE-2024-29857 > (High) which is due to Bouncy Castle. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4302) Please generate a new 2.9.x deployment
[ https://issues.apache.org/jira/browse/TIKA-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4302: -- Fix Version/s: (was: TIKA-4239) > Please generate a new 2.9.x deployment > -- > > Key: TIKA-4302 > URL: https://issues.apache.org/jira/browse/TIKA-4302 > Project: Tika > Issue Type: Task >Affects Versions: 2.9.2 >Reporter: Alan Klein >Priority: Major > > It appears that a number of dependencies were updated in TIKA-4166 > Would you be able to generate a new 2.9.x deployment that includes the > changes in TIKA-4166 ? I am specifically looking to address CVE-2024-29857 > (High) which is due to Bouncy Castle. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Reopened] (TIKA-4302) Please generate a new 2.9.x deployment
[ https://issues.apache.org/jira/browse/TIKA-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr reopened TIKA-4302: --- > Please generate a new 2.9.x deployment > -- > > Key: TIKA-4302 > URL: https://issues.apache.org/jira/browse/TIKA-4302 > Project: Tika > Issue Type: Task >Affects Versions: 2.9.2 >Reporter: Alan Klein >Priority: Major > Fix For: TIKA-4239 > > > It appears that a number of dependencies were updated in TIKA-4166 > Would you be able to generate a new 2.9.x deployment that includes the > changes in TIKA-4166 ? I am specifically looking to address CVE-2024-29857 > (High) which is due to Bouncy Castle. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (TIKA-4302) Please generate a new 2.9.x deployment
[ https://issues.apache.org/jira/browse/TIKA-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr closed TIKA-4302. - Fix Version/s: TIKA-4239 Resolution: Duplicate I don't know the date but there will definitively be another 2 release. I'm closing this issue as duplicate of TIKA-4239. See also the tika homepage for how we did end of life for 1, there were several 1 releases while 2 was released. > Please generate a new 2.9.x deployment > -- > > Key: TIKA-4302 > URL: https://issues.apache.org/jira/browse/TIKA-4302 > Project: Tika > Issue Type: Task >Affects Versions: 2.9.2 >Reporter: Alan Klein >Priority: Major > Fix For: TIKA-4239 > > > It appears that a number of dependencies were updated in TIKA-4166 > Would you be able to generate a new 2.9.x deployment that includes the > changes in TIKA-4166 ? I am specifically looking to address CVE-2024-29857 > (High) which is due to Bouncy Castle. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4239) Update to 2.9.3
[ https://issues.apache.org/jira/browse/TIKA-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877407#comment-17877407 ] Tilman Hausherr commented on TIKA-4239: --- I've modified the tika-branch2x-jdk11 build job so that it creates a JIRA comment like already done with the trunk. > Update to 2.9.3 > --- > > Key: TIKA-4239 > URL: https://issues.apache.org/jira/browse/TIKA-4239 > Project: Tika > Issue Type: Task > Components: build >Affects Versions: 2.9.2 >Reporter: Tilman Hausherr >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4302) Please generate a new 2.9.x deployment
[ https://issues.apache.org/jira/browse/TIKA-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877298#comment-17877298 ] Tilman Hausherr commented on TIKA-4302: --- I looked at CVE-2024-29857. The "worst" that could happen is high CPU load. > Please generate a new 2.9.x deployment > -- > > Key: TIKA-4302 > URL: https://issues.apache.org/jira/browse/TIKA-4302 > Project: Tika > Issue Type: Task >Affects Versions: 2.9.2 >Reporter: Alan Klein >Priority: Major > > It appears that a number of dependencies were updated in TIKA-4166 > Would you be able to generate a new 2.9.x deployment that includes the > changes in TIKA-4166 ? I am specifically looking to address CVE-2024-29857 > (High) which is due to Bouncy Castle. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4302) Please generate a new 2.9.x deployment
[ https://issues.apache.org/jira/browse/TIKA-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877242#comment-17877242 ] Tilman Hausherr commented on TIKA-4302: --- Snapshots are here: https://repository.apache.org/content/groups/snapshots/org/apache/tika/tika-app/2.9.3-SNAPSHOT/ > Please generate a new 2.9.x deployment > -- > > Key: TIKA-4302 > URL: https://issues.apache.org/jira/browse/TIKA-4302 > Project: Tika > Issue Type: Task >Affects Versions: 2.9.2 >Reporter: Alan Klein >Priority: Major > > It appears that a number of dependencies were updated in TIKA-4166 > Would you be able to generate a new 2.9.x deployment that includes the > changes in TIKA-4166 ? I am specifically looking to address CVE-2024-29857 > (High) which is due to Bouncy Castle. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (TIKA-4231) Parsing Arabic PDF is returning bad data
[ https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr closed TIKA-4231. - Resolution: Duplicate Closing as duplicate. You can still comment. > Parsing Arabic PDF is returning bad data > > > Key: TIKA-4231 > URL: https://issues.apache.org/jira/browse/TIKA-4231 > Project: Tika > Issue Type: Bug >Affects Versions: 2.6.0, 2.9.1 > Environment: I am using Java 18. And using maven dependency > tika-parsers-standard-package > ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)] > >Reporter: Aamir >Priority: Major > Labels: ActualText > Attachments: TIKA-4231-arabic-new.txt, arabic-pdfbox.txt, arabic.pdf, > arabic.txt > > > Attached is a PDF with arabic text in it. > When parsed using tika version 2.6.0 or 2.9.1, it produces gibberish > characters. > The generated text doc is also attached which contains the parsed text. > Most of the other Arabic PDFs parse fine, but this one is giving this output. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4280) Tasks for the 3.0.0 release
[ https://issues.apache.org/jira/browse/TIKA-4280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17876410#comment-17876410 ] Tilman Hausherr edited comment on TIKA-4280 at 8/24/24 9:52 AM: re hdf5 I've created a ticket [https://github.com/bytedeco/javacpp-presets/issues/1533] Before that I contacted the hdf5 people but from their answer and from [https://bytedeco.org|https://bytedeco.org/] I think that they're not the ones responsible. was (Author: tilman): re hdf5 I've created a ticket [https://github.com/bytedeco/javacpp-presets/issues/1533] Before that I conntacted the hdf5 people but from their answer and from https://bytedeco.org I think that they're not the ones responsible. > Tasks for the 3.0.0 release > --- > > Key: TIKA-4280 > URL: https://issues.apache.org/jira/browse/TIKA-4280 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > I'm too lazy to open separate tickets. Please do so if desired. > Some items: > * Before releasing the real 3.0.0 we need to remove any "-M" dependencies > * Decide about the ffmpeg issue and the hdf5 issue > * Run the regression tests vs 2.9.x > * Convert tika-grpc to use the dependency plugin instead of the shade plugin > * Turn javadocs back on. I got errors during the deploy process because > javadoc needed the auto-generated code ("cannot find symbol > DeleteFetcherRequest"). We need to enable javadocs for the rest of the > project. > * TIKA-4290 Tilman question > Other things? Thank you [~tilman] for the first two! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4280) Tasks for the 3.0.0 release
[ https://issues.apache.org/jira/browse/TIKA-4280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17876410#comment-17876410 ] Tilman Hausherr commented on TIKA-4280: --- re hdf5 I've created a ticket [https://github.com/bytedeco/javacpp-presets/issues/1533] Before that I conntacted the hdf5 people but from their answer and from https://bytedeco.org I think that they're not the ones responsible. > Tasks for the 3.0.0 release > --- > > Key: TIKA-4280 > URL: https://issues.apache.org/jira/browse/TIKA-4280 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > I'm too lazy to open separate tickets. Please do so if desired. > Some items: > * Before releasing the real 3.0.0 we need to remove any "-M" dependencies > * Decide about the ffmpeg issue and the hdf5 issue > * Run the regression tests vs 2.9.x > * Convert tika-grpc to use the dependency plugin instead of the shade plugin > * Turn javadocs back on. I got errors during the deploy process because > javadoc needed the auto-generated code ("cannot find symbol > DeleteFetcherRequest"). We need to enable javadocs for the rest of the > project. > * TIKA-4290 Tilman question > Other things? Thank you [~tilman] for the first two! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4280) Tasks for the 3.0.0 release
[ https://issues.apache.org/jira/browse/TIKA-4280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17875620#comment-17875620 ] Tilman Hausherr commented on TIKA-4280: --- I just reverted the collections "-M" version, I think it was me who set it just so that project gets more tests. Re tika-dl lets keep it as it is, it has been with non regular releases since 2018 (TIKA-2672). > Tasks for the 3.0.0 release > --- > > Key: TIKA-4280 > URL: https://issues.apache.org/jira/browse/TIKA-4280 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > I'm too lazy to open separate tickets. Please do so if desired. > Some items: > * Before releasing the real 3.0.0 we need to remove any "-M" dependencies > * Decide about the ffmpeg issue and the hdf5 issue > * Run the regression tests vs 2.9.x > * Convert tika-grpc to use the dependency plugin instead of the shade plugin > * Turn javadocs back on. I got errors during the deploy process because > javadoc needed the auto-generated code ("cannot find symbol > DeleteFetcherRequest"). We need to enable javadocs for the rest of the > project. > * TIKA-4290 Tilman question > Other things? Thank you [~tilman] for the first two! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4280) Tasks for the 3.0.0 release
[ https://issues.apache.org/jira/browse/TIKA-4280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17875620#comment-17875620 ] Tilman Hausherr edited comment on TIKA-4280 at 8/21/24 7:04 PM: I just reverted the collections "-M" version, I think it was me who set it just so that this other project gets more tests. Re tika-dl lets keep it as it is, it has been with non regular releases since 2018 (TIKA-2672). was (Author: tilman): I just reverted the collections "-M" version, I think it was me who set it just so that project gets more tests. Re tika-dl lets keep it as it is, it has been with non regular releases since 2018 (TIKA-2672). > Tasks for the 3.0.0 release > --- > > Key: TIKA-4280 > URL: https://issues.apache.org/jira/browse/TIKA-4280 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > I'm too lazy to open separate tickets. Please do so if desired. > Some items: > * Before releasing the real 3.0.0 we need to remove any "-M" dependencies > * Decide about the ffmpeg issue and the hdf5 issue > * Run the regression tests vs 2.9.x > * Convert tika-grpc to use the dependency plugin instead of the shade plugin > * Turn javadocs back on. I got errors during the deploy process because > javadoc needed the auto-generated code ("cannot find symbol > DeleteFetcherRequest"). We need to enable javadocs for the rest of the > project. > * TIKA-4290 Tilman question > Other things? Thank you [~tilman] for the first two! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4280) Tasks for the 3.0.0 release
[ https://issues.apache.org/jira/browse/TIKA-4280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17875592#comment-17875592 ] Tilman Hausherr commented on TIKA-4280: --- TIKA-4290 is resolved, although he's of course free to bring up more changes but he has now kept quiet for some time. the ffmpeg issue and the hdf5 issue: 1.14.3-1.5.10 is the latest version on maven central but it has a CVE. They claim it has been fixed in 1.14.4 [https://www.hdfgroup.org/2024/05/06/new-hdf5-cve-issues-fixed-in-1-14-4/] but that one isn't available. ffmpeg has also a CVE, I've excluded it completely, see my comment in tika-parsers/tika-parsers-ml/tika-dl/pom.xml . At this time it is still at the vulnerable 6.1.1-1.5.10 . Do we have a "stakeholder" on these two issues who can help? > Tasks for the 3.0.0 release > --- > > Key: TIKA-4280 > URL: https://issues.apache.org/jira/browse/TIKA-4280 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > I'm too lazy to open separate tickets. Please do so if desired. > Some items: > * Before releasing the real 3.0.0 we need to remove any "-M" dependencies > * Decide about the ffmpeg issue and the hdf5 issue > * Run the regression tests vs 2.9.x > * Convert tika-grpc to use the dependency plugin instead of the shade plugin > * Turn javadocs back on. I got errors during the deploy process because > javadoc needed the auto-generated code ("cannot find symbol > DeleteFetcherRequest"). We need to enable javadocs for the rest of the > project. > * TIKA-4290 Tilman question > Other things? Thank you [~tilman] for the first two! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-3858) Ligatures convert on text extraction
[ https://issues.apache.org/jira/browse/TIKA-3858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874491#comment-17874491 ] Tilman Hausherr commented on TIKA-3858: --- Fixed in PDFBOX-5868. > Ligatures convert on text extraction > - > > Key: TIKA-3858 > URL: https://issues.apache.org/jira/browse/TIKA-3858 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.4.1 > Environment: win 8, jre 1.5 >Reporter: tom hill >Priority: Major > Labels: ActualText > Attachments: TikaChromeInboxLigature.pdf > > > It appears that the issue in TIKA-1289 is still present. Ligatures get > replaced by a question mark. > As a particular example, the ft ligature is getting replaced by utf-8: ef bf > bd > Is there any new resolution on this issue? Just returning the fl ligature > would be great, or normalizing it to f, t. > This particular example comes from saving my gmail inbox page as a pdf, in > chrome. It uses the ft ligature in the word "Drafts". > There are many similar examples, it's not specific to one pdf generator. > I'm using tika-app-2.4.1.jar -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4231) Parsing Arabic PDF is returning bad data
[ https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874464#comment-17874464 ] Tilman Hausherr edited comment on TIKA-4231 at 8/17/24 9:21 AM: Here's a new text extraction after fixing PDFBOX-5868: [^TIKA-4231-arabic-new.txt] does this look closer to what you're expecting? was (Author: tilman): Here's a new text extraction: [^TIKA-4231-arabic-new.txt] does this look closer to what you're expecting? > Parsing Arabic PDF is returning bad data > > > Key: TIKA-4231 > URL: https://issues.apache.org/jira/browse/TIKA-4231 > Project: Tika > Issue Type: Bug >Affects Versions: 2.6.0, 2.9.1 > Environment: I am using Java 18. And using maven dependency > tika-parsers-standard-package > ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)] > >Reporter: Aamir >Priority: Major > Attachments: TIKA-4231-arabic-new.txt, arabic-pdfbox.txt, arabic.pdf, > arabic.txt > > > Attached is a PDF with arabic text in it. > When parsed using tika version 2.6.0 or 2.9.1, it produces gibberish > characters. > The generated text doc is also attached which contains the parsed text. > Most of the other Arabic PDFs parse fine, but this one is giving this output. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4231) Parsing Arabic PDF is returning bad data
[ https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874464#comment-17874464 ] Tilman Hausherr commented on TIKA-4231: --- Here's a new text extraction: [^TIKA-4231-arabic-new.txt] does this look closer to what you're expecting? > Parsing Arabic PDF is returning bad data > > > Key: TIKA-4231 > URL: https://issues.apache.org/jira/browse/TIKA-4231 > Project: Tika > Issue Type: Bug >Affects Versions: 2.6.0, 2.9.1 > Environment: I am using Java 18. And using maven dependency > tika-parsers-standard-package > ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)] > >Reporter: Aamir >Priority: Major > Attachments: TIKA-4231-arabic-new.txt, arabic-pdfbox.txt, arabic.pdf, > arabic.txt > > > Attached is a PDF with arabic text in it. > When parsed using tika version 2.6.0 or 2.9.1, it produces gibberish > characters. > The generated text doc is also attached which contains the parsed text. > Most of the other Arabic PDFs parse fine, but this one is giving this output. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4231) Parsing Arabic PDF is returning bad data
[ https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4231: -- Attachment: TIKA-4231-arabic-new.txt > Parsing Arabic PDF is returning bad data > > > Key: TIKA-4231 > URL: https://issues.apache.org/jira/browse/TIKA-4231 > Project: Tika > Issue Type: Bug >Affects Versions: 2.6.0, 2.9.1 > Environment: I am using Java 18. And using maven dependency > tika-parsers-standard-package > ([https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/2.6.0)] > >Reporter: Aamir >Priority: Major > Attachments: TIKA-4231-arabic-new.txt, arabic-pdfbox.txt, arabic.pdf, > arabic.txt > > > Attached is a PDF with arabic text in it. > When parsed using tika version 2.6.0 or 2.9.1, it produces gibberish > characters. > The generated text doc is also attached which contains the parsed text. > Most of the other Arabic PDFs parse fine, but this one is giving this output. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4298) Failed to detect charset for zip entry with short non-Unicode file name
[ https://issues.apache.org/jira/browse/TIKA-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17873949#comment-17873949 ] Tilman Hausherr commented on TIKA-4298: --- The problem is that this image might be considered to be a work of art. Your colleague didn't sign an ICLA. IMHO there might be two solutions: 1) you recreate the zip file without the image 2) you change the test so that it loads the zip file from the URL in the ticket. (2) is done a lot in PDFBox but I haven't seen it in tika. > Failed to detect charset for zip entry with short non-Unicode file name > --- > > Key: TIKA-4298 > URL: https://issues.apache.org/jira/browse/TIKA-4298 > Project: Tika > Issue Type: Bug > Components: detector >Reporter: Mingchun Zhao >Priority: Major > Fix For: 3.0.0, 2.9.3 > > Attachments: TIKA-4298.patch, testZipEntryNameCharsetShiftSJIS.zip > > > The Japanese file names extracted from a zip file > [^testZipEntryNameCharsetShiftSJIS.zip] were garbled. The charset of the file > name is Shift_JIS, but the detect() method within the PackageParser class was > not able to detect the charset properly. > {code:java} > $ ls -1 testZipEntryNameCharsetShiftSJIS > shiba.png > 文章1.txt > 文章2.txt > {code} > {code:java} > $ java -jar tika-app-2.9.2.jar testZipEntryNameCharsetShiftSJIS.zip > xmlns="http://www.w3.org/1999/xhtml";> > > > content="org.apache.tika.parser.pkg.PackageParser"/> > > > > > > > > > shiba.png > > > ���1.txt > あいうえお > かきくけこ > > > ���2.txt > さしすせそ > たちつてと > > % {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4298) Failed to detect charset for zip entry with short non-Unicode file name
[ https://issues.apache.org/jira/browse/TIKA-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17873921#comment-17873921 ] Tilman Hausherr commented on TIKA-4298: --- I already tested it locally - nice. But what's with the ZIP file? Is this from the wild, or did you create it yourself? Who has the copyright of the shiba.png image? > Failed to detect charset for zip entry with short non-Unicode file name > --- > > Key: TIKA-4298 > URL: https://issues.apache.org/jira/browse/TIKA-4298 > Project: Tika > Issue Type: Bug > Components: detector >Reporter: Mingchun Zhao >Priority: Major > Fix For: 3.0.0, 2.9.3 > > Attachments: TIKA-4298.patch, testZipEntryNameCharsetShiftSJIS.zip > > > The Japanese file names extracted from a zip file > [^testZipEntryNameCharsetShiftSJIS.zip] were garbled. The charset of the file > name is Shift_JIS, but the detect() method within the PackageParser class was > not able to detect the charset properly. > {code:java} > $ ls -1 testZipEntryNameCharsetShiftSJIS > shiba.png > 文章1.txt > 文章2.txt > {code} > {code:java} > $ java -jar tika-app-2.9.2.jar testZipEntryNameCharsetShiftSJIS.zip > xmlns="http://www.w3.org/1999/xhtml";> > > > content="org.apache.tika.parser.pkg.PackageParser"/> > > > > > > > > > shiba.png > > > ���1.txt > あいうえお > かきくけこ > > > ���2.txt > さしすせそ > たちつてと > > % {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4298) Failed to detect charset for zip entry with short non-Unicode file name
[ https://issues.apache.org/jira/browse/TIKA-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4298: -- Fix Version/s: 3.0.0 2.9.3 > Failed to detect charset for zip entry with short non-Unicode file name > --- > > Key: TIKA-4298 > URL: https://issues.apache.org/jira/browse/TIKA-4298 > Project: Tika > Issue Type: Bug > Components: detector >Reporter: Mingchun Zhao >Priority: Major > Fix For: 3.0.0, 2.9.3 > > Attachments: TIKA-4298.patch, testZipEntryNameCharsetShiftSJIS.zip > > > The Japanese file names extracted from a zip file > [^testZipEntryNameCharsetShiftSJIS.zip] were garbled. The charset of the file > name is Shift_JIS, but the detect() method within the PackageParser class was > not able to detect the charset properly. > {code:java} > $ ls -1 testZipEntryNameCharsetShiftSJIS > shiba.png > 文章1.txt > 文章2.txt > {code} > {code:java} > $ java -jar tika-app-2.9.2.jar testZipEntryNameCharsetShiftSJIS.zip > xmlns="http://www.w3.org/1999/xhtml";> > > > content="org.apache.tika.parser.pkg.PackageParser"/> > > > > > > > > > shiba.png > > > ���1.txt > あいうえお > かきくけこ > > > ���2.txt > さしすせそ > たちつてと > > % {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-4290) Fix code inspection anomalies
[ https://issues.apache.org/jira/browse/TIKA-4290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr resolved TIKA-4290. --- Resolution: Fixed > Fix code inspection anomalies > - > > Key: TIKA-4290 > URL: https://issues.apache.org/jira/browse/TIKA-4290 > Project: Tika > Issue Type: Bug >Affects Versions: 2.9.2 > Reporter: Tilman Hausherr >Priority: Minor > Fix For: 3.0.0, 2.9.3 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-4296) "Parameter must be 1-based, but is -1" when using Tika with PDFBox 2.0.32
[ https://issues.apache.org/jira/browse/TIKA-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr resolved TIKA-4296. --- Resolution: Fixed > "Parameter must be 1-based, but is -1" when using Tika with PDFBox 2.0.32 > - > > Key: TIKA-4296 > URL: https://issues.apache.org/jira/browse/TIKA-4296 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.9.2 >Reporter: Thomas Mortagne >Assignee: Tilman Hausherr >Priority: Major > Fix For: 3.0.0, 2.9.3 > > Attachments: pdf.pdf > > > I just upgraded my pdfbox dependency to 2.0.32 and any Tika#parseToString of > a pdf file seems to produce the following warning: > {noformat} > WARN o.apache.pdfbox.text.PDFTextStripper - Parameter must be 1-based, but > is -1 > {noformat} > The behavior is the same as with 2.0.31, it's just that pdfbox is apparently > not too happy anymore with the way it's used by Tika. > This new warning was apparently introduced by PDFBOX-5822. > Just in case it's not actually any file, here is one with which I reproduce: > [^pdf.pdf] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4296) "Parameter must be 1-based, but is -1" when using Tika with PDFBox 2.0.32
[ https://issues.apache.org/jira/browse/TIKA-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17871740#comment-17871740 ] Tilman Hausherr commented on TIKA-4296: --- I'll have to wait until PDFBox 3.0.3 is released (very soon). > "Parameter must be 1-based, but is -1" when using Tika with PDFBox 2.0.32 > - > > Key: TIKA-4296 > URL: https://issues.apache.org/jira/browse/TIKA-4296 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.9.2 >Reporter: Thomas Mortagne >Assignee: Tilman Hausherr >Priority: Major > Fix For: 3.0.0, 2.9.3 > > Attachments: pdf.pdf > > > I just upgraded my pdfbox dependency to 2.0.32 and any Tika#parseToString of > a pdf file seems to produce the following warning: > {noformat} > WARN o.apache.pdfbox.text.PDFTextStripper - Parameter must be 1-based, but > is -1 > {noformat} > The behavior is the same as with 2.0.31, it's just that pdfbox is apparently > not too happy anymore with the way it's used by Tika. > This new warning was apparently introduced by PDFBOX-5822. > Just in case it's not actually any file, here is one with which I reproduce: > [^pdf.pdf] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Reopened] (TIKA-4296) "Parameter must be 1-based, but is -1" when using Tika with PDFBox 2.0.32
[ https://issues.apache.org/jira/browse/TIKA-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr reopened TIKA-4296: --- I reverted the trunk to investigate the test failures. > "Parameter must be 1-based, but is -1" when using Tika with PDFBox 2.0.32 > - > > Key: TIKA-4296 > URL: https://issues.apache.org/jira/browse/TIKA-4296 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.9.2 >Reporter: Thomas Mortagne >Assignee: Tilman Hausherr >Priority: Major > Fix For: 3.0.0, 2.9.3 > > Attachments: pdf.pdf > > > I just upgraded my pdfbox dependency to 2.0.32 and any Tika#parseToString of > a pdf file seems to produce the following warning: > {noformat} > WARN o.apache.pdfbox.text.PDFTextStripper - Parameter must be 1-based, but > is -1 > {noformat} > The behavior is the same as with 2.0.31, it's just that pdfbox is apparently > not too happy anymore with the way it's used by Tika. > This new warning was apparently introduced by PDFBOX-5822. > Just in case it's not actually any file, here is one with which I reproduce: > [^pdf.pdf] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-4296) "Parameter must be 1-based, but is -1" when using Tika with PDFBox 2.0.32
[ https://issues.apache.org/jira/browse/TIKA-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr resolved TIKA-4296. --- Resolution: Fixed Thanks, this will be fixed in the next version. It's really just a warning. > "Parameter must be 1-based, but is -1" when using Tika with PDFBox 2.0.32 > - > > Key: TIKA-4296 > URL: https://issues.apache.org/jira/browse/TIKA-4296 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.9.2 >Reporter: Thomas Mortagne >Assignee: Tilman Hausherr >Priority: Major > Fix For: 3.0.0, 2.9.3 > > Attachments: pdf.pdf > > > I just upgraded my pdfbox dependency to 2.0.32 and any Tika#parseToString of > a pdf file seems to produce the following warning: > {noformat} > WARN o.apache.pdfbox.text.PDFTextStripper - Parameter must be 1-based, but > is -1 > {noformat} > The behavior is the same as with 2.0.31, it's just that pdfbox is apparently > not too happy anymore with the way it's used by Tika. > This new warning was apparently introduced by PDFBOX-5822. > Just in case it's not actually any file, here is one with which I reproduce: > [^pdf.pdf] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (TIKA-4296) "Parameter must be 1-based, but is -1" when using Tika with PDFBox 2.0.32
[ https://issues.apache.org/jira/browse/TIKA-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr reassigned TIKA-4296: - Assignee: Tilman Hausherr > "Parameter must be 1-based, but is -1" when using Tika with PDFBox 2.0.32 > - > > Key: TIKA-4296 > URL: https://issues.apache.org/jira/browse/TIKA-4296 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.9.2 >Reporter: Thomas Mortagne >Assignee: Tilman Hausherr >Priority: Major > Attachments: pdf.pdf > > > I just upgraded my pdfbox dependency to 2.0.32 and any Tika#parseToString of > a pdf file seems to produce the following warning: > {noformat} > WARN o.apache.pdfbox.text.PDFTextStripper - Parameter must be 1-based, but > is -1 > {noformat} > The behavior is the same as with 2.0.31, it's just that pdfbox is apparently > not too happy anyore with the way it's used by Tika. > This new warning was apparently introduced by PDFBOX-5822. > Just in case it's not actually any file, here is one with which I reproduce: > [^pdf.pdf] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4296) "Parameter must be 1-based, but is -1" when using Tika with PDFBox 2.0.32
[ https://issues.apache.org/jira/browse/TIKA-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4296: -- Fix Version/s: 3.0.0 2.9.3 > "Parameter must be 1-based, but is -1" when using Tika with PDFBox 2.0.32 > - > > Key: TIKA-4296 > URL: https://issues.apache.org/jira/browse/TIKA-4296 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.9.2 >Reporter: Thomas Mortagne >Assignee: Tilman Hausherr >Priority: Major > Fix For: 3.0.0, 2.9.3 > > Attachments: pdf.pdf > > > I just upgraded my pdfbox dependency to 2.0.32 and any Tika#parseToString of > a pdf file seems to produce the following warning: > {noformat} > WARN o.apache.pdfbox.text.PDFTextStripper - Parameter must be 1-based, but > is -1 > {noformat} > The behavior is the same as with 2.0.31, it's just that pdfbox is apparently > not too happy anyore with the way it's used by Tika. > This new warning was apparently introduced by PDFBOX-5822. > Just in case it's not actually any file, here is one with which I reproduce: > [^pdf.pdf] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4247) HttpFetcher - add ability to send request headers
[ https://issues.apache.org/jira/browse/TIKA-4247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4247: -- Fix Version/s: 3.0.0 > HttpFetcher - add ability to send request headers > - > > Key: TIKA-4247 > URL: https://issues.apache.org/jira/browse/TIKA-4247 > Project: Tika > Issue Type: New Feature >Reporter: Nicholas DiPiazza >Priority: Major > Fix For: 3.0.0 > > > add ability to send request headers -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?
[ https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr resolved TIKA-4252. --- Resolution: Fixed > PipesClient#process - seems to lose the Fetch input metadata? > - > > Key: TIKA-4252 > URL: https://issues.apache.org/jira/browse/TIKA-4252 > Project: Tika > Issue Type: Bug >Reporter: Nicholas DiPiazza >Priority: Major > Fix For: 3.0.0 > > > when calling: > PipesResult pipesResult = pipesClient.process(new > FetchEmitTuple(request.getFetchKey(), > new FetchKey(fetcher.getName(), request.getFetchKey()), > new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, > FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP)); > the tikaMetadata is not present in the fetch data when the fetch method is > called. > > It's OK through this part: > UnsynchronizedByteArrayOutputStream bos = > UnsynchronizedByteArrayOutputStream.builder().get(); > try (ObjectOutputStream objectOutputStream = new > ObjectOutputStream(bos)) > { objectOutputStream.writeObject(t); } > byte[] bytes = bos.toByteArray(); > output.write(CALL.getByte()); > output.writeInt(bytes.length); > output.write(bytes); > output.flush(); > > i verified the bytes have the expected metadata from that point. > > UPDATE: found issue > > org.apache.tika.pipes.PipesServer#parseFromTuple > > is using a new Metadata when it should only use empty metadata if fetch tuple > metadata is null. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4294) Simplify serialization/deserialization of ParseContext
[ https://issues.apache.org/jira/browse/TIKA-4294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17871132#comment-17871132 ] Tilman Hausherr edited comment on TIKA-4294 at 8/5/24 5:36 PM: --- What I mean is that if its name is equal to the superclass, then the result is the superclass. Else if not equal, then the class is created from the superclass name and again, the result is the superclass. {code} Class superClazz = Class.forName(superClassName); {code} would be the same. After writing this I googled... seems that yes it does take time, then your code should as it is https://stackoverflow.com/questions/18231991/class-forname-caching https://stackoverflow.com/questions/25967441/difference-between-calling-a-class-constructor-and-using-class-forname-newinst was (Author: tilman): What I mean is that if its name is equal to the superclass, then the result is the superclass. Else if not equal, then the class is created from the superclass name and again, the result is the superclass. {code} Class superClazz = Class.forName(superClassName); {code} would be the same. After writing this I googled... seems that yes it does take time, then your code should stay that way https://stackoverflow.com/questions/18231991/class-forname-caching https://stackoverflow.com/questions/25967441/difference-between-calling-a-class-constructor-and-using-class-forname-newinst > Simplify serialization/deserialization of ParseContext > -- > > Key: TIKA-4294 > URL: https://issues.apache.org/jira/browse/TIKA-4294 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Trivial > Fix For: 3.0.0 > > > Via [~dimirsen] (?) and [~tilman]'s ping on TIKA-4252, we should simplify the > serialization and deserialization of ParseContext to avoid redundancy of the > superclass. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4294) Simplify serialization/deserialization of ParseContext
[ https://issues.apache.org/jira/browse/TIKA-4294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17871132#comment-17871132 ] Tilman Hausherr commented on TIKA-4294: --- What I mean is that if its name is equal to the superclass, then the result is the superclass. Else if not equal, then the class is created from the superclass name and again, the result is the superclass. {code} Class superClazz = Class.forName(superClassName); {code} would be the same. After writing this I googled... seems that yes it does take time, then your code should stay that way https://stackoverflow.com/questions/18231991/class-forname-caching https://stackoverflow.com/questions/25967441/difference-between-calling-a-class-constructor-and-using-class-forname-newinst > Simplify serialization/deserialization of ParseContext > -- > > Key: TIKA-4294 > URL: https://issues.apache.org/jira/browse/TIKA-4294 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Trivial > Fix For: 3.0.0 > > > Via [~dimirsen] (?) and [~tilman]'s ping on TIKA-4252, we should simplify the > serialization and deserialization of ParseContext to avoid redundancy of the > superclass. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4294) Simplify serialization/deserialization of ParseContext
[ https://issues.apache.org/jira/browse/TIKA-4294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17871127#comment-17871127 ] Tilman Hausherr commented on TIKA-4294: --- that alternative is still there: {code} Class superClazz = className.equals(superClassName) ? clazz : Class.forName(superClassName); {code} The result will always be the {{superClassName}} class. It only makes sense if you'd assume that {{Class.forName}} is a very slow operation (I don't know if it is), or could fail for security reasons. Or I'm missing something here. > Simplify serialization/deserialization of ParseContext > -- > > Key: TIKA-4294 > URL: https://issues.apache.org/jira/browse/TIKA-4294 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Trivial > Fix For: 3.0.0 > > > Via [~dimirsen] (?) and [~tilman]'s ping on TIKA-4252, we should simplify the > serialization and deserialization of ParseContext to avoid redundancy of the > superclass. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-4291) In JDBCEmitter local var dateFormats shadows class filed with the same name
[ https://issues.apache.org/jira/browse/TIKA-4291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr resolved TIKA-4291. --- Assignee: Tilman Hausherr Resolution: Fixed Thanks! > In JDBCEmitter local var dateFormats shadows class filed with the same name > --- > > Key: TIKA-4291 > URL: https://issues.apache.org/jira/browse/TIKA-4291 > Project: Tika > Issue Type: Bug > Components: tika-pipes >Affects Versions: 3.0.0-BETA, 2.9.2 >Reporter: Dmitrii Kriukov >Assignee: Tilman Hausherr >Priority: Major > Fix For: 3.0.0, 2.9.3 > > > Line 338 of JDBCEmitter > Local variable dateFormats is created, populated with values, but never used > in its scope. > It's not clear how to fix. Was it planned to use class field with the same > type and name? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4291) In JDBCEmitter local var dateFormats shadows class filed with the same name
[ https://issues.apache.org/jira/browse/TIKA-4291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4291: -- Fix Version/s: 3.0.0 2.9.3 > In JDBCEmitter local var dateFormats shadows class filed with the same name > --- > > Key: TIKA-4291 > URL: https://issues.apache.org/jira/browse/TIKA-4291 > Project: Tika > Issue Type: Bug > Components: tika-pipes >Affects Versions: 3.0.0-BETA, 2.9.2 >Reporter: Dmitrii Kriukov >Priority: Major > Fix For: 3.0.0, 2.9.3 > > > Line 338 of JDBCEmitter > Local variable dateFormats is created, populated with values, but never used > in its scope. > It's not clear how to fix. Was it planned to use class field with the same > type and name? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4291) In JDBCEmitter local var dateFormats shadows class filed with the same name
[ https://issues.apache.org/jira/browse/TIKA-4291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4291: -- Affects Version/s: 2.9.2 3.0.0-BETA > In JDBCEmitter local var dateFormats shadows class filed with the same name > --- > > Key: TIKA-4291 > URL: https://issues.apache.org/jira/browse/TIKA-4291 > Project: Tika > Issue Type: Bug > Components: tika-pipes >Affects Versions: 3.0.0-BETA, 2.9.2 >Reporter: Dmitrii Kriukov >Priority: Major > > Line 338 of JDBCEmitter > Local variable dateFormats is created, populated with values, but never used > in its scope. > It's not clear how to fix. Was it planned to use class field with the same > type and name? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4291) In JDBCEmitter local var dateFormats shadows class filed with the same name
[ https://issues.apache.org/jira/browse/TIKA-4291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17871083#comment-17871083 ] Tilman Hausherr commented on TIKA-4291: --- This was done in TIKA-3916 / TIKA-3930. I think that code was moved to the top in the constructor and then it was forgotten to delete it. Also, the left index was always the same. ping [~tallison] > In JDBCEmitter local var dateFormats shadows class filed with the same name > --- > > Key: TIKA-4291 > URL: https://issues.apache.org/jira/browse/TIKA-4291 > Project: Tika > Issue Type: Bug > Components: tika-pipes >Reporter: Dmitrii Kriukov >Priority: Major > > Line 338 of JDBCEmitter > Local variable dateFormats is created, populated with values, but never used > in its scope. > It's not clear how to fix. Was it planned to use class field with the same > type and name? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-4292) Mismatched type in contains() calls in OneNoteTreeWalker
[ https://issues.apache.org/jira/browse/TIKA-4292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr resolved TIKA-4292. --- Fix Version/s: 3.0.0 Assignee: Tilman Hausherr Resolution: Fixed Thank you, fixed. > Mismatched type in contains() calls in OneNoteTreeWalker > > > Key: TIKA-4292 > URL: https://issues.apache.org/jira/browse/TIKA-4292 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 3.0.0-BETA >Reporter: Dmitrii Kriukov >Assignee: Tilman Hausherr >Priority: Major > Fix For: 3.0.0 > > > lines 472 499 - Set can't contain instances of > OneNotePropertyId -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4292) Mismatched type in contains() calls in OneNoteTreeWalker
[ https://issues.apache.org/jira/browse/TIKA-4292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4292: -- Affects Version/s: 3.0.0-BETA > Mismatched type in contains() calls in OneNoteTreeWalker > > > Key: TIKA-4292 > URL: https://issues.apache.org/jira/browse/TIKA-4292 > Project: Tika > Issue Type: Bug >Affects Versions: 3.0.0-BETA >Reporter: Dmitrii Kriukov >Priority: Major > > lines 472 499 - Set can't contain instances of > OneNotePropertyId -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4292) Mismatched type in contains() calls in OneNoteTreeWalker
[ https://issues.apache.org/jira/browse/TIKA-4292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4292: -- Component/s: parser > Mismatched type in contains() calls in OneNoteTreeWalker > > > Key: TIKA-4292 > URL: https://issues.apache.org/jira/browse/TIKA-4292 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 3.0.0-BETA >Reporter: Dmitrii Kriukov >Priority: Major > > lines 472 499 - Set can't contain instances of > OneNotePropertyId -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-4293) Mismatched type in contains() calls in StreamingDetectContext
[ https://issues.apache.org/jira/browse/TIKA-4293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr resolved TIKA-4293. --- Assignee: Tilman Hausherr Resolution: Fixed Thanks, fixed. Ignore the Hudson entry. The CI is very unstable currently. > Mismatched type in contains() calls in StreamingDetectContext > - > > Key: TIKA-4293 > URL: https://issues.apache.org/jira/browse/TIKA-4293 > Project: Tika > Issue Type: Bug >Affects Versions: 2.9.2 >Reporter: Dmitrii Kriukov > Assignee: Tilman Hausherr >Priority: Major > Fix For: 3.0.0, 2.9.3 > > > line 80 > Map may not contain keys of type Class -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4293) Mismatched type in contains() calls in StreamingDetectContext
[ https://issues.apache.org/jira/browse/TIKA-4293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4293: -- Affects Version/s: 2.9.2 > Mismatched type in contains() calls in StreamingDetectContext > - > > Key: TIKA-4293 > URL: https://issues.apache.org/jira/browse/TIKA-4293 > Project: Tika > Issue Type: Bug >Affects Versions: 2.9.2 >Reporter: Dmitrii Kriukov >Priority: Major > > line 80 > Map may not contain keys of type Class -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4293) Mismatched type in contains() calls in StreamingDetectContext
[ https://issues.apache.org/jira/browse/TIKA-4293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4293: -- Fix Version/s: 3.0.0 2.9.3 > Mismatched type in contains() calls in StreamingDetectContext > - > > Key: TIKA-4293 > URL: https://issues.apache.org/jira/browse/TIKA-4293 > Project: Tika > Issue Type: Bug >Affects Versions: 2.9.2 >Reporter: Dmitrii Kriukov >Priority: Major > Fix For: 3.0.0, 2.9.3 > > > line 80 > Map may not contain keys of type Class -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4290) Fix code inspection anonalies
[ https://issues.apache.org/jira/browse/TIKA-4290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4290: -- Fix Version/s: 3.0.0 2.9.3 > Fix code inspection anonalies > - > > Key: TIKA-4290 > URL: https://issues.apache.org/jira/browse/TIKA-4290 > Project: Tika > Issue Type: Bug >Affects Versions: 2.9.2 > Reporter: Tilman Hausherr >Priority: Minor > Fix For: 3.0.0, 2.9.3 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4280) Tasks for the 3.0.0 release
[ https://issues.apache.org/jira/browse/TIKA-4280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4280: -- Description: I'm too lazy to open separate tickets. Please do so if desired. Some items: * Before releasing the real 3.0.0 we need to remove any "-M" dependencies * Decide about the ffmpeg issue and the hdf5 issue * Run the regression tests vs 2.9.x * Convert tika-grpc to use the dependency plugin instead of the shade plugin * Turn javadocs back on. I got errors during the deploy process because javadoc needed the auto-generated code ("cannot find symbol DeleteFetcherRequest"). We need to enable javadocs for the rest of the project. * TIKA-4290 Tilman question Other things? Thank you [~tilman] for the first two! was: I'm too lazy to open separate tickets. Please do so if desired. Some items: * Before releasing the real 3.0.0 we need to remove any "-M" dependencies * Decide about the ffmpeg issue and the hdf5 issue * Run the regression tests vs 2.9.x * Convert tika-grpc to use the dependency plugin instead of the shade plugin * Turn javadocs back on. I got errors during the deploy process because javadoc needed the auto-generated code ("cannot find symbol DeleteFetcherRequest"). We need to enable javadocs for the rest of the project. Other things? Thank you [~tilman] for the first two! > Tasks for the 3.0.0 release > --- > > Key: TIKA-4280 > URL: https://issues.apache.org/jira/browse/TIKA-4280 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > I'm too lazy to open separate tickets. Please do so if desired. > Some items: > * Before releasing the real 3.0.0 we need to remove any "-M" dependencies > * Decide about the ffmpeg issue and the hdf5 issue > * Run the regression tests vs 2.9.x > * Convert tika-grpc to use the dependency plugin instead of the shade plugin > * Turn javadocs back on. I got errors during the deploy process because > javadoc needed the auto-generated code ("cannot find symbol > DeleteFetcherRequest"). We need to enable javadocs for the rest of the > project. > * TIKA-4290 Tilman question > Other things? Thank you [~tilman] for the first two! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Reopened] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?
[ https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr reopened TIKA-4252: --- > PipesClient#process - seems to lose the Fetch input metadata? > - > > Key: TIKA-4252 > URL: https://issues.apache.org/jira/browse/TIKA-4252 > Project: Tika > Issue Type: Bug >Reporter: Nicholas DiPiazza >Priority: Major > Fix For: 3.0.0 > > > when calling: > PipesResult pipesResult = pipesClient.process(new > FetchEmitTuple(request.getFetchKey(), > new FetchKey(fetcher.getName(), request.getFetchKey()), > new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, > FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP)); > the tikaMetadata is not present in the fetch data when the fetch method is > called. > > It's OK through this part: > UnsynchronizedByteArrayOutputStream bos = > UnsynchronizedByteArrayOutputStream.builder().get(); > try (ObjectOutputStream objectOutputStream = new > ObjectOutputStream(bos)) > { objectOutputStream.writeObject(t); } > byte[] bytes = bos.toByteArray(); > output.write(CALL.getByte()); > output.writeInt(bytes.length); > output.write(bytes); > output.flush(); > > i verified the bytes have the expected metadata from that point. > > UPDATE: found issue > > org.apache.tika.pipes.PipesServer#parseFromTuple > > is using a new Metadata when it should only use empty metadata if fetch tuple > metadata is null. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?
[ https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17870807#comment-17870807 ] Tilman Hausherr commented on TIKA-4252: --- Please have a look at PR# 1872. Even with the proposed correction of {code} Class superClazz = clazz.equals(superClassName) ? clazz : Class.forName(superClassName); {code} to {code} Class superClazz = clazz.toString().equals(superClassName) ? clazz : Class.forName(superClassName); {code} superClazz would always be assigned the same value regardless how the alternative works out. Also, {{clazzName}} from a few lines above is unused. I wonder if something completely different was intended. > PipesClient#process - seems to lose the Fetch input metadata? > - > > Key: TIKA-4252 > URL: https://issues.apache.org/jira/browse/TIKA-4252 > Project: Tika > Issue Type: Bug >Reporter: Nicholas DiPiazza >Priority: Major > Fix For: 3.0.0 > > > when calling: > PipesResult pipesResult = pipesClient.process(new > FetchEmitTuple(request.getFetchKey(), > new FetchKey(fetcher.getName(), request.getFetchKey()), > new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, > FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP)); > the tikaMetadata is not present in the fetch data when the fetch method is > called. > > It's OK through this part: > UnsynchronizedByteArrayOutputStream bos = > UnsynchronizedByteArrayOutputStream.builder().get(); > try (ObjectOutputStream objectOutputStream = new > ObjectOutputStream(bos)) > { objectOutputStream.writeObject(t); } > byte[] bytes = bos.toByteArray(); > output.write(CALL.getByte()); > output.writeInt(bytes.length); > output.write(bytes); > output.flush(); > > i verified the bytes have the expected metadata from that point. > > UPDATE: found issue > > org.apache.tika.pipes.PipesServer#parseFromTuple > > is using a new Metadata when it should only use empty metadata if fetch tuple > metadata is null. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4290) Fix code inspection anonalies
Tilman Hausherr created TIKA-4290: - Summary: Fix code inspection anonalies Key: TIKA-4290 URL: https://issues.apache.org/jira/browse/TIKA-4290 Project: Tika Issue Type: Bug Affects Versions: 2.9.2 Reporter: Tilman Hausherr -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4283) Add detection for JKS Keystore
[ https://issues.apache.org/jira/browse/TIKA-4283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4283: -- Component/s: core parser > Add detection for JKS Keystore > -- > > Key: TIKA-4283 > URL: https://issues.apache.org/jira/browse/TIKA-4283 > Project: Tika > Issue Type: New Feature > Components: core, parser >Affects Versions: 2.9.2 >Reporter: Lonzak >Priority: Major > Fix For: 3.0.0, 2.9.3 > > > I added detection for java keystores JKS. It is based on the magic byte. > > Some additional infos: > [https://en.wikipedia.org/wiki/Java_KeyStore] > The magic bytes are described here: > [https://en.wikipedia.org/wiki/List_of_file_signatures] > > A proprietary keystore implementation provided by SUN. > [https://docs.oracle.com/javase/8/docs/technotes/guides/security/crypto/CryptoSpec.html#KeystoreImplementation] > > If possible this should be added to 2.9.X Branch. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4283) Add detection for JKS Keystore
[ https://issues.apache.org/jira/browse/TIKA-4283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4283: -- Affects Version/s: 2.9.2 > Add detection for JKS Keystore > -- > > Key: TIKA-4283 > URL: https://issues.apache.org/jira/browse/TIKA-4283 > Project: Tika > Issue Type: New Feature >Affects Versions: 2.9.2 >Reporter: Lonzak >Priority: Major > Fix For: 3.0.0, 2.9.3 > > > I added detection for java keystores JKS. It is based on the magic byte. > > Some additional infos: > [https://en.wikipedia.org/wiki/Java_KeyStore] > The magic bytes are described here: > [https://en.wikipedia.org/wiki/List_of_file_signatures] > > A proprietary keystore implementation provided by SUN. > [https://docs.oracle.com/javase/8/docs/technotes/guides/security/crypto/CryptoSpec.html#KeystoreImplementation] > > If possible this should be added to 2.9.X Branch. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-4283) Add detection for JKS Keystore
[ https://issues.apache.org/jira/browse/TIKA-4283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr resolved TIKA-4283. --- Assignee: Tilman Hausherr Resolution: Fixed Done, it's now in 2.* as well, thanks. > Add detection for JKS Keystore > -- > > Key: TIKA-4283 > URL: https://issues.apache.org/jira/browse/TIKA-4283 > Project: Tika > Issue Type: New Feature > Components: core, parser >Affects Versions: 2.9.2 >Reporter: Lonzak >Assignee: Tilman Hausherr >Priority: Major > Fix For: 3.0.0, 2.9.3 > > > I added detection for java keystores JKS. It is based on the magic byte. > > Some additional infos: > [https://en.wikipedia.org/wiki/Java_KeyStore] > The magic bytes are described here: > [https://en.wikipedia.org/wiki/List_of_file_signatures] > > A proprietary keystore implementation provided by SUN. > [https://docs.oracle.com/javase/8/docs/technotes/guides/security/crypto/CryptoSpec.html#KeystoreImplementation] > > If possible this should be added to 2.9.X Branch. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4283) Add detection for JKS Keystore
[ https://issues.apache.org/jira/browse/TIKA-4283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4283: -- Fix Version/s: 3.0.0 > Add detection for JKS Keystore > -- > > Key: TIKA-4283 > URL: https://issues.apache.org/jira/browse/TIKA-4283 > Project: Tika > Issue Type: New Feature >Reporter: Lonzak >Priority: Major > Fix For: 3.0.0, 2.9.3 > > > I added detection for java keystores JKS. It is based on the magic byte. > > Some additional infos: > [https://en.wikipedia.org/wiki/Java_KeyStore] > The magic bytes are described here: > [https://en.wikipedia.org/wiki/List_of_file_signatures] > > A proprietary keystore implementation provided by SUN. > [https://docs.oracle.com/javase/8/docs/technotes/guides/security/crypto/CryptoSpec.html#KeystoreImplementation] > > If possible this should be added to 2.9.X Branch. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4285) Invalid Link for changelog CHANGES.txt files
[ https://issues.apache.org/jira/browse/TIKA-4285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17867684#comment-17867684 ] Tilman Hausherr commented on TIKA-4285: --- Additionally: the 3.0.0-BETA2 link works, however the text mentions "Tika 2.9.2". > Invalid Link for changelog CHANGES.txt files > > > Key: TIKA-4285 > URL: https://issues.apache.org/jira/browse/TIKA-4285 > Project: Tika > Issue Type: Task >Affects Versions: 2.9.0, 2.9.1, 2.9.2 >Reporter: Lonzak >Priority: Major > > On the tika [start page|https://tika.apache.org/] the linked change log files > CHANGES.txt starting with version 2.9.0 are missing/broken. > > {+}Working{+}: > https://archive.apache.org/dist/tika/2.8.0/CHANGES-2.8.0.txt > +Not working:+ > https://archive.apache.org/dist/{-}{color:#FF}release{color}{-}/tika/2.9.0/CHANGES-2.9.0.txt > https://archive.apache.org/dist/{-}{color:#FF}release{color}{-}/tika/2.9.1/CHANGES-2.9.1.txt > https://archive.apache.org/dist/{-}{color:#FF}release{color}{-}/tika/2.9.2/CHANGES-2.9.2.txt -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (TIKA-4284) [Security] CVE-2020-27511 fix needed for activemq-osgi-5.17.6 and strudl.0.3.13
[ https://issues.apache.org/jira/browse/TIKA-4284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr closed TIKA-4284. - Resolution: Invalid > [Security] CVE-2020-27511 fix needed for activemq-osgi-5.17.6 and > strudl.0.3.13 > --- > > Key: TIKA-4284 > URL: https://issues.apache.org/jira/browse/TIKA-4284 > Project: Tika > Issue Type: Bug >Reporter: Abhijit Rajwade >Priority: Major > Labels: SECURITY > > CVE-2020-27511 fix needed for activemq-osgi-5.17.6 and strudl.0.3.13 > Description : > Severity :CVE CVSS 3: 7.5Sonatype CVSS 3: 7.5 > Weakness :Sonatype CWE: 400 > Source : National Vulnerability Database > Categories : Data > Description from CVE :An issue was discovered in the stripTags and > unescapeHTML components in Prototype 1.7.3 where an attacker can cause a > Regular Expression Denial of Servicethrough stripping crafted HTML tags. > Explanation : The prototype package is vulnerable to Regular Expression > Denial of Service [ReDoS] attacks. The stripTags[] function in the String.js > file used to unescape HTML fails to efficiently parse and remove tags within > a given string. An attacker can exploit this vulnerability by submitting a > crafted code block which, when parsed by the affected function, will exhaust > system resources and trigger a DoS condition. > Detection : The application is vulnerable by using this component. > Recommendation : There is no non-vulnerable upgrade path for this > component/package. We recommend investigating alternative components or a > potential mitigating control. > Root Cause : activemq-osgi-5.17.6.jarorg/apache/activemq/web/prototype.js : > [ , ] > Advisories : Attack: https://github.com/AlyxRen/prototype.node.js > CVSS Details :CVE CVSS 3: 7.5CVSS Vector: > CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:N/I:N/A:H > CVE : CVE-2020-27511 > URL : http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-27511 > Remediation : This component does not have any non-vulnerable Version. Please > contact the vendor to get this vulnerability fixed. > === > Description : > Severity :CVE CVSS 3: 7.5Sonatype CVSS 3: 7.5 > Weakness :Sonatype CWE: 400 > Source : National Vulnerability Database > Categories : Data > Description from CVE :An issue was discovered in the stripTags and > unescapeHTML components in Prototype 1.7.3 where an attacker can cause a > Regular Expression Denial of Servicethrough stripping crafted HTML tags. > Explanation : The prototype package is vulnerable to Regular Expression > Denial of Service [ReDoS] attacks. The stripTags[] function in the String.js > file used to unescape HTML fails to efficiently parse and remove tags within > a given string. An attacker can exploit this vulnerability by submitting a > crafted code block which, when parsed by the affected function, will exhaust > system resources and trigger a DoS condition. > Detection : The application is vulnerable by using this component. > Recommendation : There is no non-vulnerable upgrade path for this > component/package. We recommend investigating alternative components or a > potential mitigating control. > Root Cause : strudl.0.3.13 : [ , ] > Advisories : Attack: https://github.com/AlyxRen/prototype.node.js > CVSS Details :CVE CVSS 3: 7.5CVSS Vector: > CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:N/I:N/A:H > CVE : CVE-2020-27511 > URL : http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-27511 > Remediation : This component does not have any non-vulnerable Version. Please > contact the vendor to get this vulnerability fixed. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4284) [Security] CVE-2020-27511 fix needed for activemq-osgi-5.17.6 and strudl.0.3.13
[ https://issues.apache.org/jira/browse/TIKA-4284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17867236#comment-17867236 ] Tilman Hausherr commented on TIKA-4284: --- How is this related to Tika? What subproject uses activemq-osgi-5.17.6 and strudl.0.3.13? > [Security] CVE-2020-27511 fix needed for activemq-osgi-5.17.6 and > strudl.0.3.13 > --- > > Key: TIKA-4284 > URL: https://issues.apache.org/jira/browse/TIKA-4284 > Project: Tika > Issue Type: Bug >Reporter: Abhijit Rajwade >Priority: Major > Labels: SECURITY > > CVE-2020-27511 fix needed for activemq-osgi-5.17.6 and strudl.0.3.13 > Description : > Severity :CVE CVSS 3: 7.5Sonatype CVSS 3: 7.5 > Weakness :Sonatype CWE: 400 > Source : National Vulnerability Database > Categories : Data > Description from CVE :An issue was discovered in the stripTags and > unescapeHTML components in Prototype 1.7.3 where an attacker can cause a > Regular Expression Denial of Servicethrough stripping crafted HTML tags. > Explanation : The prototype package is vulnerable to Regular Expression > Denial of Service [ReDoS] attacks. The stripTags[] function in the String.js > file used to unescape HTML fails to efficiently parse and remove tags within > a given string. An attacker can exploit this vulnerability by submitting a > crafted code block which, when parsed by the affected function, will exhaust > system resources and trigger a DoS condition. > Detection : The application is vulnerable by using this component. > Recommendation : There is no non-vulnerable upgrade path for this > component/package. We recommend investigating alternative components or a > potential mitigating control. > Root Cause : activemq-osgi-5.17.6.jarorg/apache/activemq/web/prototype.js : > [ , ] > Advisories : Attack: https://github.com/AlyxRen/prototype.node.js > CVSS Details :CVE CVSS 3: 7.5CVSS Vector: > CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:N/I:N/A:H > CVE : CVE-2020-27511 > URL : http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-27511 > Remediation : This component does not have any non-vulnerable Version. Please > contact the vendor to get this vulnerability fixed. > === > Description : > Severity :CVE CVSS 3: 7.5Sonatype CVSS 3: 7.5 > Weakness :Sonatype CWE: 400 > Source : National Vulnerability Database > Categories : Data > Description from CVE :An issue was discovered in the stripTags and > unescapeHTML components in Prototype 1.7.3 where an attacker can cause a > Regular Expression Denial of Servicethrough stripping crafted HTML tags. > Explanation : The prototype package is vulnerable to Regular Expression > Denial of Service [ReDoS] attacks. The stripTags[] function in the String.js > file used to unescape HTML fails to efficiently parse and remove tags within > a given string. An attacker can exploit this vulnerability by submitting a > crafted code block which, when parsed by the affected function, will exhaust > system resources and trigger a DoS condition. > Detection : The application is vulnerable by using this component. > Recommendation : There is no non-vulnerable upgrade path for this > component/package. We recommend investigating alternative components or a > potential mitigating control. > Root Cause : strudl.0.3.13 : [ , ] > Advisories : Attack: https://github.com/AlyxRen/prototype.node.js > CVSS Details :CVE CVSS 3: 7.5CVSS Vector: > CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:N/I:N/A:H > CVE : CVE-2020-27511 > URL : http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-27511 > Remediation : This component does not have any non-vulnerable Version. Please > contact the vendor to get this vulnerability fixed. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4282) Syntax error with h2 version 2.3.230
[ https://issues.apache.org/jira/browse/TIKA-4282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4282: -- Description: The latest h2 version (which needs jdk11) brings a syntax error because of an unneeded comma in one SQL query. release notes: https://github.com/h2database/h2database/releases/tag/version-2.3.230 likely this: https://github.com/h2database/h2database/issues/3106 was: The latest h2 (which needs jdk11) version brings a syntax error because of an unneeded comma in one SQL query. release notes: https://github.com/h2database/h2database/releases/tag/version-2.3.230 likely this: https://github.com/h2database/h2database/issues/3106 > Syntax error with h2 version 2.3.230 > > > Key: TIKA-4282 > URL: https://issues.apache.org/jira/browse/TIKA-4282 > Project: Tika > Issue Type: Bug > Components: tika-eval >Affects Versions: 3.0.0-BETA, 2.9.2 > Reporter: Tilman Hausherr >Assignee: Tilman Hausherr >Priority: Minor > Fix For: 3.0.0, 2.9.3 > > > The latest h2 version (which needs jdk11) brings a syntax error because of an > unneeded comma in one SQL query. > release notes: > https://github.com/h2database/h2database/releases/tag/version-2.3.230 > likely this: > https://github.com/h2database/h2database/issues/3106 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-4282) Syntax error with h2 version 2.3.230
[ https://issues.apache.org/jira/browse/TIKA-4282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr resolved TIKA-4282. --- Resolution: Fixed > Syntax error with h2 version 2.3.230 > > > Key: TIKA-4282 > URL: https://issues.apache.org/jira/browse/TIKA-4282 > Project: Tika > Issue Type: Bug > Components: tika-eval >Affects Versions: 3.0.0-BETA, 2.9.2 > Reporter: Tilman Hausherr >Assignee: Tilman Hausherr >Priority: Minor > Fix For: 3.0.0, 2.9.3 > > > The latest h2 (which needs jdk11) version brings a syntax error because of an > unneeded comma in one SQL query. > release notes: > https://github.com/h2database/h2database/releases/tag/version-2.3.230 > likely this: > https://github.com/h2database/h2database/issues/3106 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4282) Syntax error with h2 version 2.3.230
Tilman Hausherr created TIKA-4282: - Summary: Syntax error with h2 version 2.3.230 Key: TIKA-4282 URL: https://issues.apache.org/jira/browse/TIKA-4282 Project: Tika Issue Type: Bug Components: tika-eval Affects Versions: 3.0.0-BETA Reporter: Tilman Hausherr Assignee: Tilman Hausherr Fix For: 3.0.0 The latest h2 version brings a syntax error because of an unneeded comma in one SQL query. release notes: https://github.com/h2database/h2database/releases/tag/version-2.3.230 likely this: https://github.com/h2database/h2database/issues/3106 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4282) Syntax error with h2 version 2.3.230
[ https://issues.apache.org/jira/browse/TIKA-4282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4282: -- Description: The latest h2 (which needs jdk11) version brings a syntax error because of an unneeded comma in one SQL query. release notes: https://github.com/h2database/h2database/releases/tag/version-2.3.230 likely this: https://github.com/h2database/h2database/issues/3106 was: The latest h2 version brings a syntax error because of an unneeded comma in one SQL query. release notes: https://github.com/h2database/h2database/releases/tag/version-2.3.230 likely this: https://github.com/h2database/h2database/issues/3106 > Syntax error with h2 version 2.3.230 > > > Key: TIKA-4282 > URL: https://issues.apache.org/jira/browse/TIKA-4282 > Project: Tika > Issue Type: Bug > Components: tika-eval >Affects Versions: 3.0.0-BETA > Reporter: Tilman Hausherr >Assignee: Tilman Hausherr >Priority: Minor > Fix For: 3.0.0 > > > The latest h2 (which needs jdk11) version brings a syntax error because of an > unneeded comma in one SQL query. > release notes: > https://github.com/h2database/h2database/releases/tag/version-2.3.230 > likely this: > https://github.com/h2database/h2database/issues/3106 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4282) Syntax error with h2 version 2.3.230
[ https://issues.apache.org/jira/browse/TIKA-4282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4282: -- Affects Version/s: 2.9.2 > Syntax error with h2 version 2.3.230 > > > Key: TIKA-4282 > URL: https://issues.apache.org/jira/browse/TIKA-4282 > Project: Tika > Issue Type: Bug > Components: tika-eval >Affects Versions: 3.0.0-BETA, 2.9.2 > Reporter: Tilman Hausherr >Assignee: Tilman Hausherr >Priority: Minor > Fix For: 3.0.0 > > > The latest h2 (which needs jdk11) version brings a syntax error because of an > unneeded comma in one SQL query. > release notes: > https://github.com/h2database/h2database/releases/tag/version-2.3.230 > likely this: > https://github.com/h2database/h2database/issues/3106 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4282) Syntax error with h2 version 2.3.230
[ https://issues.apache.org/jira/browse/TIKA-4282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4282: -- Fix Version/s: 2.9.3 > Syntax error with h2 version 2.3.230 > > > Key: TIKA-4282 > URL: https://issues.apache.org/jira/browse/TIKA-4282 > Project: Tika > Issue Type: Bug > Components: tika-eval >Affects Versions: 3.0.0-BETA, 2.9.2 > Reporter: Tilman Hausherr >Assignee: Tilman Hausherr >Priority: Minor > Fix For: 3.0.0, 2.9.3 > > > The latest h2 (which needs jdk11) version brings a syntax error because of an > unneeded comma in one SQL query. > release notes: > https://github.com/h2database/h2database/releases/tag/version-2.3.230 > likely this: > https://github.com/h2database/h2database/issues/3106 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (TIKA-1155) Number Format is converted with an error
[ https://issues.apache.org/jira/browse/TIKA-1155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr closed TIKA-1155. - Resolution: Cannot Reproduce Closing because it can no longer be reproduced, it has probably been fixed either by us or in POI. Please comment and/or reopen if you disagree. > Number Format is converted with an error > > > Key: TIKA-1155 > URL: https://issues.apache.org/jira/browse/TIKA-1155 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.4 >Reporter: Evgeniy Buyanov >Priority: Major > Attachments: screenshot-1.png, test-Excel.csv, test.xlsx, test.xml > > Original Estimate: 2h > Remaining Estimate: 2h > > {code:Title=Source data} > ><NumberFormat ss:Format="_-* #,##0\ _B_F_-;\-* #,##0\ _B_F_-;_-* > "-"\ _B_F_-;_-@_-"/> > > 10 > -10 > {code} > java -jar tika-app-1.4.jar test.xlsx > test.xml > {code:Title=Result} > * 10 _F > -10 _F > {code} > related ASF Bugzilla – Bug > [52592|https://issues.apache.org/bugzilla/show_bug.cgi?id=52592] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-1155) Number Format is converted with an error
[ https://issues.apache.org/jira/browse/TIKA-1155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17866408#comment-17866408 ] Tilman Hausherr commented on TIKA-1155: --- Current output: {code:xml} Sheet1 10 - 10 - text {code} Looks like this on the screen: !screenshot-1.png! > Number Format is converted with an error > > > Key: TIKA-1155 > URL: https://issues.apache.org/jira/browse/TIKA-1155 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.4 >Reporter: Evgeniy Buyanov >Priority: Major > Attachments: screenshot-1.png, test-Excel.csv, test.xlsx, test.xml > > Original Estimate: 2h > Remaining Estimate: 2h > > {code:Title=Source data} > ><NumberFormat ss:Format="_-* #,##0\ _B_F_-;\-* #,##0\ _B_F_-;_-* > "-"\ _B_F_-;_-@_-"/> > > 10 > -10 > {code} > java -jar tika-app-1.4.jar test.xlsx > test.xml > {code:Title=Result} > * 10 _F > -10 _F > {code} > related ASF Bugzilla – Bug > [52592|https://issues.apache.org/bugzilla/show_bug.cgi?id=52592] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-1155) Number Format is converted with an error
[ https://issues.apache.org/jira/browse/TIKA-1155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-1155: -- Attachment: screenshot-1.png > Number Format is converted with an error > > > Key: TIKA-1155 > URL: https://issues.apache.org/jira/browse/TIKA-1155 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.4 >Reporter: Evgeniy Buyanov >Priority: Major > Attachments: screenshot-1.png, test-Excel.csv, test.xlsx, test.xml > > Original Estimate: 2h > Remaining Estimate: 2h > > {code:Title=Source data} > ><NumberFormat ss:Format="_-* #,##0\ _B_F_-;\-* #,##0\ _B_F_-;_-* > "-"\ _B_F_-;_-@_-"/> > > 10 > -10 > {code} > java -jar tika-app-1.4.jar test.xlsx > test.xml > {code:Title=Result} > * 10 _F > -10 _F > {code} > related ASF Bugzilla – Bug > [52592|https://issues.apache.org/bugzilla/show_bug.cgi?id=52592] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (TIKA-3028) Failed test at SAS7BDATParserTest:112
[ https://issues.apache.org/jira/browse/TIKA-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr closed TIKA-3028. - Resolution: Cannot Reproduce Closing for now because of no activity for years, please reopen if it still happens. I remember I had several problems in my early months as a committer with a german locale, and we did some fixes in the code and some configuration changes in my IDE. > Failed test at SAS7BDATParserTest:112 > - > > Key: TIKA-3028 > URL: https://issues.apache.org/jira/browse/TIKA-3028 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.23 >Reporter: Wknds >Priority: Blocker > Attachments: Bildschirmfoto 2020-01-24 um 23.12.20.png > > > Test fails at > SAS7BDATParserTest.testMultiColumns:112->TikaTest.assertContains:107. > Expected date is _01Jan1960:00:00_ > while the dates in the (untouched) test file are abbreviated by an '.' on my > system (please refer to the terminal output below). > {code:java} > // code placeholder > [ERROR] Failures: > [ERROR] > SAS7BDATParserTest.testMultiColumns:112->TikaTest.assertContains:107 > 01Jan1960:00:00 not found in: > TESTING Record Number Square of the Record Number Description of > the Row Percent DonePercent Increment datedatetimetime >0 0 This is row0 of 100% > 01-01-1960 01Jan.1960:00:00:01.00 00:00:011 1 This > is row1 of 1010% 0.0%02-01-1960 > 01Jan.1960:00:00:10.00 00:00:032 4 This is row > 2 of 1020% 50.0% 17-01-1960 > 01Jan.1960:00:01:40.00 00:00:093 9 This is row > 3 of 1030% 66.7% 22-03-1960 > 01Jan.1960:00:16:40.00 00:00:274 16 This is row > 4 of 1040% 75.0% 13-09-1960 > 01Jan.1960:02:46:40.00 00:01:215 25 This is row > 5 of 1050% 80.0% 17-09-1961 > 02Jan.1960:03:46:40.00 00:04:036 36 This is row > 6 of 1060% 83.3% 20-07-1963 > 12Jan.1960:13:46:40.00 00:12:097 49 This is row > 7 of 1070% 85.7% 29-07-1966 > 25Apr.1960:17:46:40.00 00:36:278 64 This is row > 8 of 1080% 87.5% 20-03-1971 > 03März1963:09:46:40.00 01:49:219 81 This is row > 9 of 1090% 88.9% 18-12-1977 > 09Sep.1991:01:46:40.00 05:28:0310 100 This is row > 10 of 10100%90.0% 19-05-1987 > 19Nov.2276:17:46:40.00 16:24:09 > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-3290) Extension reading it as eml instead of txt
[ https://issues.apache.org/jira/browse/TIKA-3290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-3290: -- Fix Version/s: (was: 1.24.1) > Extension reading it as eml instead of txt > -- > > Key: TIKA-3290 > URL: https://issues.apache.org/jira/browse/TIKA-3290 > Project: Tika > Issue Type: Bug > Components: core, mime >Affects Versions: 1.25 >Reporter: Tika User >Priority: Major > Labels: tika-parsers > Attachments: image-2021-02-22-10-13-08-447.png, > image-2021-02-23-12-39-00-778.png, test_sample_message.txt > > > The attached file extension is reading it as eml instead of txt. With version > 1.24.1 it is reading it as txt and now with the upgrade to 1.25, it is > reading it as eml. So that while parsing we are getting mail corrupted error. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-3172) PDF Parser configuration enable auto space using tika config file
[ https://issues.apache.org/jira/browse/TIKA-3172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr resolved TIKA-3172. --- Fix Version/s: 1.25 Assignee: Tim Allison Resolution: Fixed > PDF Parser configuration enable auto space using tika config file > - > > Key: TIKA-3172 > URL: https://issues.apache.org/jira/browse/TIKA-3172 > Project: Tika > Issue Type: Wish > Components: parser >Affects Versions: 1.24.1 >Reporter: Akash >Assignee: Tim Allison >Priority: Major > Fix For: 1.25 > > > Need information on how to set enableAutoSpace using tika config file. > {code:java} > / > > > > > > > false > > > > / > {code} > Above configuration is not working. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (TIKA-3155) Parse Error while extracting CSV files
[ https://issues.apache.org/jira/browse/TIKA-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr closed TIKA-3155. - Resolution: Duplicate Closing as duplicate of TIKA-4278. This isn't a CSV file by the improved logic. > Parse Error while extracting CSV files > -- > > Key: TIKA-3155 > URL: https://issues.apache.org/jira/browse/TIKA-3155 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.24.1 >Reporter: Akash >Priority: Major > Attachments: UTF-8_chars.csv > > > We are getting parse error while trying to extract csv files. > This was working in version 1.9, but exception coming in 1.24.1 > > {code:java} > /Exception in thread "main" org.apache.tika.exception.TikaException: > exception parsing the csv > at > org.apache.tikar.csv.TextAndCSVParser.parse.parse(TextAndCSVParser.java:198 > undefined) > at > org.apache.tikar.CompositeParser.parse.parse(CompositeParser.java:280 > undefined) > at > org.apache.tikar.CompositeParser.parse.parse(CompositeParser.java:280 > undefined) > at > org.apache.tikar.AutoDetectParser.parse.parse(AutoDetectParser.java:143 > undefined) > at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:209 > undefined) > at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:496 undefined) > at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:149 undefined) > Caused by: java.lang.IllegalStateException: IOException reading next record: > java.io.IOException: (startline 39) EOF reached before encapsulated token > finished > at > org.apache.commons.csv.CSVParser$CSVRecordIterator.getNextRecord(CSVParser.java:145 > undefined) > at > org.apache.commons.csv.CSVParser$CSVRecordIterator.hasNext(CSVParser.java:155 > undefined) > at > org.apache.tikar.csv.TextAndCSVParser.parse.parse(TextAndCSVParser.java:178 > undefined) > ... 6 more > Caused by: java.io.IOException: (startline 39) EOF reached before > encapsulated token finished > at org.apache.commons.csv.Lexer.parseEncapsulatedToken(Lexer.java:288 > undefined) > at org.apache.commons.csv.Lexer.nextToken(Lexer.java:158 undefined) > at org.apache.commons.csv.CSVParser.nextRecord(CSVParser.java:674 > undefined) > at > org.apache.commons.csv.CSVParser$CSVRecordIterator.getNextRecord(CSVParser.java:142 > undefined)/ > {code} > Issue is coming when we encounter double quotes in one of the cells. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4278) TextAndCSVParser doesn't detect semicolon separated file
[ https://issues.apache.org/jira/browse/TIKA-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17866277#comment-17866277 ] Tilman Hausherr commented on TIKA-4278: --- If colon and another delimiter have been detected with the same confidence, use the other one. > TextAndCSVParser doesn't detect semicolon separated file > > > Key: TIKA-4278 > URL: https://issues.apache.org/jira/browse/TIKA-4278 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.9.2 >Reporter: Tilman Hausherr >Assignee: Tilman Hausherr >Priority: Major > Labels: csv, csvparser > Fix For: 3.0.0, 2.9.3 > > Attachments: reports_csv_2.9.2_vs_2.9.3.tar.xz, > reports_csv_2.9.2_vs_2.9.3_3.tar.xz, reports_csv_2.9.2_vs_2.9.3_4.tar.xz > > > I ran the code from the attached SO issue and yes it doesn't detect semicolon > separated files. The reason is this line in {{TextAndCSVParser.java}}: > {code:java} > private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'}; > {code} > This is later used by {{CSVSniffer}}. For some reason the other delimiters > (pipe, colon and semicolon) aren't in that array, although they are in > {{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now > it works for semicolon. > Can I change this by adding the missing delimiters or was there a reason that > I missed? Proposed change would change CSVSniffer so that delimiters is a set > and then pass {{CHAR_TO_STRING_DELIMITER_MAP.keySet()}}. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4278) TextAndCSVParser doesn't detect semicolon separated file
[ https://issues.apache.org/jira/browse/TIKA-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4278: -- Attachment: reports_csv_2.9.2_vs_2.9.3_4.tar.xz > TextAndCSVParser doesn't detect semicolon separated file > > > Key: TIKA-4278 > URL: https://issues.apache.org/jira/browse/TIKA-4278 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.9.2 > Reporter: Tilman Hausherr >Assignee: Tilman Hausherr >Priority: Major > Labels: csv, csvparser > Fix For: 3.0.0, 2.9.3 > > Attachments: reports_csv_2.9.2_vs_2.9.3.tar.xz, > reports_csv_2.9.2_vs_2.9.3_3.tar.xz, reports_csv_2.9.2_vs_2.9.3_4.tar.xz > > > I ran the code from the attached SO issue and yes it doesn't detect semicolon > separated files. The reason is this line in {{TextAndCSVParser.java}}: > {code:java} > private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'}; > {code} > This is later used by {{CSVSniffer}}. For some reason the other delimiters > (pipe, colon and semicolon) aren't in that array, although they are in > {{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now > it works for semicolon. > Can I change this by adding the missing delimiters or was there a reason that > I missed? Proposed change would change CSVSniffer so that delimiters is a set > and then pass {{CHAR_TO_STRING_DELIMITER_MAP.keySet()}}. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4278) TextAndCSVParser doesn't detect semicolon separated file
[ https://issues.apache.org/jira/browse/TIKA-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17866147#comment-17866147 ] Tilman Hausherr edited comment on TIKA-4278 at 7/15/24 6:40 PM: I've now added a check that if the delimiter isn't in row zero then further hits later don't count. This fixes the problem that too many files are recognized as CSV that are not. Only one problem left now: false colon-separated lines. I never had any in decades, but a google search does find some SO questions, so I'll leave that there for now. We can still change it after the "big" regression tests. was (Author: tilman): I've now added a check that if the delimiter isn't in row zero then further hits later don't count. This fixes the problem that too many files are recognized as CSV that are not. Only one problem left now: colon-separated lines. I never had any in decades, but a google search does find some SO questions, so I'll leave that there for now. We can still change it after the "big" regression tests. > TextAndCSVParser doesn't detect semicolon separated file > > > Key: TIKA-4278 > URL: https://issues.apache.org/jira/browse/TIKA-4278 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 2.9.2 >Reporter: Tilman Hausherr >Assignee: Tilman Hausherr >Priority: Major > Labels: csv, csvparser > Fix For: 3.0.0, 2.9.3 > > Attachments: reports_csv_2.9.2_vs_2.9.3.tar.xz, > reports_csv_2.9.2_vs_2.9.3_3.tar.xz > > > I ran the code from the attached SO issue and yes it doesn't detect semicolon > separated files. The reason is this line in {{TextAndCSVParser.java}}: > {code:java} > private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'}; > {code} > This is later used by {{CSVSniffer}}. For some reason the other delimiters > (pipe, colon and semicolon) aren't in that array, although they are in > {{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now > it works for semicolon. > Can I change this by adding the missing delimiters or was there a reason that > I missed? Proposed change would change CSVSniffer so that delimiters is a set > and then pass {{CHAR_TO_STRING_DELIMITER_MAP.keySet()}}. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4278) TextAndCSVParser doesn't detect semicolon separated file
[ https://issues.apache.org/jira/browse/TIKA-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17866147#comment-17866147 ] Tilman Hausherr edited comment on TIKA-4278 at 7/15/24 6:24 PM: I've now added a check that if the delimiter isn't in row zero then further hits later don't count. This fixes the problem that too many files are recognized as CSV that are not. Only one problem left now: colon-separated lines. I never had any in decades, but a google search does find some SO questions, so I'll leave that there for now. We can still change it after the "big" regression tests. was (Author: tilman): I've now added a check that if the delimiter isn't in row zero then further hits later don't count. This fixes the problem that too many files are recognized as CSV that are not. Only one problem left now: colon-separated lines. I never had any in decades, but a google search does find some SO questions, so I'll leave that there for now. > TextAndCSVParser doesn't detect semicolon separated file > > > Key: TIKA-4278 > URL: https://issues.apache.org/jira/browse/TIKA-4278 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.9.2 >Reporter: Tilman Hausherr >Assignee: Tilman Hausherr >Priority: Major > Labels: csv, csvparser > Fix For: 3.0.0, 2.9.3 > > Attachments: reports_csv_2.9.2_vs_2.9.3.tar.xz, > reports_csv_2.9.2_vs_2.9.3_3.tar.xz > > > I ran the code from the attached SO issue and yes it doesn't detect semicolon > separated files. The reason is this line in {{TextAndCSVParser.java}}: > {code:java} > private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'}; > {code} > This is later used by {{CSVSniffer}}. For some reason the other delimiters > (pipe, colon and semicolon) aren't in that array, although they are in > {{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now > it works for semicolon. > Can I change this by adding the missing delimiters or was there a reason that > I missed? Proposed change would change CSVSniffer so that delimiters is a set > and then pass {{CHAR_TO_STRING_DELIMITER_MAP.keySet()}}. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4278) TextAndCSVParser doesn't detect semicolon separated file
[ https://issues.apache.org/jira/browse/TIKA-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17866147#comment-17866147 ] Tilman Hausherr commented on TIKA-4278: --- I've now added a check that if the delimiter isn't in row zero then further hits later don't count. This fixes the problem that too many files are recognized as CSV that are not. Only one problem left now: colon-separated lines. I never had any in decades, but a google search does find some SO questions, so I'll leave that there for now. > TextAndCSVParser doesn't detect semicolon separated file > > > Key: TIKA-4278 > URL: https://issues.apache.org/jira/browse/TIKA-4278 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 2.9.2 > Reporter: Tilman Hausherr >Assignee: Tilman Hausherr >Priority: Major > Labels: csv, csvparser > Fix For: 3.0.0, 2.9.3 > > Attachments: reports_csv_2.9.2_vs_2.9.3.tar.xz, > reports_csv_2.9.2_vs_2.9.3_3.tar.xz > > > I ran the code from the attached SO issue and yes it doesn't detect semicolon > separated files. The reason is this line in {{TextAndCSVParser.java}}: > {code:java} > private static final char[] DEFAULT_DELIMITERS = new char[]\{',', '\t'}; > {code} > This is later used by {{CSVSniffer}}. For some reason the other delimiters > (pipe, colon and semicolon) aren't in that array, although they are in > {{CHAR_TO_STRING_DELIMITER_MAP}}. I modified {{DEFAULT_DELIMITERS}} and now > it works for semicolon. > Can I change this by adding the missing delimiters or was there a reason that > I missed? Proposed change would change CSVSniffer so that delimiters is a set > and then pass {{CHAR_TO_STRING_DELIMITER_MAP.keySet()}}. -- This message was sent by Atlassian Jira (v8.20.10#820010)