[jira] [Commented] (TIKA-1675) please avoid xmlbeans dependency
[ https://issues.apache.org/jira/browse/TIKA-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14617457#comment-14617457 ] Michael McCandless commented on TIKA-1675: -- bq. If the project is dead and not fixing packaging bugs like this, i think its irresponsible to depend on it. +1 Maybe POI could/should absorb the parts of xmlbeans it depends on? please avoid xmlbeans dependency Key: TIKA-1675 URL: https://issues.apache.org/jira/browse/TIKA-1675 Project: Tika Issue Type: Bug Reporter: Robert Muir This dependency (e.g jar file) is fundamentally broken... XMLBEANS-499 Is there an alternative that could be used? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1628) ExternalParser.check should return false if it hits SecurityException
[ https://issues.apache.org/jira/browse/TIKA-1628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved TIKA-1628. -- Resolution: Pending Closed Thanks [~gagravarr] and [~thetaphi] ExternalParser.check should return false if it hits SecurityException - Key: TIKA-1628 URL: https://issues.apache.org/jira/browse/TIKA-1628 Project: Tika Issue Type: Bug Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 1.9 Attachments: TIKA-1628.patch If you run Tika with a Java security manager that blocks execution of external processes, ExternalParser.check throws SecurityException, but I think it should just return false? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1544) empty lines are not preserved
[ https://issues.apache.org/jira/browse/TIKA-1544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14309956#comment-14309956 ] Michael McCandless commented on TIKA-1544: -- bq. Michael McCandless, is the fix this simple? Hmm, maybe :) It's strange we are calling endParagraph when inParagraph is false? Maybe we are missing a lazyStartParagraph somewhere? empty lines are not preserved - Key: TIKA-1544 URL: https://issues.apache.org/jira/browse/TIKA-1544 Project: Tika Issue Type: Bug Affects Versions: 1.6 Environment: Windows 8, Java 1.8 Reporter: mortee Priority: Minor Attachments: preserve_new_lines_in_rtf.patch, testRTFNewlines.rtf I'm trying to extract the text content from RTF documents. The files contain empty lines (two or more consecutive paragraph-end marks), on which the further processing relies to tell apart different parts of the text. But unfortuantely Tika (with --text switch) eliminates all those empty lines, instead of preserving them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1544) empty lines are not preserved
[ https://issues.apache.org/jira/browse/TIKA-1544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14310014#comment-14310014 ] Michael McCandless commented on TIKA-1544: -- bq. I have hesitation about changing the original logic, because I'm not sure why if(inParagraph) was added...maybe for properly closing formatting? I don't remember why... bq. would be more general and handle the formatting stuff properly? I think that may be safer? In case there was pending styling that needs to be closed ... I think if Tika's tests pass with that change you should commit! empty lines are not preserved - Key: TIKA-1544 URL: https://issues.apache.org/jira/browse/TIKA-1544 Project: Tika Issue Type: Bug Affects Versions: 1.6 Environment: Windows 8, Java 1.8 Reporter: mortee Priority: Minor Attachments: preserve_new_lines_in_rtf.patch, testRTFNewlines.rtf I'm trying to extract the text content from RTF documents. The files contain empty lines (two or more consecutive paragraph-end marks), on which the further processing relies to tell apart different parts of the text. But unfortuantely Tika (with --text switch) eliminates all those empty lines, instead of preserving them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1305) New list processing changes appear to be causing RTFParser exception
[ https://issues.apache.org/jira/browse/TIKA-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14013647#comment-14013647 ] Michael McCandless commented on TIKA-1305: -- Net/net the RTF is corrupted right? But we want to make a best-effort to gloss over the corruption and still extract what we can? I think that makes sense. +1 for the simple solution, maybe w/ a comment explaining it's best effort when we see a corrupted doc? New list processing changes appear to be causing RTFParser exception Key: TIKA-1305 URL: https://issues.apache.org/jira/browse/TIKA-1305 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.6 Environment: Mac OSX 10.7.5 Tika 1.6-SNAPSHOT Reporter: Chris Bamford Priority: Minor Labels: newbie Attachments: rtfparsererror_2.rtf Some RTFs cause RTFParser to throw a RuntimeException: Unexpected RuntimeException from org.apache.tika.parser.rtf.RTFParser@425e60f2 When tracing in the debugger (surfaces in CompositeParser.parse() where it catches the RuntimeException, line 244 in my copy), the exception (e) is: java.lang.ArrayIndexOutOfBoundsException: -1 A committer (Tim Allison) believes that it is being caused by recent list processing changes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (TIKA-1078) TikaCLI: invalid characters in embedded document name causes FNFE when trying to save
[ https://issues.apache.org/jira/browse/TIKA-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved TIKA-1078. -- Resolution: Fixed Thanks Stefano, I made one small change (added generics: HashSetCharacter) and committed. TikaCLI: invalid characters in embedded document name causes FNFE when trying to save - Key: TIKA-1078 URL: https://issues.apache.org/jira/browse/TIKA-1078 Project: Tika Issue Type: Bug Components: cli, parser Reporter: Michael McCandless Fix For: 1.5 Attachments: T-DS_Excel2003-PPT2003_1.xls, tika-1078-2.patch, tika-1078.patch Attached document hits this on Windows: {noformat} C:\java.exe -jar tika-app-1.3.jar -z -x c:\data\idit\T-DS_Excel2003-PPT2003_1.xls Extracting 'file0.png' (image/png) to .\file0.png Extracting 'file1.emf' (application/x-emf) to .\file1.emf Extracting 'file2.jpg' (image/jpeg) to .\file2.jpg Extracting 'file3.emf' (application/x-emf) to .\file3.emf Extracting 'file4.wmf' (application/x-msmetafile) to .\file4.wmf Extracting 'MBD0016BDE4/?£☺.bin' (application/octet-stream) to .\MBD0016BDE4\?£☺.bin Exception in thread main org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser@75f875f8 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:248) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:139) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:415) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:109) Caused by: java.io.FileNotFoundException: .\MBD0016BDE4\?£☺.bin (The filename, directory name, or volume label syntax is incorrect.) at java.io.FileOutputStream.init(FileOutputStream.java:205) at java.io.FileOutputStream.init(FileOutputStream.java:156) at org.apache.tika.cli.TikaCLI$FileEmbeddedDocumentExtractor.parseEmbedded(TikaCLI.java:722) at org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:201) at org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:158) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:194) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 5 more {noformat} TikaCLI manages to create the sub-directory, but because the embedded fileName has invalid (for Windows) characters, it fails. On Linux it runs fine. I think somehow ... we have to sanitize the embedded file name ... -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1078) TikaCLI: invalid characters in embedded document name causes FNFE when trying to save
[ https://issues.apache.org/jira/browse/TIKA-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13869205#comment-13869205 ] Michael McCandless commented on TIKA-1078: -- Thanks Stefano! Can you fix the license header on the two new files to match the current sources? Thanks. Also, we don't normally include @ author tags. Maybe use a HashSet instead of an array for RESERVED, so it's not an O(N) lookup per character? Also, since you check for ' ', you shouldn't need any entries 0x20? Sometimes (rarely?), attachment filenames have their own sub-directories, and the code today will happily .mkdirs those subdirectories, but it looks like with this patch we now replace / and \ with their hex equivalents, instead? I think that's OK... TikaCLI: invalid characters in embedded document name causes FNFE when trying to save - Key: TIKA-1078 URL: https://issues.apache.org/jira/browse/TIKA-1078 Project: Tika Issue Type: Bug Components: cli, parser Reporter: Michael McCandless Fix For: 1.5 Attachments: T-DS_Excel2003-PPT2003_1.xls, tika-1078.patch Attached document hits this on Windows: {noformat} C:\java.exe -jar tika-app-1.3.jar -z -x c:\data\idit\T-DS_Excel2003-PPT2003_1.xls Extracting 'file0.png' (image/png) to .\file0.png Extracting 'file1.emf' (application/x-emf) to .\file1.emf Extracting 'file2.jpg' (image/jpeg) to .\file2.jpg Extracting 'file3.emf' (application/x-emf) to .\file3.emf Extracting 'file4.wmf' (application/x-msmetafile) to .\file4.wmf Extracting 'MBD0016BDE4/?£☺.bin' (application/octet-stream) to .\MBD0016BDE4\?£☺.bin Exception in thread main org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser@75f875f8 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:248) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:139) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:415) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:109) Caused by: java.io.FileNotFoundException: .\MBD0016BDE4\?£☺.bin (The filename, directory name, or volume label syntax is incorrect.) at java.io.FileOutputStream.init(FileOutputStream.java:205) at java.io.FileOutputStream.init(FileOutputStream.java:156) at org.apache.tika.cli.TikaCLI$FileEmbeddedDocumentExtractor.parseEmbedded(TikaCLI.java:722) at org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:201) at org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:158) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:194) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 5 more {noformat} TikaCLI manages to create the sub-directory, but because the embedded fileName has invalid (for Windows) characters, it fails. On Linux it runs fine. I think somehow ... we have to sanitize the embedded file name ... -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1211) OpenDocument (ODF) parser produces multiple startDocument() events
[ https://issues.apache.org/jira/browse/TIKA-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13850567#comment-13850567 ] Michael McCandless commented on TIKA-1211: -- +1 to fix XHTMLContentHandler to allow only one startDocument event. OpenDocument (ODF) parser produces multiple startDocument() events -- Key: TIKA-1211 URL: https://issues.apache.org/jira/browse/TIKA-1211 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Uwe Schindler Related to SOLR-4809: Solr receives multiple startDocument events when parsing OpenDocumentFiles. The parser already prevents multiple endDocuments, but not multiple startDocuments. The bug was introduced when we added parsing content.xml and meta.xml (TIKA-736, but both feed elements to the XHTML output, so we get multiple start/endDocuments). -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Resolved] (TIKA-1192) ArrayIndexOutOfBoundsException: 9 parsing RTF
[ https://issues.apache.org/jira/browse/TIKA-1192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved TIKA-1192. -- Resolution: Fixed Fix Version/s: 1.5 Thanks Dave, I just committed this. ArrayIndexOutOfBoundsException: 9 parsing RTF - Key: TIKA-1192 URL: https://issues.apache.org/jira/browse/TIKA-1192 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Dave Kincaid Assignee: Michael McCandless Labels: rtf Fix For: 1.5 Attachments: testRTFListOverride.rtf, tika-1192-test-case.patch, tika-1192.patch When trying to parse an RTF file I'm getting the following exception. I am not able to attach the file for privacy reasons: {noformat} java.lang.ArrayIndexOutOfBoundsException: 9 TextExtractor.java:872 org.apache.tika.parser.rtf.TextExtractor.processControlWord TextExtractor.java:566 org.apache.tika.parser.rtf.TextExtractor.parseControlWord TextExtractor.java:492 org.apache.tika.parser.rtf.TextExtractor.parseControlToken TextExtractor.java:459 org.apache.tika.parser.rtf.TextExtractor.extract TextExtractor.java:448 org.apache.tika.parser.rtf.TextExtractor.extract RTFParser.java:56 org.apache.tika.parser.rtf.RTFParser.parse (Unknown Source) sun.reflect.NativeMethodAccessorImpl.invoke0 NativeMethodAccessorImpl.java:57 sun.reflect.NativeMethodAccessorImpl.invoke DelegatingMethodAccessorImpl.java:43 sun.reflect.DelegatingMethodAccessorImpl.invoke Method.java:606 java.lang.reflect.Method.invoke Reflector.java:93 clojure.lang.Reflector.invokeMatchingMethod Reflector.java:28 clojure.lang.Reflector.invokeInstanceMethod tika_parser.clj:20 rtf-parser.tika-parser/parse form-init2921349737948661927.clj:1 rtf-parser.tika-parser/eval4200 Compiler.java:6619 clojure.lang.Compiler.eval Compiler.java:6582 clojure.lang.Compiler.eval core.clj:2852 clojure.core/eval main.clj:259 clojure.main/repl[fn] main.clj:259 clojure.main/repl[fn] main.clj:277 clojure.main/repl[fn] main.clj:277 clojure.main/repl RestFn.java:1096 clojure.lang.RestFn.invoke interruptible_eval.clj:56 clojure.tools.nrepl.middleware.interruptible-eval/evaluate[fn] AFn.java:159 clojure.lang.AFn.applyToHelper AFn.java:151 clojure.lang.AFn.applyTo core.clj:617 clojure.core/apply core.clj:1788 clojure.core/with-bindings* RestFn.java:425 clojure.lang.RestFn.invoke interruptible_eval.clj:41 clojure.tools.nrepl.middleware.interruptible-eval/evaluate interruptible_eval.clj:171 clojure.tools.nrepl.middleware.interruptible-eval/interruptible-eval[fn] core.clj:2330 clojure.core/comp[fn] interruptible_eval.clj:138 clojure.tools.nrepl.middleware.interruptible-eval/run-next[fn] AFn.java:24 clojure.lang.AFn.run ThreadPoolExecutor.java:1145 java.util.concurrent.ThreadPoolExecutor.runWorker ThreadPoolExecutor.java:615 java.util.concurrent.ThreadPoolExecutor$Worker.run Thread.java:724 java.lang.Thread.run {noformat} -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Assigned] (TIKA-1192) ArrayIndexOutOfBoundsException: 9 parsing RTF
[ https://issues.apache.org/jira/browse/TIKA-1192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless reassigned TIKA-1192: Assignee: Michael McCandless ArrayIndexOutOfBoundsException: 9 parsing RTF - Key: TIKA-1192 URL: https://issues.apache.org/jira/browse/TIKA-1192 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Dave Kincaid Assignee: Michael McCandless Labels: rtf Attachments: tika-1192.patch When trying to parse an RTF file I'm getting the following exception. I am not able to attach the file for privacy reasons: {noformat} java.lang.ArrayIndexOutOfBoundsException: 9 TextExtractor.java:872 org.apache.tika.parser.rtf.TextExtractor.processControlWord TextExtractor.java:566 org.apache.tika.parser.rtf.TextExtractor.parseControlWord TextExtractor.java:492 org.apache.tika.parser.rtf.TextExtractor.parseControlToken TextExtractor.java:459 org.apache.tika.parser.rtf.TextExtractor.extract TextExtractor.java:448 org.apache.tika.parser.rtf.TextExtractor.extract RTFParser.java:56 org.apache.tika.parser.rtf.RTFParser.parse (Unknown Source) sun.reflect.NativeMethodAccessorImpl.invoke0 NativeMethodAccessorImpl.java:57 sun.reflect.NativeMethodAccessorImpl.invoke DelegatingMethodAccessorImpl.java:43 sun.reflect.DelegatingMethodAccessorImpl.invoke Method.java:606 java.lang.reflect.Method.invoke Reflector.java:93 clojure.lang.Reflector.invokeMatchingMethod Reflector.java:28 clojure.lang.Reflector.invokeInstanceMethod tika_parser.clj:20 rtf-parser.tika-parser/parse form-init2921349737948661927.clj:1 rtf-parser.tika-parser/eval4200 Compiler.java:6619 clojure.lang.Compiler.eval Compiler.java:6582 clojure.lang.Compiler.eval core.clj:2852 clojure.core/eval main.clj:259 clojure.main/repl[fn] main.clj:259 clojure.main/repl[fn] main.clj:277 clojure.main/repl[fn] main.clj:277 clojure.main/repl RestFn.java:1096 clojure.lang.RestFn.invoke interruptible_eval.clj:56 clojure.tools.nrepl.middleware.interruptible-eval/evaluate[fn] AFn.java:159 clojure.lang.AFn.applyToHelper AFn.java:151 clojure.lang.AFn.applyTo core.clj:617 clojure.core/apply core.clj:1788 clojure.core/with-bindings* RestFn.java:425 clojure.lang.RestFn.invoke interruptible_eval.clj:41 clojure.tools.nrepl.middleware.interruptible-eval/evaluate interruptible_eval.clj:171 clojure.tools.nrepl.middleware.interruptible-eval/interruptible-eval[fn] core.clj:2330 clojure.core/comp[fn] interruptible_eval.clj:138 clojure.tools.nrepl.middleware.interruptible-eval/run-next[fn] AFn.java:24 clojure.lang.AFn.run ThreadPoolExecutor.java:1145 java.util.concurrent.ThreadPoolExecutor.runWorker ThreadPoolExecutor.java:615 java.util.concurrent.ThreadPoolExecutor$Worker.run Thread.java:724 java.lang.Thread.run {noformat} -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (TIKA-1192) ArrayIndexOutOfBoundsException: 9 parsing RTF
[ https://issues.apache.org/jira/browse/TIKA-1192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13817472#comment-13817472 ] Michael McCandless commented on TIKA-1192: -- bq. Yes, when that fragment is part of an RTF file it provokes the exception, so if you could put it into a valid RTF file it should throw the exception. Hmm, I've tried that (quickly) but so far cannot provoke it ... I'd really prefer to commit a test case along w/ this fix ... ArrayIndexOutOfBoundsException: 9 parsing RTF - Key: TIKA-1192 URL: https://issues.apache.org/jira/browse/TIKA-1192 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Dave Kincaid Assignee: Michael McCandless Labels: rtf Attachments: tika-1192.patch When trying to parse an RTF file I'm getting the following exception. I am not able to attach the file for privacy reasons: {noformat} java.lang.ArrayIndexOutOfBoundsException: 9 TextExtractor.java:872 org.apache.tika.parser.rtf.TextExtractor.processControlWord TextExtractor.java:566 org.apache.tika.parser.rtf.TextExtractor.parseControlWord TextExtractor.java:492 org.apache.tika.parser.rtf.TextExtractor.parseControlToken TextExtractor.java:459 org.apache.tika.parser.rtf.TextExtractor.extract TextExtractor.java:448 org.apache.tika.parser.rtf.TextExtractor.extract RTFParser.java:56 org.apache.tika.parser.rtf.RTFParser.parse (Unknown Source) sun.reflect.NativeMethodAccessorImpl.invoke0 NativeMethodAccessorImpl.java:57 sun.reflect.NativeMethodAccessorImpl.invoke DelegatingMethodAccessorImpl.java:43 sun.reflect.DelegatingMethodAccessorImpl.invoke Method.java:606 java.lang.reflect.Method.invoke Reflector.java:93 clojure.lang.Reflector.invokeMatchingMethod Reflector.java:28 clojure.lang.Reflector.invokeInstanceMethod tika_parser.clj:20 rtf-parser.tika-parser/parse form-init2921349737948661927.clj:1 rtf-parser.tika-parser/eval4200 Compiler.java:6619 clojure.lang.Compiler.eval Compiler.java:6582 clojure.lang.Compiler.eval core.clj:2852 clojure.core/eval main.clj:259 clojure.main/repl[fn] main.clj:259 clojure.main/repl[fn] main.clj:277 clojure.main/repl[fn] main.clj:277 clojure.main/repl RestFn.java:1096 clojure.lang.RestFn.invoke interruptible_eval.clj:56 clojure.tools.nrepl.middleware.interruptible-eval/evaluate[fn] AFn.java:159 clojure.lang.AFn.applyToHelper AFn.java:151 clojure.lang.AFn.applyTo core.clj:617 clojure.core/apply core.clj:1788 clojure.core/with-bindings* RestFn.java:425 clojure.lang.RestFn.invoke interruptible_eval.clj:41 clojure.tools.nrepl.middleware.interruptible-eval/evaluate interruptible_eval.clj:171 clojure.tools.nrepl.middleware.interruptible-eval/interruptible-eval[fn] core.clj:2330 clojure.core/comp[fn] interruptible_eval.clj:138 clojure.tools.nrepl.middleware.interruptible-eval/run-next[fn] AFn.java:24 clojure.lang.AFn.run ThreadPoolExecutor.java:1145 java.util.concurrent.ThreadPoolExecutor.runWorker ThreadPoolExecutor.java:615 java.util.concurrent.ThreadPoolExecutor$Worker.run Thread.java:724 java.lang.Thread.run {noformat} -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (TIKA-1192) ArrayIndexOutOfBoundsException: 9 parsing RTF
[ https://issues.apache.org/jira/browse/TIKA-1192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13817512#comment-13817512 ] Michael McCandless commented on TIKA-1192: -- Thanks Dave. ArrayIndexOutOfBoundsException: 9 parsing RTF - Key: TIKA-1192 URL: https://issues.apache.org/jira/browse/TIKA-1192 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Dave Kincaid Assignee: Michael McCandless Labels: rtf Attachments: tika-1192.patch When trying to parse an RTF file I'm getting the following exception. I am not able to attach the file for privacy reasons: {noformat} java.lang.ArrayIndexOutOfBoundsException: 9 TextExtractor.java:872 org.apache.tika.parser.rtf.TextExtractor.processControlWord TextExtractor.java:566 org.apache.tika.parser.rtf.TextExtractor.parseControlWord TextExtractor.java:492 org.apache.tika.parser.rtf.TextExtractor.parseControlToken TextExtractor.java:459 org.apache.tika.parser.rtf.TextExtractor.extract TextExtractor.java:448 org.apache.tika.parser.rtf.TextExtractor.extract RTFParser.java:56 org.apache.tika.parser.rtf.RTFParser.parse (Unknown Source) sun.reflect.NativeMethodAccessorImpl.invoke0 NativeMethodAccessorImpl.java:57 sun.reflect.NativeMethodAccessorImpl.invoke DelegatingMethodAccessorImpl.java:43 sun.reflect.DelegatingMethodAccessorImpl.invoke Method.java:606 java.lang.reflect.Method.invoke Reflector.java:93 clojure.lang.Reflector.invokeMatchingMethod Reflector.java:28 clojure.lang.Reflector.invokeInstanceMethod tika_parser.clj:20 rtf-parser.tika-parser/parse form-init2921349737948661927.clj:1 rtf-parser.tika-parser/eval4200 Compiler.java:6619 clojure.lang.Compiler.eval Compiler.java:6582 clojure.lang.Compiler.eval core.clj:2852 clojure.core/eval main.clj:259 clojure.main/repl[fn] main.clj:259 clojure.main/repl[fn] main.clj:277 clojure.main/repl[fn] main.clj:277 clojure.main/repl RestFn.java:1096 clojure.lang.RestFn.invoke interruptible_eval.clj:56 clojure.tools.nrepl.middleware.interruptible-eval/evaluate[fn] AFn.java:159 clojure.lang.AFn.applyToHelper AFn.java:151 clojure.lang.AFn.applyTo core.clj:617 clojure.core/apply core.clj:1788 clojure.core/with-bindings* RestFn.java:425 clojure.lang.RestFn.invoke interruptible_eval.clj:41 clojure.tools.nrepl.middleware.interruptible-eval/evaluate interruptible_eval.clj:171 clojure.tools.nrepl.middleware.interruptible-eval/interruptible-eval[fn] core.clj:2330 clojure.core/comp[fn] interruptible_eval.clj:138 clojure.tools.nrepl.middleware.interruptible-eval/run-next[fn] AFn.java:24 clojure.lang.AFn.run ThreadPoolExecutor.java:1145 java.util.concurrent.ThreadPoolExecutor.runWorker ThreadPoolExecutor.java:615 java.util.concurrent.ThreadPoolExecutor$Worker.run Thread.java:724 java.lang.Thread.run {noformat} -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (TIKA-1181) RTFParser not keeping HTML font colors and underscore tags.
[ https://issues.apache.org/jira/browse/TIKA-1181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13788163#comment-13788163 ] Michael McCandless commented on TIKA-1181: -- The RTFParser currently only carries bold and italic styling through; I guess we could add underline. It doesn't try to preserve any colors. Really the goal is (mostly?) text extraction, not precise formatting of the extracted text, so I think colors/styling are somewhat low priority. But I suppose underling/colors can convey information about how important that text was, and so could be useful to stages (like indexing with Lucene) after text extraction. RTFParser not keeping HTML font colors and underscore tags. --- Key: TIKA-1181 URL: https://issues.apache.org/jira/browse/TIKA-1181 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Environment: Windows server 2008 Reporter: Leo Labels: RTFParser Hi, I'm having problems with this code. It does not put the font colors and underscores u/u tags in the HTML from the RTF string. Is there anything I can do to put them there? Code: InputStream in = new ByteArrayInputStream(rtfString.getBytes(UTF-8)); org.apache.tika.parser.rtf.RTFParser parser = new org.apache.tika.parser.rtf.RTFParser(); Metadata metadata = new Metadata(); StringWriter sw = new StringWriter(); SAXTransformerFactory factory = (SAXTransformerFactory) SAXTransformerFactory.newInstance(); TransformerHandler handler = factory.newTransformerHandler(); handler.getTransformer().setOutputProperty(OutputKeys.METHOD, xml); handler.getTransformer().setOutputProperty(OutputKeys.INDENT, no); handler.setResult(new StreamResult(sw)); parser.parse(in, handler, metadata, new ParseContext()); String xhtml = sw.toString(); xhtml = xhtml.replaceAll(\r\n, br\r\n); Thanks for looking at it. Leo -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (TIKA-1143) Fails to parse some PPT file
[ https://issues.apache.org/jira/browse/TIKA-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13698837#comment-13698837 ] Michael McCandless commented on TIKA-1143: -- Are you able to extract text from the rest of the document? Those logged exceptions look like warnings, indicating that the summary information failed to parse ... Fails to parse some PPT file Key: TIKA-1143 URL: https://issues.apache.org/jira/browse/TIKA-1143 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Vincent Massol Attachments: XWikiIExpoPresentation.ppt See also http://jira.xwiki.org/browse/XWIKI-9308 Here's what I get with the attached file: {noformat} 2013-07-03 11:52:45,332 [XWiki Solr index thread] WARN a.t.p.m.AbstractPOIFSExtractor - Ignoring unexpected exception while parsing summary entry DocumentSummaryInformation java.lang.ClassCastException: [B cannot be cast to java.lang.String at org.apache.poi.hpsf.DocumentSummaryInformation.getCategory(DocumentSummaryInformation.java:78) ~[poi-3.9.jar:3.9] at org.apache.tika.parser.microsoft.SummaryExtractor.parse(SummaryExtractor.java:143) [tika-parsers-1.4.jar:na] at org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaryEntryIfExists(SummaryExtractor.java:88) [tika-parsers-1.4.jar:na] at org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaries(SummaryExtractor.java:73) [tika-parsers-1.4.jar:na] at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:170) [tika-parsers-1.4.jar:na] at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161) [tika-parsers-1.4.jar:na] at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) [tika-core-1.4.jar:na] at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) [tika-core-1.4.jar:na] at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) [tika-core-1.4.jar:na] at org.apache.tika.Tika.parseToString(Tika.java:380) [tika-core-1.4.jar:na] at org.xwiki.search.solr.internal.metadata.AttachmentSolrMetadataExtractor.getContentAsText(AttachmentSolrMetadataExtractor.java:130) [xwiki-platform-search-solr-api-5.2-20130702.194010-10.jar:na] at org.xwiki.search.solr.internal.metadata.AttachmentSolrMetadataExtractor.setLocaleAndContentFields(AttachmentSolrMetadataExtractor.java:97) [xwiki-platform-search-solr-api-5.2-20130702.194010-10.jar:na] at org.xwiki.search.solr.internal.metadata.AttachmentSolrMetadataExtractor.setFieldsInternal(AttachmentSolrMetadataExtractor.java:79) [xwiki-platform-search-solr-api-5.2-20130702.194010-10.jar:na] at org.xwiki.search.solr.internal.metadata.AbstractSolrMetadataExtractor.getSolrDocument(AbstractSolrMetadataExtractor.java:114) [xwiki-platform-search-solr-api-5.2-20130702.194010-10.jar:na] at org.xwiki.search.solr.internal.DefaultSolrIndexer.getSolrDocument(DefaultSolrIndexer.java:465) [xwiki-platform-search-solr-api-5.2-20130702.194010-10.jar:na] at org.xwiki.search.solr.internal.DefaultSolrIndexer.processBatch(DefaultSolrIndexer.java:378) [xwiki-platform-search-solr-api-5.2-20130702.194010-10.jar:na] at org.xwiki.search.solr.internal.DefaultSolrIndexer.runInternal(DefaultSolrIndexer.java:353) [xwiki-platform-search-solr-api-5.2-20130702.194010-10.jar:na] at com.xpn.xwiki.util.AbstractXWikiRunnable.run(AbstractXWikiRunnable.java:121) [xwiki-platform-oldcore-5.2-20130702.190754-22.jar:na] at java.lang.Thread.run(Thread.java:680) [na:1.6.0_51] 2013-07-03 11:52:49,985 [Lucene Index Updater] WARN a.t.p.m.AbstractPOIFSExtractor - Ignoring unexpected exception while parsing summary entry DocumentSummaryInformation java.lang.ClassCastException: [B cannot be cast to java.lang.String at org.apache.poi.hpsf.DocumentSummaryInformation.getCategory(DocumentSummaryInformation.java:78) ~[poi-3.9.jar:3.9] at org.apache.tika.parser.microsoft.SummaryExtractor.parse(SummaryExtractor.java:143) [tika-parsers-1.4.jar:na] at org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaryEntryIfExists(SummaryExtractor.java:88) [tika-parsers-1.4.jar:na] at org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaries(SummaryExtractor.java:73) [tika-parsers-1.4.jar:na] at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:170) [tika-parsers-1.4.jar:na] at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161) [tika-parsers-1.4.jar:na] at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) [tika-core-1.4.jar:na] at
[jira] [Assigned] (TIKA-1128) Replace line tabulation with line break
[ https://issues.apache.org/jira/browse/TIKA-1128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless reassigned TIKA-1128: Assignee: Michael McCandless Replace line tabulation with line break --- Key: TIKA-1128 URL: https://issues.apache.org/jira/browse/TIKA-1128 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.3 Reporter: Privezentsev Konstantin Assignee: Michael McCandless Priority: Trivial Attachments: 0001-TIKA-1128-Replace-line-tabular-by-line-break-when-ex.patch Tika WordExtractor not replacing line tabular character by line break like POI WordExtractor. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-1128) Replace line tabulation with line break
[ https://issues.apache.org/jira/browse/TIKA-1128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-1128: - Fix Version/s: 1.5 Replace line tabulation with line break --- Key: TIKA-1128 URL: https://issues.apache.org/jira/browse/TIKA-1128 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.3 Reporter: Privezentsev Konstantin Assignee: Michael McCandless Priority: Trivial Fix For: 1.5 Attachments: 0001-TIKA-1128-Replace-line-tabular-by-line-break-when-ex.patch Tika WordExtractor not replacing line tabular character by line break like POI WordExtractor. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1128) Replace line tabulation with line break
[ https://issues.apache.org/jira/browse/TIKA-1128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13670252#comment-13670252 ] Michael McCandless commented on TIKA-1128: -- Thanks Privezentsev. Do you have an example Word document that emits the line tabular character? I'd like to add a basic test case ... thanks. Replace line tabulation with line break --- Key: TIKA-1128 URL: https://issues.apache.org/jira/browse/TIKA-1128 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.3 Reporter: Privezentsev Konstantin Assignee: Michael McCandless Priority: Trivial Fix For: 1.5 Attachments: 0001-TIKA-1128-Replace-line-tabular-by-line-break-when-ex.patch Tika WordExtractor not replacing line tabular character by line break like POI WordExtractor. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (TIKA-1128) Replace line tabulation with line break
[ https://issues.apache.org/jira/browse/TIKA-1128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved TIKA-1128. -- Resolution: Fixed Fix Version/s: (was: 1.5) 1.4 Thanks Konstantin, I just committed this! Replace line tabulation with line break --- Key: TIKA-1128 URL: https://issues.apache.org/jira/browse/TIKA-1128 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.3 Reporter: Konstantin Privezentsev Assignee: Michael McCandless Priority: Trivial Fix For: 1.4 Attachments: 0001-TIKA-1128-Replace-line-tabular-by-line-break-when-ex.patch, tabular_symbol.doc Tika WordExtractor not replacing line tabular character by line break like POI WordExtractor. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1098) not able to parse pdfs/docs/ppts using 1.1 tika parser
[ https://issues.apache.org/jira/browse/TIKA-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615793#comment-13615793 ] Michael McCandless commented on TIKA-1098: -- Hmm PDFBox is hitting that exception when Tika calls .getAnnotations. You might be able to workaround this if you call PDFParser.setExtractAnnotationText(false)? Then Tika shouldn't call .getAnnotations... It looks like PDFBOX-1273 is the same issue. not able to parse pdfs/docs/ppts using 1.1 tika parser Key: TIKA-1098 URL: https://issues.apache.org/jira/browse/TIKA-1098 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.1 Environment: linux redhat Reporter: Qian Diao Attachments: url_1763_approx-alg-notes.pdf Hi, I got some parsing problems when using Tika 1.1 for the attached pdf file. my code (Test.java): import java.io.File; import java.io.InputStream; import java.io.FileInputStream; import org.apache.tika.metadata.Metadata; import org.apache.tika.parser.AutoDetectParser; import org.apache.tika.parser.ParseContext; import org.apache.tika.parser.Parser; import org.apache.tika.parser.html.BoilerpipeContentHandler; import org.apache.tika.sax.BodyContentHandler; import org.apache.tika.parser.html.HtmlParser; import de.l3s.boilerpipe.extractors.ArticleExtractor; public class Test { private static final String validBoilerpipeFilenameRegEx = .*(\\.)(htm|html|shtml|php|asp|aspx)$; public String parseFile(File inFile) { if (inFile == null || !inFile.isFile() || !inFile.canRead()) return null; InputStream is = null; String outputText = ; try { // Open input stream is = new FileInputStream(inFile); // Prepare parser BodyContentHandler contenthandler = new BodyContentHandler(-1); Metadata metadata = new Metadata(); metadata.set(Metadata.RESOURCE_NAME_KEY, inFile.getName()); ParseContext pc = new ParseContext(); // Call parse with boilerpipe if valid boilerpipe extension; otherwise, call regular parse. if (!inFile.getName().matches(validBoilerpipeFilenameRegEx)) { Parser parser = new AutoDetectParser(); parser.parse(is, contenthandler, metadata, pc); } else { Parser parser = new HtmlParser(); BoilerpipeContentHandler bh = new BoilerpipeContentHandler(contenthandler, new ArticleExtractor()); parser.parse(is, bh, metadata, pc); } // Prepare text for write outputText = contenthandler.toString(); } catch (Exception e) { System.out.println(e); return null; } finally { try { if (is != null) is.close(); } catch (Exception e) {} } return outputText; } =output org.apache.tika.exception.TikaException: Unable to extract PDF content url_1763_approx-alg-notes.pdf -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1074) Extraction should continue if an exception is hit visiting an embedded document
[ https://issues.apache.org/jira/browse/TIKA-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13585082#comment-13585082 ] Michael McCandless commented on TIKA-1074: -- bq. My app needs to extract text even from corrupt documents. That's exactly the intent here as well. bq. Currently I am setting ParseContext with a custom AutoDetectParser that, when an exception is hit, e.g. visiting an embedded, catches the exception, logs it AND extracts raw/binary strings from the problematic doc (or embedded) Wait, the exceptions that this change now catches logs is in the decoding an OLE10 embedded entry (into its byte[] data), not in actually parsing of the resulting byte[] data. If the exception is hit later when we recurse into parseEmbedded, the exception is still thrown as before, so your custom AutoDetectParser will still see/handle the exception. But I think this is separately a good idea (an AutoDetectParser logging continuing by default): is this something you could possibly contribute...? Do you have an example corrupted document? We could test before/after this change and see. Extraction should continue if an exception is hit visiting an embedded document --- Key: TIKA-1074 URL: https://issues.apache.org/jira/browse/TIKA-1074 Project: Tika Issue Type: Improvement Components: parser Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 1.4 Attachments: TIKA-1074.patch, TIKA-1074.patch Spinoff from TIKA-1072. In that issue, a problematic document (still not sure if document is corrupt, or possible POI bug) caused an exception when visiting the embedded documents. If I change Tika to suppress that exception, the rest of the document extracts fine. So somehow I think we should be more robust here, and maybe log the exception, or save/record the exception(s) somewhere so after parsing the app could decide what to do about them ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1074) Extraction should continue if an exception is hit visiting an embedded document
[ https://issues.apache.org/jira/browse/TIKA-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13584176#comment-13584176 ] Michael McCandless commented on TIKA-1074: -- Thanks Jukka. InterruptedException is never thrown in these places today, so I can't add the separate catch clause (compiler is angry). So, the instanceof check for IE is in case in the future we do handle interrupts in these places ... we could just remove it and add it back in the future if we add IE (seems risky). Or I can change that code to throw TikaException instead on interrupt (and restore the interrupt bit), except in the TikaCLI case, EmbeddedDocumentExtractor.parseEmbedded doesn't throw TikaException today (the other two places already do). But it's a little weird throw TikaExc in response to an interrupt (ie, code above will be trying to catch an IE) ... I think it's cleaner to set the interrupt bit and let the next place that waits see the interrupt bit and throw IE? Extraction should continue if an exception is hit visiting an embedded document --- Key: TIKA-1074 URL: https://issues.apache.org/jira/browse/TIKA-1074 Project: Tika Issue Type: Improvement Components: parser Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 1.4 Attachments: TIKA-1074.patch, TIKA-1074.patch Spinoff from TIKA-1072. In that issue, a problematic document (still not sure if document is corrupt, or possible POI bug) caused an exception when visiting the embedded documents. If I change Tika to suppress that exception, the rest of the document extracts fine. So somehow I think we should be more robust here, and maybe log the exception, or save/record the exception(s) somewhere so after parsing the app could decide what to do about them ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1074) Extraction should continue if an exception is hit visiting an embedded document
[ https://issues.apache.org/jira/browse/TIKA-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13584249#comment-13584249 ] Michael McCandless commented on TIKA-1074: -- {quote} bq. InterruptedException is never thrown in these places today, so I can't add the separate catch clause (compiler is angry). It's a checked exception, so if it isn't declared to be thrown by POI, it shouldn't get thrown here (even though the VM doesn't strictly prohibit that). {quote} Exactly: I'm trying to future proof. bq. So in that case the extra check shouldn't even be needed. Wait, do you mean I should remove the handling entirely (not bother future proofing)? {quote} bq. I think it's cleaner to set the interrupt bit and let the next place that waits see the interrupt bit and throw IE? I don't really like this approach. We're essentially saying: Yes, you asked me to stop what I'm doing, but instead I'll just finish up what I was doing and ask the next guy to stop. Instead, when receiving an IE I'd prefer Tika to stop immediately, either by letting the IE bubble up or (where necessary) by throwing a TikaException that wraps the IE. {quote} OK, maybe we can throw TikaException today (*and* set the interrupt bit), and then in the future (if/when these places really do throw IE), we can change this to throwing a IE instead of TikaException. I can put that as a TODO. Extraction should continue if an exception is hit visiting an embedded document --- Key: TIKA-1074 URL: https://issues.apache.org/jira/browse/TIKA-1074 Project: Tika Issue Type: Improvement Components: parser Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 1.4 Attachments: TIKA-1074.patch, TIKA-1074.patch Spinoff from TIKA-1072. In that issue, a problematic document (still not sure if document is corrupt, or possible POI bug) caused an exception when visiting the embedded documents. If I change Tika to suppress that exception, the rest of the document extracts fine. So somehow I think we should be more robust here, and maybe log the exception, or save/record the exception(s) somewhere so after parsing the app could decide what to do about them ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1074) Extraction should continue if an exception is hit visiting an embedded document
[ https://issues.apache.org/jira/browse/TIKA-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13584362#comment-13584362 ] Michael McCandless commented on TIKA-1074: -- OK I'll remove the future proofing. Extraction should continue if an exception is hit visiting an embedded document --- Key: TIKA-1074 URL: https://issues.apache.org/jira/browse/TIKA-1074 Project: Tika Issue Type: Improvement Components: parser Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 1.4 Attachments: TIKA-1074.patch, TIKA-1074.patch Spinoff from TIKA-1072. In that issue, a problematic document (still not sure if document is corrupt, or possible POI bug) caused an exception when visiting the embedded documents. If I change Tika to suppress that exception, the rest of the document extracts fine. So somehow I think we should be more robust here, and maybe log the exception, or save/record the exception(s) somewhere so after parsing the app could decide what to do about them ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (TIKA-1074) Extraction should continue if an exception is hit visiting an embedded document
[ https://issues.apache.org/jira/browse/TIKA-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved TIKA-1074. -- Resolution: Fixed Extraction should continue if an exception is hit visiting an embedded document --- Key: TIKA-1074 URL: https://issues.apache.org/jira/browse/TIKA-1074 Project: Tika Issue Type: Improvement Components: parser Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 1.4 Attachments: TIKA-1074.patch, TIKA-1074.patch Spinoff from TIKA-1072. In that issue, a problematic document (still not sure if document is corrupt, or possible POI bug) caused an exception when visiting the embedded documents. If I change Tika to suppress that exception, the rest of the document extracts fine. So somehow I think we should be more robust here, and maybe log the exception, or save/record the exception(s) somewhere so after parsing the app could decide what to do about them ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (TIKA-1074) Extraction should continue if an exception is hit visiting an embedded document
[ https://issues.apache.org/jira/browse/TIKA-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved TIKA-1074. -- Resolution: Fixed Extraction should continue if an exception is hit visiting an embedded document --- Key: TIKA-1074 URL: https://issues.apache.org/jira/browse/TIKA-1074 Project: Tika Issue Type: Improvement Components: parser Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 1.4 Attachments: TIKA-1074.patch Spinoff from TIKA-1072. In that issue, a problematic document (still not sure if document is corrupt, or possible POI bug) caused an exception when visiting the embedded documents. If I change Tika to suppress that exception, the rest of the document extracts fine. So somehow I think we should be more robust here, and maybe log the exception, or save/record the exception(s) somewhere so after parsing the app could decide what to do about them ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-1074) Extraction should continue if an exception is hit visiting an embedded document
[ https://issues.apache.org/jira/browse/TIKA-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-1074: - Attachment: TIKA-1074.patch Patch, catching Exception not Throwable, and restoring the interrupt bit if the exc was InterruptedException. Extraction should continue if an exception is hit visiting an embedded document --- Key: TIKA-1074 URL: https://issues.apache.org/jira/browse/TIKA-1074 Project: Tika Issue Type: Improvement Components: parser Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 1.4 Attachments: TIKA-1074.patch, TIKA-1074.patch Spinoff from TIKA-1072. In that issue, a problematic document (still not sure if document is corrupt, or possible POI bug) caused an exception when visiting the embedded documents. If I change Tika to suppress that exception, the rest of the document extracts fine. So somehow I think we should be more robust here, and maybe log the exception, or save/record the exception(s) somewhere so after parsing the app could decide what to do about them ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Reopened] (TIKA-1074) Extraction should continue if an exception is hit visiting an embedded document
[ https://issues.apache.org/jira/browse/TIKA-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless reopened TIKA-1074: -- Extraction should continue if an exception is hit visiting an embedded document --- Key: TIKA-1074 URL: https://issues.apache.org/jira/browse/TIKA-1074 Project: Tika Issue Type: Improvement Components: parser Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 1.4 Attachments: TIKA-1074.patch Spinoff from TIKA-1072. In that issue, a problematic document (still not sure if document is corrupt, or possible POI bug) caused an exception when visiting the embedded documents. If I change Tika to suppress that exception, the rest of the document extracts fine. So somehow I think we should be more robust here, and maybe log the exception, or save/record the exception(s) somewhere so after parsing the app could decide what to do about them ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1074) Extraction should continue if an exception is hit visiting an embedded document
[ https://issues.apache.org/jira/browse/TIKA-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13582481#comment-13582481 ] Michael McCandless commented on TIKA-1074: -- Thanks Uwe, I'll change to catching Exception not Throwable, and restoring the interrupt bit for InterruptedException. Extraction should continue if an exception is hit visiting an embedded document --- Key: TIKA-1074 URL: https://issues.apache.org/jira/browse/TIKA-1074 Project: Tika Issue Type: Improvement Components: parser Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 1.4 Attachments: TIKA-1074.patch Spinoff from TIKA-1072. In that issue, a problematic document (still not sure if document is corrupt, or possible POI bug) caused an exception when visiting the embedded documents. If I change Tika to suppress that exception, the rest of the document extracts fine. So somehow I think we should be more robust here, and maybe log the exception, or save/record the exception(s) somewhere so after parsing the app could decide what to do about them ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (TIKA-1074) Extraction should continue if an exception is hit visiting an embedded document
[ https://issues.apache.org/jira/browse/TIKA-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless reassigned TIKA-1074: Assignee: Michael McCandless Extraction should continue if an exception is hit visiting an embedded document --- Key: TIKA-1074 URL: https://issues.apache.org/jira/browse/TIKA-1074 Project: Tika Issue Type: Improvement Components: parser Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 1.4 Spinoff from TIKA-1072. In that issue, a problematic document (still not sure if document is corrupt, or possible POI bug) caused an exception when visiting the embedded documents. If I change Tika to suppress that exception, the rest of the document extracts fine. So somehow I think we should be more robust here, and maybe log the exception, or save/record the exception(s) somewhere so after parsing the app could decide what to do about them ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-1074) Extraction should continue if an exception is hit visiting an embedded document
[ https://issues.apache.org/jira/browse/TIKA-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-1074: - Attachment: TIKA-1074.patch Patch, just logging a warning and continuing, if we hit the exceptions in TIKA-1072, TIKA-1078 or TIKA-1079. I think it's ready. Extraction should continue if an exception is hit visiting an embedded document --- Key: TIKA-1074 URL: https://issues.apache.org/jira/browse/TIKA-1074 Project: Tika Issue Type: Improvement Components: parser Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 1.4 Attachments: TIKA-1074.patch Spinoff from TIKA-1072. In that issue, a problematic document (still not sure if document is corrupt, or possible POI bug) caused an exception when visiting the embedded documents. If I change Tika to suppress that exception, the rest of the document extracts fine. So somehow I think we should be more robust here, and maybe log the exception, or save/record the exception(s) somewhere so after parsing the app could decide what to do about them ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-369) Improve accuracy of language detection
[ https://issues.apache.org/jira/browse/TIKA-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13573492#comment-13573492 ] Michael McCandless commented on TIKA-369: - The language-detection lib is now in Maven: http://search.maven.org/#artifactdetails|com.cybozu.labs|langdetect|1.1-20120112|jar And it's compiled to Java 5 ... I think we should do a hard cutover (replace Tika's current language detection with this library)? Any objections? Improve accuracy of language detection -- Key: TIKA-369 URL: https://issues.apache.org/jira/browse/TIKA-369 Project: Tika Issue Type: Improvement Components: languageidentifier Affects Versions: 0.6 Reporter: Ken Krugler Assignee: Ken Krugler Attachments: lingdet-mccs.pdf, Surprise and Coincidence.pdf, textcat.pdf Currently the LanguageProfile code uses 3-grams to find the best language profile using Pearson's chi-square test. This has three issues: 1. The results aren't very good for short runs of text. Ted Dunning's paper (attached) indicates that a log-likelihood ratio (LLR) test works much better, which would then make language detection faster due to less text needing to be processed. 2. The current LanguageIdentifier.isReasonablyCertain() method uses an exact value as a threshold for certainty. This is very sensitive to the amount of text being processed, and thus gives false negative results for short runs of text. 3. Certainty should also be based on how much better the result is for language X, compared to the next best language. If two languages both had identical sum-of-squares values, and this value was below the threshold, then the result is still not very certain. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (TIKA-1053) Upgrade Tika Parsers to use ASM 4.x
[ https://issues.apache.org/jira/browse/TIKA-1053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved TIKA-1053. -- Resolution: Fixed Fix Version/s: 1.4 Thanks Uwe. Upgrade Tika Parsers to use ASM 4.x --- Key: TIKA-1053 URL: https://issues.apache.org/jira/browse/TIKA-1053 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.2 Reporter: Vincent Massol Assignee: Michael McCandless Fix For: 1.4 Attachments: TIKA-1053.patch Right now Tika 1.2 uses ASM 3.1. However this is causing some issues for us on the XWiki project since we also bundle other framework that use a more recent version of ASM (we use pegdown which uses parboiled which draws ASM 4.0). The problem is that ASM 3.x and 4.0 are not compatible... See http://jira.xwiki.org/browse/XE-1269 for more details about the issue we're facing. Thanks for considering upgrading to ASM 4.x :) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (TIKA-1078) TikaCLI: invalid characters in embedded document name causes FNFE when trying to save
Michael McCandless created TIKA-1078: Summary: TikaCLI: invalid characters in embedded document name causes FNFE when trying to save Key: TIKA-1078 URL: https://issues.apache.org/jira/browse/TIKA-1078 Project: Tika Issue Type: Bug Reporter: Michael McCandless Fix For: 1.4 Attachments: T-DS_Excel2003-PPT2003_1.xls Attached document hits this on Windows: {noformat} C:\java.exe -jar tika-app-1.3.jar -z -x c:\data\idit\T-DS_Excel2003-PPT2003_1.xls Extracting 'file0.png' (image/png) to .\file0.png Extracting 'file1.emf' (application/x-emf) to .\file1.emf Extracting 'file2.jpg' (image/jpeg) to .\file2.jpg Extracting 'file3.emf' (application/x-emf) to .\file3.emf Extracting 'file4.wmf' (application/x-msmetafile) to .\file4.wmf Extracting 'MBD0016BDE4/?£☺.bin' (application/octet-stream) to .\MBD0016BDE4\?£☺.bin Exception in thread main org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser@75f875f8 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:248) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:139) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:415) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:109) Caused by: java.io.FileNotFoundException: .\MBD0016BDE4\?£☺.bin (The filename, directory name, or volume label syntax is incorrect.) at java.io.FileOutputStream.init(FileOutputStream.java:205) at java.io.FileOutputStream.init(FileOutputStream.java:156) at org.apache.tika.cli.TikaCLI$FileEmbeddedDocumentExtractor.parseEmbedded(TikaCLI.java:722) at org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:201) at org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:158) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:194) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 5 more {noformat} TikaCLI manages to create the sub-directory, but because the embedded fileName has invalid (for Windows) characters, it fails. On Linux it runs fine. I think somehow ... we have to sanitize the embedded file name ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-1078) TikaCLI: invalid characters in embedded document name causes FNFE when trying to save
[ https://issues.apache.org/jira/browse/TIKA-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-1078: - Attachment: T-DS_Excel2003-PPT2003_1.xls TikaCLI: invalid characters in embedded document name causes FNFE when trying to save - Key: TIKA-1078 URL: https://issues.apache.org/jira/browse/TIKA-1078 Project: Tika Issue Type: Bug Reporter: Michael McCandless Fix For: 1.4 Attachments: T-DS_Excel2003-PPT2003_1.xls Attached document hits this on Windows: {noformat} C:\java.exe -jar tika-app-1.3.jar -z -x c:\data\idit\T-DS_Excel2003-PPT2003_1.xls Extracting 'file0.png' (image/png) to .\file0.png Extracting 'file1.emf' (application/x-emf) to .\file1.emf Extracting 'file2.jpg' (image/jpeg) to .\file2.jpg Extracting 'file3.emf' (application/x-emf) to .\file3.emf Extracting 'file4.wmf' (application/x-msmetafile) to .\file4.wmf Extracting 'MBD0016BDE4/?£☺.bin' (application/octet-stream) to .\MBD0016BDE4\?£☺.bin Exception in thread main org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser@75f875f8 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:248) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:139) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:415) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:109) Caused by: java.io.FileNotFoundException: .\MBD0016BDE4\?£☺.bin (The filename, directory name, or volume label syntax is incorrect.) at java.io.FileOutputStream.init(FileOutputStream.java:205) at java.io.FileOutputStream.init(FileOutputStream.java:156) at org.apache.tika.cli.TikaCLI$FileEmbeddedDocumentExtractor.parseEmbedded(TikaCLI.java:722) at org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:201) at org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:158) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:194) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 5 more {noformat} TikaCLI manages to create the sub-directory, but because the embedded fileName has invalid (for Windows) characters, it fails. On Linux it runs fine. I think somehow ... we have to sanitize the embedded file name ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-1079) Word document hits AIOOBE in SummaryExtractor.parseSummaries
[ https://issues.apache.org/jira/browse/TIKA-1079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-1079: - Attachment: guide_to_daips_(id_3152_ver_1.0.0).doc Word document hits AIOOBE in SummaryExtractor.parseSummaries Key: TIKA-1079 URL: https://issues.apache.org/jira/browse/TIKA-1079 Project: Tika Issue Type: Bug Reporter: Michael McCandless Fix For: 1.4 Attachments: guide_to_daips_(id_3152_ver_1.0.0).doc I'm not yet sure if this is a corrupted document (though, MS Word opens it just fine) or a bug in POI ... but I hit this exc when running it through TikaCLI: {noformat} java.lang.ArrayIndexOutOfBoundsException: -1 at org.apache.poi.hpsf.CodePageString.init(CodePageString.java:161) at org.apache.poi.hpsf.TypedPropertyValue.readValue(TypedPropertyValue.java:158) at org.apache.poi.hpsf.VariantSupport.read(VariantSupport.java:163) at org.apache.poi.hpsf.Property.init(Property.java:164) at org.apache.poi.hpsf.Section.init(Section.java:277) at org.apache.poi.hpsf.PropertySet.init(PropertySet.java:451) at org.apache.poi.hpsf.PropertySet.init(PropertySet.java:246) at org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaryEntryIfExists(SummaryExtractor.java:78) at org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaries(SummaryExtractor.java:69) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:170) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:139) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:415) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:109) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1074) Extraction should continue if an exception is hit visiting an embedded document
[ https://issues.apache.org/jira/browse/TIKA-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13571288#comment-13571288 ] Michael McCandless commented on TIKA-1074: -- TIKA-1079 is another example where if we recorded/logged an exc and moved on we could have parsed the rest of the document ... Extraction should continue if an exception is hit visiting an embedded document --- Key: TIKA-1074 URL: https://issues.apache.org/jira/browse/TIKA-1074 Project: Tika Issue Type: Improvement Components: parser Reporter: Michael McCandless Fix For: 1.4 Spinoff from TIKA-1072. In that issue, a problematic document (still not sure if document is corrupt, or possible POI bug) caused an exception when visiting the embedded documents. If I change Tika to suppress that exception, the rest of the document extracts fine. So somehow I think we should be more robust here, and maybe log the exception, or save/record the exception(s) somewhere so after parsing the app could decide what to do about them ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1072) AIOOBE when handling embedded document in .doc file
[ https://issues.apache.org/jira/browse/TIKA-1072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13570208#comment-13570208 ] Michael McCandless commented on TIKA-1072: -- OK I did some digging on this. The DirectoryNode of this embedded document has these entries: {noformat} ent=PICT size=797 ent=ObjInfo size=4 ent=Ole10Native size=40 ent=Ole10FmtProgID size=13 ent=OlePres000 size=40 ent=CompObj size=82 ent=PIC size=100 ent=META size=582 ent=Ole size=20 {noformat} And so I believe it really is an OLE10Native record... OLE10Native then tries to parse it, with plain=false, but then runs out of bytes on this line: {noformat} flags2 = LittleEndian.getShort(data, ofs); {noformat} It seems likely something is corrupt about this entry? Does 40 bytes seem way too small for an OLE10Native entry? If so, I wonder if we could fix AbstractPOIFSExtractor to log the exception and then skip this one embedded document and then go on to parsing the others? Ie, isolate the exception, rather than aborting the entire extraction; in this case the main document extracts fine. AIOOBE when handling embedded document in .doc file --- Key: TIKA-1072 URL: https://issues.apache.org/jira/browse/TIKA-1072 Project: Tika Issue Type: Bug Reporter: Michael McCandless Fix For: 1.4 Attachments: 20-Force-on-a-current-S00.doc I have a Word (.doc) document that hits an exception when I run: {noformat} java -jar tika-app/target/tika-app-1.4-SNAPSHOT.jar /x/tmp/20-Force-on-a-current-S00.doc {noformat} Here's the exception: {noformat} Caused by: java.lang.ArrayIndexOutOfBoundsException: 40 at org.apache.poi.util.LittleEndian.getShort(LittleEndian.java:225) at org.apache.poi.poifs.filesystem.Ole10Native.init(Ole10Native.java:139) at org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:89) at org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:149) at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:135) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:186) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) {noformat} It happens when we try to parse an OLE10 embedded object ... the code that does this parsing captures and ignores Ole10NativeException and skips the entry ... so I'm wondering if we should also catch AIOOBE and skip the entry? Ie, maybe this entry really is not OLE10, and the Ole10Native code is failing to throw Ole10NativeException for it? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1072) AIOOBE when handling embedded document in .doc file
[ https://issues.apache.org/jira/browse/TIKA-1072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13570305#comment-13570305 ] Michael McCandless commented on TIKA-1072: -- Thanks Nick, I'll try asking on dev@poi. I'll open a separate issue about continuing parsing even when an embedded doc hits an exception ... AIOOBE when handling embedded document in .doc file --- Key: TIKA-1072 URL: https://issues.apache.org/jira/browse/TIKA-1072 Project: Tika Issue Type: Bug Reporter: Michael McCandless Fix For: 1.4 Attachments: 20-Force-on-a-current-S00.doc I have a Word (.doc) document that hits an exception when I run: {noformat} java -jar tika-app/target/tika-app-1.4-SNAPSHOT.jar /x/tmp/20-Force-on-a-current-S00.doc {noformat} Here's the exception: {noformat} Caused by: java.lang.ArrayIndexOutOfBoundsException: 40 at org.apache.poi.util.LittleEndian.getShort(LittleEndian.java:225) at org.apache.poi.poifs.filesystem.Ole10Native.init(Ole10Native.java:139) at org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:89) at org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:149) at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:135) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:186) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) {noformat} It happens when we try to parse an OLE10 embedded object ... the code that does this parsing captures and ignores Ole10NativeException and skips the entry ... so I'm wondering if we should also catch AIOOBE and skip the entry? Ie, maybe this entry really is not OLE10, and the Ole10Native code is failing to throw Ole10NativeException for it? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1072) AIOOBE when handling embedded document in .doc file
[ https://issues.apache.org/jira/browse/TIKA-1072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13570308#comment-13570308 ] Michael McCandless commented on TIKA-1072: -- OK I opened TIKA-1074; this issue will explore whether this document is corrupt or not ... AIOOBE when handling embedded document in .doc file --- Key: TIKA-1072 URL: https://issues.apache.org/jira/browse/TIKA-1072 Project: Tika Issue Type: Bug Reporter: Michael McCandless Fix For: 1.4 Attachments: 20-Force-on-a-current-S00.doc I have a Word (.doc) document that hits an exception when I run: {noformat} java -jar tika-app/target/tika-app-1.4-SNAPSHOT.jar /x/tmp/20-Force-on-a-current-S00.doc {noformat} Here's the exception: {noformat} Caused by: java.lang.ArrayIndexOutOfBoundsException: 40 at org.apache.poi.util.LittleEndian.getShort(LittleEndian.java:225) at org.apache.poi.poifs.filesystem.Ole10Native.init(Ole10Native.java:139) at org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:89) at org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:149) at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:135) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:186) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) {noformat} It happens when we try to parse an OLE10 embedded object ... the code that does this parsing captures and ignores Ole10NativeException and skips the entry ... so I'm wondering if we should also catch AIOOBE and skip the entry? Ie, maybe this entry really is not OLE10, and the Ole10Native code is failing to throw Ole10NativeException for it? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (TIKA-1074) Extraction should continue if an exception is hit visiting an embedded document
Michael McCandless created TIKA-1074: Summary: Extraction should continue if an exception is hit visiting an embedded document Key: TIKA-1074 URL: https://issues.apache.org/jira/browse/TIKA-1074 Project: Tika Issue Type: Improvement Components: parser Reporter: Michael McCandless Fix For: 1.4 Spinoff from TIKA-1072. In that issue, a problematic document (still not sure if document is corrupt, or possible POI bug) caused an exception when visiting the embedded documents. If I change Tika to suppress that exception, the rest of the document extracts fine. So somehow I think we should be more robust here, and maybe log the exception, or save/record the exception(s) somewhere so after parsing the app could decide what to do about them ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-1072) AIOOBE when handling embedded document in .doc file
[ https://issues.apache.org/jira/browse/TIKA-1072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-1072: - Attachment: Ole10NativeEntry.bin I'm attaching the 40 byte \U0001Ole10Native entry (40 bytes); here's the hex dump: 24 00 00 00 02 00 01 01 00 0a 01 12 83 46 02 86 |$F..| 0010 3d 12 83 49 12 83 6c 12 83 42 12 82 73 12 82 69 |=..I..l..B..s..i| 0020 12 82 6e 02 84 71 00 00 |..n..q..| 0028 AIOOBE when handling embedded document in .doc file --- Key: TIKA-1072 URL: https://issues.apache.org/jira/browse/TIKA-1072 Project: Tika Issue Type: Bug Reporter: Michael McCandless Fix For: 1.4 Attachments: 20-Force-on-a-current-S00.doc, Ole10NativeEntry.bin I have a Word (.doc) document that hits an exception when I run: {noformat} java -jar tika-app/target/tika-app-1.4-SNAPSHOT.jar /x/tmp/20-Force-on-a-current-S00.doc {noformat} Here's the exception: {noformat} Caused by: java.lang.ArrayIndexOutOfBoundsException: 40 at org.apache.poi.util.LittleEndian.getShort(LittleEndian.java:225) at org.apache.poi.poifs.filesystem.Ole10Native.init(Ole10Native.java:139) at org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:89) at org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:149) at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:135) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:186) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) {noformat} It happens when we try to parse an OLE10 embedded object ... the code that does this parsing captures and ignores Ole10NativeException and skips the entry ... so I'm wondering if we should also catch AIOOBE and skip the entry? Ie, maybe this entry really is not OLE10, and the Ole10Native code is failing to throw Ole10NativeException for it? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (TIKA-1072) AIOOBE when handling embedded document in .doc file
Michael McCandless created TIKA-1072: Summary: AIOOBE when handling embedded document in .doc file Key: TIKA-1072 URL: https://issues.apache.org/jira/browse/TIKA-1072 Project: Tika Issue Type: Bug Reporter: Michael McCandless Fix For: 1.4 Attachments: 20-Force-on-a-current-S00.doc I have a Word (.doc) document that hits an exception when I run: {noformat} java -jar tika-app/target/tika-app-1.4-SNAPSHOT.jar /x/tmp/20-Force-on-a-current-S00.doc {noformat} Here's the exception: {noformat} Caused by: java.lang.ArrayIndexOutOfBoundsException: 40 at org.apache.poi.util.LittleEndian.getShort(LittleEndian.java:225) at org.apache.poi.poifs.filesystem.Ole10Native.init(Ole10Native.java:139) at org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:89) at org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:149) at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:135) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:186) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) {noformat} It happens when we try to parse an OLE10 embedded object ... the code that does this parsing captures and ignores Ole10NativeException and skips the entry ... so I'm wondering if we should also catch AIOOBE and skip the entry? Ie, maybe this entry really is not OLE10, and the Ole10Native code is failing to throw Ole10NativeException for it? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-1072) AIOOBE when handling embedded document in .doc file
[ https://issues.apache.org/jira/browse/TIKA-1072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-1072: - Attachment: 20-Force-on-a-current-S00.doc AIOOBE when handling embedded document in .doc file --- Key: TIKA-1072 URL: https://issues.apache.org/jira/browse/TIKA-1072 Project: Tika Issue Type: Bug Reporter: Michael McCandless Fix For: 1.4 Attachments: 20-Force-on-a-current-S00.doc I have a Word (.doc) document that hits an exception when I run: {noformat} java -jar tika-app/target/tika-app-1.4-SNAPSHOT.jar /x/tmp/20-Force-on-a-current-S00.doc {noformat} Here's the exception: {noformat} Caused by: java.lang.ArrayIndexOutOfBoundsException: 40 at org.apache.poi.util.LittleEndian.getShort(LittleEndian.java:225) at org.apache.poi.poifs.filesystem.Ole10Native.init(Ole10Native.java:139) at org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:89) at org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:149) at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:135) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:186) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) {noformat} It happens when we try to parse an OLE10 embedded object ... the code that does this parsing captures and ignores Ole10NativeException and skips the entry ... so I'm wondering if we should also catch AIOOBE and skip the entry? Ie, maybe this entry really is not OLE10, and the Ole10Native code is failing to throw Ole10NativeException for it? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (TIKA-1067) Tika extracts non-existent asterisks (*) from .ppt files
Michael McCandless created TIKA-1067: Summary: Tika extracts non-existent asterisks (*) from .ppt files Key: TIKA-1067 URL: https://issues.apache.org/jira/browse/TIKA-1067 Project: Tika Issue Type: Bug Reporter: Michael McCandless I created a new blank presentation, put in title + subtitle, saved it as .ppt, and then ran TikaCLI -t: {noformat} bodydiv class=slideShowdiv class=slidep class=slide-master-content*br/ *br/ /p p class=slide-contentTestingbr/ testingbr/ /p /div /div div class=slideNotes/ {noformat} The two extra *'s seem to be coming from the master slide, but I'm not sure which text runs they are and how to stop them ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1062) Add list detection to RTFParser
[ https://issues.apache.org/jira/browse/TIKA-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13562060#comment-13562060 ] Michael McCandless commented on TIKA-1062: -- Hi Axel, I don't actually know that Tika has adopted official code style (anyone?). Really I was just carrying forward Lucene's code style (put {} around even single-line code blocks to avoid future bug risk...). You succeeded very well, and, yes, the current code style varies :) Add list detection to RTFParser --- Key: TIKA-1062 URL: https://issues.apache.org/jira/browse/TIKA-1062 Project: Tika Issue Type: Improvement Components: parser Reporter: Axel Dörfler Assignee: Michael McCandless Priority: Minor Labels: patch Fix For: 1.4 Attachments: testRTFListLibreOffice.rtf, testRTFListMicrosoftWord.rtf, tika-rtf-lists.patch RTF supports lists, and the parser could support those, too, using HTML ul/ol/li tags. I'm attaching a patch that implements basic support for Word 97 and newer lists. Nested lists are not supported correctly, yet, though, and a number of formatting options are ignored. I've also added test cases for this, and adapted existing tests where needed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1062) Add list detection to RTFParser
[ https://issues.apache.org/jira/browse/TIKA-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13560927#comment-13560927 ] Michael McCandless commented on TIKA-1062: -- Should the ListDescriptor list = listTable.get(listID); in isUnorderedList be currentListTable.get instead? Add list detection to RTFParser --- Key: TIKA-1062 URL: https://issues.apache.org/jira/browse/TIKA-1062 Project: Tika Issue Type: Improvement Components: parser Reporter: Axel Dörfler Assignee: Michael McCandless Priority: Minor Labels: patch Attachments: testRTFListLibreOffice.rtf, testRTFListMicrosoftWord.rtf, tika-rtf-lists.patch RTF supports lists, and the parser could support those, too, using HTML ul/ol/li tags. I'm attaching a patch that implements basic support for Word 97 and newer lists. Nested lists are not supported correctly, yet, though, and a number of formatting options are ignored. I've also added test cases for this, and adapted existing tests where needed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (TIKA-1048) XMLParser should add whitespace between elements
[ https://issues.apache.org/jira/browse/TIKA-1048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved TIKA-1048. -- Resolution: Fixed XMLParser should add whitespace between elements Key: TIKA-1048 URL: https://issues.apache.org/jira/browse/TIKA-1048 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 1.3 Attachments: TIKA-1048.patch, TIKA-1048.patch If the incoming XML is compact (ie doesn't have whitespace between elements), I think we should somehow add whitespace between elements when extracting text? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (TIKA-1048) XMLParser should add whitespace between elements
Michael McCandless created TIKA-1048: Summary: XMLParser should add whitespace between elements Key: TIKA-1048 URL: https://issues.apache.org/jira/browse/TIKA-1048 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless Fix For: 1.3 Attachments: TIKA-1048.patch If the incoming XML is compact (ie doesn't have whitespace between elements), I think we should somehow add whitespace between elements when extracting text? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-1048) XMLParser should add whitespace between elements
[ https://issues.apache.org/jira/browse/TIKA-1048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-1048: - Attachment: TIKA-1048.patch Patch w/ failing test ... I'm not sure where/how to best fix this yet ... XMLParser should add whitespace between elements Key: TIKA-1048 URL: https://issues.apache.org/jira/browse/TIKA-1048 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless Fix For: 1.3 Attachments: TIKA-1048.patch If the incoming XML is compact (ie doesn't have whitespace between elements), I think we should somehow add whitespace between elements when extracting text? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (TIKA-1031) TikaCLI doesn't create sub-dirs when extracting Zip files
[ https://issues.apache.org/jira/browse/TIKA-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved TIKA-1031. -- Resolution: Fixed TikaCLI doesn't create sub-dirs when extracting Zip files - Key: TIKA-1031 URL: https://issues.apache.org/jira/browse/TIKA-1031 Project: Tika Issue Type: Bug Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 1.3 Attachments: TIKA-1031.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (TIKA-1032) Powerpoint (.pptx) can have duplicate embedded ids
[ https://issues.apache.org/jira/browse/TIKA-1032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved TIKA-1032. -- Resolution: Fixed Fix Version/s: 1.3 Powerpoint (.pptx) can have duplicate embedded ids -- Key: TIKA-1032 URL: https://issues.apache.org/jira/browse/TIKA-1032 Project: Tika Issue Type: Bug Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 1.3 Attachments: TIKA-1032.patch Apparently the relId is only unique within one slide ... I fixed it to prefix slideN_. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-712) Master slide text isn't extracted
[ https://issues.apache.org/jira/browse/TIKA-712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13508010#comment-13508010 ] Michael McCandless commented on TIKA-712: - I committed the patch; I'll leave this issue open for a possible future correct fix where we can detect boilerplate text in PPT. Master slide text isn't extracted - Key: TIKA-712 URL: https://issues.apache.org/jira/browse/TIKA-712 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless Attachments: testPPT_masterFooter2.ppt, testPPT_masterFooter2.pptx, testPPT_masterFooter.ppt, testPPT_masterFooter.pptx, TIKA-712-master-slide.xml, TIKA-712.patch, TIKA-712.patch It looks like we are not getting text from the master slide for PPT and PPTX. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (TIKA-1035) PDF bookmark text is not extracted
[ https://issues.apache.org/jira/browse/TIKA-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved TIKA-1035. -- Resolution: Fixed PDF bookmark text is not extracted -- Key: TIKA-1035 URL: https://issues.apache.org/jira/browse/TIKA-1035 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 1.3 Attachments: TIKA-1035.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (TIKA-1036) ZIP parsing doesn't leave placeholders for each package entry
[ https://issues.apache.org/jira/browse/TIKA-1036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved TIKA-1036. -- Resolution: Fixed Fix Version/s: 1.3 ZIP parsing doesn't leave placeholders for each package entry - Key: TIKA-1036 URL: https://issues.apache.org/jira/browse/TIKA-1036 Project: Tika Issue Type: Improvement Components: parser Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 1.3 Attachments: TIKA-1036.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (TIKA-1035) PDF bookmark text is not extracted
Michael McCandless created TIKA-1035: Summary: PDF bookmark text is not extracted Key: TIKA-1035 URL: https://issues.apache.org/jira/browse/TIKA-1035 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 1.3 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-1035) PDF bookmark text is not extracted
[ https://issues.apache.org/jira/browse/TIKA-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-1035: - Attachment: TIKA-1035.patch Patch w/ test ... PDF bookmark text is not extracted -- Key: TIKA-1035 URL: https://issues.apache.org/jira/browse/TIKA-1035 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 1.3 Attachments: TIKA-1035.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (TIKA-1036) ZIP parsing doesn't leave placeholders for each package entry
Michael McCandless created TIKA-1036: Summary: ZIP parsing doesn't leave placeholders for each package entry Key: TIKA-1036 URL: https://issues.apache.org/jira/browse/TIKA-1036 Project: Tika Issue Type: Improvement Components: parser Reporter: Michael McCandless Assignee: Michael McCandless -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-1036) ZIP parsing doesn't leave placeholders for each package entry
[ https://issues.apache.org/jira/browse/TIKA-1036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-1036: - Attachment: TIKA-1036.patch Patch w/ test ... ZIP parsing doesn't leave placeholders for each package entry - Key: TIKA-1036 URL: https://issues.apache.org/jira/browse/TIKA-1036 Project: Tika Issue Type: Improvement Components: parser Reporter: Michael McCandless Assignee: Michael McCandless Attachments: TIKA-1036.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-712) Master slide text isn't extracted
[ https://issues.apache.org/jira/browse/TIKA-712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-712: Attachment: TIKA-712.patch I think I found a committable workaround (patch) for including text from the master slide for PPT documents: I uncommented the existing code, but then exclude text that is type 0 (TITLE_TYPE) or 1 (BODY_TYPE), just for the master slide. In my ad-hoc testing this eliminates the boilerplate text but lets other user changes to the master slide come through correctly ... this isn't perfect but I think it's a good step forward. Master slide text isn't extracted - Key: TIKA-712 URL: https://issues.apache.org/jira/browse/TIKA-712 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless Attachments: testPPT_masterFooter2.ppt, testPPT_masterFooter2.pptx, testPPT_masterFooter.ppt, testPPT_masterFooter.pptx, TIKA-712-master-slide.xml, TIKA-712.patch, TIKA-712.patch It looks like we are not getting text from the master slide for PPT and PPTX. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (TIKA-1033) Tika doesn't parse embedded OLE Chart/Graph objects
Michael McCandless created TIKA-1033: Summary: Tika doesn't parse embedded OLE Chart/Graph objects Key: TIKA-1033 URL: https://issues.apache.org/jira/browse/TIKA-1033 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless Priority: Minor Attachments: emb.ppt I have an example ppt that embeds a chart, but Tika mis-identifies it as an XLS document. The progID (oleShape.getProgID() in HSLFExtractor.handleSlideEmbeddedResources) is MSGraph.Chart.8 ... and we seem to detect it as Excel (application/vnd.ms-excel) but then the ExcelExtractor hits this exception: {noformat} org.apache.poi.hssf.record.RecordFormatException: Unable to construct record instance at org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create(RecordFactory.java:65) at org.apache.poi.hssf.record.RecordFactory.createSingleRecord(RecordFactory.java:301) at org.apache.poi.hssf.record.RecordFactoryInputStream.readNextRecord(RecordFactoryInputStream.java:285) at org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord(RecordFactoryInputStream.java:251) at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents(HSSFEventFactory.java:143) at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents(HSSFEventFactory.java:106) at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile(ExcelExtractor.java:302) at org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:147) {noformat} Since DelegatingParser silently suppresses all exceptions, when you run TikaCLI you won't see any exception nor text extracted, but if you run with -z, it will save 1.xls which if you then try to parse with TikaCLI hits the above exception. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-1033) Tika doesn't parse embedded OLE Chart/Graph objects
[ https://issues.apache.org/jira/browse/TIKA-1033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-1033: - Attachment: emb.ppt Tika doesn't parse embedded OLE Chart/Graph objects --- Key: TIKA-1033 URL: https://issues.apache.org/jira/browse/TIKA-1033 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless Priority: Minor Attachments: emb.ppt I have an example ppt that embeds a chart, but Tika mis-identifies it as an XLS document. The progID (oleShape.getProgID() in HSLFExtractor.handleSlideEmbeddedResources) is MSGraph.Chart.8 ... and we seem to detect it as Excel (application/vnd.ms-excel) but then the ExcelExtractor hits this exception: {noformat} org.apache.poi.hssf.record.RecordFormatException: Unable to construct record instance at org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create(RecordFactory.java:65) at org.apache.poi.hssf.record.RecordFactory.createSingleRecord(RecordFactory.java:301) at org.apache.poi.hssf.record.RecordFactoryInputStream.readNextRecord(RecordFactoryInputStream.java:285) at org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord(RecordFactoryInputStream.java:251) at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents(HSSFEventFactory.java:143) at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents(HSSFEventFactory.java:106) at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile(ExcelExtractor.java:302) at org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:147) {noformat} Since DelegatingParser silently suppresses all exceptions, when you run TikaCLI you won't see any exception nor text extracted, but if you run with -z, it will save 1.xls which if you then try to parse with TikaCLI hits the above exception. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1033) Tika doesn't parse embedded OLE Chart/Graph objects
[ https://issues.apache.org/jira/browse/TIKA-1033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13504563#comment-13504563 ] Michael McCandless commented on TIKA-1033: -- Here's the full stack trace when I parse the .xls file that TikaCLI extracts: {noformat} Exception in thread main org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@4eaf6cb1 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:138) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:399) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:121) Caused by: org.apache.poi.hssf.record.RecordFormatException: Unable to construct record instance at org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create(RecordFactory.java:65) at org.apache.poi.hssf.record.RecordFactory.createSingleRecord(RecordFactory.java:301) at org.apache.poi.hssf.record.RecordFactoryInputStream.readNextRecord(RecordFactoryInputStream.java:285) at org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord(RecordFactoryInputStream.java:251) at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents(HSSFEventFactory.java:143) at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents(HSSFEventFactory.java:106) at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile(ExcelExtractor.java:292) at org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:144) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:194) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 5 more Caused by: org.apache.poi.hssf.record.RecordFormatException: Not enough data (0) to read requested (2) bytes at org.apache.poi.hssf.record.RecordInputStream.checkRecordPosition(RecordInputStream.java:216) at org.apache.poi.hssf.record.RecordInputStream.readShort(RecordInputStream.java:233) at org.apache.poi.hssf.record.WindowOneRecord.init(WindowOneRecord.java:71) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:525) at org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create(RecordFactory.java:57) ... 15 more {noformat} Tika doesn't parse embedded OLE Chart/Graph objects --- Key: TIKA-1033 URL: https://issues.apache.org/jira/browse/TIKA-1033 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless Priority: Minor Attachments: emb.ppt I have an example ppt that embeds a chart, but Tika mis-identifies it as an XLS document. The progID (oleShape.getProgID() in HSLFExtractor.handleSlideEmbeddedResources) is MSGraph.Chart.8 ... and we seem to detect it as Excel (application/vnd.ms-excel) but then the ExcelExtractor hits this exception: {noformat} org.apache.poi.hssf.record.RecordFormatException: Unable to construct record instance at org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create(RecordFactory.java:65) at org.apache.poi.hssf.record.RecordFactory.createSingleRecord(RecordFactory.java:301) at org.apache.poi.hssf.record.RecordFactoryInputStream.readNextRecord(RecordFactoryInputStream.java:285) at org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord(RecordFactoryInputStream.java:251) at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents(HSSFEventFactory.java:143) at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents(HSSFEventFactory.java:106) at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile(ExcelExtractor.java:302) at org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:147) {noformat} Since DelegatingParser silently suppresses all exceptions, when you run TikaCLI you won't see any exception nor text extracted, but if you run with -z, it will save 1.xls which if you then try
[jira] [Commented] (TIKA-1033) Tika doesn't parse embedded OLE Chart/Graph objects
[ https://issues.apache.org/jira/browse/TIKA-1033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13504668#comment-13504668 ] Michael McCandless commented on TIKA-1033: -- I asked the person who created this test file; here's his answer: {noformat} I created the file with my PowerPoint (PowerPoint 2003). To embed the chart: 1. Select from the menu Insert 2. Select chart (I selected the default chart) 3. Place the chart {noformat} Tika doesn't parse embedded OLE Chart/Graph objects --- Key: TIKA-1033 URL: https://issues.apache.org/jira/browse/TIKA-1033 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless Priority: Minor Attachments: emb.ppt I have an example ppt that embeds a chart, but Tika mis-identifies it as an XLS document. The progID (oleShape.getProgID() in HSLFExtractor.handleSlideEmbeddedResources) is MSGraph.Chart.8 ... and we seem to detect it as Excel (application/vnd.ms-excel) but then the ExcelExtractor hits this exception: {noformat} org.apache.poi.hssf.record.RecordFormatException: Unable to construct record instance at org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create(RecordFactory.java:65) at org.apache.poi.hssf.record.RecordFactory.createSingleRecord(RecordFactory.java:301) at org.apache.poi.hssf.record.RecordFactoryInputStream.readNextRecord(RecordFactoryInputStream.java:285) at org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord(RecordFactoryInputStream.java:251) at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents(HSSFEventFactory.java:143) at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents(HSSFEventFactory.java:106) at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile(ExcelExtractor.java:302) at org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:147) {noformat} Since DelegatingParser silently suppresses all exceptions, when you run TikaCLI you won't see any exception nor text extracted, but if you run with -z, it will save 1.xls which if you then try to parse with TikaCLI hits the above exception. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1033) Tika doesn't parse embedded OLE Chart/Graph objects
[ https://issues.apache.org/jira/browse/TIKA-1033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13504673#comment-13504673 ] Michael McCandless commented on TIKA-1033: -- bq. The raw chart object looks to actually be an excel file, Hmm, so now I'm very confused :) Did something go wrong when Tika pulled out the bits from emb.ppt to create 1.xls? When I try to open 1.xls in Excel it's unhappy (Cannot open Microsoft Graph chart gallery files.). bq. Note that embedded objects in office files are actually stored as the raw object (used for editing), and a rendered version of the file (so that viewing the parent document is quick, normally an EMF) Yeah I see separately the *.emf files being extracted by TikaCLI. Tika doesn't parse embedded OLE Chart/Graph objects --- Key: TIKA-1033 URL: https://issues.apache.org/jira/browse/TIKA-1033 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless Priority: Minor Attachments: emb.ppt I have an example ppt that embeds a chart, but Tika mis-identifies it as an XLS document. The progID (oleShape.getProgID() in HSLFExtractor.handleSlideEmbeddedResources) is MSGraph.Chart.8 ... and we seem to detect it as Excel (application/vnd.ms-excel) but then the ExcelExtractor hits this exception: {noformat} org.apache.poi.hssf.record.RecordFormatException: Unable to construct record instance at org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create(RecordFactory.java:65) at org.apache.poi.hssf.record.RecordFactory.createSingleRecord(RecordFactory.java:301) at org.apache.poi.hssf.record.RecordFactoryInputStream.readNextRecord(RecordFactoryInputStream.java:285) at org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord(RecordFactoryInputStream.java:251) at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents(HSSFEventFactory.java:143) at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents(HSSFEventFactory.java:106) at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile(ExcelExtractor.java:302) at org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:147) {noformat} Since DelegatingParser silently suppresses all exceptions, when you run TikaCLI you won't see any exception nor text extracted, but if you run with -z, it will save 1.xls which if you then try to parse with TikaCLI hits the above exception. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1033) Tika doesn't parse embedded OLE Chart/Graph objects
[ https://issues.apache.org/jira/browse/TIKA-1033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13504703#comment-13504703 ] Michael McCandless commented on TIKA-1033: -- Interesting: with PowerPoint 2007, when I double-click the embedded chart, it pops up a dialogue box saying To edit this chart using the new features available in the 2007 Microsoft Office system, you must first convert it to the 2007 Office system format. Do you want to convert this chart to the new format? [Convert] [Convert All] [Edit Existing]. If I click [Edit Existing] it lets me edit the chart data in what looks like Excel, in Compatibility Mode. OK I'll open a POI bug and reference back to this issue... Thanks Nick. Tika doesn't parse embedded OLE Chart/Graph objects --- Key: TIKA-1033 URL: https://issues.apache.org/jira/browse/TIKA-1033 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless Priority: Minor Attachments: emb.ppt I have an example ppt that embeds a chart, but Tika mis-identifies it as an XLS document. The progID (oleShape.getProgID() in HSLFExtractor.handleSlideEmbeddedResources) is MSGraph.Chart.8 ... and we seem to detect it as Excel (application/vnd.ms-excel) but then the ExcelExtractor hits this exception: {noformat} org.apache.poi.hssf.record.RecordFormatException: Unable to construct record instance at org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create(RecordFactory.java:65) at org.apache.poi.hssf.record.RecordFactory.createSingleRecord(RecordFactory.java:301) at org.apache.poi.hssf.record.RecordFactoryInputStream.readNextRecord(RecordFactoryInputStream.java:285) at org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord(RecordFactoryInputStream.java:251) at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents(HSSFEventFactory.java:143) at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents(HSSFEventFactory.java:106) at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile(ExcelExtractor.java:302) at org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:147) {noformat} Since DelegatingParser silently suppresses all exceptions, when you run TikaCLI you won't see any exception nor text extracted, but if you run with -z, it will save 1.xls which if you then try to parse with TikaCLI hits the above exception. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1033) Tika doesn't parse embedded OLE Chart/Graph objects
[ https://issues.apache.org/jira/browse/TIKA-1033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13504726#comment-13504726 ] Michael McCandless commented on TIKA-1033: -- OK I opened https://issues.apache.org/bugzilla/show_bug.cgi?id=54213 Tika doesn't parse embedded OLE Chart/Graph objects --- Key: TIKA-1033 URL: https://issues.apache.org/jira/browse/TIKA-1033 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless Priority: Minor Attachments: emb.ppt I have an example ppt that embeds a chart, but Tika mis-identifies it as an XLS document. The progID (oleShape.getProgID() in HSLFExtractor.handleSlideEmbeddedResources) is MSGraph.Chart.8 ... and we seem to detect it as Excel (application/vnd.ms-excel) but then the ExcelExtractor hits this exception: {noformat} org.apache.poi.hssf.record.RecordFormatException: Unable to construct record instance at org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create(RecordFactory.java:65) at org.apache.poi.hssf.record.RecordFactory.createSingleRecord(RecordFactory.java:301) at org.apache.poi.hssf.record.RecordFactoryInputStream.readNextRecord(RecordFactoryInputStream.java:285) at org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord(RecordFactoryInputStream.java:251) at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents(HSSFEventFactory.java:143) at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents(HSSFEventFactory.java:106) at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile(ExcelExtractor.java:302) at org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:147) {noformat} Since DelegatingParser silently suppresses all exceptions, when you run TikaCLI you won't see any exception nor text extracted, but if you run with -z, it will save 1.xls which if you then try to parse with TikaCLI hits the above exception. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (TIKA-1031) TikaCLI doesn't create sub-dirs when extracting Zip files
Michael McCandless created TIKA-1031: Summary: TikaCLI doesn't create sub-dirs when extracting Zip files Key: TIKA-1031 URL: https://issues.apache.org/jira/browse/TIKA-1031 Project: Tika Issue Type: Bug Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 1.3 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-1031) TikaCLI doesn't create sub-dirs when extracting Zip files
[ https://issues.apache.org/jira/browse/TIKA-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-1031: - Attachment: TIKA-1031.patch Patch w/ test fix. TikaCLI doesn't create sub-dirs when extracting Zip files - Key: TIKA-1031 URL: https://issues.apache.org/jira/browse/TIKA-1031 Project: Tika Issue Type: Bug Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 1.3 Attachments: TIKA-1031.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (TIKA-1024) An MP3 with an UTF-16 ID3 tag containing only the BOM should produce empty string value for that tag
[ https://issues.apache.org/jira/browse/TIKA-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved TIKA-1024. -- Resolution: Fixed An MP3 with an UTF-16 ID3 tag containing only the BOM should produce empty string value for that tag Key: TIKA-1024 URL: https://issues.apache.org/jira/browse/TIKA-1024 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 1.3 Attachments: testNakedUTF16BOM.mp3, TIKA-1024.patch This seems to be a difference between JVMs: on IBM's JVM I incorrectly see the BOM as the value of the tag, while on Oracle's JVM I correctly get the empty string. I'm not sure if this is a bug in IBM's JVM ... the javadocs are not totally clear how a UTF-16 string containing only the BOM should be decoded by new String(...) ... to fix this I think we should just detect this case and short-circuit empty string return. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (TIKA-1025) Powerpoint (.ppt) parser doesn't leave placeholder where documents are embedded
[ https://issues.apache.org/jira/browse/TIKA-1025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved TIKA-1025. -- Resolution: Fixed Fix Version/s: 1.3 Powerpoint (.ppt) parser doesn't leave placeholder where documents are embedded --- Key: TIKA-1025 URL: https://issues.apache.org/jira/browse/TIKA-1025 Project: Tika Issue Type: Bug Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 1.3 Attachments: TIKA-1025.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-369) Improve accuracy of language detection
[ https://issues.apache.org/jira/browse/TIKA-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13499838#comment-13499838 ] Michael McCandless commented on TIKA-369: - +1 to cut over to https://code.google.com/p/language-detection Improve accuracy of language detection -- Key: TIKA-369 URL: https://issues.apache.org/jira/browse/TIKA-369 Project: Tika Issue Type: Improvement Components: languageidentifier Affects Versions: 0.6 Reporter: Ken Krugler Assignee: Ken Krugler Attachments: lingdet-mccs.pdf, Surprise and Coincidence.pdf, textcat.pdf Currently the LanguageProfile code uses 3-grams to find the best language profile using Pearson's chi-square test. This has three issues: 1. The results aren't very good for short runs of text. Ted Dunning's paper (attached) indicates that a log-likelihood ratio (LLR) test works much better, which would then make language detection faster due to less text needing to be processed. 2. The current LanguageIdentifier.isReasonablyCertain() method uses an exact value as a threshold for certainty. This is very sensitive to the amount of text being processed, and thus gives false negative results for short runs of text. 3. Certainty should also be based on how much better the result is for language X, compared to the next best language. If two languages both had identical sum-of-squares values, and this value was below the threshold, then the result is still not very certain. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (TIKA-1024) An MP3 with an UTF-16 ID3 tag containing only the BOM should produce empty string value for that tag
Michael McCandless created TIKA-1024: Summary: An MP3 with an UTF-16 ID3 tag containing only the BOM should produce empty string value for that tag Key: TIKA-1024 URL: https://issues.apache.org/jira/browse/TIKA-1024 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 1.3 This seems to be a difference between JVMs: on IBM's JVM I incorrectly see the BOM as the value of the tag, while on Oracle's JVM I correctly get the empty string. I'm not sure if this is a bug in IBM's JVM ... the javadocs are not totally clear how a UTF-16 string containing only the BOM should be decoded by new String(...) ... to fix this I think we should just detect this case and short-circuit empty string return. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-1024) An MP3 with an UTF-16 ID3 tag containing only the BOM should produce empty string value for that tag
[ https://issues.apache.org/jira/browse/TIKA-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-1024: - Attachment: testNakedUTF16BOM.mp3 An MP3 with an UTF-16 ID3 tag containing only the BOM should produce empty string value for that tag Key: TIKA-1024 URL: https://issues.apache.org/jira/browse/TIKA-1024 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 1.3 Attachments: testNakedUTF16BOM.mp3, TIKA-1024.patch This seems to be a difference between JVMs: on IBM's JVM I incorrectly see the BOM as the value of the tag, while on Oracle's JVM I correctly get the empty string. I'm not sure if this is a bug in IBM's JVM ... the javadocs are not totally clear how a UTF-16 string containing only the BOM should be decoded by new String(...) ... to fix this I think we should just detect this case and short-circuit empty string return. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-1024) An MP3 with an UTF-16 ID3 tag containing only the BOM should produce empty string value for that tag
[ https://issues.apache.org/jira/browse/TIKA-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-1024: - Attachment: TIKA-1024.patch Patch w/ failing test and fix. An MP3 with an UTF-16 ID3 tag containing only the BOM should produce empty string value for that tag Key: TIKA-1024 URL: https://issues.apache.org/jira/browse/TIKA-1024 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 1.3 Attachments: testNakedUTF16BOM.mp3, TIKA-1024.patch This seems to be a difference between JVMs: on IBM's JVM I incorrectly see the BOM as the value of the tag, while on Oracle's JVM I correctly get the empty string. I'm not sure if this is a bug in IBM's JVM ... the javadocs are not totally clear how a UTF-16 string containing only the BOM should be decoded by new String(...) ... to fix this I think we should just detect this case and short-circuit empty string return. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (TIKA-1025) Powerpoint (.ppt) parser doesn't leave placeholder where documents are embedded
Michael McCandless created TIKA-1025: Summary: Powerpoint (.ppt) parser doesn't leave placeholder where documents are embedded Key: TIKA-1025 URL: https://issues.apache.org/jira/browse/TIKA-1025 Project: Tika Issue Type: Bug Reporter: Michael McCandless Assignee: Michael McCandless -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-1025) Powerpoint (.ppt) parser doesn't leave placeholder where documents are embedded
[ https://issues.apache.org/jira/browse/TIKA-1025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-1025: - Attachment: TIKA-1025.patch Patch w/ test fix. Powerpoint (.ppt) parser doesn't leave placeholder where documents are embedded --- Key: TIKA-1025 URL: https://issues.apache.org/jira/browse/TIKA-1025 Project: Tika Issue Type: Bug Reporter: Michael McCandless Assignee: Michael McCandless Attachments: TIKA-1025.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (TIKA-1019) Document links in Word documents don't leave a placeholder
[ https://issues.apache.org/jira/browse/TIKA-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved TIKA-1019. -- Resolution: Fixed Document links in Word documents don't leave a placeholder -- Key: TIKA-1019 URL: https://issues.apache.org/jira/browse/TIKA-1019 Project: Tika Issue Type: Bug Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 1.3 Attachments: testDocumentLink.doc, TIKA-1019.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (TIKA-1019) Document links in Word documents don't leave a placeholder
[ https://issues.apache.org/jira/browse/TIKA-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved TIKA-1019. -- Resolution: Fixed Document links in Word documents don't leave a placeholder -- Key: TIKA-1019 URL: https://issues.apache.org/jira/browse/TIKA-1019 Project: Tika Issue Type: Bug Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 1.3 Attachments: testDocumentLink.doc, TIKA-1019.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Reopened] (TIKA-1019) Document links in Word documents don't leave a placeholder
[ https://issues.apache.org/jira/browse/TIKA-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless reopened TIKA-1019: -- I reverted my commit for now ... the test file was way too large ... Document links in Word documents don't leave a placeholder -- Key: TIKA-1019 URL: https://issues.apache.org/jira/browse/TIKA-1019 Project: Tika Issue Type: Bug Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 1.3 Attachments: testDocumentLink.doc, TIKA-1019.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (TIKA-1019) Document links in Word documents don't leave a placeholder
[ https://issues.apache.org/jira/browse/TIKA-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless reassigned TIKA-1019: Assignee: Michael McCandless Document links in Word documents don't leave a placeholder -- Key: TIKA-1019 URL: https://issues.apache.org/jira/browse/TIKA-1019 Project: Tika Issue Type: Bug Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 1.3 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-1019) Document links in Word documents don't leave a placeholder
[ https://issues.apache.org/jira/browse/TIKA-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-1019: - Attachment: testDocumentLink.doc TIKA-1019.patch Patch w/ test and fix. Document links in Word documents don't leave a placeholder -- Key: TIKA-1019 URL: https://issues.apache.org/jira/browse/TIKA-1019 Project: Tika Issue Type: Bug Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 1.3 Attachments: testDocumentLink.doc, TIKA-1019.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (TIKA-1015) Word (.doc) embedded files don't set relationship ID in the Metadata
[ https://issues.apache.org/jira/browse/TIKA-1015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved TIKA-1015. -- Resolution: Fixed Word (.doc) embedded files don't set relationship ID in the Metadata Key: TIKA-1015 URL: https://issues.apache.org/jira/browse/TIKA-1015 Project: Tika Issue Type: Bug Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 1.3 Attachments: TIKA-1015.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Reopened] (TIKA-953) Tika failed to recognize non-ustar Tar file?
[ https://issues.apache.org/jira/browse/TIKA-953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless reopened TIKA-953: - I have another non-ustar tar file that's incorrectly detected as application/octet-stream (though file identifies it as a tar archive) ... Tika failed to recognize non-ustar Tar file? - Key: TIKA-953 URL: https://issues.apache.org/jira/browse/TIKA-953 Project: Tika Issue Type: Bug Components: mime Affects Versions: 1.1 Reporter: Jing Li Fix For: 1.2 Attachments: test.tar The file type indeed is POSIX tar archive (GNU) when I use command file in linux, but Tika recognize it as application/xhtml+xml. The class I used with is DefaultDetector. Below is the head data of the file: 99, 102, 101, 114, 98, 114, 97, 99, 104, 101, 46, 48, 48, 54, 55, 54, 50, 55, 57, 45, 53, 54, 54, 55, 50, 52, 47, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 48, 48, 48, 48, 55, 48, 48, 0, 48, 48, 48, 48, 48, 48, 48, 0, 48, 48, 48, 48, 48, 48, 48, 0, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 0, 49, 49, 55, 55, 55, 49, 49, 52, 50, 48, 53, 0, 48, 49, 51, 51, 51, 49, 0, 32, 53, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 117, 115, 116, 97, 114, 32, 32, 0, 114, 111, 111, 116, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 114, 111, 111, 116, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-953) Tika failed to recognize non-ustar Tar file?
[ https://issues.apache.org/jira/browse/TIKA-953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-953: Attachment: test2.tar file reports this as a tar archive, but: {noformat} cat test2.tar | java -jar tika-app/target/tika-app-1.3-SNAPSHOT.jar --detect {noformat} says application/octet-stream. I created the tar file with 7z: {noformat} 7z a -ttar test2.tar New\ Text\ Document.txt {noformat} Tika failed to recognize non-ustar Tar file? - Key: TIKA-953 URL: https://issues.apache.org/jira/browse/TIKA-953 Project: Tika Issue Type: Bug Components: mime Affects Versions: 1.1 Reporter: Jing Li Fix For: 1.2 Attachments: test2.tar, test.tar The file type indeed is POSIX tar archive (GNU) when I use command file in linux, but Tika recognize it as application/xhtml+xml. The class I used with is DefaultDetector. Below is the head data of the file: 99, 102, 101, 114, 98, 114, 97, 99, 104, 101, 46, 48, 48, 54, 55, 54, 50, 55, 57, 45, 53, 54, 54, 55, 50, 52, 47, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 48, 48, 48, 48, 55, 48, 48, 0, 48, 48, 48, 48, 48, 48, 48, 0, 48, 48, 48, 48, 48, 48, 48, 0, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 0, 49, 49, 55, 55, 55, 49, 49, 52, 50, 48, 53, 0, 48, 49, 51, 51, 51, 49, 0, 32, 53, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 117, 115, 116, 97, 114, 32, 32, 0, 114, 111, 111, 116, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 114, 111, 111, 116, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (TIKA-1015) Word (.doc) embedded files don't set relationship ID in the Metadata
Michael McCandless created TIKA-1015: Summary: Word (.doc) embedded files don't set relationship ID in the Metadata Key: TIKA-1015 URL: https://issues.apache.org/jira/browse/TIKA-1015 Project: Tika Issue Type: Bug Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 1.3 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-1015) Word (.doc) embedded files don't set relationship ID in the Metadata
[ https://issues.apache.org/jira/browse/TIKA-1015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-1015: - Attachment: TIKA-1015.patch Simple patch, but my only slight hesitation is I added an argument to the protected AbstractPOIFSExtractor.handleEmbeddedResource method. I'm assuming this is OK to do (it's intended for the concrete per-document-type subclasses we have), but if an expert Tika user out there has a custom subclass, and they invoke this method, then they'll have to update their sources ... but this is very expert so I think it's OK. Word (.doc) embedded files don't set relationship ID in the Metadata Key: TIKA-1015 URL: https://issues.apache.org/jira/browse/TIKA-1015 Project: Tika Issue Type: Bug Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 1.3 Attachments: TIKA-1015.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (TIKA-1011) Exception (Null charset name) processing .mhtml file
[ https://issues.apache.org/jira/browse/TIKA-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved TIKA-1011. -- Resolution: Fixed Exception (Null charset name) processing .mhtml file Key: TIKA-1011 URL: https://issues.apache.org/jira/browse/TIKA-1011 Project: Tika Issue Type: Bug Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 1.3 Attachments: TIKA-1011.patch This small test.mhtml file: {noformat} From: Saved by Windows Internet Explorer 8 Subject: Index Pages Date: Tue, 28 Aug 2012 09:53:28 +0300 MIME-Version: 1.0 Content-Type: multipart/related; type=multipart/alternative; boundary==_NextPart_000__01CD8502.F991E790 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157 This is a multi-part message in MIME format. --=_NextPart_000__01CD8502.F991E790 Content-Type: multipart/alternative; boundary==_NextPart_001_0023_01CD8502.F99DCE70 --=_NextPart_001_0023_01CD8502.F99DCE70 Content-Type: text/html; charset=x-user-defined Content-Transfer-Encoding: quoted-printable {noformat} Hits this exception when run through TikaCLI: {noformat} ?xml version=1.0 encoding=UTF-8?Exception in thread main org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.html.HtmlParser@37e67d34 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:102) at org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133) at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:76) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:138) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:399) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:121) Caused by: java.lang.IllegalArgumentException: Null charset name at java.nio.charset.Charset.lookup(Charset.java:467) at java.nio.charset.Charset.forName(Charset.java:540) at org.apache.tika.parser.txt.CharsetDetector.setCanonicalDeclaredEncoding(CharsetDetector.java:352) at org.apache.tika.parser.txt.CharsetDetector.setDeclaredEncoding(CharsetDetector.java:75) at org.apache.tika.parser.txt.Icu4jEncodingDetector.detect(Icu4jEncodingDetector.java:49) at org.apache.tika.detect.AutoDetectReader.detect(AutoDetectReader.java:51) at org.apache.tika.detect.AutoDetectReader.init(AutoDetectReader.java:92) at org.apache.tika.detect.AutoDetectReader.init(AutoDetectReader.java:98) at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:74) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 11 more {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (TIKA-1011) Exception (Null charset name) processing .mhtml file
Michael McCandless created TIKA-1011: Summary: Exception (Null charset name) processing .mhtml file Key: TIKA-1011 URL: https://issues.apache.org/jira/browse/TIKA-1011 Project: Tika Issue Type: Bug Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 1.3 This small test.mhtml file: {noformat} From: Saved by Windows Internet Explorer 8 Subject: Index Pages Date: Tue, 28 Aug 2012 09:53:28 +0300 MIME-Version: 1.0 Content-Type: multipart/related; type=multipart/alternative; boundary==_NextPart_000__01CD8502.F991E790 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157 This is a multi-part message in MIME format. --=_NextPart_000__01CD8502.F991E790 Content-Type: multipart/alternative; boundary==_NextPart_001_0023_01CD8502.F99DCE70 --=_NextPart_001_0023_01CD8502.F99DCE70 Content-Type: text/html; charset=x-user-defined Content-Transfer-Encoding: quoted-printable {noformat} Hits this exception when run through TikaCLI: {noformat} ?xml version=1.0 encoding=UTF-8?Exception in thread main org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.html.HtmlParser@37e67d34 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:102) at org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133) at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:76) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:138) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:399) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:121) Caused by: java.lang.IllegalArgumentException: Null charset name at java.nio.charset.Charset.lookup(Charset.java:467) at java.nio.charset.Charset.forName(Charset.java:540) at org.apache.tika.parser.txt.CharsetDetector.setCanonicalDeclaredEncoding(CharsetDetector.java:352) at org.apache.tika.parser.txt.CharsetDetector.setDeclaredEncoding(CharsetDetector.java:75) at org.apache.tika.parser.txt.Icu4jEncodingDetector.detect(Icu4jEncodingDetector.java:49) at org.apache.tika.detect.AutoDetectReader.detect(AutoDetectReader.java:51) at org.apache.tika.detect.AutoDetectReader.init(AutoDetectReader.java:92) at org.apache.tika.detect.AutoDetectReader.init(AutoDetectReader.java:98) at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:74) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 11 more {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-1011) Exception (Null charset name) processing .mhtml file
[ https://issues.apache.org/jira/browse/TIKA-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-1011: - Attachment: TIKA-1011.patch Exception (Null charset name) processing .mhtml file Key: TIKA-1011 URL: https://issues.apache.org/jira/browse/TIKA-1011 Project: Tika Issue Type: Bug Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 1.3 Attachments: TIKA-1011.patch This small test.mhtml file: {noformat} From: Saved by Windows Internet Explorer 8 Subject: Index Pages Date: Tue, 28 Aug 2012 09:53:28 +0300 MIME-Version: 1.0 Content-Type: multipart/related; type=multipart/alternative; boundary==_NextPart_000__01CD8502.F991E790 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157 This is a multi-part message in MIME format. --=_NextPart_000__01CD8502.F991E790 Content-Type: multipart/alternative; boundary==_NextPart_001_0023_01CD8502.F99DCE70 --=_NextPart_001_0023_01CD8502.F99DCE70 Content-Type: text/html; charset=x-user-defined Content-Transfer-Encoding: quoted-printable {noformat} Hits this exception when run through TikaCLI: {noformat} ?xml version=1.0 encoding=UTF-8?Exception in thread main org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.html.HtmlParser@37e67d34 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:102) at org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133) at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:76) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:138) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:399) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:121) Caused by: java.lang.IllegalArgumentException: Null charset name at java.nio.charset.Charset.lookup(Charset.java:467) at java.nio.charset.Charset.forName(Charset.java:540) at org.apache.tika.parser.txt.CharsetDetector.setCanonicalDeclaredEncoding(CharsetDetector.java:352) at org.apache.tika.parser.txt.CharsetDetector.setDeclaredEncoding(CharsetDetector.java:75) at org.apache.tika.parser.txt.Icu4jEncodingDetector.detect(Icu4jEncodingDetector.java:49) at org.apache.tika.detect.AutoDetectReader.detect(AutoDetectReader.java:51) at org.apache.tika.detect.AutoDetectReader.init(AutoDetectReader.java:92) at org.apache.tika.detect.AutoDetectReader.init(AutoDetectReader.java:98) at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:74) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 11 more {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (TIKA-1010) Embedded documents in RTF are not extracted
Michael McCandless created TIKA-1010: Summary: Embedded documents in RTF are not extracted Key: TIKA-1010 URL: https://issues.apache.org/jira/browse/TIKA-1010 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless When an RTF doc embeds a doc it looks like this: {noformat} {\object\objemb \objw628\objh765{\*\objclass Package}{\*\objdata 0105020008005061636b616765006600 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400 5404bbfaee00080054044505 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300} {noformat} But, unfortunately, the format of those hex bytes is not spelled out in the RTF spec ... the spec merely says the bytes are saved by the OLESaveToStream function ... and I haven't been able to find a description of what the bytes mean. In this case they are a Package object (\objclass Package), which I think is an [old?] way to wrap any non-OLE file (this is just a .txt file). Here's the hex dump: {noformat} 01 05 00 00 02 00 00 00 08 00 00 00 50 61 63 6b |Pack| 0010 61 67 65 00 00 00 00 00 00 00 00 00 66 00 00 00 |age.f...| 0020 02 00 48 77 2e 74 78 74 00 43 3a 5c 44 4f 43 55 |..Hw.txt.C:\DOCU| 0030 4d 45 7e 31 5c 69 67 61 6c 73 68 5c 44 65 73 6b |ME~1\igalsh\Desk| 0040 74 6f 70 5c 48 57 2e 74 78 74 00 00 00 03 00 22 |top\HW.txt.| 0050 00 00 00 43 3a 5c 44 4f 43 55 4d 45 7e 31 5c 69 |...C:\DOCUME~1\i| 0060 67 61 6c 73 68 5c 44 65 73 6b 74 6f 70 5c 48 57 |galsh\Desktop\HW| 0070 2e 74 78 74 00 0b 00 00 00 48 65 6c 6c 6f 20 57 |.txt.Hello W| 0080 6f 72 6c 64 00 00 01 05 00 00 05 00 00 00 0d 00 |orld| 0090 00 00 4d 45 54 41 46 49 4c 45 50 49 43 54 00 54 |..METAFILEPICT.T| 00a0 04 00 00 bb fa ff ff ee 00 00 00 08 00 54 04 45 |.T.E| 00b0 05 00 00 01 00 09 00 00 03 73 00 00 00 02 00 1c |.s..| 00c0 00 00 00 00 00 05 00 00 00 0b 02 00 00 00 00 05 || 00d0 00 00 00 0c 02 32 00 29 00 1c 00 00 00 fb 02 f5 |.2.)| 00e0 ff 00 00 00 00 00 00 90 01 00 00 00 01 00 00 00 || 00f0 00 54 61 68 6f 6d 61 00 00 55 17 0a 70 00 fc 07 |.Tahoma..U..p...| 0100 00 58 b1 f3 77 61 b1 f3 77 20 40 f5 77 49 36 66 |.X..wa..w @.wI6f| 0110 83 04 00 00 00 2d 01 00 00 05 00 00 00 09 02 00 |.-..| 0120 00 00 00 05 00 00 00 01 02 ff ff ff 00 05 00 00 || 0130 00 02 01 01 00 00 00 05 00 00 00 2e 01 06 00 00 || 0140 00 09 00 00 00 21 05 06 00 48 77 2e 74 78 74 21 |.!...Hw.txt!| 0150 00 15 00 1c 00 00 00 fb 02 10 00 07 00 00 00 00 || 0160 00 bc 02 00 00 00 00 01 02 02 22 53 79 73 74 65 |..Syste| 0170 6d 00 00 49 36 66 83 00 00 0a 00 26 00 8a 01 00 |m..I6f.| 0180 00 00 00 ff ff ff ff 8c fc 07 00 04 00 00 00 2d |...-| 0190 01 01 00 03 00 00 00 00 00 |.| 0199 {noformat} Anyway I have no idea how to decode the bytes at this point ... just opening the issue in case anyone else does! -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-1005) In Microsoft Office Word 2010 documents, text inside a textbox is not extracted/parsed out.
[ https://issues.apache.org/jira/browse/TIKA-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-1005: - Attachment: TIKA-1005.patch Patch w/ test ... In Microsoft Office Word 2010 documents, text inside a textbox is not extracted/parsed out. --- Key: TIKA-1005 URL: https://issues.apache.org/jira/browse/TIKA-1005 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.2 Environment: Windows 7, Windows Server 2008, Windows Server 2008 R2 (32bit and 64bit each) Reporter: David A. Patterson Assignee: Michael McCandless Attachments: Textbox example.docx, TIKA-1005.patch Text inside a textbox, which itself can be in the body, the header or the footer, is not extracted using any type of parser (including AutoDetectParser) in combination with any type of ContentHandler. This is NOT a duplicate of TIKA-904. This specifically concerns the .docx file format. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (TIKA-1006) NPE in extractParagraph (styleClass) in XWPFWordExtractorDecorator
[ https://issues.apache.org/jira/browse/TIKA-1006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless reassigned TIKA-1006: Assignee: Michael McCandless NPE in extractParagraph (styleClass) in XWPFWordExtractorDecorator -- Key: TIKA-1006 URL: https://issues.apache.org/jira/browse/TIKA-1006 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.2 Reporter: Sture Svensson Assignee: Michael McCandless Priority: Minor Attachments: fix.patch The following line TagAndStyle tas = WordExtractor.buildParagraphTagAndStyle(style.getName(),paragraph.getPartType() == BodyType.TABLECELL); Throws an NPE if style is null. This should be checked, patch is attatched -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1006) NPE in extractParagraph (styleClass) in XWPFWordExtractorDecorator
[ https://issues.apache.org/jira/browse/TIKA-1006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13474947#comment-13474947 ] Michael McCandless commented on TIKA-1006: -- Thanks Sture, that patch looks good! Do you have an example .docx showing the issue? Would be nice to commit a test case along with the bug fix ... NPE in extractParagraph (styleClass) in XWPFWordExtractorDecorator -- Key: TIKA-1006 URL: https://issues.apache.org/jira/browse/TIKA-1006 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.2 Reporter: Sture Svensson Assignee: Michael McCandless Priority: Minor Attachments: fix.patch The following line TagAndStyle tas = WordExtractor.buildParagraphTagAndStyle(style.getName(),paragraph.getPartType() == BodyType.TABLECELL); Throws an NPE if style is null. This should be checked, patch is attatched -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (TIKA-1005) In Microsoft Office Word 2010 documents, text inside a textbox is not extracted/parsed out.
[ https://issues.apache.org/jira/browse/TIKA-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless reassigned TIKA-1005: Assignee: Michael McCandless In Microsoft Office Word 2010 documents, text inside a textbox is not extracted/parsed out. --- Key: TIKA-1005 URL: https://issues.apache.org/jira/browse/TIKA-1005 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.2 Environment: Windows 7, Windows Server 2008, Windows Server 2008 R2 (32bit and 64bit each) Reporter: David A. Patterson Assignee: Michael McCandless Attachments: Textbox example.docx Text inside a textbox, which itself can be in the body, the header or the footer, is not extracted using any type of parser (including AutoDetectParser) in combination with any type of ContentHandler. This is NOT a duplicate of TIKA-904. This specifically concerns the .docx file format. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1005) In Microsoft Office Word 2010 documents, text inside a textbox is not extracted/parsed out.
[ https://issues.apache.org/jira/browse/TIKA-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13474958#comment-13474958 ] Michael McCandless commented on TIKA-1005: -- Thanks David, I'll dig! In Microsoft Office Word 2010 documents, text inside a textbox is not extracted/parsed out. --- Key: TIKA-1005 URL: https://issues.apache.org/jira/browse/TIKA-1005 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.2 Environment: Windows 7, Windows Server 2008, Windows Server 2008 R2 (32bit and 64bit each) Reporter: David A. Patterson Assignee: Michael McCandless Attachments: Textbox example.docx Text inside a textbox, which itself can be in the body, the header or the footer, is not extracted using any type of parser (including AutoDetectParser) in combination with any type of ContentHandler. This is NOT a duplicate of TIKA-904. This specifically concerns the .docx file format. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (TIKA-1006) NPE in extractParagraph (styleClass) in XWPFWordExtractorDecorator
[ https://issues.apache.org/jira/browse/TIKA-1006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved TIKA-1006. -- Resolution: Fixed Fix Version/s: 1.3 Thanks Sture, I just committed the test document fix! NPE in extractParagraph (styleClass) in XWPFWordExtractorDecorator -- Key: TIKA-1006 URL: https://issues.apache.org/jira/browse/TIKA-1006 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.2 Reporter: Sture Svensson Assignee: Michael McCandless Priority: Minor Fix For: 1.3 Attachments: example.docx, fix.patch The following line TagAndStyle tas = WordExtractor.buildParagraphTagAndStyle(style.getName(),paragraph.getPartType() == BodyType.TABLECELL); Throws an NPE if style is null. This should be checked, patch is attatched -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1005) In Microsoft Office Word 2010 documents, text inside a textbox is not extracted/parsed out.
[ https://issues.apache.org/jira/browse/TIKA-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13474250#comment-13474250 ] Michael McCandless commented on TIKA-1005: -- Could you attach an example showing the problem? Thanks. In Microsoft Office Word 2010 documents, text inside a textbox is not extracted/parsed out. --- Key: TIKA-1005 URL: https://issues.apache.org/jira/browse/TIKA-1005 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.2 Environment: Windows 7, Windows Server 2008, Windows Server 2008 R2 (32bit and 64bit each) Reporter: David A. Patterson Text inside a textbox, which itself can be in the body, the header or the footer, is not extracted using any type of parser (including AutoDetectParser) in combination with any type of ContentHandler. This is NOT a duplicate of TIKA-904. This specifically concerns the .docx file format. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (TIKA-997) Leave a placeholder when documents are embedded in .pptx documents
[ https://issues.apache.org/jira/browse/TIKA-997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved TIKA-997. - Resolution: Fixed Fix Version/s: 1.3 Leave a placeholder when documents are embedded in .pptx documents -- Key: TIKA-997 URL: https://issues.apache.org/jira/browse/TIKA-997 Project: Tika Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 1.3 Attachments: TIKA-997.patch Just like TIKA-956, we should leave a div class=embedded id=XXX to record where a given sub-document appeared. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-997) Leave a placeholder when documents are embedded in .pptx documents
[ https://issues.apache.org/jira/browse/TIKA-997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-997: Attachment: TIKA-997.patch Patch. It's not perfect, because the placeholder will appear at the end of the slide that embedded the document. I think to do better we'd need to parse out x/y positions of each element and sort that, but that's going to get rather hairy ... so at least this is progress. Leave a placeholder when documents are embedded in .pptx documents -- Key: TIKA-997 URL: https://issues.apache.org/jira/browse/TIKA-997 Project: Tika Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Attachments: TIKA-997.patch Just like TIKA-956, we should leave a div class=embedded id=XXX to record where a given sub-document appeared. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (TIKA-999) RTF Parser doesn't extract page/word/character count metadata
Michael McCandless created TIKA-999: --- Summary: RTF Parser doesn't extract page/word/character count metadata Key: TIKA-999 URL: https://issues.apache.org/jira/browse/TIKA-999 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 1.3 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira