[jira] [Commented] (TIKA-1600) Unable to parse ODT files because of failed to close temporary resources
[ https://issues.apache.org/jira/browse/TIKA-1600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492084#comment-14492084 ] Hong-Thai Nguyen commented on TIKA-1600: The root exception is an NPE when parsing ODT files with elements in footnote: {code} java.lang.NullPointerException at org.apache.tika.parser.odf.OpenDocumentContentParser$OpenDocumentElementMappingContentHandler.startSpan(OpenDocumentContentParser.java:174) at org.apache.tika.parser.odf.OpenDocumentContentParser$OpenDocumentElementMappingContentHandler.startElement(OpenDocumentContentParser.java:287) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.parser.odf.NSNormalizerContentHandler.startElement(NSNormalizerContentHandler.java:69) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.startElement(AbstractSAXParser.java:501) at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.scanStartElement(XMLNSDocumentScannerImpl.java:400) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2756) at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:647) at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737) at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205) at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522) at javax.xml.parsers.SAXParser.parse(SAXParser.java:395) at javax.xml.parsers.SAXParser.parse(SAXParser.java:198) at org.apache.tika.parser.odf.OpenDocumentContentParser.parseInternal(OpenDocumentContentParser.java:503) at org.apache.tika.parser.odf.OpenDocumentParser.handleZipEntry(OpenDocumentParser.java:187) at org.apache.tika.parser.odf.OpenDocumentParser.parse(OpenDocumentParser.java:164) at org.apache.tika.parser.odf.OpenDocumentParserTest.can_parse_odt_file(OpenDocumentParserTest.java:41) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229) at org.junit.runners.ParentRunner.run(ParentRunner.java:309) at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50) at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197) {code} Seems that supporting style for ODF is recently added in 1.8: {noformat} Revision: 107 Author: tpalsulich Date: samedi 14 mars 2015 00:25:53
[jira] [Commented] (TIKA-1581) jhighlight license concerns
[ https://issues.apache.org/jira/browse/TIKA-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386900#comment-14386900 ] Hong-Thai Nguyen commented on TIKA-1581: And great thank to [~kkrugler] with many investigation and efforts to push release of jhighlight 1.0.2 jhighlight license concerns --- Key: TIKA-1581 URL: https://issues.apache.org/jira/browse/TIKA-1581 Project: Tika Issue Type: Bug Affects Versions: 1.7 Reporter: Karl Wright Fix For: 1.8 jhighlight jar is a Tika dependency. The Lucene team discovered that, while it claims to be a CDDL/LGPL dual-license, some of its functionality is LGPL only: {code} Solr's contrib/extraction contains jhighlight-1.0.jar which declares itself as dual CDDL or LGPL license. However, some of its classes are distributed only under LGPL, e.g. com.uwyn.jhighlight.highlighter. CppHighlighter.java GroovyHighlighter.java JavaHighlighter.java XmlHighlighter.java I downloaded the sources from Maven (http://search.maven.org/remotecontent?filepath=com/uwyn/jhighlight/1.0/jhighlight-1.0-sources.jar) to confirm that, and also found this SVN repo: http://svn.rifers.org/jhighlight/tags/release-1.0, though the project's website seems to not exist anymore (https://jhighlight.dev.java.net/). I didn't find any direct usage of it in our code, so I guess it's probably needed by a 3rd party dependency, such as Tika. Therefore if we e.g. omit it, things will compile, but may fail at runtime. {code} Is it possible to remove this dependency for future releases, or allow only optional inclusion of this package? It is of concern to the ManifoldCF project because we distribute a binary package that includes Tika and its required dependencies, which currently includes jHighlight. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1581) jhighlight license concerns
[ https://issues.apache.org/jira/browse/TIKA-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen updated TIKA-1581: --- Fix Version/s: 1.8 jhighlight license concerns --- Key: TIKA-1581 URL: https://issues.apache.org/jira/browse/TIKA-1581 Project: Tika Issue Type: Bug Affects Versions: 1.7 Reporter: Karl Wright Fix For: 1.8 jhighlight jar is a Tika dependency. The Lucene team discovered that, while it claims to be a CDDL/LGPL dual-license, some of its functionality is LGPL only: {code} Solr's contrib/extraction contains jhighlight-1.0.jar which declares itself as dual CDDL or LGPL license. However, some of its classes are distributed only under LGPL, e.g. com.uwyn.jhighlight.highlighter. CppHighlighter.java GroovyHighlighter.java JavaHighlighter.java XmlHighlighter.java I downloaded the sources from Maven (http://search.maven.org/remotecontent?filepath=com/uwyn/jhighlight/1.0/jhighlight-1.0-sources.jar) to confirm that, and also found this SVN repo: http://svn.rifers.org/jhighlight/tags/release-1.0, though the project's website seems to not exist anymore (https://jhighlight.dev.java.net/). I didn't find any direct usage of it in our code, so I guess it's probably needed by a 3rd party dependency, such as Tika. Therefore if we e.g. omit it, things will compile, but may fail at runtime. {code} Is it possible to remove this dependency for future releases, or allow only optional inclusion of this package? It is of concern to the ManifoldCF project because we distribute a binary package that includes Tika and its required dependencies, which currently includes jHighlight. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1581) jhighlight license concerns
[ https://issues.apache.org/jira/browse/TIKA-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen resolved TIKA-1581. Resolution: Fixed jhighlight license concerns --- Key: TIKA-1581 URL: https://issues.apache.org/jira/browse/TIKA-1581 Project: Tika Issue Type: Bug Affects Versions: 1.7 Reporter: Karl Wright Fix For: 1.8 jhighlight jar is a Tika dependency. The Lucene team discovered that, while it claims to be a CDDL/LGPL dual-license, some of its functionality is LGPL only: {code} Solr's contrib/extraction contains jhighlight-1.0.jar which declares itself as dual CDDL or LGPL license. However, some of its classes are distributed only under LGPL, e.g. com.uwyn.jhighlight.highlighter. CppHighlighter.java GroovyHighlighter.java JavaHighlighter.java XmlHighlighter.java I downloaded the sources from Maven (http://search.maven.org/remotecontent?filepath=com/uwyn/jhighlight/1.0/jhighlight-1.0-sources.jar) to confirm that, and also found this SVN repo: http://svn.rifers.org/jhighlight/tags/release-1.0, though the project's website seems to not exist anymore (https://jhighlight.dev.java.net/). I didn't find any direct usage of it in our code, so I guess it's probably needed by a 3rd party dependency, such as Tika. Therefore if we e.g. omit it, things will compile, but may fail at runtime. {code} Is it possible to remove this dependency for future releases, or allow only optional inclusion of this package? It is of concern to the ManifoldCF project because we distribute a binary package that includes Tika and its required dependencies, which currently includes jHighlight. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1581) jhighlight license concerns
[ https://issues.apache.org/jira/browse/TIKA-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371432#comment-14371432 ] Hong-Thai Nguyen commented on TIKA-1581: I've contacted also 'gbe...@uwyn.com', seem that it's his email. Wait for feel days for his feedback. Otherwise, we can create an 'unshipped' module to group all parsers and their dependencies without Apache license jhighlight license concerns --- Key: TIKA-1581 URL: https://issues.apache.org/jira/browse/TIKA-1581 Project: Tika Issue Type: Bug Affects Versions: 1.7 Reporter: Karl Wright jhighlight jar is a Tika dependency. The Lucene team discovered that, while it claims to be a CDDL/LGPL dual-license, some of its functionality is LGPL only: {code} Solr's contrib/extraction contains jhighlight-1.0.jar which declares itself as dual CDDL or LGPL license. However, some of its classes are distributed only under LGPL, e.g. com.uwyn.jhighlight.highlighter. CppHighlighter.java GroovyHighlighter.java JavaHighlighter.java XmlHighlighter.java I downloaded the sources from Maven (http://search.maven.org/remotecontent?filepath=com/uwyn/jhighlight/1.0/jhighlight-1.0-sources.jar) to confirm that, and also found this SVN repo: http://svn.rifers.org/jhighlight/tags/release-1.0, though the project's website seems to not exist anymore (https://jhighlight.dev.java.net/). I didn't find any direct usage of it in our code, so I guess it's probably needed by a 3rd party dependency, such as Tika. Therefore if we e.g. omit it, things will compile, but may fail at runtime. {code} Is it possible to remove this dependency for future releases, or allow only optional inclusion of this package? It is of concern to the ManifoldCF project because we distribute a binary package that includes Tika and its required dependencies, which currently includes jHighlight. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1581) jhighlight license concerns
[ https://issues.apache.org/jira/browse/TIKA-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371432#comment-14371432 ] Hong-Thai Nguyen edited comment on TIKA-1581 at 3/20/15 3:10 PM: - I've contacted also 'gbe...@uwyn.com', seem that it's his email. Wait for feel days for his feedback. Otherwise, we can create an 'unshipped' module to group all parsers and their dependencies without Apache license. [~steve_rowe], folked vesion you mentioned don't change anything about original license terms of JHighlight. was (Author: thaichat04): I've contacted also 'gbe...@uwyn.com', seem that it's his email. Wait for feel days for his feedback. Otherwise, we can create an 'unshipped' module to group all parsers and their dependencies without Apache license jhighlight license concerns --- Key: TIKA-1581 URL: https://issues.apache.org/jira/browse/TIKA-1581 Project: Tika Issue Type: Bug Affects Versions: 1.7 Reporter: Karl Wright jhighlight jar is a Tika dependency. The Lucene team discovered that, while it claims to be a CDDL/LGPL dual-license, some of its functionality is LGPL only: {code} Solr's contrib/extraction contains jhighlight-1.0.jar which declares itself as dual CDDL or LGPL license. However, some of its classes are distributed only under LGPL, e.g. com.uwyn.jhighlight.highlighter. CppHighlighter.java GroovyHighlighter.java JavaHighlighter.java XmlHighlighter.java I downloaded the sources from Maven (http://search.maven.org/remotecontent?filepath=com/uwyn/jhighlight/1.0/jhighlight-1.0-sources.jar) to confirm that, and also found this SVN repo: http://svn.rifers.org/jhighlight/tags/release-1.0, though the project's website seems to not exist anymore (https://jhighlight.dev.java.net/). I didn't find any direct usage of it in our code, so I guess it's probably needed by a 3rd party dependency, such as Tika. Therefore if we e.g. omit it, things will compile, but may fail at runtime. {code} Is it possible to remove this dependency for future releases, or allow only optional inclusion of this package? It is of concern to the ManifoldCF project because we distribute a binary package that includes Tika and its required dependencies, which currently includes jHighlight. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1581) jhighlight license concerns
[ https://issues.apache.org/jira/browse/TIKA-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371432#comment-14371432 ] Hong-Thai Nguyen edited comment on TIKA-1581 at 3/20/15 3:36 PM: - I've contacted also 'gbe...@uwyn.com', seem that it's his email. Wait for feel days for his feedback. Otherwise, we can create an 'unshipped' module to group all parsers and their dependencies without Apache license. was (Author: thaichat04): I've contacted also 'gbe...@uwyn.com', seem that it's his email. Wait for feel days for his feedback. Otherwise, we can create an 'unshipped' module to group all parsers and their dependencies without Apache license. [~steve_rowe], folked vesion you mentioned don't change anything about original license terms of JHighlight. jhighlight license concerns --- Key: TIKA-1581 URL: https://issues.apache.org/jira/browse/TIKA-1581 Project: Tika Issue Type: Bug Affects Versions: 1.7 Reporter: Karl Wright jhighlight jar is a Tika dependency. The Lucene team discovered that, while it claims to be a CDDL/LGPL dual-license, some of its functionality is LGPL only: {code} Solr's contrib/extraction contains jhighlight-1.0.jar which declares itself as dual CDDL or LGPL license. However, some of its classes are distributed only under LGPL, e.g. com.uwyn.jhighlight.highlighter. CppHighlighter.java GroovyHighlighter.java JavaHighlighter.java XmlHighlighter.java I downloaded the sources from Maven (http://search.maven.org/remotecontent?filepath=com/uwyn/jhighlight/1.0/jhighlight-1.0-sources.jar) to confirm that, and also found this SVN repo: http://svn.rifers.org/jhighlight/tags/release-1.0, though the project's website seems to not exist anymore (https://jhighlight.dev.java.net/). I didn't find any direct usage of it in our code, so I guess it's probably needed by a 3rd party dependency, such as Tika. Therefore if we e.g. omit it, things will compile, but may fail at runtime. {code} Is it possible to remove this dependency for future releases, or allow only optional inclusion of this package? It is of concern to the ManifoldCF project because we distribute a binary package that includes Tika and its required dependencies, which currently includes jHighlight. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1505) chmparser breaks down when extracting from file of CHM format v3
[ https://issues.apache.org/jira/browse/TIKA-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264786#comment-14264786 ] Hong-Thai Nguyen commented on TIKA-1505: Can you provide also problem files and tests ? And, 1.7 in releasing out, this issue is not really blocking and we can postpone to next 1.8 chmparser breaks down when extracting from file of CHM format v3 Key: TIKA-1505 URL: https://issues.apache.org/jira/browse/TIKA-1505 Project: Tika Issue Type: Bug Reporter: Bin Hawking Fix For: 1.7 chmparser throws exception or returns faulty text when: 1. extracting from file of CHM format version 3 2. chm file with lzx reset interval 2 3. chm file with 5000 objects I am making the fix now. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1447) CHM parser: wrong directory list
[ https://issues.apache.org/jira/browse/TIKA-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen resolved TIKA-1447. Resolution: Fixed CHM parser: wrong directory list Key: TIKA-1447 URL: https://issues.apache.org/jira/browse/TIKA-1447 Project: Tika Issue Type: Bug Affects Versions: 1.7 Reporter: Bin Hawking Priority: Critical CHM parser gets wrong directory list of a chm file (eg. testChm2.chm in tika-parser's test-resources): 1. Duplicate entries (mostly from PMGI chunks, which should have been ignored.) 2. Invalid entry (usually with unreadable entry name). 3. Missed entries (some times it is like TIKA-1176) I have fixed it (to some degree), by using the PMGL header to find dir chunks and their respective meaningful parts. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1446) CHM parser : wrong decompression of aligned blocks
[ https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen resolved TIKA-1446. Resolution: Fixed CHM parser : wrong decompression of aligned blocks -- Key: TIKA-1446 URL: https://issues.apache.org/jira/browse/TIKA-1446 Project: Tika Issue Type: Bug Affects Versions: 1.7 Reporter: Bin Hawking Priority: Critical Attachments: chm.zip If an embedded file contains aligned blocks, the parser outputs chaotic text or empty text as to this file. I have fixed it myself, corrected decompressAlignedBlock() and its preparation methods. Mostly this bug is due to misusing main tree/align tree/length tree. And some tree is built wrong. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1447) CHM parser: wrong directory list
[ https://issues.apache.org/jira/browse/TIKA-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen updated TIKA-1447: --- Fix Version/s: 1.7 CHM parser: wrong directory list Key: TIKA-1447 URL: https://issues.apache.org/jira/browse/TIKA-1447 Project: Tika Issue Type: Bug Affects Versions: 1.7 Reporter: Bin Hawking Priority: Critical Fix For: 1.7 CHM parser gets wrong directory list of a chm file (eg. testChm2.chm in tika-parser's test-resources): 1. Duplicate entries (mostly from PMGI chunks, which should have been ignored.) 2. Invalid entry (usually with unreadable entry name). 3. Missed entries (some times it is like TIKA-1176) I have fixed it (to some degree), by using the PMGL header to find dir chunks and their respective meaningful parts. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1448) CHM parser : defect in file extraction
[ https://issues.apache.org/jira/browse/TIKA-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen resolved TIKA-1448. Resolution: Fixed CHM parser : defect in file extraction -- Key: TIKA-1448 URL: https://issues.apache.org/jira/browse/TIKA-1448 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.7 Reporter: Bin Hawking Fix For: 1.7 in ChmBlockInfo class: chmBlockInfo .setIniBlock((chmBlockInfo.startBlock - chmBlockInfo.startBlock) % (int) clcd.getResetInterval()); always sets 0 according to the lzx algorithm, should be chmBlockInfo .setIniBlock( chmBlockInfo.startBlock - chmBlockInfo.startBlock % (int) clcd.getResetInterval()); -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1430) CHM parser gets faulty text (fix found)
[ https://issues.apache.org/jira/browse/TIKA-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen resolved TIKA-1430. Resolution: Fixed CHM parser gets faulty text (fix found) --- Key: TIKA-1430 URL: https://issues.apache.org/jira/browse/TIKA-1430 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.5, 1.6 Environment: Windows 7; JDK 7 or 8 Reporter: Bin Hawking Priority: Critical Fix For: 1.7 Get partially wrong text out of a CHM file, including the chm files in tika-parsers/src/test/resources/test-documents/testChm*.chm I tried 1.6 and 1.5. Same bad. I wonder why no one complained before? I checked the source code. The cause is obvious: When tika decompresses the LZX, the first block is done well, but as to the 2nd block and later on, Tika uses previous content as the compressed data. see in org.apache.tika.parser.chm.lzx.ChmLzxBlock if (prevBlock != null prevBlock.getState().getBlockLength() prevBlock .getState().getBlockRemaining()) setChmSection(new ChmSection(prevBlock.getContent())); // NOTE: the dataSegment to be decompressed is not kept else setChmSection(new ChmSection(dataSegment)); My fix: 1.Add a prevcontent member variable in ChmSection class, so that dataSegment and prevBlock.getContent() are both kept in it. 2.In ChmLzxBlock.extractContent() when invoking decompressBlock(), pass ChmSection.prevcontent if exists, instead of ChmSection.data. Now, I tried some chm files, and got the correct looking texts. BTW. The unit test should be tougher, as in this case some small text (the first block) is decompressed correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1430) CHM parser gets faulty text (fix found)
[ https://issues.apache.org/jira/browse/TIKA-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen updated TIKA-1430: --- Fix Version/s: 1.7 CHM parser gets faulty text (fix found) --- Key: TIKA-1430 URL: https://issues.apache.org/jira/browse/TIKA-1430 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.5, 1.6 Environment: Windows 7; JDK 7 or 8 Reporter: Bin Hawking Priority: Critical Fix For: 1.7 Get partially wrong text out of a CHM file, including the chm files in tika-parsers/src/test/resources/test-documents/testChm*.chm I tried 1.6 and 1.5. Same bad. I wonder why no one complained before? I checked the source code. The cause is obvious: When tika decompresses the LZX, the first block is done well, but as to the 2nd block and later on, Tika uses previous content as the compressed data. see in org.apache.tika.parser.chm.lzx.ChmLzxBlock if (prevBlock != null prevBlock.getState().getBlockLength() prevBlock .getState().getBlockRemaining()) setChmSection(new ChmSection(prevBlock.getContent())); // NOTE: the dataSegment to be decompressed is not kept else setChmSection(new ChmSection(dataSegment)); My fix: 1.Add a prevcontent member variable in ChmSection class, so that dataSegment and prevBlock.getContent() are both kept in it. 2.In ChmLzxBlock.extractContent() when invoking decompressBlock(), pass ChmSection.prevcontent if exists, instead of ChmSection.data. Now, I tried some chm files, and got the correct looking texts. BTW. The unit test should be tougher, as in this case some small text (the first block) is decompressed correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1446) CHM parser : wrong decompression of aligned blocks
[ https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen updated TIKA-1446: --- Fix Version/s: 1.7 CHM parser : wrong decompression of aligned blocks -- Key: TIKA-1446 URL: https://issues.apache.org/jira/browse/TIKA-1446 Project: Tika Issue Type: Bug Affects Versions: 1.7 Reporter: Bin Hawking Priority: Critical Fix For: 1.7 Attachments: chm.zip If an embedded file contains aligned blocks, the parser outputs chaotic text or empty text as to this file. I have fixed it myself, corrected decompressAlignedBlock() and its preparation methods. Mostly this bug is due to misusing main tree/align tree/length tree. And some tree is built wrong. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1448) CHM parser : defect in file extraction
[ https://issues.apache.org/jira/browse/TIKA-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen updated TIKA-1448: --- Fix Version/s: 1.7 CHM parser : defect in file extraction -- Key: TIKA-1448 URL: https://issues.apache.org/jira/browse/TIKA-1448 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.7 Reporter: Bin Hawking Fix For: 1.7 in ChmBlockInfo class: chmBlockInfo .setIniBlock((chmBlockInfo.startBlock - chmBlockInfo.startBlock) % (int) clcd.getResetInterval()); always sets 0 according to the lzx algorithm, should be chmBlockInfo .setIniBlock( chmBlockInfo.startBlock - chmBlockInfo.startBlock % (int) clcd.getResetInterval()); -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-672) Proper error handling in the CHM parser
[ https://issues.apache.org/jira/browse/TIKA-672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen updated TIKA-672: -- Fix Version/s: 1.7 Proper error handling in the CHM parser --- Key: TIKA-672 URL: https://issues.apache.org/jira/browse/TIKA-672 Project: Tika Issue Type: Bug Components: parser Reporter: Jukka Zitting Priority: Minor Fix For: 1.7 The new CHM parser (TIKA-245) swallows exceptions and uses System.err and System.out prints to report problems in many places. We should change that to properly throw exceptions as follows: - IOExceptions when the document stream can not be read - TikaExceptions when the stream can not be parsed -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-672) Proper error handling in the CHM parser
[ https://issues.apache.org/jira/browse/TIKA-672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen resolved TIKA-672. --- Resolution: Fixed Check no more System.err/System.out inside CHM parser Proper error handling in the CHM parser --- Key: TIKA-672 URL: https://issues.apache.org/jira/browse/TIKA-672 Project: Tika Issue Type: Bug Components: parser Reporter: Jukka Zitting Priority: Minor Fix For: 1.7 The new CHM parser (TIKA-245) swallows exceptions and uses System.err and System.out prints to report problems in many places. We should change that to properly throw exceptions as follows: - IOExceptions when the document stream can not be read - TikaExceptions when the stream can not be parsed -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1447) CHM parser: wrong directory list
[ https://issues.apache.org/jira/browse/TIKA-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14214535#comment-14214535 ] Hong-Thai Nguyen commented on TIKA-1447: [~binhawking], The work on TIKA-1446 fixed this issue ? Any change to double check again ? Thanks, CHM parser: wrong directory list Key: TIKA-1447 URL: https://issues.apache.org/jira/browse/TIKA-1447 Project: Tika Issue Type: Bug Affects Versions: 1.7 Reporter: Bin Hawking Priority: Critical CHM parser gets wrong directory list of a chm file (eg. testChm2.chm in tika-parser's test-resources): 1. Duplicate entries (mostly from PMGI chunks, which should have been ignored.) 2. Invalid entry (usually with unreadable entry name). 3. Missed entries (some times it is like TIKA-1176) I have fixed it (to some degree), by using the PMGL header to find dir chunks and their respective meaningful parts. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1446) CHM parser : wrong decompression of aligned blocks
[ https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208079#comment-14208079 ] Hong-Thai Nguyen commented on TIKA-1446: Hi [~binhawking], I've merge your pull request and make title comparison before/after on a local corpus of CHM files. Before merge, we have only one failed file, after merge we have 10 failed files. I've pushed failed CHM files under _test-documents/chm_ a checking test case into: https://github.com/thaichat04/tika I made also some clean-up. Any chance you have a look again ? CHM parser : wrong decompression of aligned blocks -- Key: TIKA-1446 URL: https://issues.apache.org/jira/browse/TIKA-1446 Project: Tika Issue Type: Bug Affects Versions: 1.7 Reporter: Bin Hawking Priority: Critical Attachments: chm.zip If an embedded file contains aligned blocks, the parser outputs chaotic text or empty text as to this file. I have fixed it myself, corrected decompressAlignedBlock() and its preparation methods. Mostly this bug is due to misusing main tree/align tree/length tree. And some tree is built wrong. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1446) CHM parser : wrong decompression of aligned blocks
[ https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208079#comment-14208079 ] Hong-Thai Nguyen edited comment on TIKA-1446 at 11/12/14 2:38 PM: -- Hi [~binhawking], I've merged your contribution and make title comparison before/after on a local corpus of CHM files. Before merge, we have only one failed file, after merge we have 10 failed files. I've pushed failed CHM files under _test-documents/chm_ a checking test case into: https://github.com/thaichat04/tika I made also some clean-up. Any chance you have a look again ? was (Author: thaichat04): Hi [~binhawking], I've merge your pull request and make title comparison before/after on a local corpus of CHM files. Before merge, we have only one failed file, after merge we have 10 failed files. I've pushed failed CHM files under _test-documents/chm_ a checking test case into: https://github.com/thaichat04/tika I made also some clean-up. Any chance you have a look again ? CHM parser : wrong decompression of aligned blocks -- Key: TIKA-1446 URL: https://issues.apache.org/jira/browse/TIKA-1446 Project: Tika Issue Type: Bug Affects Versions: 1.7 Reporter: Bin Hawking Priority: Critical Attachments: chm.zip If an embedded file contains aligned blocks, the parser outputs chaotic text or empty text as to this file. I have fixed it myself, corrected decompressAlignedBlock() and its preparation methods. Mostly this bug is due to misusing main tree/align tree/length tree. And some tree is built wrong. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1463) TesseractOCRParser does not work in Windows
[ https://issues.apache.org/jira/browse/TIKA-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14196343#comment-14196343 ] Hong-Thai Nguyen commented on TIKA-1463: Thank [~lfcnassif], without .exe effectively works also. BTW, path with space is buggy. I leave this fix because adding .exe only in Windows don't hurt anything. TesseractOCRParser does not work in Windows --- Key: TIKA-1463 URL: https://issues.apache.org/jira/browse/TIKA-1463 Project: Tika Issue Type: Bug Reporter: Hong-Thai Nguyen STR: * Case 1: ** Setting tesseractPath to a common installation path of Tesseract: C:\Program Files (x86)\Tesseract-OCR ** the checking available Tesseract command returns always false * Case 2: ** Even setting to no space value in tesseractPath, says C:\Tesseract-OCR ** the checking running command of tesseract on Windows is not correct: C:\Tesseract-OCR\tesseract, it must be C:\Tesseract-OCR\tesseract.exe -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1463) TesseractOCRParser does work in Windows
Hong-Thai Nguyen created TIKA-1463: -- Summary: TesseractOCRParser does work in Windows Key: TIKA-1463 URL: https://issues.apache.org/jira/browse/TIKA-1463 Project: Tika Issue Type: Bug Reporter: Hong-Thai Nguyen STR: * Case 1: ** Setting tesseractPath to C:\Program Files (x86)\Tesseract-OCR ** the checking available Tesseract command returns always false * Case 2: ** Even setting to no wildcard in tesseractPath, say C:\Tesseract-OCR ** the checking running command of tesseract on Windows is not correct: C:\Tesseract-OCR\tesseract, it must be C:\Tesseract-OCR\tesseract.exe -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1463) TesseractOCRParser does work in Windows
[ https://issues.apache.org/jira/browse/TIKA-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194694#comment-14194694 ] Hong-Thai Nguyen commented on TIKA-1463: Fixed in r1636382 TesseractOCRParser does work in Windows --- Key: TIKA-1463 URL: https://issues.apache.org/jira/browse/TIKA-1463 Project: Tika Issue Type: Bug Reporter: Hong-Thai Nguyen STR: * Case 1: ** Setting tesseractPath to C:\Program Files (x86)\Tesseract-OCR ** the checking available Tesseract command returns always false * Case 2: ** Even setting to no wildcard in tesseractPath, say C:\Tesseract-OCR ** the checking running command of tesseract on Windows is not correct: C:\Tesseract-OCR\tesseract, it must be C:\Tesseract-OCR\tesseract.exe -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1463) TesseractOCRParser does not work in Windows
[ https://issues.apache.org/jira/browse/TIKA-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen updated TIKA-1463: --- Summary: TesseractOCRParser does not work in Windows (was: TesseractOCRParser does work in Windows) TesseractOCRParser does not work in Windows --- Key: TIKA-1463 URL: https://issues.apache.org/jira/browse/TIKA-1463 Project: Tika Issue Type: Bug Reporter: Hong-Thai Nguyen STR: * Case 1: ** Setting tesseractPath to C:\Program Files (x86)\Tesseract-OCR ** the checking available Tesseract command returns always false * Case 2: ** Even setting to no wildcard in tesseractPath, say C:\Tesseract-OCR ** the checking running command of tesseract on Windows is not correct: C:\Tesseract-OCR\tesseract, it must be C:\Tesseract-OCR\tesseract.exe -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1463) TesseractOCRParser does not work in Windows
[ https://issues.apache.org/jira/browse/TIKA-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen updated TIKA-1463: --- Description: STR: * Case 1: ** Setting tesseractPath to a common installation path of Tesseract: C:\Program Files (x86)\Tesseract-OCR ** the checking available Tesseract command returns always false * Case 2: ** Even setting to no space value in tesseractPath, says C:\Tesseract-OCR ** the checking running command of tesseract on Windows is not correct: C:\Tesseract-OCR\tesseract, it must be C:\Tesseract-OCR\tesseract.exe was: STR: * Case 1: ** Setting tesseractPath to C:\Program Files (x86)\Tesseract-OCR ** the checking available Tesseract command returns always false * Case 2: ** Even setting to no wildcard in tesseractPath, say C:\Tesseract-OCR ** the checking running command of tesseract on Windows is not correct: C:\Tesseract-OCR\tesseract, it must be C:\Tesseract-OCR\tesseract.exe TesseractOCRParser does not work in Windows --- Key: TIKA-1463 URL: https://issues.apache.org/jira/browse/TIKA-1463 Project: Tika Issue Type: Bug Reporter: Hong-Thai Nguyen STR: * Case 1: ** Setting tesseractPath to a common installation path of Tesseract: C:\Program Files (x86)\Tesseract-OCR ** the checking available Tesseract command returns always false * Case 2: ** Even setting to no space value in tesseractPath, says C:\Tesseract-OCR ** the checking running command of tesseract on Windows is not correct: C:\Tesseract-OCR\tesseract, it must be C:\Tesseract-OCR\tesseract.exe -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (TIKA-1463) TesseractOCRParser does not work in Windows
[ https://issues.apache.org/jira/browse/TIKA-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen closed TIKA-1463. -- Resolution: Fixed TesseractOCRParser does not work in Windows --- Key: TIKA-1463 URL: https://issues.apache.org/jira/browse/TIKA-1463 Project: Tika Issue Type: Bug Reporter: Hong-Thai Nguyen STR: * Case 1: ** Setting tesseractPath to a common installation path of Tesseract: C:\Program Files (x86)\Tesseract-OCR ** the checking available Tesseract command returns always false * Case 2: ** Even setting to no space value in tesseractPath, says C:\Tesseract-OCR ** the checking running command of tesseract on Windows is not correct: C:\Tesseract-OCR\tesseract, it must be C:\Tesseract-OCR\tesseract.exe -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1446) CHM parser : wrong decompression of aligned blocks
[ https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181530#comment-14181530 ] Hong-Thai Nguyen commented on TIKA-1446: Thank alot [~binhawking], I've quick look on your fix. Effectually, there's quite a lot of changes. After cleanup fix some minor, I broke CHM tests. We appreciate really your contribution and we should continue finalize. I've created new pull request basing on a branch with your fix + my cleanup: https://github.com/apache/tika/pull/21 https://github.com/thaichat04/tika.git, branch TIKA-1446 CHM parser : wrong decompression of aligned blocks -- Key: TIKA-1446 URL: https://issues.apache.org/jira/browse/TIKA-1446 Project: Tika Issue Type: Bug Affects Versions: 1.7 Reporter: Bin Hawking Priority: Critical Attachments: chm.zip If an embedded file contains aligned blocks, the parser outputs chaotic text or empty text as to this file. I have fixed it myself, corrected decompressAlignedBlock() and its preparation methods. Mostly this bug is due to misusing main tree/align tree/length tree. And some tree is built wrong. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1422) org.apache.tika.parser.mail.RFC822ParserTest fails
[ https://issues.apache.org/jira/browse/TIKA-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178186#comment-14178186 ] Hong-Thai Nguyen commented on TIKA-1422: Applied latest fix on r1633325 with some formatting. Thank org.apache.tika.parser.mail.RFC822ParserTest fails -- Key: TIKA-1422 URL: https://issues.apache.org/jira/browse/TIKA-1422 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.7 Attachments: TIKA-1422.Mattmann.100114.patch.txt, TIKA-1422.Mattmann.100414.patch.txt, TIKA-1422.oleg.20141021.patch, TIKA-1422.palsulich.100414.patch, TIKA-1422.palsulich.100714.patch I'm seeing test failures from: {noformat} Results : Failed tests: testMultipart(org.apache.tika.parser.mail.RFC822ParserTest): (..) Tests run: 538, Failures: 1, Errors: 0, Skipped: 1 {noformat} CentOS6 VM image, running: {noformat} [mattmann@memex tika]$ java -version java version 1.7.0_67 Java(TM) SE Runtime Environment (build 1.7.0_67-b01) Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode) [mattmann@memex tika]$ mvn -version Apache Maven 3.2.1 (ea8b2b07643dbb1b84b6d16e1f08391b666bc1e9; 2014-02-14T09:37:52-08:00) Maven home: /usr/share/apache-maven Java version: 1.7.0_65, vendor: Oracle Corporation Java home: /data/home/mattmann/dist/jdk1.7.0_65/jre Default locale: en_US, platform encoding: UTF-8 OS name: linux, version: 2.6.32-431.23.3.el6.centos.plus.x86_64, arch: amd64, family: unix [mattmann@memex tika]$ {noformat} Here are the surefire reports - no clue what's up here: {noformat} [mattmann@memex tika]$ more tika-parsers/target/surefire-reports/org.apache.tika.parser.mail.RFC822ParserTest.txt --- Test set: org.apache.tika.parser.mail.RFC822ParserTest --- Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.699 sec FAILURE! testMultipart(org.apache.tika.parser.mail.RFC822ParserTest) Time elapsed: 0.152 sec FAILURE! org.mockito.exceptions.verification.TooManyActualInvocations: xHTMLContentHandler.startElement( http://www.w3.org/1999/xhtml;, div, div, isA(org.xml.sax.Attributes) ); Wanted 4 times but was 5 at org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:87) Caused by: org.mockito.exceptions.cause.UndesiredInvocation: Undesired invocation: at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284) at org.apache.tika.parser.ocr.TesseractOCRParser.extractOutput(TesseractOCRParser.java:243) at org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:155) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247) at org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:102) at org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133) at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:76) at org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:84) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at
[jira] [Comment Edited] (TIKA-1422) org.apache.tika.parser.mail.RFC822ParserTest fails
[ https://issues.apache.org/jira/browse/TIKA-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178186#comment-14178186 ] Hong-Thai Nguyen edited comment on TIKA-1422 at 10/21/14 9:48 AM: -- Applied latest fix on r1633325 r161 with some formatting. Thank was (Author: thaichat04): Applied latest fix on r1633325 with some formatting. Thank org.apache.tika.parser.mail.RFC822ParserTest fails -- Key: TIKA-1422 URL: https://issues.apache.org/jira/browse/TIKA-1422 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.7 Attachments: TIKA-1422.Mattmann.100114.patch.txt, TIKA-1422.Mattmann.100414.patch.txt, TIKA-1422.oleg.20141021.patch, TIKA-1422.palsulich.100414.patch, TIKA-1422.palsulich.100714.patch I'm seeing test failures from: {noformat} Results : Failed tests: testMultipart(org.apache.tika.parser.mail.RFC822ParserTest): (..) Tests run: 538, Failures: 1, Errors: 0, Skipped: 1 {noformat} CentOS6 VM image, running: {noformat} [mattmann@memex tika]$ java -version java version 1.7.0_67 Java(TM) SE Runtime Environment (build 1.7.0_67-b01) Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode) [mattmann@memex tika]$ mvn -version Apache Maven 3.2.1 (ea8b2b07643dbb1b84b6d16e1f08391b666bc1e9; 2014-02-14T09:37:52-08:00) Maven home: /usr/share/apache-maven Java version: 1.7.0_65, vendor: Oracle Corporation Java home: /data/home/mattmann/dist/jdk1.7.0_65/jre Default locale: en_US, platform encoding: UTF-8 OS name: linux, version: 2.6.32-431.23.3.el6.centos.plus.x86_64, arch: amd64, family: unix [mattmann@memex tika]$ {noformat} Here are the surefire reports - no clue what's up here: {noformat} [mattmann@memex tika]$ more tika-parsers/target/surefire-reports/org.apache.tika.parser.mail.RFC822ParserTest.txt --- Test set: org.apache.tika.parser.mail.RFC822ParserTest --- Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.699 sec FAILURE! testMultipart(org.apache.tika.parser.mail.RFC822ParserTest) Time elapsed: 0.152 sec FAILURE! org.mockito.exceptions.verification.TooManyActualInvocations: xHTMLContentHandler.startElement( http://www.w3.org/1999/xhtml;, div, div, isA(org.xml.sax.Attributes) ); Wanted 4 times but was 5 at org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:87) Caused by: org.mockito.exceptions.cause.UndesiredInvocation: Undesired invocation: at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284) at org.apache.tika.parser.ocr.TesseractOCRParser.extractOutput(TesseractOCRParser.java:243) at org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:155) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247) at org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:102) at org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133) at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:76) at org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:84) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at
[jira] [Commented] (TIKA-1422) org.apache.tika.parser.mail.RFC822ParserTest fails
[ https://issues.apache.org/jira/browse/TIKA-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173537#comment-14173537 ] Hong-Thai Nguyen commented on TIKA-1422: I'm not using Tesseract org.apache.tika.parser.mail.RFC822ParserTest fails -- Key: TIKA-1422 URL: https://issues.apache.org/jira/browse/TIKA-1422 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.7 Attachments: TIKA-1422.Mattmann.100114.patch.txt, TIKA-1422.Mattmann.100414.patch.txt, TIKA-1422.palsulich.100414.patch, TIKA-1422.palsulich.100714.patch I'm seeing test failures from: {noformat} Results : Failed tests: testMultipart(org.apache.tika.parser.mail.RFC822ParserTest): (..) Tests run: 538, Failures: 1, Errors: 0, Skipped: 1 {noformat} CentOS6 VM image, running: {noformat} [mattmann@memex tika]$ java -version java version 1.7.0_67 Java(TM) SE Runtime Environment (build 1.7.0_67-b01) Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode) [mattmann@memex tika]$ mvn -version Apache Maven 3.2.1 (ea8b2b07643dbb1b84b6d16e1f08391b666bc1e9; 2014-02-14T09:37:52-08:00) Maven home: /usr/share/apache-maven Java version: 1.7.0_65, vendor: Oracle Corporation Java home: /data/home/mattmann/dist/jdk1.7.0_65/jre Default locale: en_US, platform encoding: UTF-8 OS name: linux, version: 2.6.32-431.23.3.el6.centos.plus.x86_64, arch: amd64, family: unix [mattmann@memex tika]$ {noformat} Here are the surefire reports - no clue what's up here: {noformat} [mattmann@memex tika]$ more tika-parsers/target/surefire-reports/org.apache.tika.parser.mail.RFC822ParserTest.txt --- Test set: org.apache.tika.parser.mail.RFC822ParserTest --- Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.699 sec FAILURE! testMultipart(org.apache.tika.parser.mail.RFC822ParserTest) Time elapsed: 0.152 sec FAILURE! org.mockito.exceptions.verification.TooManyActualInvocations: xHTMLContentHandler.startElement( http://www.w3.org/1999/xhtml;, div, div, isA(org.xml.sax.Attributes) ); Wanted 4 times but was 5 at org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:87) Caused by: org.mockito.exceptions.cause.UndesiredInvocation: Undesired invocation: at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284) at org.apache.tika.parser.ocr.TesseractOCRParser.extractOutput(TesseractOCRParser.java:243) at org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:155) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247) at org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:102) at org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133) at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:76) at org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:84) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) at
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169090#comment-14169090 ] Hong-Thai Nguyen commented on TIKA-1445: Interesting question ! For me, parser's selection and parsers priority decision should be done on runtime by configuration, not inside a parser. Image's parser is an interesting case of concurrent parsers (Tesseract vs classical Image Parsers). We have double problem here: 1. When many parsers can work with same mime type, which one is selected ? 2. When we have many parsers, can we apply many parsers and merge results (metadata handler) . * For case 1, if we use a override config of parsers on runtime, we can declare many parsers with matching mimetype and the later one in list will be selected. We may extend CLI/WebService to inject this kind of configuration. * For case 2, we don't have a solution for now. We may extend CompositeParser to accept a mode 'many' parsers and call matching parsers in chain. The merging result is an other problem.we can accept a same metadata name is override by an other parser. The perfect solution is (again) using nested structure on our metadata which enable store each parser's result. Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.7 Attachments: TIKA-1445.Mattmann.101214.patch.txt Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1176) ChmDirectoryListingSet does not correctly enumerate directory entries
[ https://issues.apache.org/jira/browse/TIKA-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169146#comment-14169146 ] Hong-Thai Nguyen commented on TIKA-1176: Hi [~mdgeek], thank for your offering code testing file. Unfortunately, this check raised other exception on this file: {code} The full exception stack trace is included below: org.apache.tika.exception.TikaException at org.apache.tika.parser.chm.core.ChmExtractor.extractChmEntry(ChmExtractor.java:355) at org.apache.tika.parser.chm.ChmParser.parse(ChmParser.java:70) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:326) at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:285) at org.apache.tika.gui.ParsingTransferHandler.importFiles(ParsingTransferHandler.java:94) at org.apache.tika.gui.ParsingTransferHandler.importData(ParsingTransferHandler.java:77) at javax.swing.TransferHandler.importData(TransferHandler.java:755) at javax.swing.TransferHandler$DropHandler.drop(TransferHandler.java:1478) at java.awt.dnd.DropTarget.drop(DropTarget.java:434) at javax.swing.TransferHandler$SwingDropTarget.drop(TransferHandler.java:1203) at sun.awt.dnd.SunDropTargetContextPeer.processDropMessage(SunDropTargetContextPeer.java:519) at sun.awt.dnd.SunDropTargetContextPeer$EventDispatcher.dispatchDropEvent(SunDropTargetContextPeer.java:832) at sun.awt.dnd.SunDropTargetContextPeer$EventDispatcher.dispatchEvent(SunDropTargetContextPeer.java:756) at sun.awt.dnd.SunDropTargetEvent.dispatch(SunDropTargetEvent.java:30) at java.awt.Component.dispatchEventImpl(Component.java:4517) at java.awt.Container.dispatchEventImpl(Container.java:2097) at java.awt.Component.dispatchEvent(Component.java:4488) at java.awt.LightweightDispatcher.retargetMouseEvent(Container.java:4575) at java.awt.LightweightDispatcher.processDropTargetEvent(Container.java:4310) at java.awt.LightweightDispatcher.dispatchEvent(Container.java:4161) at java.awt.Container.dispatchEventImpl(Container.java:2083) at java.awt.Window.dispatchEventImpl(Window.java:2489) at java.awt.Component.dispatchEvent(Component.java:4488) at java.awt.EventQueue.dispatchEventImpl(EventQueue.java:674) at java.awt.EventQueue.access$400(EventQueue.java:81) at java.awt.EventQueue$2.run(EventQueue.java:633) at java.awt.EventQueue$2.run(EventQueue.java:631) at java.security.AccessController.doPrivileged(Native Method) at java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:87) at java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:98) at java.awt.EventQueue$3.run(EventQueue.java:647) at java.awt.EventQueue$3.run(EventQueue.java:645) at java.security.AccessController.doPrivileged(Native Method) at java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:87) at java.awt.EventQueue.dispatchEvent(EventQueue.java:644) at java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:269) at java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:184) at java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:174) at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:169) at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:161) at java.awt.EventDispatchThread.run(EventDispatchThread.java:122) Caused by: java.lang.ArrayIndexOutOfBoundsException at java.lang.System.arraycopy(Native Method) at org.apache.tika.parser.chm.core.ChmCommons.copyOfRange(ChmCommons.java:342) at org.apache.tika.parser.chm.core.ChmCommons.getChmBlockSegment(ChmCommons.java:108) at org.apache.tika.parser.chm.core.ChmExtractor.extractChmEntry(ChmExtractor.java:337) ... 43 more {code} It's quite complex our CHM Parser, can you apply a full fix and a test with expected content in output on your file ? Thanks, ChmDirectoryListingSet does not correctly enumerate directory entries - Key: TIKA-1176 URL: https://issues.apache.org/jira/browse/TIKA-1176 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Doug Martin Attachments: HelpStudioSample.chm
[jira] [Commented] (TIKA-1428) Microsoft Word 97 - 2003 (.doc) footnote references are Unicode Replacement Character
[ https://issues.apache.org/jira/browse/TIKA-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147880#comment-14147880 ] Hong-Thai Nguyen commented on TIKA-1428: Thanks [~theoettheo], any chance to have a patch with a test case for this problem ? Microsoft Word 97 - 2003 (.doc) footnote references are Unicode Replacement Character - Key: TIKA-1428 URL: https://issues.apache.org/jira/browse/TIKA-1428 Project: Tika Issue Type: Bug Affects Versions: 1.4, 1.6 Reporter: Theodor Sjöstedt Priority: Minor Attachments: TIKA-doc-footnotes-issue.png Footnotes from {{.doc}} documents are extracted, but the references to the footnotes are replaced by the Unicode Replacement Character (�). I have tried this in 1.4 and 1.6. In 1.4, both reference in text and reference at footnote have been replaced. In 1.6, reference in text has disappeared completely. See attached image for original document, 1.4 Formatted text, and 1.6 Formatted text. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1421) Tika-Parsers tests fail on CentOS6 if tesseract isn't installed
[ https://issues.apache.org/jira/browse/TIKA-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143041#comment-14143041 ] Hong-Thai Nguyen commented on TIKA-1421: Not only CentOS, this test failed also on my Windows without Tesseract installed. Tika-Parsers tests fail on CentOS6 if tesseract isn't installed --- Key: TIKA-1421 URL: https://issues.apache.org/jira/browse/TIKA-1421 Project: Tika Issue Type: Bug Components: parser Environment: CentOS6 AWS VM for DARPA Memex Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.7 While testing TIKA-93 on CentOS6, I ran into some test failing issues on a 1.7-trunk fresh install of tika in tika-parsers: {noformat} Running org.apache.tika.parser.chm.TestChmLzxcControlData Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.008 sec Running org.apache.tika.parser.chm.TestChmBlockInfo Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.003 sec Running org.apache.tika.parser.chm.TestChmItsfHeader Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.005 sec Running org.apache.tika.parser.txt.TXTParserTest Tests run: 11, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.016 sec Running org.apache.tika.parser.txt.CharsetDetectorTest Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.02 sec Running org.apache.tika.parser.image.xmp.JempboxExtractorTest Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.014 sec Running org.apache.tika.parser.image.PSDParserTest Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.002 sec Running org.apache.tika.parser.image.ImageParserTest Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.034 sec Running org.apache.tika.parser.image.ImageMetadataExtractorTest Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.241 sec Running org.apache.tika.parser.image.MetadataFieldsTest Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0 sec Running org.apache.tika.parser.image.TiffParserTest Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.02 sec Running org.apache.tika.parser.font.FontParsersTest Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.192 sec Running org.apache.tika.parser.mp4.MP4ParserTest Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.07 sec Running org.apache.tika.parser.mp3.Mp3ParserTest Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.046 sec Running org.apache.tika.parser.mp3.MpegStreamTest Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.003 sec Running org.apache.tika.parser.dwg.DWGParserTest Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.02 sec Running org.apache.tika.parser.pkg.GzipParserTest Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.252 sec Running org.apache.tika.parser.pkg.Seven7ParserTest Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.37 sec Running org.apache.tika.parser.pkg.TarParserTest Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.118 sec Running org.apache.tika.parser.pkg.Bzip2ParserTest Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.233 sec Running org.apache.tika.parser.pkg.ArParserTest Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.017 sec Running org.apache.tika.parser.pkg.ZipParserTest Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.302 sec Running org.apache.tika.parser.video.FLVParserTest Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.026 sec Running org.apache.tika.parser.solidworks.SolidworksParserTest Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.019 sec Running org.apache.tika.parser.ibooks.iBooksParserTest Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.019 sec Running org.apache.tika.parser.ParsingReaderTest Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.018 sec Running org.apache.tika.parser.mail.RFC822ParserTest Tests run: 8, Failures: 1, Errors: 1, Skipped: 0, Time elapsed: 0.31 sec FAILURE! Running org.apache.tika.parser.mbox.MboxParserTest Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.026 sec Running org.apache.tika.parser.mbox.OutlookPSTParserTest Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.094 sec Running org.apache.tika.parser.jpeg.JpegParserTest Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.153 sec Running org.apache.tika.parser.executable.ExecutableParserTest Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.003 sec Running
[jira] [Updated] (TIKA-1421) Tika-Parsers tests fail on CentOS6 if tesseract isn't installed
[ https://issues.apache.org/jira/browse/TIKA-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen updated TIKA-1421: --- Priority: Blocker (was: Major) Tika-Parsers tests fail on CentOS6 if tesseract isn't installed --- Key: TIKA-1421 URL: https://issues.apache.org/jira/browse/TIKA-1421 Project: Tika Issue Type: Bug Components: parser Environment: CentOS6 AWS VM for DARPA Memex Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Blocker Fix For: 1.7 While testing TIKA-93 on CentOS6, I ran into some test failing issues on a 1.7-trunk fresh install of tika in tika-parsers: {noformat} Running org.apache.tika.parser.chm.TestChmLzxcControlData Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.008 sec Running org.apache.tika.parser.chm.TestChmBlockInfo Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.003 sec Running org.apache.tika.parser.chm.TestChmItsfHeader Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.005 sec Running org.apache.tika.parser.txt.TXTParserTest Tests run: 11, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.016 sec Running org.apache.tika.parser.txt.CharsetDetectorTest Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.02 sec Running org.apache.tika.parser.image.xmp.JempboxExtractorTest Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.014 sec Running org.apache.tika.parser.image.PSDParserTest Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.002 sec Running org.apache.tika.parser.image.ImageParserTest Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.034 sec Running org.apache.tika.parser.image.ImageMetadataExtractorTest Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.241 sec Running org.apache.tika.parser.image.MetadataFieldsTest Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0 sec Running org.apache.tika.parser.image.TiffParserTest Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.02 sec Running org.apache.tika.parser.font.FontParsersTest Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.192 sec Running org.apache.tika.parser.mp4.MP4ParserTest Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.07 sec Running org.apache.tika.parser.mp3.Mp3ParserTest Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.046 sec Running org.apache.tika.parser.mp3.MpegStreamTest Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.003 sec Running org.apache.tika.parser.dwg.DWGParserTest Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.02 sec Running org.apache.tika.parser.pkg.GzipParserTest Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.252 sec Running org.apache.tika.parser.pkg.Seven7ParserTest Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.37 sec Running org.apache.tika.parser.pkg.TarParserTest Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.118 sec Running org.apache.tika.parser.pkg.Bzip2ParserTest Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.233 sec Running org.apache.tika.parser.pkg.ArParserTest Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.017 sec Running org.apache.tika.parser.pkg.ZipParserTest Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.302 sec Running org.apache.tika.parser.video.FLVParserTest Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.026 sec Running org.apache.tika.parser.solidworks.SolidworksParserTest Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.019 sec Running org.apache.tika.parser.ibooks.iBooksParserTest Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.019 sec Running org.apache.tika.parser.ParsingReaderTest Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.018 sec Running org.apache.tika.parser.mail.RFC822ParserTest Tests run: 8, Failures: 1, Errors: 1, Skipped: 0, Time elapsed: 0.31 sec FAILURE! Running org.apache.tika.parser.mbox.MboxParserTest Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.026 sec Running org.apache.tika.parser.mbox.OutlookPSTParserTest Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.094 sec Running org.apache.tika.parser.jpeg.JpegParserTest Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.153 sec Running org.apache.tika.parser.executable.ExecutableParserTest Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.003 sec Running org.apache.tika.parser.rtf.RTFParserTest Tests run: 31, Failures: 0, Errors: 0, Skipped: 0, Time
[jira] [Commented] (TIKA-1412) NPE in OpenDocumentParser
[ https://issues.apache.org/jira/browse/TIKA-1412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143043#comment-14143043 ] Hong-Thai Nguyen commented on TIKA-1412: Add a test at r1626706 NPE in OpenDocumentParser - Key: TIKA-1412 URL: https://issues.apache.org/jira/browse/TIKA-1412 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.6 Reporter: Andrzej Bialecki Fix For: 1.7 Attachments: TIKA-1412.diff There's a missing else in OpenDocumentParser when it constructs a ZipInputStream from the InputStream, which results in NPE when the InputStream is an instance of TikaInputStream but has neither openContainer nor file: {code} ... Caused by: java.lang.NullPointerException at org.apache.tika.parser.odf.OpenDocumentParser.parse(OpenDocumentParser.java:161) ~[tika-parsers-1.6.jar:1.6] at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) ~[tika-core-1.6.jar:1.6] {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1413) OOXML thumbnail name added to body
[ https://issues.apache.org/jira/browse/TIKA-1413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen resolved TIKA-1413. Resolution: Fixed OOXML thumbnail name added to body -- Key: TIKA-1413 URL: https://issues.apache.org/jira/browse/TIKA-1413 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.6 Reporter: Andrzej Bialecki AbstractOOXMLExtractor.handleThumbnail processes thumbnails using EmbeddedDocumentExtractor, but with the outputHtml flag set to true (unlike other embedded parts in handleEmbeddedParts(...)). This results in adding the thumbnail name to the main body of the document (as a package-entry), which in my opinion is wrong. Example: {code} ?xml version=1.0 encoding=UTF-8?html xmlns=http://www.w3.org/1999/xhtml; head meta name=meta:slide-count content=1/ meta name=cp:revision content=5/ meta name=meta:last-author content=Nick Burch/ meta name=Slide-Count content=1/ meta name=Last-Author content=Nick Burch/ meta name=meta:save-date content=2010-09-08T16:15:14Z/ meta name=Content-Length content=202969/ meta name=subject content=Gym class featuring a brown fox and lazy dog/ meta name=Application-Name content=Microsoft Office PowerPoint/ meta name=Author content=Nevin Nollop/ meta name=dcterms:created content=1601-01-01T00:00:00Z/ meta name=Application-Version content=12./ meta name=date content=2010-09-08T16:15:14Z/ meta name=Total-Time content=2/ meta name=extended-properties:Template content=/ meta name=publisher content=/ meta name=creator content=Nevin Nollop/ meta name=Word-Count content=9/ meta name=meta:paragraph-count content=1/ meta name=extended-properties:AppVersion content=12./ meta name=Creation-Date content=1601-01-01T00:00:00Z/ meta name=meta:author content=Nevin Nollop/ meta name=cp:subject content=Gym class featuring a brown fox and lazy dog/ meta name=extended-properties:Application content=Microsoft Office PowerPoint/ meta name=resourceName content=testPPT_embeded.pptx/ meta name=Paragraph-Count content=1/ meta name=dc:title content=The quick brown fox jumps over the lazy dog/ meta name=Last-Save-Date content=2010-09-08T16:15:14Z/ meta name=custom:Version content=1/ meta name=Revision-Number content=5/ meta name=Last-Printed content=1601-01-01T00:00:00Z/ meta name=meta:print-date content=1601-01-01T00:00:00Z/ meta name=meta:creation-date content=1601-01-01T00:00:00Z/ meta name=dcterms:modified content=2010-09-08T16:15:14Z/ meta name=Template content=/ meta name=dc:creator content=Nevin Nollop/ meta name=meta:word-count content=9/ meta name=extended-properties:Company content=/ meta name=Last-Modified content=2010-09-08T16:15:14Z/ meta name=extended-properties:PresentationFormat content=On-screen Show (4:3)/ meta name=X-Parsed-By content=org.apache.tika.parser.DefaultParser/ meta name=X-Parsed-By content=org.apache.tika.parser.microsoft.ooxml.OOXMLParser/ meta name=modified content=2010-09-08T16:15:14Z/ meta name=xmpTPg:NPages content=1/ meta name=extended-properties:TotalTime content=2/ meta name=dc:publisher content=/ meta name=Content-Type content=application/vnd.openxmlformats-officedocument.presentationml.presentation/ meta name=Presentation-Format content=On-screen Show (4:3)/ titleThe quick brown fox jumps over the lazy dog/title /head bodypThe quick brown fox jumps over the lazy dog/p div class=embedded id=slide1_rId4/ div class=embedded id=slide1_rId5/ div class=embedded id=slide1_rId6/ div class=embedded id=slide1_rId7/ div class=embedded id=slide1_rId8/ div class=embedded id=slide1_rId9/ div class=embedded id=thumbnail_0.jpeg/div class=package-entryh1thumbnail_0.jpeg/h1/div/body/html {code} The extracted plain text looks like this (using tika-app): {code} The quick brown fox jumps over the lazy dog thumbnail_0.jpeg {code} The fix is trivial - change the flag in AbstractOOXMLExtractor:158 to false. I think also that the id attribute should be set to the real thumbnail path within the package (i.e. tPart.getPartName().getName()) instead of the artificially created sequential name. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1413) OOXML thumbnail name added to body
[ https://issues.apache.org/jira/browse/TIKA-1413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14126949#comment-14126949 ] Hong-Thai Nguyen commented on TIKA-1413: I agree. Fixed in r1623819 and _id_ is now from partName(). OOXML thumbnail name added to body -- Key: TIKA-1413 URL: https://issues.apache.org/jira/browse/TIKA-1413 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.6 Reporter: Andrzej Bialecki AbstractOOXMLExtractor.handleThumbnail processes thumbnails using EmbeddedDocumentExtractor, but with the outputHtml flag set to true (unlike other embedded parts in handleEmbeddedParts(...)). This results in adding the thumbnail name to the main body of the document (as a package-entry), which in my opinion is wrong. Example: {code} ?xml version=1.0 encoding=UTF-8?html xmlns=http://www.w3.org/1999/xhtml; head meta name=meta:slide-count content=1/ meta name=cp:revision content=5/ meta name=meta:last-author content=Nick Burch/ meta name=Slide-Count content=1/ meta name=Last-Author content=Nick Burch/ meta name=meta:save-date content=2010-09-08T16:15:14Z/ meta name=Content-Length content=202969/ meta name=subject content=Gym class featuring a brown fox and lazy dog/ meta name=Application-Name content=Microsoft Office PowerPoint/ meta name=Author content=Nevin Nollop/ meta name=dcterms:created content=1601-01-01T00:00:00Z/ meta name=Application-Version content=12./ meta name=date content=2010-09-08T16:15:14Z/ meta name=Total-Time content=2/ meta name=extended-properties:Template content=/ meta name=publisher content=/ meta name=creator content=Nevin Nollop/ meta name=Word-Count content=9/ meta name=meta:paragraph-count content=1/ meta name=extended-properties:AppVersion content=12./ meta name=Creation-Date content=1601-01-01T00:00:00Z/ meta name=meta:author content=Nevin Nollop/ meta name=cp:subject content=Gym class featuring a brown fox and lazy dog/ meta name=extended-properties:Application content=Microsoft Office PowerPoint/ meta name=resourceName content=testPPT_embeded.pptx/ meta name=Paragraph-Count content=1/ meta name=dc:title content=The quick brown fox jumps over the lazy dog/ meta name=Last-Save-Date content=2010-09-08T16:15:14Z/ meta name=custom:Version content=1/ meta name=Revision-Number content=5/ meta name=Last-Printed content=1601-01-01T00:00:00Z/ meta name=meta:print-date content=1601-01-01T00:00:00Z/ meta name=meta:creation-date content=1601-01-01T00:00:00Z/ meta name=dcterms:modified content=2010-09-08T16:15:14Z/ meta name=Template content=/ meta name=dc:creator content=Nevin Nollop/ meta name=meta:word-count content=9/ meta name=extended-properties:Company content=/ meta name=Last-Modified content=2010-09-08T16:15:14Z/ meta name=extended-properties:PresentationFormat content=On-screen Show (4:3)/ meta name=X-Parsed-By content=org.apache.tika.parser.DefaultParser/ meta name=X-Parsed-By content=org.apache.tika.parser.microsoft.ooxml.OOXMLParser/ meta name=modified content=2010-09-08T16:15:14Z/ meta name=xmpTPg:NPages content=1/ meta name=extended-properties:TotalTime content=2/ meta name=dc:publisher content=/ meta name=Content-Type content=application/vnd.openxmlformats-officedocument.presentationml.presentation/ meta name=Presentation-Format content=On-screen Show (4:3)/ titleThe quick brown fox jumps over the lazy dog/title /head bodypThe quick brown fox jumps over the lazy dog/p div class=embedded id=slide1_rId4/ div class=embedded id=slide1_rId5/ div class=embedded id=slide1_rId6/ div class=embedded id=slide1_rId7/ div class=embedded id=slide1_rId8/ div class=embedded id=slide1_rId9/ div class=embedded id=thumbnail_0.jpeg/div class=package-entryh1thumbnail_0.jpeg/h1/div/body/html {code} The extracted plain text looks like this (using tika-app): {code} The quick brown fox jumps over the lazy dog thumbnail_0.jpeg {code} The fix is trivial - change the flag in AbstractOOXMLExtractor:158 to false. I think also that the id attribute should be set to the real thumbnail path within the package (i.e. tPart.getPartName().getName()) instead of the artificially created sequential name. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1373) AutoDetectParser extracts no text when SourceCodeParser is selected
[ https://issues.apache.org/jira/browse/TIKA-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077885#comment-14077885 ] Hong-Thai Nguyen commented on TIKA-1373: Normally it's on next official 1.6 release, but you can try with this candidate release: http://people.apache.org/~mattmann/apache-tika-1.6/rc1/ AutoDetectParser extracts no text when SourceCodeParser is selected --- Key: TIKA-1373 URL: https://issues.apache.org/jira/browse/TIKA-1373 Project: Tika Issue Type: Bug Affects Versions: 1.5 Reporter: Andrés Aguilar-Umaña When using the AutoDetectParser in java code, and the SourceCodeParser is selected (i.e. java files), the handler gets no text: I have this test program: {code} String data = public class HelloWorld {}; ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes()); Parser autoDetectParser = new AutoDetectParser(); BodyContentHandler bch = new BodyContentHandler(50); ParseContext parseContext = new ParseContext(); Metadata metadata = new Metadata(); metadata.set(Metadata.CONTENT_TYPE, text/x-java-source); try { autoDetectParser.parse(bais, bch, metadata, parseContext); } catch (Exception e) { e.printStackTrace(); } System.out.println(Text extracted: +bch.toString()) {code} It returns (using the SourceCodeParser): {code} Text extracted: {code} But when I use this code: {code} String data = public class HelloWorld {}; ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes()); Parser autoDetectParser = new AutoDetectParser(); BodyContentHandler bch = new BodyContentHandler(50); ParseContext parseContext = new ParseContext(); Metadata metadata = new Metadata(); metadata.set(Metadata.CONTENT_TYPE, text/plain); try { autoDetectParser.parse(bais, bch, metadata, parseContext); } catch (Exception e) { e.printStackTrace(); } System.out.println(Text extracted: +bch.toString()) {code} The Text Parser is used and I get: {code} Text extracted: public class HelloWorld {} {code} I have also tested this command: {code} java -jar tika-app-1.5.jar -t D:\text.java (no text) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1373) AutoDetectParser extracts no text when SourceCodeParser is selected
[ https://issues.apache.org/jira/browse/TIKA-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14073042#comment-14073042 ] Hong-Thai Nguyen commented on TIKA-1373: HtmlParser skips tags generated by JHighlight. I found a solution by using directly TagSoup Parser. Commit in r1613051. As I mentioned in TIKA-1224, this parser is quick dirty approach to parser source code file. Again, the _right_ one parser is must have dedicate parser by language and parse deeply elements and build events on-the-fly. AutoDetectParser extracts no text when SourceCodeParser is selected --- Key: TIKA-1373 URL: https://issues.apache.org/jira/browse/TIKA-1373 Project: Tika Issue Type: Bug Affects Versions: 1.5 Reporter: Andrés Aguilar-Umaña When using the AutoDetectParser in java code, and the SourceCodeParser is selected (i.e. java files), the handler gets no text: I have this test program: {code} String data = public class HelloWorld {}; ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes()); Parser autoDetectParser = new AutoDetectParser(); BodyContentHandler bch = new BodyContentHandler(50); ParseContext parseContext = new ParseContext(); Metadata metadata = new Metadata(); metadata.set(Metadata.CONTENT_TYPE, text/x-java-source); try { autoDetectParser.parse(bais, bch, metadata, parseContext); } catch (Exception e) { e.printStackTrace(); } System.out.println(Text extracted: +bch.toString()) {code} It returns (using the SourceCodeParser): {code} Text extracted: {code} But when I use this code: {code} String data = public class HelloWorld {}; ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes()); Parser autoDetectParser = new AutoDetectParser(); BodyContentHandler bch = new BodyContentHandler(50); ParseContext parseContext = new ParseContext(); Metadata metadata = new Metadata(); metadata.set(Metadata.CONTENT_TYPE, text/plain); try { autoDetectParser.parse(bais, bch, metadata, parseContext); } catch (Exception e) { e.printStackTrace(); } System.out.println(Text extracted: +bch.toString()) {code} The Text Parser is used and I get: {code} Text extracted: public class HelloWorld {} {code} I have also tested this command: {code} java -jar tika-app-1.5.jar -t D:\text.java (no text) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (TIKA-1373) AutoDetectParser extracts no text when SourceCodeParser is selected
[ https://issues.apache.org/jira/browse/TIKA-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen resolved TIKA-1373. Resolution: Fixed AutoDetectParser extracts no text when SourceCodeParser is selected --- Key: TIKA-1373 URL: https://issues.apache.org/jira/browse/TIKA-1373 Project: Tika Issue Type: Bug Affects Versions: 1.5 Reporter: Andrés Aguilar-Umaña When using the AutoDetectParser in java code, and the SourceCodeParser is selected (i.e. java files), the handler gets no text: I have this test program: {code} String data = public class HelloWorld {}; ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes()); Parser autoDetectParser = new AutoDetectParser(); BodyContentHandler bch = new BodyContentHandler(50); ParseContext parseContext = new ParseContext(); Metadata metadata = new Metadata(); metadata.set(Metadata.CONTENT_TYPE, text/x-java-source); try { autoDetectParser.parse(bais, bch, metadata, parseContext); } catch (Exception e) { e.printStackTrace(); } System.out.println(Text extracted: +bch.toString()) {code} It returns (using the SourceCodeParser): {code} Text extracted: {code} But when I use this code: {code} String data = public class HelloWorld {}; ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes()); Parser autoDetectParser = new AutoDetectParser(); BodyContentHandler bch = new BodyContentHandler(50); ParseContext parseContext = new ParseContext(); Metadata metadata = new Metadata(); metadata.set(Metadata.CONTENT_TYPE, text/plain); try { autoDetectParser.parse(bais, bch, metadata, parseContext); } catch (Exception e) { e.printStackTrace(); } System.out.println(Text extracted: +bch.toString()) {code} The Text Parser is used and I get: {code} Text extracted: public class HelloWorld {} {code} I have also tested this command: {code} java -jar tika-app-1.5.jar -t D:\text.java (no text) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1373) AutoDetectParser extracts no text when SourceCodeParser is selected
[ https://issues.apache.org/jira/browse/TIKA-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071643#comment-14071643 ] Hong-Thai Nguyen commented on TIKA-1373: Can you format your description with {code} annotation and if I understand well the output of 1st section is empty ? AutoDetectParser extracts no text when SourceCodeParser is selected --- Key: TIKA-1373 URL: https://issues.apache.org/jira/browse/TIKA-1373 Project: Tika Issue Type: Bug Affects Versions: 1.5 Reporter: Andrés Aguilar-Umaña When using the AutoDetectParser in java code, and the SourceCodeParser is selected (i.e. java files), the handler gets no text: I have this test program: String data = public class HelloWorld {}; ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes()); Parser autoDetectParser = new AutoDetectParser(); autoDetectParser = new SourceCodeParser(); BodyContentHandler bch = new BodyContentHandler(50); ParseContext parseContext = new ParseContext(); Metadata metadata = new Metadata(); metadata.set(Metadata.CONTENT_TYPE, text/x-java-source); try { autoDetectParser.parse(bais, bch, metadata, parseContext); } catch (Exception e) { e.printStackTrace(); } System.out.println(Text extracted: +bch.toString()) It returns (using the SourceCodeParser): Text extracted: But when I use this code: String data = public class HelloWorld {}; ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes()); Parser autoDetectParser = new AutoDetectParser(); autoDetectParser = new SourceCodeParser(); BodyContentHandler bch = new BodyContentHandler(50); ParseContext parseContext = new ParseContext(); Metadata metadata = new Metadata(); metadata.set(Metadata.CONTENT_TYPE, text/plain); try { autoDetectParser.parse(bais, bch, metadata, parseContext); } catch (Exception e) { e.printStackTrace(); } System.out.println(Text extracted: +bch.toString()) The Text Parser is used and I get: Text extracted: public class HelloWorld {} I have also tested this command: java -jar tika-app-1.5.jar -t D:\text.java (no text) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1373) AutoDetectParser extracts no text when SourceCodeParser is selected
[ https://issues.apache.org/jira/browse/TIKA-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071713#comment-14071713 ] Hong-Thai Nguyen commented on TIKA-1373: Yes, I saw the trouble when implementing this parser. How can we get that we are asking for text instead of HTML ? Can Handler is instanceOf BodyContentHandler is enough ? AutoDetectParser extracts no text when SourceCodeParser is selected --- Key: TIKA-1373 URL: https://issues.apache.org/jira/browse/TIKA-1373 Project: Tika Issue Type: Bug Affects Versions: 1.5 Reporter: Andrés Aguilar-Umaña When using the AutoDetectParser in java code, and the SourceCodeParser is selected (i.e. java files), the handler gets no text: I have this test program: String data = public class HelloWorld {}; ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes()); Parser autoDetectParser = new AutoDetectParser(); autoDetectParser = new SourceCodeParser(); BodyContentHandler bch = new BodyContentHandler(50); ParseContext parseContext = new ParseContext(); Metadata metadata = new Metadata(); metadata.set(Metadata.CONTENT_TYPE, text/x-java-source); try { autoDetectParser.parse(bais, bch, metadata, parseContext); } catch (Exception e) { e.printStackTrace(); } System.out.println(Text extracted: +bch.toString()) It returns (using the SourceCodeParser): Text extracted: But when I use this code: String data = public class HelloWorld {}; ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes()); Parser autoDetectParser = new AutoDetectParser(); autoDetectParser = new SourceCodeParser(); BodyContentHandler bch = new BodyContentHandler(50); ParseContext parseContext = new ParseContext(); Metadata metadata = new Metadata(); metadata.set(Metadata.CONTENT_TYPE, text/plain); try { autoDetectParser.parse(bais, bch, metadata, parseContext); } catch (Exception e) { e.printStackTrace(); } System.out.println(Text extracted: +bch.toString()) The Text Parser is used and I get: Text extracted: public class HelloWorld {} I have also tested this command: java -jar tika-app-1.5.jar -t D:\text.java (no text) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (TIKA-1373) AutoDetectParser extracts no text when SourceCodeParser is selected
[ https://issues.apache.org/jira/browse/TIKA-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071643#comment-14071643 ] Hong-Thai Nguyen edited comment on TIKA-1373 at 7/23/14 1:42 PM: - Can you format your description with {noformat}{code}{noformat} annotation and if I understand well the output of 1st section is empty ? was (Author: thaichat04): Can you format your description with {code} annotation and if I understand well the output of 1st section is empty ? AutoDetectParser extracts no text when SourceCodeParser is selected --- Key: TIKA-1373 URL: https://issues.apache.org/jira/browse/TIKA-1373 Project: Tika Issue Type: Bug Affects Versions: 1.5 Reporter: Andrés Aguilar-Umaña When using the AutoDetectParser in java code, and the SourceCodeParser is selected (i.e. java files), the handler gets no text: I have this test program: String data = public class HelloWorld {}; ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes()); Parser autoDetectParser = new AutoDetectParser(); autoDetectParser = new SourceCodeParser(); BodyContentHandler bch = new BodyContentHandler(50); ParseContext parseContext = new ParseContext(); Metadata metadata = new Metadata(); metadata.set(Metadata.CONTENT_TYPE, text/x-java-source); try { autoDetectParser.parse(bais, bch, metadata, parseContext); } catch (Exception e) { e.printStackTrace(); } System.out.println(Text extracted: +bch.toString()) It returns (using the SourceCodeParser): Text extracted: But when I use this code: String data = public class HelloWorld {}; ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes()); Parser autoDetectParser = new AutoDetectParser(); autoDetectParser = new SourceCodeParser(); BodyContentHandler bch = new BodyContentHandler(50); ParseContext parseContext = new ParseContext(); Metadata metadata = new Metadata(); metadata.set(Metadata.CONTENT_TYPE, text/plain); try { autoDetectParser.parse(bais, bch, metadata, parseContext); } catch (Exception e) { e.printStackTrace(); } System.out.println(Text extracted: +bch.toString()) The Text Parser is used and I get: Text extracted: public class HelloWorld {} I have also tested this command: java -jar tika-app-1.5.jar -t D:\text.java (no text) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1095) Only gibberish extracted from this PDF
[ https://issues.apache.org/jira/browse/TIKA-1095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14061867#comment-14061867 ] Hong-Thai Nguyen commented on TIKA-1095: Event with latest Tika can't convert this file. It seems that a font problem on this PDF file. Can you report this to PDFBox tracker: https://issues.apache.org/jira/browse/PDFBOX/ ? Only gibberish extracted from this PDF -- Key: TIKA-1095 URL: https://issues.apache.org/jira/browse/TIKA-1095 Project: Tika Issue Type: Bug Components: general Affects Versions: 1.3 Environment: Probably any Reporter: Bas van Meurs Labels: patch Attachments: ALG 2010-05-19 03 bijlage 1 - besluitenlijst dagelijks bestuur d d 10 februari 2010.pdf, test.txt java -jar /usr/share/tika/tika-app-1.3.jar -t /home/adrupal/www/sites/stadsregio.nl/files/files/Agendastukken/ALG 2010-05-19 03 bijlage 1 - besluitenlijst dagelijks bestuur d d 10 februari 2010.pdf /tmp/test.txt This produces all gibberish. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TIKA-1095) Only gibberish extracted from this PDF
[ https://issues.apache.org/jira/browse/TIKA-1095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen updated TIKA-1095: --- Component/s: (was: general) parser Only gibberish extracted from this PDF -- Key: TIKA-1095 URL: https://issues.apache.org/jira/browse/TIKA-1095 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.3 Environment: Probably any Reporter: Bas van Meurs Labels: pdfbox Attachments: ALG 2010-05-19 03 bijlage 1 - besluitenlijst dagelijks bestuur d d 10 februari 2010.pdf, test.txt java -jar /usr/share/tika/tika-app-1.3.jar -t /home/adrupal/www/sites/stadsregio.nl/files/files/Agendastukken/ALG 2010-05-19 03 bijlage 1 - besluitenlijst dagelijks bestuur d d 10 februari 2010.pdf /tmp/test.txt This produces all gibberish. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1350) OutlookPSTParser: Unknown message type: IPM.Note
[ https://issues.apache.org/jira/browse/TIKA-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14040519#comment-14040519 ] Hong-Thai Nguyen commented on TIKA-1350: Richard Johnson (author of java-pstlib) is trying deploy new version 0.8.1 to Maven Center (ref. https://issues.sonatype.org/browse/OSSRH-8965?focusedCommentId=260254page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-260254) When this work done, we can upgrade to 0.8.1 in Tika dependence to get fix. OutlookPSTParser: Unknown message type: IPM.Note Key: TIKA-1350 URL: https://issues.apache.org/jira/browse/TIKA-1350 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.7 Reporter: Jonathan Evans Labels: libpst, parser, pst Fix For: 1.7 Original Estimate: 0.2h Remaining Estimate: 0.2h When parsing some emails in a PST file I get the error Unknown message type: IPM.Note preventing them from being parsed. This is because of an extra null byte at the end of the message class string. This has been fixed in version 0.8.1 of java-libpst so a version bump is all that is required. https://github.com/rjohnsondev/java-libpst/issues/14 I would attempt to do this myself but I am unsure how to open a pull request with SVN. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1308) Support in memory parse mode(don't create temp file): to support run Tika in GAE
[ https://issues.apache.org/jira/browse/TIKA-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14008704#comment-14008704 ] Hong-Thai Nguyen commented on TIKA-1308: A virtual FileSystem may be a solution, If you're on Java 7. The NIO APIs with FileSytemProvider [1] allows you define or inject a Virtual FileSystem (eg. Common VFS [2]). [1] http://docs.oracle.com/javase/7/docs/api/java/nio/file/spi/FileSystemProvider.html [2] http://commons.apache.org/proper/commons-vfs/filesystems.html Support in memory parse mode(don't create temp file): to support run Tika in GAE Key: TIKA-1308 URL: https://issues.apache.org/jira/browse/TIKA-1308 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Reporter: yuanyun.cn Labels: gae Fix For: 1.6 I am trying to use Tika in GAE and write a simple servlet to extract meta data info from jpeg: String urlStr = req.getParameter(imageUrl); byte[] oldImageData = IOUtils.toByteArray(new URL(urlStr)); ByteArrayInputStream bais = new ByteArrayInputStream(oldImageData); Metadata metadata = new Metadata(); BodyContentHandler ch = new BodyContentHandler(); AutoDetectParser parser = new AutoDetectParser(); parser.parse(bais, ch, metadata, new ParseContext()); bais.close(); This fails with exception: Caused by: java.lang.SecurityException: Unable to create temporary file at java.io.File.createTempFile(File.java:1986) at org.apache.tika.io.TemporaryResources.createTemporaryFile(TemporaryResources.java:66) at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:533) at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242 Checked the code, in org.apache.tika.parser.jpeg.JpegParser.parse(InputStream, ContentHandler, Metadata, ParseContext), it creates a temp file from the input stream. I can understand why tika create temp file from the stream: so tika can parse it multiple times. But as GAE and other cloud servers are getting more popular, is it possible to avoid create temp file: instead we can copy the origin stream to a byteArray stream, so tika can also parse it multiple times. -- This will have a limit on the file size, as tika keeps the whole file in memory, but this can make tika work in GAE and maybe other cloud server. We can add a parameter in parser.parse to indicate whether do in memory parse only. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TIKA-1290) Upgrade to PDFBOX 1.8.5
[ https://issues.apache.org/jira/browse/TIKA-1290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen updated TIKA-1290: --- Labels: trivial (was: ) Upgrade to PDFBOX 1.8.5 --- Key: TIKA-1290 URL: https://issues.apache.org/jira/browse/TIKA-1290 Project: Tika Issue Type: Improvement Reporter: Hong-Thai Nguyen Labels: trivial PDFBOX 1.8.5 has been released: http://pdfbox.apache.org/downloads.html#recent We can update to this version, and eventually test fix also TIKA-1231 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (TIKA-1290) Upgrade to PDFBOX 1.8.5
[ https://issues.apache.org/jira/browse/TIKA-1290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen resolved TIKA-1290. Resolution: Fixed r1592780 Upgrade to PDFBOX 1.8.5 --- Key: TIKA-1290 URL: https://issues.apache.org/jira/browse/TIKA-1290 Project: Tika Issue Type: Improvement Reporter: Hong-Thai Nguyen Labels: trivial PDFBOX 1.8.5 has been released: http://pdfbox.apache.org/downloads.html#recent We can update to this version, and eventually test fix also TIKA-1231 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1287) Update NetCDF .jar file on Maven Central
[ https://issues.apache.org/jira/browse/TIKA-1287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13987521#comment-13987521 ] Hong-Thai Nguyen commented on TIKA-1287: Technically, not difficult to upload new jar lib on Maven Center, you follow just steps mention by [~gagravarr], I did recently for java-pstlib. BTW, we must care about license of lib if you are not the author of this lib. See http://www.unidata.ucar.edu/software/thredds/current/netcdf-java/documentation.htm, netCDF's license not not Apache license. You should contact them first to ask authorization if you want to upload yourself this lib. Update NetCDF .jar file on Maven Central Key: TIKA-1287 URL: https://issues.apache.org/jira/browse/TIKA-1287 Project: Tika Issue Type: Bug Affects Versions: 1.5 Reporter: Ann Burgess Labels: jar, maven, netcdf, tika, unit-test, update I am working to update the NetCDFParser file. When using the most-recent .jar file available from http://www.unidata.ucar.edu/ at the command line I receive a note about a depreciated API: javac -classpath ../../../../tika-core/target/tika-core-1.6-SNAPSHOT.jar:../../../../toolsUI-4.3.jar org/apache/tika/parser/netcdf/NetCDFParser.java Note: org/apache/tika/parser/netcdf/NetCDFParser.java uses or overrides a deprecated API. Note: Recompile with -Xlint:deprecation for details. After updating the NetCDFParser file with non-deprecated methods (e.x. changing dimension.getName() to dimension.getFullName()) however, I get failed unit tests in maven, which I assume is because the Maven Central Repo has the lapsed version of the .jar file needed for NetCDF files ( http://search.maven.org/#search%7Cgav%7C1%7Cg%3A%22edu.ucar%22%20AND%20a%3A%22netcdf%22) . Can anyone provide insight into how I get the updated .jar file into the Maven Central Repository? Is there an alternative method to update Tika so I can run my unit tests in Maven? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (TIKA-1290) Upgrade to PDFBOX 1.8.5
Hong-Thai Nguyen created TIKA-1290: -- Summary: Upgrade to PDFBOX 1.8.5 Key: TIKA-1290 URL: https://issues.apache.org/jira/browse/TIKA-1290 Project: Tika Issue Type: Improvement Reporter: Hong-Thai Nguyen PDFBOX 1.8.5 has been released: http://pdfbox.apache.org/downloads.html#recent We can update to this version, and eventually test fix also TIKA-1231 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1283) Add thumbnail as possible metadata item to TikaCoreProperties
[ https://issues.apache.org/jira/browse/TIKA-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13983434#comment-13983434 ] Hong-Thai Nguyen commented on TIKA-1283: +1 for me to create a thumbnail field in metadata Set. - For OOXML, that's an item inside archive (see TIKA-1223). PowerPoint has always embedded thumbnail in Jpeg, but optional with docx xlsx (available only when user check on 'save preview' option when saving document). - For OLE Documents, see: http://poi.apache.org/hpsf/thumbnails.html. You can get thumbnail content from POI API: {code} static byte[] process(File docFile) throws Exception { final HWPFDocumentCore wordDocument = AbstractWordUtils.loadDoc(docFile); SummaryInformation summaryInformation = wordDocument.getSummaryInformation(); System.out.println(summaryInformation.getAuthor()); System.out.println(summaryInformation.getApplicationName() + : + summaryInformation.getTitle()); Thumbnail thumbnail = new Thumbnail(summaryInformation.getThumbnail()); System.out.println(thumbnail.getClipboardFormat()); System.out.println(thumbnail.getClipboardFormatTag()); return thumbnail.getThumbnailAsWMF(); } {code} Unfortunately , there's an open bug on POI to get properly thumbnail content: https://issues.apache.org/bugzilla/show_bug.cgi?id=56194 docx, xlsx ole formats, they are WMF EMF formats. Quite difficult to handle these kind of images. But, this is out of our scope. Add thumbnail as possible metadata item to TikaCoreProperties --- Key: TIKA-1283 URL: https://issues.apache.org/jira/browse/TIKA-1283 Project: Tika Issue Type: Improvement Components: metadata Reporter: Tim Allison Priority: Minor TIKA-90 originally requested to add thumbnails to a document's metadata. I'd like to have a unified way of determining whether an embedded document/resource is a thumbnail or a regular attachment. With the changes in TIKA-1223 (ooxml) and TIKA-1010 (rtf), we are now pulling out more thumbnails than before. I propose adding tika:thumbnail to the metadata of each thumbnail image. The consumer can then determine what to do with the embedded resource based on the metadata. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (TIKA-1279) Missing return lines at output of SourceCodeParser
Hong-Thai Nguyen created TIKA-1279: -- Summary: Missing return lines at output of SourceCodeParser Key: TIKA-1279 URL: https://issues.apache.org/jira/browse/TIKA-1279 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.5 Reporter: Hong-Thai Nguyen Priority: Trivial Fix For: 1.6 xhtml output is on a single line. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1224) Adding Source code (Java, Groovy, C) parser
[ https://issues.apache.org/jira/browse/TIKA-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13979614#comment-13979614 ] Hong-Thai Nguyen commented on TIKA-1224: Thank [~ben.12] for feedback. For line return problem at output, I created a new issue: TIKA-1279 For -t option in TikaCLI, It's ambiguous on mimetype of java file. It's could be text/plain (in this case, TxtParser will be used to return original text as is), x-java-source (SourceCodeParser will be used). For -h option, output is normally something: {code} Author: Hong-Thai.Nguyen Content-Encoding: windows-1252 Content-Length: 4899 Content-Type: text/x-java-source LoC: 133 creator: Hong-Thai.Nguyen dc:creator: Hong-Thai.Nguyen meta:author: Hong-Thai.Nguyen resourceName: SourceCodeParser.java {code} the creator is from 'author' annotation in javadoc. This parser is quite generic (quick and dirty as mentioned by [~kkrugler]) and simplistic. We can make a more dedicate Java source parser and extract more metadata (member, attributes...). If you interest this kind of parser, please create new issue and eventually an investigation on this work is warmly welcome. Regards, Adding Source code (Java, Groovy, C) parser --- Key: TIKA-1224 URL: https://issues.apache.org/jira/browse/TIKA-1224 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Reporter: Hong-Thai Nguyen Priority: Minor We can parser some source code file formats: text/x-java-source text/x-groovy text/x-c for HTML rendering from code, we can use jhightlight: http://www.ohloh.net/p/jhighlight -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (TIKA-1279) Missing return lines at output of SourceCodeParser
[ https://issues.apache.org/jira/browse/TIKA-1279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen resolved TIKA-1279. Resolution: Fixed Fixed at r1589687 Missing return lines at output of SourceCodeParser -- Key: TIKA-1279 URL: https://issues.apache.org/jira/browse/TIKA-1279 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.5 Reporter: Hong-Thai Nguyen Priority: Trivial Fix For: 1.6 xhtml output is on a single line. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TIKA-1276) Missing embedded dependencies in tika-bundle
[ https://issues.apache.org/jira/browse/TIKA-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen updated TIKA-1276: --- Fix Version/s: 1.6 Missing embedded dependencies in tika-bundle Key: TIKA-1276 URL: https://issues.apache.org/jira/browse/TIKA-1276 Project: Tika Issue Type: Bug Components: packaging Affects Versions: 1.5 Environment: OSGI, Apache Felix via Apache Sling Launcher Reporter: Rupert Westenthaler Fix For: 1.6 Attachments: TIKA-1276_20140423_rwesten.diff While updating from tika 1.2 to 1.5 I that the `org.apache.tika:tika-bundle:1.5` module has some missing dependences. 1. `com.uwyn:jhighlight:1.0` is not embedded Because of that installing the bundle results in the following exception {code} org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [103]: Unable to resolve 103.0: missing requirement [103.0] osgi.wiring.package; (osgi.wiring.package=com.uwyn.jhighlight.renderer)) org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [103]: Unable to resolve 103.0: missing requirement [103.0] osgi.wiring.package; (osgi.wiring.package=com.uwyn.jhighlight.renderer) at org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962) at org.apache.felix.framework.Felix.startBundle(Felix.java:2025) at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279) at org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304) at java.lang.Thread.run(Thread.java:744) {code} 2. `org.ow2.asm:asm:4.1` is not embedded because `org.apache.tika:tika-core:1.5` uses `org.ow2.asm-debug-all:asm:4.1` and therefore the `Embed-Dependency` directive `asm` does not match any dependency. Because of that one do get the following exception (after fixing (1)) {code} org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement [96.0] osgi.wiring.package; ((osgi.wiring.package=org.objectweb.asm)(version=4.1.0)(!(version=5.0.0 org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement [96.0] osgi.wiring.package; ((osgi.wiring.package=org.objectweb.asm)(version=4.1.0)(!(version=5.0.0))) at org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962) at org.apache.felix.framework.Felix.startBundle(Felix.java:2025) at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279) at org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304) at java.lang.Thread.run(Thread.java:744) {code} There are two possibilities to fix this (a) change the `Embed-Dependency` to `asm-debug-all` or adding a dependency to `org.ow2.asm:asm:4.1` to the tika-bundle pom file. 3. `edu.ucar:netcdf:4.2-min` is not embedded Because of that one does get the following exception (after fixing (1) and (2)) {code} org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement [96.0] osgi.wiring.package; (osgi.wiring.package=ucar.ma2)) org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement [96.0] osgi.wiring.package; (osgi.wiring.package=ucar.ma2) at org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962) at org.apache.felix.framework.Felix.startBundle(Felix.java:2025) at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279) at org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304) at java.lang.Thread.run(Thread.java:744) {code} 4. The `com.adobe.xmp:xmpcore:5.1.2` dependency is required at runtime After fixing the above issues the tika-bundle was started successfully. However when extracting EXIG metadata from a jpeg image I got the following exception. {code} java.lang.NoClassDefFoundError: com/adobe/xmp/XMPException at com.drew.imaging.jpeg.JpegMetadataReader.extractMetadataFromJpegSegmentReader(JpegMetadataReader.java:112) at com.drew.imaging.jpeg.JpegMetadataReader.readMetadata(JpegMetadataReader.java:71) at org.apache.tika.parser.image.ImageMetadataExtractor.parseJpeg(ImageMetadataExtractor.java:91) at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56) [..] {code} Embedding xmpcore in the tika-bundle solved this issue. NOTES: * The Apache Stanbol integration tests only covers PDF, JPEG, DOCX. So there might be
[jira] [Resolved] (TIKA-1276) Missing embedded dependencies in tika-bundle
[ https://issues.apache.org/jira/browse/TIKA-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen resolved TIKA-1276. Resolution: Fixed Thank [~rwesten], added your patch at r1589717 Missing embedded dependencies in tika-bundle Key: TIKA-1276 URL: https://issues.apache.org/jira/browse/TIKA-1276 Project: Tika Issue Type: Bug Components: packaging Affects Versions: 1.5 Environment: OSGI, Apache Felix via Apache Sling Launcher Reporter: Rupert Westenthaler Fix For: 1.6 Attachments: TIKA-1276_20140423_rwesten.diff While updating from tika 1.2 to 1.5 I that the `org.apache.tika:tika-bundle:1.5` module has some missing dependences. 1. `com.uwyn:jhighlight:1.0` is not embedded Because of that installing the bundle results in the following exception {code} org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [103]: Unable to resolve 103.0: missing requirement [103.0] osgi.wiring.package; (osgi.wiring.package=com.uwyn.jhighlight.renderer)) org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [103]: Unable to resolve 103.0: missing requirement [103.0] osgi.wiring.package; (osgi.wiring.package=com.uwyn.jhighlight.renderer) at org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962) at org.apache.felix.framework.Felix.startBundle(Felix.java:2025) at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279) at org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304) at java.lang.Thread.run(Thread.java:744) {code} 2. `org.ow2.asm:asm:4.1` is not embedded because `org.apache.tika:tika-core:1.5` uses `org.ow2.asm-debug-all:asm:4.1` and therefore the `Embed-Dependency` directive `asm` does not match any dependency. Because of that one do get the following exception (after fixing (1)) {code} org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement [96.0] osgi.wiring.package; ((osgi.wiring.package=org.objectweb.asm)(version=4.1.0)(!(version=5.0.0 org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement [96.0] osgi.wiring.package; ((osgi.wiring.package=org.objectweb.asm)(version=4.1.0)(!(version=5.0.0))) at org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962) at org.apache.felix.framework.Felix.startBundle(Felix.java:2025) at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279) at org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304) at java.lang.Thread.run(Thread.java:744) {code} There are two possibilities to fix this (a) change the `Embed-Dependency` to `asm-debug-all` or adding a dependency to `org.ow2.asm:asm:4.1` to the tika-bundle pom file. 3. `edu.ucar:netcdf:4.2-min` is not embedded Because of that one does get the following exception (after fixing (1) and (2)) {code} org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement [96.0] osgi.wiring.package; (osgi.wiring.package=ucar.ma2)) org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement [96.0] osgi.wiring.package; (osgi.wiring.package=ucar.ma2) at org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962) at org.apache.felix.framework.Felix.startBundle(Felix.java:2025) at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279) at org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304) at java.lang.Thread.run(Thread.java:744) {code} 4. The `com.adobe.xmp:xmpcore:5.1.2` dependency is required at runtime After fixing the above issues the tika-bundle was started successfully. However when extracting EXIG metadata from a jpeg image I got the following exception. {code} java.lang.NoClassDefFoundError: com/adobe/xmp/XMPException at com.drew.imaging.jpeg.JpegMetadataReader.extractMetadataFromJpegSegmentReader(JpegMetadataReader.java:112) at com.drew.imaging.jpeg.JpegMetadataReader.readMetadata(JpegMetadataReader.java:71) at org.apache.tika.parser.image.ImageMetadataExtractor.parseJpeg(ImageMetadataExtractor.java:91) at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56) [..] {code} Embedding xmpcore in the tika-bundle solved this issue. NOTES: * The Apache Stanbol integration tests
[jira] [Resolved] (TIKA-1279) Missing return lines at output of SourceCodeParser
[ https://issues.apache.org/jira/browse/TIKA-1279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen resolved TIKA-1279. Resolution: Fixed Thank [~rgauss] for this good catch. I fixed with more tests in r1589742 Hoping that we can move away Java 6 soon :) Missing return lines at output of SourceCodeParser -- Key: TIKA-1279 URL: https://issues.apache.org/jira/browse/TIKA-1279 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.5 Reporter: Hong-Thai Nguyen Assignee: Hong-Thai Nguyen Priority: Trivial Fix For: 1.6 xhtml output is on a single line. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TIKA-623) Add support for Outlook PST
[ https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen updated TIKA-623: -- Assignee: (was: Hong-Thai Nguyen) Add support for Outlook PST --- Key: TIKA-623 URL: https://issues.apache.org/jira/browse/TIKA-623 Project: Tika Issue Type: New Feature Components: parser Reporter: Tran Nam Quang Fix For: 1.6 Attachments: OutlookPSTParser.java Hello everyone, As you might know, Outlook stores its mails and other stuff in a single PST file. There's a relatively new Java library called java-libpst for reading Outlook PST files. It is licensed under the LGPL and available over here: http://code.google.com/p/java-libpst/ I have tested the library on Outlook 2000 and Outlook 2003, with good results. It would be great if the library could be integrated into Tika. Best regards Tran Nam Quang -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (TIKA-1244) Better parsing of Mbox files
[ https://issues.apache.org/jira/browse/TIKA-1244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen resolved TIKA-1244. Resolution: Fixed Fix Version/s: 1.6 Commited on r1583305, thanks [~lfcnassif] I preserved metadata extraction from current MboxParser because message/rfc822 seems not enable extract all fields in header. Better parsing of Mbox files Key: TIKA-1244 URL: https://issues.apache.org/jira/browse/TIKA-1244 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Reporter: Luis Filipe Nassif Assignee: Hong-Thai Nguyen Fix For: 1.6 Attachments: MboxParser.java.patch MboxParser currently looses metadata of all emails, except first. It does not extract/parse emails, nor decode parts. It should handle embedded emails like other container parsers do, so emails will be automatically parsed by RFC822Parser. I will try to add a patch for this. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (TIKA-1244) Better parsing of Mbox files
[ https://issues.apache.org/jira/browse/TIKA-1244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen reassigned TIKA-1244: -- Assignee: Hong-Thai Nguyen Better parsing of Mbox files Key: TIKA-1244 URL: https://issues.apache.org/jira/browse/TIKA-1244 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Reporter: Luis Filipe Nassif Assignee: Hong-Thai Nguyen Attachments: MboxParser.java.patch MboxParser currently looses metadata of all emails, except first. It does not extract/parse emails, nor decode parts. It should handle embedded emails like other container parsers do, so emails will be automatically parsed by RFC822Parser. I will try to add a patch for this. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1244) Better parsing of Mbox files
[ https://issues.apache.org/jira/browse/TIKA-1244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13942965#comment-13942965 ] Hong-Thai Nguyen commented on TIKA-1244: +1 for me too, I was at same intention to redo this parser when making PST. I'll have some next week, and hope can have a look on your patch. Thanks Better parsing of Mbox files Key: TIKA-1244 URL: https://issues.apache.org/jira/browse/TIKA-1244 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Reporter: Luis Filipe Nassif Attachments: MboxParser.java.patch MboxParser currently looses metadata of all emails, except first. It does not extract/parse emails, nor decode parts. It should handle embedded emails like other container parsers do, so emails will be automatically parsed by RFC822Parser. I will try to add a patch for this. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-623) Add support for Outlook PST
[ https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923703#comment-13923703 ] Hong-Thai Nguyen commented on TIKA-623: --- [~lfcnassif], binary attached is handled with embeddedExtractor. BTW, I agree that we can split each mail to a separate unit. [~talli...@apache.org], we couldn't fix .pst and .msg (msg is already handled as part of OfficeParser), and feel free to finish properly this issue as you can :) Add support for Outlook PST --- Key: TIKA-623 URL: https://issues.apache.org/jira/browse/TIKA-623 Project: Tika Issue Type: New Feature Components: parser Reporter: Tran Nam Quang Assignee: Hong-Thai Nguyen Fix For: 1.6 Attachments: OutlookPSTParser.java Hello everyone, As you might know, Outlook stores its mails and other stuff in a single PST file. There's a relatively new Java library called java-libpst for reading Outlook PST files. It is licensed under the LGPL and available over here: http://code.google.com/p/java-libpst/ I have tested the library on Outlook 2000 and Outlook 2003, with good results. It would be great if the library could be integrated into Tika. Best regards Tran Nam Quang -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (TIKA-1257) MS Word Filter out control characters on ouput
Hong-Thai Nguyen created TIKA-1257: -- Summary: MS Word Filter out control characters on ouput Key: TIKA-1257 URL: https://issues.apache.org/jira/browse/TIKA-1257 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Hong-Thai Nguyen Fix For: 1.6 Control characters present mostly in table of index and un-visualizable. We should filter out them. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TIKA-1257) MS Word Filter out control characters on ouput
[ https://issues.apache.org/jira/browse/TIKA-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen updated TIKA-1257: --- Attachment: tika-doc-control-char.png 5f01ae23-9e6e-4faa-808a-f78dbb20cc71.doc MS Word Filter out control characters on ouput -- Key: TIKA-1257 URL: https://issues.apache.org/jira/browse/TIKA-1257 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Hong-Thai Nguyen Fix For: 1.6 Attachments: 5f01ae23-9e6e-4faa-808a-f78dbb20cc71.doc, tika-doc-control-char.png Control characters present mostly in table of index and un-visualizable. We should filter out them. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (TIKA-1257) MS Word Filter out control characters on ouput
[ https://issues.apache.org/jira/browse/TIKA-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen resolved TIKA-1257. Resolution: Fixed Fixed on r1574874 MS Word Filter out control characters on ouput -- Key: TIKA-1257 URL: https://issues.apache.org/jira/browse/TIKA-1257 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Hong-Thai Nguyen Fix For: 1.6 Attachments: 5f01ae23-9e6e-4faa-808a-f78dbb20cc71.doc, tika-doc-control-char.png Control characters present mostly in table of index and un-visualizable. We should filter out them. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (TIKA-1257) MS Word Filter out control characters on ouput
[ https://issues.apache.org/jira/browse/TIKA-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922490#comment-13922490 ] Hong-Thai Nguyen edited comment on TIKA-1257 at 3/6/14 1:50 PM: Fixed on r1574874 r1574877 was (Author: thaichat04): Fixed on r1574874 MS Word Filter out control characters on ouput -- Key: TIKA-1257 URL: https://issues.apache.org/jira/browse/TIKA-1257 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Hong-Thai Nguyen Fix For: 1.6 Attachments: 5f01ae23-9e6e-4faa-808a-f78dbb20cc71.doc, tika-doc-control-char.png Control characters present mostly in table of index and un-visualizable. We should filter out them. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TIKA-1257) MS Word Filter out control characters on ouput
[ https://issues.apache.org/jira/browse/TIKA-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen updated TIKA-1257: --- Attachment: (was: 5f01ae23-9e6e-4faa-808a-f78dbb20cc71.doc) MS Word Filter out control characters on ouput -- Key: TIKA-1257 URL: https://issues.apache.org/jira/browse/TIKA-1257 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Hong-Thai Nguyen Fix For: 1.6 Attachments: tika-doc-control-char.png Control characters present mostly in table of index and un-visualizable. We should filter out them. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TIKA-1257) MS Word Filter out control characters on ouput
[ https://issues.apache.org/jira/browse/TIKA-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen updated TIKA-1257: --- Attachment: testControlCharacters.doc MS Word Filter out control characters on ouput -- Key: TIKA-1257 URL: https://issues.apache.org/jira/browse/TIKA-1257 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Hong-Thai Nguyen Fix For: 1.6 Attachments: testControlCharacters.doc, tika-doc-control-char.png Control characters present mostly in table of index and un-visualizable. We should filter out them. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TIKA-623) Add support for Outlook PST
[ https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen updated TIKA-623: -- Fix Version/s: 1.6 Add support for Outlook PST --- Key: TIKA-623 URL: https://issues.apache.org/jira/browse/TIKA-623 Project: Tika Issue Type: New Feature Components: parser Reporter: Tran Nam Quang Fix For: 1.6 Attachments: OutlookPSTParser.java Hello everyone, As you might know, Outlook stores its mails and other stuff in a single PST file. There's a relatively new Java library called java-libpst for reading Outlook PST files. It is licensed under the LGPL and available over here: http://code.google.com/p/java-libpst/ I have tested the library on Outlook 2000 and Outlook 2003, with good results. It would be great if the library could be integrated into Tika. Best regards Tran Nam Quang -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (TIKA-623) Add support for Outlook PST
[ https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13920692#comment-13920692 ] Hong-Thai Nguyen edited comment on TIKA-623 at 3/5/14 9:30 AM: --- java-libpst-0.7 has been uploaded to oss sonatype nexus: https://issues.sonatype.org/browse/OSSRH-8965 If there's no objection, I'll refactory attached parser and provide output as: {code} html xmlns=http://www.w3.org/1999/xhtml; head meta name=Content-Length content=271360 / meta name=isValid content=true / meta name=Content-Type content=application/vnd.ms-outlook / title/title /head body div class=email-folder h1Début du fichier de données Outlook/h1 div class=email-entry h1lt;530d9cac.5080...@gmail.comgt;/h1 meta subject=Re: Feature Generators / meta internetMessageId=lt;530d9cac.5080...@gmail.comgt; / meta descriptorNodeId=2097188 / meta lastModificationTime=1393418263291 / meta senderName=Jörn Kottmann / meta senderEmailAddress=kottm...@gmail.com / meta recipients=No recipients table! / pmail content/p /div div class=email-folder h1Éléments supprimés/h1 /div /div div class=email-folder h1Racine (pour la recherche)/h1 /div div class=email-folder h1SPAM Search Folder 2/h1 /div /body /html {code} was (Author: thaichat04): java-libpst-0.7 has been uploaded to oss sonatype nexus. If there's no objection, I'll refactory attached parser and provide output as: {code} html xmlns=http://www.w3.org/1999/xhtml; head meta name=Content-Length content=271360 / meta name=isValid content=true / meta name=Content-Type content=application/vnd.ms-outlook / title/title /head body div class=email-folder h1Début du fichier de données Outlook/h1 div class=email-entry h1lt;530d9cac.5080...@gmail.comgt;/h1 meta subject=Re: Feature Generators / meta internetMessageId=lt;530d9cac.5080...@gmail.comgt; / meta descriptorNodeId=2097188 / meta lastModificationTime=1393418263291 / meta senderName=Jörn Kottmann / meta senderEmailAddress=kottm...@gmail.com / meta recipients=No recipients table! / pmail content/p /div div class=email-folder h1Éléments supprimés/h1 /div /div div class=email-folder h1Racine (pour la recherche)/h1 /div div class=email-folder h1SPAM Search Folder 2/h1 /div /body /html {code} Add support for Outlook PST --- Key: TIKA-623 URL: https://issues.apache.org/jira/browse/TIKA-623 Project: Tika Issue Type: New Feature Components: parser Reporter: Tran Nam Quang Assignee: Hong-Thai Nguyen Fix For: 1.6 Attachments: OutlookPSTParser.java Hello everyone, As you might know, Outlook stores its mails and other stuff in a single PST file. There's a relatively new Java library called java-libpst for reading Outlook PST files. It is licensed under the LGPL and available over here: http://code.google.com/p/java-libpst/ I have tested the library on Outlook 2000 and Outlook 2003, with good results. It would be great if the library could be integrated into Tika. Best regards Tran Nam Quang -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (TIKA-623) Add support for Outlook PST
[ https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen reassigned TIKA-623: - Assignee: Hong-Thai Nguyen Add support for Outlook PST --- Key: TIKA-623 URL: https://issues.apache.org/jira/browse/TIKA-623 Project: Tika Issue Type: New Feature Components: parser Reporter: Tran Nam Quang Assignee: Hong-Thai Nguyen Fix For: 1.6 Attachments: OutlookPSTParser.java Hello everyone, As you might know, Outlook stores its mails and other stuff in a single PST file. There's a relatively new Java library called java-libpst for reading Outlook PST files. It is licensed under the LGPL and available over here: http://code.google.com/p/java-libpst/ I have tested the library on Outlook 2000 and Outlook 2003, with good results. It would be great if the library could be integrated into Tika. Best regards Tran Nam Quang -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (TIKA-623) Add support for Outlook PST
[ https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen resolved TIKA-623. --- Resolution: Fixed Commit on r1574411 Add support for Outlook PST --- Key: TIKA-623 URL: https://issues.apache.org/jira/browse/TIKA-623 Project: Tika Issue Type: New Feature Components: parser Reporter: Tran Nam Quang Assignee: Hong-Thai Nguyen Fix For: 1.6 Attachments: OutlookPSTParser.java Hello everyone, As you might know, Outlook stores its mails and other stuff in a single PST file. There's a relatively new Java library called java-libpst for reading Outlook PST files. It is licensed under the LGPL and available over here: http://code.google.com/p/java-libpst/ I have tested the library on Outlook 2000 and Outlook 2003, with good results. It would be great if the library could be integrated into Tika. Best regards Tran Nam Quang -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (TIKA-1089) Tika conversion failed on following documents
[ https://issues.apache.org/jira/browse/TIKA-1089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen resolved TIKA-1089. Resolution: Invalid Fix Version/s: 1.5 Assignee: Hong-Thai Nguyen Should create each issue by file, then investigate to resolve one by one. Tika conversion failed on following documents - Key: TIKA-1089 URL: https://issues.apache.org/jira/browse/TIKA-1089 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.3 Environment: windows, api Reporter: Hong-Thai Nguyen Assignee: Hong-Thai Nguyen Labels: test Fix For: 1.5 Attachments: crawler.log We are using Tika as our major converter of divers file formats to text, html version in a Search Engine. We've collected some documents (46) which Tika can not convert: http://www.mediafire.com/?60clr812lerx3gy -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Assigned] (TIKA-1223) Extract thumbnail of OOXML Office files
[ https://issues.apache.org/jira/browse/TIKA-1223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen reassigned TIKA-1223: -- Assignee: Hong-Thai Nguyen Extract thumbnail of OOXML Office files --- Key: TIKA-1223 URL: https://issues.apache.org/jira/browse/TIKA-1223 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.4 Reporter: Hong-Thai Nguyen Assignee: Hong-Thai Nguyen Priority: Minor Fix For: 1.6 Attachments: TIKA-1223.patch From Microsoft Office 2007 file formats, thumbnail could be included in package. We can extract this embedded thumbnail for OOXML files. As discussed in mailing list, we should extract thumbnail as a attachment, not as metadata (TIKA-90). {noformat} embeddedRelationId format is thumbnail_{i}.{extension}. {noformat} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Resolved] (TIKA-1223) Extract thumbnail of OOXML Office files
[ https://issues.apache.org/jira/browse/TIKA-1223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen resolved TIKA-1223. Resolution: Fixed r1568954 Extract thumbnail of OOXML Office files --- Key: TIKA-1223 URL: https://issues.apache.org/jira/browse/TIKA-1223 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.4 Reporter: Hong-Thai Nguyen Assignee: Hong-Thai Nguyen Priority: Minor Fix For: 1.6 Attachments: TIKA-1223.patch From Microsoft Office 2007 file formats, thumbnail could be included in package. We can extract this embedded thumbnail for OOXML files. As discussed in mailing list, we should extract thumbnail as a attachment, not as metadata (TIKA-90). {noformat} embeddedRelationId format is thumbnail_{i}.{extension}. {noformat} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Assigned] (TIKA-1223) Extract thumbnail of OOXML Office files
[ https://issues.apache.org/jira/browse/TIKA-1223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen reassigned TIKA-1223: -- Assignee: (was: Hong-Thai Nguyen) Extract thumbnail of OOXML Office files --- Key: TIKA-1223 URL: https://issues.apache.org/jira/browse/TIKA-1223 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.4 Reporter: Hong-Thai Nguyen Priority: Minor Fix For: 1.6 Attachments: TIKA-1223.patch From Microsoft Office 2007 file formats, thumbnail could be included in package. We can extract this embedded thumbnail for OOXML files. As discussed in mailing list, we should extract thumbnail as a attachment, not as metadata (TIKA-90). {noformat} embeddedRelationId format is thumbnail_{i}.{extension}. {noformat} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Resolved] (TIKA-1224) Adding Source code (Java, Groovy, C) parser
[ https://issues.apache.org/jira/browse/TIKA-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen resolved TIKA-1224. Resolution: Fixed Adding Source code (Java, Groovy, C) parser --- Key: TIKA-1224 URL: https://issues.apache.org/jira/browse/TIKA-1224 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Reporter: Hong-Thai Nguyen Priority: Minor We can parser some source code file formats: text/x-java-source text/x-groovy text/x-c for HTML rendering from code, we can use jhightlight: http://www.ohloh.net/p/jhighlight -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1224) Adding Source code (Java, Groovy, C) parser
[ https://issues.apache.org/jira/browse/TIKA-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13889491#comment-13889491 ] Hong-Thai Nguyen commented on TIKA-1224: Commited on 1563902 Adding Source code (Java, Groovy, C) parser --- Key: TIKA-1224 URL: https://issues.apache.org/jira/browse/TIKA-1224 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Reporter: Hong-Thai Nguyen Priority: Minor We can parser some source code file formats: text/x-java-source text/x-groovy text/x-c for HTML rendering from code, we can use jhightlight: http://www.ohloh.net/p/jhighlight -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1224) Adding Source code (Java, Groovy, C) parser
[ https://issues.apache.org/jira/browse/TIKA-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13877343#comment-13877343 ] Hong-Thai Nguyen commented on TIKA-1224: I agree that parsing deeply each language is not simple. This work (already done) is just providing HTML format of source languages and some metadata possible (as author, version ...) extracting from javadoc comment and probably interesting others as LoC. When we need more detailed result on a language, we must implement a dedicated parser. This parser is useful in search application. Adding Source code (Java, Groovy, C) parser --- Key: TIKA-1224 URL: https://issues.apache.org/jira/browse/TIKA-1224 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Reporter: Hong-Thai Nguyen Priority: Minor We can parser some source code file formats: text/x-java-source text/x-groovy text/x-c for HTML rendering from code, we can use jhightlight: http://www.ohloh.net/p/jhighlight -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1215) Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4
[ https://issues.apache.org/jira/browse/TIKA-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13870573#comment-13870573 ] Hong-Thai Nguyen commented on TIKA-1215: Great catch. Thank [~jukkaz] Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4 -- Key: TIKA-1215 URL: https://issues.apache.org/jira/browse/TIKA-1215 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.5 Reporter: Hong-Thai Nguyen Priority: Critical Attachments: Centres 080805@0650 RTBF Matin Première - A propos des rues de Dublin et Dubreucq.mp3, TIKA-1215-fix-prefix-namespaces.patch, tika-1215-without-wildcard.patch With attached file, 1.5 raises this exception on parsing. This file has no problem on 1.4 {code} ... Caused by: org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml not declared at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62) at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68) at org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284) at org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:323) at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:107) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at com.polyspot.document.converter.DocumentConverter.realizeTikaConversion(DocumentConverter.java:221) ... 15 more {code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (TIKA-1215) Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4
[ https://issues.apache.org/jira/browse/TIKA-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen updated TIKA-1215: --- Attachment: tika-1215-without-wildcard.patch [~gagravarr], my code style is different the one of Apache convention. Apologize for that. I attached new patch file containing changes only. Thanks Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4 -- Key: TIKA-1215 URL: https://issues.apache.org/jira/browse/TIKA-1215 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.5 Reporter: Hong-Thai Nguyen Priority: Critical Attachments: Centres 080805@0650 RTBF Matin Première - A propos des rues de Dublin et Dubreucq.mp3, TIKA-1215-fix-prefix-namespaces.patch, tika-1215-without-wildcard.patch With attached file, 1.5 raises this exception on parsing. This file has no problem on 1.4 {code} ... Caused by: org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml not declared at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62) at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68) at org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284) at org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:323) at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:107) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at com.polyspot.document.converter.DocumentConverter.realizeTikaConversion(DocumentConverter.java:221) ... 15 more {code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1215) Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4
[ https://issues.apache.org/jira/browse/TIKA-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13869590#comment-13869590 ] Hong-Thai Nguyen commented on TIKA-1215: [~talli...@apache.org], here's XML of input to parse: {noformat} h1 xmlns=http://www.w3.org/1999/xhtml;Matin Première - Tour des régions 080806/h1 pRTBF - La Première/p pSpeech/p p101698.914/p pXXX - A propos du contrat de quartier rues Dublin/Dubreucq/p {noformat} I think this regression came from TIKA-1070 {code} currentElement = currentElement.parent; {code} The parentElement of p is null, then getPrefix() raised exception, that's different from 1.4 Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4 -- Key: TIKA-1215 URL: https://issues.apache.org/jira/browse/TIKA-1215 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.5 Reporter: Hong-Thai Nguyen Priority: Critical Attachments: Centres 080805@0650 RTBF Matin Première - A propos des rues de Dublin et Dubreucq.mp3, TIKA-1215-fix-prefix-namespaces.patch, tika-1215-without-wildcard.patch With attached file, 1.5 raises this exception on parsing. This file has no problem on 1.4 {code} ... Caused by: org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml not declared at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62) at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68) at org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284) at org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:323) at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:107) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at com.polyspot.document.converter.DocumentConverter.realizeTikaConversion(DocumentConverter.java:221) ... 15 more {code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-90) Allow thumbnails as document metadata
[ https://issues.apache.org/jira/browse/TIKA-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13866498#comment-13866498 ] Hong-Thai Nguyen commented on TIKA-90: -- Useful for Open XML Office OpenOffice files and some others with embedded thumbnail. Allow thumbnails as document metadata - Key: TIKA-90 URL: https://issues.apache.org/jira/browse/TIKA-90 Project: Tika Issue Type: New Feature Components: general Reporter: Jukka Zitting It would be nice if parser components could produce thumbnail images and other non-string metadata when parsing documents. To do this, we could either generalize the current Metadata methods, or introduce new methods for handling such non-string metadata. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1216) parse method of Mp3Parser doesn't work for few mp3 files
[ https://issues.apache.org/jira/browse/TIKA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13864202#comment-13864202 ] Hong-Thai Nguyen commented on TIKA-1216: I've test with a simple test case with this file. It seems that, this problem is identical with TIKA-1215. parse method of Mp3Parser doesn't work for few mp3 files Key: TIKA-1216 URL: https://issues.apache.org/jira/browse/TIKA-1216 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Environment: Windows 7 ultimate 32-bit OS, Java 1.7 Reporter: Sumeet Gorab Priority: Blocker Labels: patch Attachments: 05 - Dharti - Sarkaaran [www.DJMaza.Com].mp3 Try to parse a Mp3 file but parse method of Mp3Parser class is not able to parse that mp3 file. Parse method is not able to complete its execution their is some issue in that method. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (TIKA-1215) Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4
[ https://issues.apache.org/jira/browse/TIKA-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen updated TIKA-1215: --- Attachment: TIKA-1215-fix-prefix-namespaces.patch I made a fix with a test for this issue. Please have a revision and commit quickly. Thanks Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4 -- Key: TIKA-1215 URL: https://issues.apache.org/jira/browse/TIKA-1215 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.5 Reporter: Hong-Thai Nguyen Priority: Critical Attachments: Centres 080805@0650 RTBF Matin Première - A propos des rues de Dublin et Dubreucq.mp3, TIKA-1215-fix-prefix-namespaces.patch With attached file, 1.5 raises this exception on parsing. This file has no problem on 1.4 {code} ... Caused by: org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml not declared at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62) at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68) at org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284) at org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:323) at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:107) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at com.polyspot.document.converter.DocumentConverter.realizeTikaConversion(DocumentConverter.java:221) ... 15 more {code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Comment Edited] (TIKA-1216) parse method of Mp3Parser doesn't work for few mp3 files
[ https://issues.apache.org/jira/browse/TIKA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13864202#comment-13864202 ] Hong-Thai Nguyen edited comment on TIKA-1216 at 1/7/14 3:57 PM: I've tested with a simple test case with this file. It seems that, this problem is identical with TIKA-1215. A patch has been submitted on this issue. Waiting for a revision commit. Thanks was (Author: thaichat04): I've test with a simple test case with this file. It seems that, this problem is identical with TIKA-1215. parse method of Mp3Parser doesn't work for few mp3 files Key: TIKA-1216 URL: https://issues.apache.org/jira/browse/TIKA-1216 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Environment: Windows 7 ultimate 32-bit OS, Java 1.7 Reporter: Sumeet Gorab Priority: Blocker Labels: patch Attachments: 05 - Dharti - Sarkaaran [www.DJMaza.Com].mp3 Try to parse a Mp3 file but parse method of Mp3Parser class is not able to parse that mp3 file. Parse method is not able to complete its execution their is some issue in that method. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1215) Regression: Unable parse a mp3 file on 1.5 which parsed successfully on 1.4
[ https://issues.apache.org/jira/browse/TIKA-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860246#comment-13860246 ] Hong-Thai Nguyen commented on TIKA-1215: [~davemeikle], here's a sample test failed on this file: {code} package com.polyspot.document.converter; import static org.fest.assertions.Assertions.assertThat; import java.io.ByteArrayOutputStream; import java.io.InputStream; import org.apache.commons.io.IOUtils; import org.apache.tika.config.TikaConfig; import org.apache.tika.metadata.Metadata; import org.apache.tika.parser.CompositeParser; import org.apache.tika.parser.ParseContext; import org.apache.tika.parser.Parser; import org.apache.tika.sax.BodyContentHandler; import org.apache.tika.sax.ToHTMLContentHandler; import org.apache.tika.sax.WriteOutContentHandler; import org.junit.Before; import org.junit.Test; import org.xml.sax.ContentHandler; public class Mp3ParserTest { private ConverterConfiguration config; private CompositeParser parser; @Before public void before() throws Exception { config = new ConverterConfiguration(); config.setMimeToConverter(src/test/resources/mimeToConverter.xml); config.setSizeLimit(40); TikaConfig tikaConf = new TikaConfig(config.getMimeToConverter().trim()); parser = (CompositeParser) tikaConf.getParser(); } @Test public void can_parse_mp3_files() throws Exception { ByteArrayOutputStream outputStream = new ByteArrayOutputStream(); ToHTMLContentHandler toHtmlContentHandler = new ToHTMLContentHandler(outputStream, UTF-8); // Extract always HTML by default WriteOutContentHandler handler = new WriteOutContentHandler(toHtmlContentHandler, (int) 400); ContentHandler bodyHandler = new BodyContentHandler(handler); InputStream input = getClass().getResourceAsStream(/mp3/test.mp3); try { ParseContext context = new ParseContext(); // parsing context.set(Parser.class, parser); parser.parse(input, bodyHandler, new Metadata(), context); } finally { IOUtils.closeQuietly(input); } String output = outputStream.toString(UTF-8); assertThat(output).isNotEmpty(); // failed } } {code} Regression: Unable parse a mp3 file on 1.5 which parsed successfully on 1.4 --- Key: TIKA-1215 URL: https://issues.apache.org/jira/browse/TIKA-1215 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.5 Reporter: Hong-Thai Nguyen Priority: Critical Attachments: Centres 080805@0650 RTBF Matin Première - A propos des rues de Dublin et Dubreucq.mp3 With attached file, 1.5 raises this exception on parsing. This file has no problem on 1.4 {code} ... Caused by: org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml not declared at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62) at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68) at org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284) at org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:323) at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:107) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at com.polyspot.document.converter.DocumentConverter.realizeTikaConversion(DocumentConverter.java:221) ... 15 more {code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (TIKA-1215) Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4
[ https://issues.apache.org/jira/browse/TIKA-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen updated TIKA-1215: --- Summary: Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4 (was: Regression: Unable parse a mp3 file on 1.5 which parsed successfully on 1.4) Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4 -- Key: TIKA-1215 URL: https://issues.apache.org/jira/browse/TIKA-1215 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.5 Reporter: Hong-Thai Nguyen Priority: Critical Attachments: Centres 080805@0650 RTBF Matin Première - A propos des rues de Dublin et Dubreucq.mp3 With attached file, 1.5 raises this exception on parsing. This file has no problem on 1.4 {code} ... Caused by: org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml not declared at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62) at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68) at org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284) at org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:323) at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:107) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at com.polyspot.document.converter.DocumentConverter.realizeTikaConversion(DocumentConverter.java:221) ... 15 more {code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Comment Edited] (TIKA-1215) Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4
[ https://issues.apache.org/jira/browse/TIKA-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860246#comment-13860246 ] Hong-Thai Nguyen edited comment on TIKA-1215 at 1/2/14 3:11 PM: [~davemeikle], here's a sample test failed on this file: {code} package com.polyspot.document.converter; import static org.fest.assertions.Assertions.assertThat; import java.io.ByteArrayOutputStream; import java.io.InputStream; import org.apache.commons.io.IOUtils; import org.apache.tika.config.TikaConfig; import org.apache.tika.metadata.Metadata; import org.apache.tika.parser.CompositeParser; import org.apache.tika.parser.ParseContext; import org.apache.tika.parser.Parser; import org.apache.tika.sax.BodyContentHandler; import org.apache.tika.sax.ToHTMLContentHandler; import org.apache.tika.sax.WriteOutContentHandler; import org.junit.Before; import org.junit.Test; import org.xml.sax.ContentHandler; public class Mp3ParserTest { private ConverterConfiguration config; private CompositeParser parser; @Before public void before() throws Exception { config = new ConverterConfiguration(); config.setMimeToConverter(src/test/resources/mimeToConverter.xml); config.setSizeLimit(40); TikaConfig tikaConf = new TikaConfig(config.getMimeToConverter().trim()); parser = (CompositeParser) tikaConf.getParser(); } @Test public void can_parse_mp3_files() throws Exception { ByteArrayOutputStream outputStream = new ByteArrayOutputStream(); ToHTMLContentHandler toHtmlContentHandler = new ToHTMLContentHandler(outputStream, UTF-8); // Extract // always // HTML // by // default WriteOutContentHandler handler = new WriteOutContentHandler(toHtmlContentHandler, (int) 400); ContentHandler bodyHandler = new BodyContentHandler(handler); InputStream input = getClass().getResourceAsStream(/mp3/test.mp3); try { ParseContext context = new ParseContext(); // parsing context.set(Parser.class, parser); Metadata metadata = new Metadata(); metadata.add(Metadata.RESOURCE_NAME_KEY, 12345); metadata.add(Metadata.CONTENT_TYPE, audio/mpeg); parser.parse(input, bodyHandler, metadata, context); } finally { IOUtils.closeQuietly(input); } String output = outputStream.toString(UTF-8); assertThat(output).isNotEmpty(); // failed } } {code} Here's stack error {noformat} org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml not declared at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62) at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68) at org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284) at org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:323) at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:107) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at com.polyspot.document.converter.Mp3ParserTest.can_parse_mp3_files(Mp3ParserTest.java:49) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at
[jira] [Comment Edited] (TIKA-1215) Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4
[ https://issues.apache.org/jira/browse/TIKA-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860246#comment-13860246 ] Hong-Thai Nguyen edited comment on TIKA-1215 at 1/2/14 3:12 PM: [~davemeikle], here's a sample test failed on this file with 1.5-SNAPSHOT, but passed on 1.4: {code} package com.polyspot.document.converter; import static org.fest.assertions.Assertions.assertThat; import java.io.ByteArrayOutputStream; import java.io.InputStream; import org.apache.commons.io.IOUtils; import org.apache.tika.config.TikaConfig; import org.apache.tika.metadata.Metadata; import org.apache.tika.parser.CompositeParser; import org.apache.tika.parser.ParseContext; import org.apache.tika.parser.Parser; import org.apache.tika.sax.BodyContentHandler; import org.apache.tika.sax.ToHTMLContentHandler; import org.apache.tika.sax.WriteOutContentHandler; import org.junit.Before; import org.junit.Test; import org.xml.sax.ContentHandler; public class Mp3ParserTest { private ConverterConfiguration config; private CompositeParser parser; @Before public void before() throws Exception { config = new ConverterConfiguration(); config.setMimeToConverter(src/test/resources/mimeToConverter.xml); config.setSizeLimit(40); TikaConfig tikaConf = new TikaConfig(config.getMimeToConverter().trim()); parser = (CompositeParser) tikaConf.getParser(); } @Test public void can_parse_mp3_files() throws Exception { ByteArrayOutputStream outputStream = new ByteArrayOutputStream(); ToHTMLContentHandler toHtmlContentHandler = new ToHTMLContentHandler(outputStream, UTF-8); // Extract // always // HTML // by // default WriteOutContentHandler handler = new WriteOutContentHandler(toHtmlContentHandler, (int) 400); ContentHandler bodyHandler = new BodyContentHandler(handler); InputStream input = getClass().getResourceAsStream(/mp3/test.mp3); try { ParseContext context = new ParseContext(); // parsing context.set(Parser.class, parser); Metadata metadata = new Metadata(); metadata.add(Metadata.RESOURCE_NAME_KEY, 12345); metadata.add(Metadata.CONTENT_TYPE, audio/mpeg); parser.parse(input, bodyHandler, metadata, context); } finally { IOUtils.closeQuietly(input); } String output = outputStream.toString(UTF-8); assertThat(output).isNotEmpty(); // failed } } {code} Here's stack error {noformat} org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml not declared at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62) at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68) at org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284) at org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:323) at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:107) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at com.polyspot.document.converter.Mp3ParserTest.can_parse_mp3_files(Mp3ParserTest.java:49) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at
[jira] [Comment Edited] (TIKA-1215) Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4
[ https://issues.apache.org/jira/browse/TIKA-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860246#comment-13860246 ] Hong-Thai Nguyen edited comment on TIKA-1215 at 1/2/14 5:20 PM: [~davemeikle], here's a sample test failed on this file with 1.5-SNAPSHOT, but passed on 1.4: {code} package com.polyspot.document.converter; import static org.fest.assertions.Assertions.assertThat; import java.io.ByteArrayOutputStream; import java.io.InputStream; import org.apache.commons.io.IOUtils; import org.apache.tika.config.TikaConfig; import org.apache.tika.metadata.Metadata; import org.apache.tika.parser.CompositeParser; import org.apache.tika.parser.ParseContext; import org.apache.tika.parser.Parser; import org.apache.tika.sax.BodyContentHandler; import org.apache.tika.sax.ToHTMLContentHandler; import org.apache.tika.sax.WriteOutContentHandler; import org.junit.Before; import org.junit.Test; import org.xml.sax.ContentHandler; public class Mp3ParserTest { private CompositeParser parser; @Before public void before() throws Exception { TikaConfig tikaConf = new TikaConfig(); parser = (CompositeParser) tikaConf.getParser(); } @Test public void can_parse_mp3_files() throws Exception { ByteArrayOutputStream outputStream = new ByteArrayOutputStream(); ToHTMLContentHandler toHtmlContentHandler = new ToHTMLContentHandler(outputStream, UTF-8); // Extract // always // HTML // by // default WriteOutContentHandler handler = new WriteOutContentHandler(toHtmlContentHandler, (int) 400); ContentHandler bodyHandler = new BodyContentHandler(handler); InputStream input = getClass().getResourceAsStream(/mp3/test.mp3); try { ParseContext context = new ParseContext(); // parsing context.set(Parser.class, parser); Metadata metadata = new Metadata(); metadata.add(Metadata.RESOURCE_NAME_KEY, 12345); metadata.add(Metadata.CONTENT_TYPE, audio/mpeg); parser.parse(input, bodyHandler, metadata, context); } finally { IOUtils.closeQuietly(input); } String output = outputStream.toString(UTF-8); assertThat(output).isNotEmpty(); // failed System.out.println(output); } } {code} Here's stack error {noformat} org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml not declared at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62) at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68) at org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284) at org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:323) at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:107) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at com.polyspot.document.converter.Mp3ParserTest.can_parse_mp3_files(Mp3ParserTest.java:49) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at
[jira] [Commented] (TIKA-1152) Process loops infinitely on parsing of a CHM file
[ https://issues.apache.org/jira/browse/TIKA-1152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13857418#comment-13857418 ] Hong-Thai Nguyen commented on TIKA-1152: Thank [~jukkaz], I've checked on trunk. Seems ok now. Process loops infinitely on parsing of a CHM file - Key: TIKA-1152 URL: https://issues.apache.org/jira/browse/TIKA-1152 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Environment: Windows/Linux Reporter: Hong-Thai Nguyen Assignee: Jukka Zitting Priority: Critical Fix For: 1.5 Attachments: ChmLzxBlock.java.patch, eventcombmt.chm By parsing [the attachment CHM file|^eventcombmt.chm] (MS Microsoft Help Files), Java process stuck. {code} Thread[main,5,main] org.apache.tika.parser.chm.lzx.ChmLzxBlock.extractContent(ChmLzxBlock.java:203) org.apache.tika.parser.chm.lzx.ChmLzxBlock.init(ChmLzxBlock.java:77) org.apache.tika.parser.chm.core.ChmExtractor.extractChmEntry(ChmExtractor.java:338) org.apache.tika.parser.chm.CHMDocumentInformation.getContent(CHMDocumentInformation.java:72) org.apache.tika.parser.chm.CHMDocumentInformation.getText(CHMDocumentInformation.java:141) org.apache.tika.parser.chm.CHM2XHTML.process(CHM2XHTML.java:34) org.apache.tika.parser.chm.ChmParser.parse(ChmParser.java:51) org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91) org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) org.apache.tika.parser.AbstractParser.parse(AbstractParser.java:53) com.polyspot.document.converter.DocumentConverter.realizeConversion(DocumentConverter.java:192) ... {code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (TIKA-1215) Regression: Unable parse a mp3 file on 1.5 which parsed successfully on 1.4
[ https://issues.apache.org/jira/browse/TIKA-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen updated TIKA-1215: --- Attachment: Centres 080805@0650 RTBF Matin Première - A propos des rues de Dublin et Dubreucq.mp3 Regression: Unable parse a mp3 file on 1.5 which parsed successfully on 1.4 --- Key: TIKA-1215 URL: https://issues.apache.org/jira/browse/TIKA-1215 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.5 Reporter: Hong-Thai Nguyen Priority: Critical Attachments: Centres 080805@0650 RTBF Matin Première - A propos des rues de Dublin et Dubreucq.mp3 With attached file, 1.5 raises this exception on parsing. This file has no problem on 1.4 {code} ... Caused by: org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml not declared at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62) at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68) at org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284) at org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:323) at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:107) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at com.polyspot.document.converter.DocumentConverter.realizeTikaConversion(DocumentConverter.java:221) ... 15 more {code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (TIKA-1215) Regression: Unable parse a mp3 file on 1.5 which parsed successfully on 1.4
Hong-Thai Nguyen created TIKA-1215: -- Summary: Regression: Unable parse a mp3 file on 1.5 which parsed successfully on 1.4 Key: TIKA-1215 URL: https://issues.apache.org/jira/browse/TIKA-1215 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.5 Reporter: Hong-Thai Nguyen Priority: Critical Attachments: Centres 080805@0650 RTBF Matin Première - A propos des rues de Dublin et Dubreucq.mp3 With attached file, 1.5 raises this exception on parsing. This file has no problem on 1.4 {code} ... Caused by: org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml not declared at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62) at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68) at org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284) at org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:323) at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:107) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at com.polyspot.document.converter.DocumentConverter.realizeTikaConversion(DocumentConverter.java:221) ... 15 more {code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Comment Edited] (TIKA-1215) Regression: Unable parse a mp3 file on 1.5 which parsed successfully on 1.4
[ https://issues.apache.org/jira/browse/TIKA-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13857542#comment-13857542 ] Hong-Thai Nguyen edited comment on TIKA-1215 at 12/27/13 3:59 PM: -- I built on latest trunk of git://git.apache.org/tika.git and via Java API was (Author: thaichat04): I built on latest trunk of git://git.apache.org/tika.git Regression: Unable parse a mp3 file on 1.5 which parsed successfully on 1.4 --- Key: TIKA-1215 URL: https://issues.apache.org/jira/browse/TIKA-1215 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.5 Reporter: Hong-Thai Nguyen Priority: Critical Attachments: Centres 080805@0650 RTBF Matin Première - A propos des rues de Dublin et Dubreucq.mp3 With attached file, 1.5 raises this exception on parsing. This file has no problem on 1.4 {code} ... Caused by: org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml not declared at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62) at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68) at org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284) at org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:323) at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:107) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at com.polyspot.document.converter.DocumentConverter.realizeTikaConversion(DocumentConverter.java:221) ... 15 more {code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1152) Process loops infinitely on parsing of a CHM file
[ https://issues.apache.org/jira/browse/TIKA-1152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13855528#comment-13855528 ] Hong-Thai Nguyen commented on TIKA-1152: [~gagravarr] or anyone can have look at patch in integrate to trunk before release 1.5 please ? Merci Process loops infinitely on parsing of a CHM file - Key: TIKA-1152 URL: https://issues.apache.org/jira/browse/TIKA-1152 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Environment: Windows/Linux Reporter: Hong-Thai Nguyen Priority: Critical Fix For: 1.5 Attachments: ChmLzxBlock.java.patch, eventcombmt.chm By parsing [the attachment CHM file|^eventcombmt.chm] (MS Microsoft Help Files), Java process stuck. {code} Thread[main,5,main] org.apache.tika.parser.chm.lzx.ChmLzxBlock.extractContent(ChmLzxBlock.java:203) org.apache.tika.parser.chm.lzx.ChmLzxBlock.init(ChmLzxBlock.java:77) org.apache.tika.parser.chm.core.ChmExtractor.extractChmEntry(ChmExtractor.java:338) org.apache.tika.parser.chm.CHMDocumentInformation.getContent(CHMDocumentInformation.java:72) org.apache.tika.parser.chm.CHMDocumentInformation.getText(CHMDocumentInformation.java:141) org.apache.tika.parser.chm.CHM2XHTML.process(CHM2XHTML.java:34) org.apache.tika.parser.chm.ChmParser.parse(ChmParser.java:51) org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91) org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) org.apache.tika.parser.AbstractParser.parse(AbstractParser.java:53) com.polyspot.document.converter.DocumentConverter.realizeConversion(DocumentConverter.java:192) ... {code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1205) Allow PDFParser to fallback to other parser if there is an exception
[ https://issues.apache.org/jira/browse/TIKA-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13845398#comment-13845398 ] Hong-Thai Nguyen commented on TIKA-1205: Just a (newbie) question, why limit only on PDFParser, not for any other parser ? I agree that fallback is necessary when having exception. But, the worst case is infinitive loop happens when parsing a document. For these two purposes, we would generalize to handle exception and timeout properly in a wrapper ? Allow PDFParser to fallback to other parser if there is an exception Key: TIKA-1205 URL: https://issues.apache.org/jira/browse/TIKA-1205 Project: Tika Issue Type: Improvement Components: parser Reporter: Tim Allison Assignee: Tim Allison Priority: Trivial Fix For: 1.5 With TIKA-1201, there is now an option to use PDFBox's NonSequentialPDFParser instead of the traditional parser for parsing PDF files. Following the description in PDFBOX-1199, it would be useful to allow fallback to the classic parser if NonSequentialPDFParser throws an IOException. For the sake of symmetry, I propose a boolean useParserFallbackOnException parameter. If this parameter is true, and if Tika's PDFParser is using the classic parser, Tika will fallback to the NonSequentialPDFParser if there is an IOException; if this parameter is true and if Tika's PDFParser is using the NonSequentialPDFParser it will fallback to the classic parser if there is an IOException. Many thanks to Hong-Thai for championing the addition of the added NonSequentialPDFParser capability in TIKA-1201, and many thanks to Timo for PDFBox's NonSequentialPDFParser (PDFBOX-1199)! -- This message was sent by Atlassian JIRA (v6.1.4#6159)