[jira] [Commented] (TIKA-775) Embed Capabilities
[ https://issues.apache.org/jira/browse/TIKA-775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13532033#comment-13532033 ] Nick Burch commented on TIKA-775: - Could you maybe add a simple dummy parser for testing with? Another option is to have the test in the parsers package, even though the main code is in core. We have quite a few examples of that, eg some of the mime magic stuff is tested in parsers because that's where the test files live > Embed Capabilities > -- > > Key: TIKA-775 > URL: https://issues.apache.org/jira/browse/TIKA-775 > Project: Tika > Issue Type: Improvement > Components: general, metadata >Affects Versions: 1.0 > Environment: The default ExternalEmbedder requires that sed be > installed. >Reporter: Ray Gauss II > Labels: embed, patch > Fix For: 1.3 > > Attachments: embed_20121029.diff, embed.diff, > tika-core-embed-patch.txt, tika-parsers-embed-patch.txt > > > This patch defines and implements the concept of embedding tika metadata into > a file stream, the reverse of extraction. > In the tika-core project an interface defining an Embedder and a generic sed > ExternalEmbedder implementation meant to be extended or configured are added. > These classes are essentially a reverse flow of the existing Parser and > ExternalParser classes. > In the tika-parsers project an ExternalEmbedderTest unit test is added which > uses the default ExternalEmbedder (calls sed) to embed a value placed in > Metadata.DESCRIPTION then verify the operation by parsing the resulting > stream. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (TIKA-1044) Can't parse Word files with no format set
[ https://issues.apache.org/jira/browse/TIKA-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-1044. -- Resolution: Fixed Fix Version/s: 1.3 Fixed in r1421646, along with a unit test based on your files, thanks! > Can't parse Word files with no format set > - > > Key: TIKA-1044 > URL: https://issues.apache.org/jira/browse/TIKA-1044 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.0 >Reporter: Jonas Wilhelmsson >Priority: Trivial > Fix For: 1.3 > > Attachments: test2.doc, test.docx > > > When we were using Solr for indexing we came over this Tika bug. > While parsing a doc or docx file that contains text without any format set > (format inside Microsoft Word) the parser will throw exceptions. > By setting a format to the text the file can be correctly parsed without > unexpected errors. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-775) Embed Capabilities
[ https://issues.apache.org/jira/browse/TIKA-775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13531938#comment-13531938 ] Ray Gauss II commented on TIKA-775: --- According to a few posts on the subject including one on developerWorks [1] it looks like it's more appropriate to reassert the thread's interrupt status with: {code} ... } catch (InterruptedException ignore) { Thread.currentThread().interrupt(); } ... {code} rather than refactoring {{ExternalParser}} and {{ExternalEmbedder}} to re-throw it or wrap in a {{TikaException}}. I too would prefer {{ExternalEmbedderTest}} to be in core, but I do feel that we want to confirm the embedding with a known working parser. Would anyone have issue with moving {{TXTParser}} and its test into core? There don't seem to be any issues with dependencies when trying it. [1] http://www.ibm.com/developerworks/java/library/j-jtp05236/index.html > Embed Capabilities > -- > > Key: TIKA-775 > URL: https://issues.apache.org/jira/browse/TIKA-775 > Project: Tika > Issue Type: Improvement > Components: general, metadata >Affects Versions: 1.0 > Environment: The default ExternalEmbedder requires that sed be > installed. >Reporter: Ray Gauss II > Labels: embed, patch > Fix For: 1.3 > > Attachments: embed_20121029.diff, embed.diff, > tika-core-embed-patch.txt, tika-parsers-embed-patch.txt > > > This patch defines and implements the concept of embedding tika metadata into > a file stream, the reverse of extraction. > In the tika-core project an interface defining an Embedder and a generic sed > ExternalEmbedder implementation meant to be extended or configured are added. > These classes are essentially a reverse flow of the existing Parser and > ExternalParser classes. > In the tika-parsers project an ExternalEmbedderTest unit test is added which > uses the default ExternalEmbedder (calls sed) to embed a value placed in > Metadata.DESCRIPTION then verify the operation by parsing the resulting > stream. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1043) Tika parser v1.2 fails on legacy power point documents
[ https://issues.apache.org/jira/browse/TIKA-1043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13531691#comment-13531691 ] Nick Burch commented on TIKA-1043: -- Could you post the full stacktrace? That one is missing the interesting bit of why the parser broke... Also, do you have an example file that shows the problem? > Tika parser v1.2 fails on legacy power point documents > -- > > Key: TIKA-1043 > URL: https://issues.apache.org/jira/browse/TIKA-1043 > Project: Tika > Issue Type: Bug >Affects Versions: 1.2 > Environment: Solr 4.0 on Tomcat 7 with manifoldcf v1.1 dev >Reporter: David Morana > Fix For: 1.2 > > > I can't index "older" powerpoint documents > I did some research and the current "fix" is to open the legacy ppt and save > it as a newer version of PowerPoint. > I have over 3000 ppt docs in my development environment alone so that's not > an option. > Here's the error in solr: > SEVERE: null:org.apache.solr.common.SolrException: > org.apache.tika.exception.TikaException: Unexpected RuntimeException from > org.apache.tika.parser.microsoft.OfficeParser@e86b202 > at > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:215) > at > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1561) > at > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:442) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:263) > at > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:244) > at > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210) > at > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:240) > at > org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:161) > at > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:164) > at > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:100) > at > org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:541) > at > org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118) > at > org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:383) > at > org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:243) > at > org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:188) > at > org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:166) > at > org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:288) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) > at java.lang.Thread.run(Thread.java:722) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (TIKA-990) Mp3Parser extracts wrong number of channels
[ https://issues.apache.org/jira/browse/TIKA-990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Gauss II resolved TIKA-990. --- Resolution: Fixed Fix Version/s: 1.3 Resolved in r1421584. > Mp3Parser extracts wrong number of channels > --- > > Key: TIKA-990 > URL: https://issues.apache.org/jira/browse/TIKA-990 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.2 >Reporter: Oliver Heger >Assignee: Ray Gauss II > Fix For: 1.3 > > > In class {{AudioFrame}} the last two bits of the MPEG frame header are used > to determine the number of channels. According to my documentation, this > information is encoded in bits 7 and 6. > I did a cross check with the ID3 tag editor tool ID3-TagIT > (http://www.id3-tagit.de/). The unit tests expect that the test MP3 files > have 2 channels. However, the tool reports that the files are mono. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-990) Mp3Parser extracts wrong number of channels
[ https://issues.apache.org/jira/browse/TIKA-990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13531588#comment-13531588 ] Ray Gauss II commented on TIKA-990: --- After investigated this: * Confirmed the test files are in fact mono with the exception of {{testMP3lyrics.mp3}} * Confirmed the MPEG spec does define the channels in bits 7,6 * After creating my own test files confirmed Tika does report the channels incorrectly Committing a fix shortly. > Mp3Parser extracts wrong number of channels > --- > > Key: TIKA-990 > URL: https://issues.apache.org/jira/browse/TIKA-990 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.2 >Reporter: Oliver Heger >Assignee: Ray Gauss II > > In class {{AudioFrame}} the last two bits of the MPEG frame header are used > to determine the number of channels. According to my documentation, this > information is encoded in bits 7 and 6. > I did a cross check with the ID3 tag editor tool ID3-TagIT > (http://www.id3-tagit.de/). The unit tests expect that the test MP3 files > have 2 channels. However, the tool reports that the files are mono. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (TIKA-990) Mp3Parser extracts wrong number of channels
[ https://issues.apache.org/jira/browse/TIKA-990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Gauss II reassigned TIKA-990: - Assignee: Ray Gauss II > Mp3Parser extracts wrong number of channels > --- > > Key: TIKA-990 > URL: https://issues.apache.org/jira/browse/TIKA-990 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.2 >Reporter: Oliver Heger >Assignee: Ray Gauss II > > In class {{AudioFrame}} the last two bits of the MPEG frame header are used > to determine the number of channels. According to my documentation, this > information is encoded in bits 7 and 6. > I did a cross check with the ID3 tag editor tool ID3-TagIT > (http://www.id3-tagit.de/). The unit tests expect that the test MP3 files > have 2 channels. However, the tool reports that the files are mono. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (TIKA-1044) Can't parse Word files with no format set
Jonas Wilhelmsson created TIKA-1044: --- Summary: Can't parse Word files with no format set Key: TIKA-1044 URL: https://issues.apache.org/jira/browse/TIKA-1044 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.0 Reporter: Jonas Wilhelmsson Priority: Trivial Attachments: test2.doc, test.docx When we were using Solr for indexing we came over this Tika bug. While parsing a doc or docx file that contains text without any format set (format inside Microsoft Word) the parser will throw exceptions. By setting a format to the text the file can be correctly parsed without unexpected errors. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-1044) Can't parse Word files with no format set
[ https://issues.apache.org/jira/browse/TIKA-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonas Wilhelmsson updated TIKA-1044: Attachment: test2.doc test.docx Stacktrace while parseing test.docx: 2012-dec-13 15:51:39 org.apache.solr.common.SolrException log ALLVARLIG: org.apache.solr.common.SolrException at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:233) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:244) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1376) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:365) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:260) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:563) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:602) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489) at java.lang.Thread.run(Thread.java:662) Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@299629 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:227) ... 19 more Caused by: java.lang.NullPointerException at org.apache.tika.parser.microsoft.WordExtractor.buildParagraphTagAndStyle(WordExtractor.java:463) at org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractParagraph(XWPFWordExtractorDecorator.java:108) at org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractIBodyText(XWPFWordExtractorDecorator.java:76) at org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.buildXHTML(XWPFWordExtractorDecorator.java:63) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:110) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:97) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:69) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 23 more Stacktrace while parseing test2.doc: 2012-dec-13 15:51:52 org.apache.solr.common.SolrException log ALLVARLIG: org.apache.solr.common.SolrException at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:233) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:244) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1376) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:365) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:260) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:
[jira] [Created] (TIKA-1043) Tika parser v1.2 fails on legacy power point documents
David Morana created TIKA-1043: -- Summary: Tika parser v1.2 fails on legacy power point documents Key: TIKA-1043 URL: https://issues.apache.org/jira/browse/TIKA-1043 Project: Tika Issue Type: Bug Affects Versions: 1.2 Environment: Solr 4.0 on Tomcat 7 with manifoldcf v1.1 dev Reporter: David Morana Fix For: 1.2 I can't index "older" powerpoint documents I did some research and the current "fix" is to open the legacy ppt and save it as a newer version of PowerPoint. I have over 3000 ppt docs in my development environment alone so that's not an option. Here's the error in solr: SEVERE: null:org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@e86b202 at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:215) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1561) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:442) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:263) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:244) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:240) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:161) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:164) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:100) at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:541) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:383) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:243) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:188) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:166) at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:288) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (TIKA-1041) Tika 1.2 universalcharset errors
[ https://issues.apache.org/jira/browse/TIKA-1041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jukka Zitting resolved TIKA-1041. - Resolution: Fixed Fix Version/s: (was: 1.2) Assignee: Jukka Zitting I fixed this in revision 1421141 by catching the NoClassDefFoundError and just ignoring the missing functionality when the required dependency is not present. A deployment can pass in a ServiceLoader with a custom LoadErrorHandler through the ParseContext to log or otherwise handle such dependency issues. > Tika 1.2 universalcharset errors > > > Key: TIKA-1041 > URL: https://issues.apache.org/jira/browse/TIKA-1041 > Project: Tika > Issue Type: Bug >Affects Versions: 1.2 > Environment: I'm running solr 4.0 with tika 1.2 on tomcat 7.0.8 with > manifoldcf v1.1dev >Reporter: David Morana >Assignee: Jukka Zitting > Fix For: 1.3 > > > This is somewhat confusing and frustrating. I successfully crawled Opentext > using all of the above. then I recrawled and it aborted almost immediately. > It choked on images, so I excluded them for now. > but now it's choking on txt files! > sometimes I get this error > SEVERE: null:java.lang.RuntimeException: java.lang.NoClassDefFoundError: > org/mozilla/universalchardet/CharsetListener > and sometimes I get this one > SEVERE: null:java.lang.RuntimeException: java.lang.NoClassDefFoundError: > org/apache/tika/parser/txt/UniversalEncodingListener -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira