Re: 1.0 RC in next 2 weeks
Hey Kevin, At one point we talked about a 0.9.1 type of point release, but decided upon 1.0 instead: http://s.apache.org/Jvs That said, I'm happy when the dev community of Tika is ready to cut a release, and will gladly RC it. It's one of my favorite things to do! Cheers, Chris On Sep 15, 2011, at 3:32 PM, Kevin Clark wrote: > In light of the recent file handle bug (via parseToString) woudl it be > possible to get a point release in the meantime? > > On Thu, Sep 15, 2011 at 3:06 PM, Mattmann, Chris A (388J) > wrote: >> Hi there Jan, >> >> I was hoping to have time to spin it up by now, but haven't yet. Plus >> there's a ton of bug fixes >> and development going on, so I'm going to let things settle a bit. >> >> Jukka and I have been communicating with Sally and press@ and we hope to >> have a 1.0 RC >> and release in time for ApacheCon NA 2011 in November, so I think that's the >> target. Would be >> great to get it before then though! >> >> Cheers, >> Chris >> >> On Sep 15, 2011, at 2:59 PM, Jan Høydahl wrote: >> >>> Hi, >>> >>> I'm planning on upgrading Solr ExtractingHandler from Tika 0.8 to 0.9, for >>> next release. Is 1.0 close enough that I should wait, e.g. within next 3-4 >>> weeks? >>> >>> -- >>> Jan Høydahl, search solution architect >>> Cominvent AS - www.cominvent.com >>> Solr Training - www.solrtraining.com >>> >>> On 3. aug. 2011, at 13:31, Jukka Zitting wrote: >>> Hi, On Wed, Aug 3, 2011 at 1:32 AM, Mattmann, Chris A (388J) wrote: > If others have items on their 1.0 TODO list, please wrap them up in the > next week or so. > I'll try and cut the 1.0 RC-1 by next weekend if that works for everyone. Thanks for pushing this forward! I'm back from a long vacation now and should be able to help clean up things in preparation for the release. I think the main thing on my part would be getting rid of the deprecated stuff on the Parser interface and elsewhere, and of course come up with a migration document that outlines how to update code that still uses the old method signatures. BR, Jukka Zitting >>> >> >> >> ++ >> Chris Mattmann, Ph.D. >> Senior Computer Scientist >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >> Office: 171-266B, Mailstop: 171-246 >> Email: chris.a.mattm...@nasa.gov >> WWW: http://sunset.usc.edu/~mattmann/ >> ++ >> Adjunct Assistant Professor, Computer Science Department >> University of Southern California, Los Angeles, CA 90089 USA >> ++ >> >> ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: 1.0 RC in next 2 weeks
In light of the recent file handle bug (via parseToString) woudl it be possible to get a point release in the meantime? On Thu, Sep 15, 2011 at 3:06 PM, Mattmann, Chris A (388J) wrote: > Hi there Jan, > > I was hoping to have time to spin it up by now, but haven't yet. Plus there's > a ton of bug fixes > and development going on, so I'm going to let things settle a bit. > > Jukka and I have been communicating with Sally and press@ and we hope to have > a 1.0 RC > and release in time for ApacheCon NA 2011 in November, so I think that's the > target. Would be > great to get it before then though! > > Cheers, > Chris > > On Sep 15, 2011, at 2:59 PM, Jan Høydahl wrote: > >> Hi, >> >> I'm planning on upgrading Solr ExtractingHandler from Tika 0.8 to 0.9, for >> next release. Is 1.0 close enough that I should wait, e.g. within next 3-4 >> weeks? >> >> -- >> Jan Høydahl, search solution architect >> Cominvent AS - www.cominvent.com >> Solr Training - www.solrtraining.com >> >> On 3. aug. 2011, at 13:31, Jukka Zitting wrote: >> >>> Hi, >>> >>> On Wed, Aug 3, 2011 at 1:32 AM, Mattmann, Chris A (388J) >>> wrote: If others have items on their 1.0 TODO list, please wrap them up in the next week or so. I'll try and cut the 1.0 RC-1 by next weekend if that works for everyone. >>> >>> Thanks for pushing this forward! >>> >>> I'm back from a long vacation now and should be able to help clean up >>> things in preparation for the release. I think the main thing on my >>> part would be getting rid of the deprecated stuff on the Parser >>> interface and elsewhere, and of course come up with a migration >>> document that outlines how to update code that still uses the old >>> method signatures. >>> >>> BR, >>> >>> Jukka Zitting >> > > > ++ > Chris Mattmann, Ph.D. > Senior Computer Scientist > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 171-266B, Mailstop: 171-246 > Email: chris.a.mattm...@nasa.gov > WWW: http://sunset.usc.edu/~mattmann/ > ++ > Adjunct Assistant Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++ > >
Re: 1.0 RC in next 2 weeks
Hi there Jan, I was hoping to have time to spin it up by now, but haven't yet. Plus there's a ton of bug fixes and development going on, so I'm going to let things settle a bit. Jukka and I have been communicating with Sally and press@ and we hope to have a 1.0 RC and release in time for ApacheCon NA 2011 in November, so I think that's the target. Would be great to get it before then though! Cheers, Chris On Sep 15, 2011, at 2:59 PM, Jan Høydahl wrote: > Hi, > > I'm planning on upgrading Solr ExtractingHandler from Tika 0.8 to 0.9, for > next release. Is 1.0 close enough that I should wait, e.g. within next 3-4 > weeks? > > -- > Jan Høydahl, search solution architect > Cominvent AS - www.cominvent.com > Solr Training - www.solrtraining.com > > On 3. aug. 2011, at 13:31, Jukka Zitting wrote: > >> Hi, >> >> On Wed, Aug 3, 2011 at 1:32 AM, Mattmann, Chris A (388J) >> wrote: >>> If others have items on their 1.0 TODO list, please wrap them up in the >>> next week or so. >>> I'll try and cut the 1.0 RC-1 by next weekend if that works for everyone. >> >> Thanks for pushing this forward! >> >> I'm back from a long vacation now and should be able to help clean up >> things in preparation for the release. I think the main thing on my >> part would be getting rid of the deprecated stuff on the Parser >> interface and elsewhere, and of course come up with a migration >> document that outlines how to update code that still uses the old >> method signatures. >> >> BR, >> >> Jukka Zitting > ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: 1.0 RC in next 2 weeks
Hi, I'm planning on upgrading Solr ExtractingHandler from Tika 0.8 to 0.9, for next release. Is 1.0 close enough that I should wait, e.g. within next 3-4 weeks? -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com On 3. aug. 2011, at 13:31, Jukka Zitting wrote: > Hi, > > On Wed, Aug 3, 2011 at 1:32 AM, Mattmann, Chris A (388J) > wrote: >> If others have items on their 1.0 TODO list, please wrap them up in the next >> week or so. >> I'll try and cut the 1.0 RC-1 by next weekend if that works for everyone. > > Thanks for pushing this forward! > > I'm back from a long vacation now and should be able to help clean up > things in preparation for the release. I think the main thing on my > part would be getting rid of the deprecated stuff on the Parser > interface and elsewhere, and of course come up with a migration > document that outlines how to update code that still uses the old > method signatures. > > BR, > > Jukka Zitting
[jira] [Commented] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events
[ https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13105542#comment-13105542 ] Ken Krugler commented on TIKA-715: -- Hi Mike - excellent work re adding these checks. I think as part of this, all of the unit tests in the HTML parser could should be modified to validate the resulting XHTML for compliance w/1.0. I remember Andrzej (???) bringing up this issue a while ago, in the context of providing a general helper class that all parsers could use to generate the appropriate elements, and avoid emitting invalid tags/attributes. There was some discussion about this on the list, but I don't think anything came of it. > Some parsers produce non-well-formed XHTML SAX events > - > > Key: TIKA-715 > URL: https://issues.apache.org/jira/browse/TIKA-715 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Michael McCandless > Fix For: 1.0 > > Attachments: TIKA-715.patch > > > With TIKA-683 I committed simple, commented out code to > SafeContentHandler, to verify that the SAX events produced by the > parser have valid (matched) tags. Ie, each startElement("foo") is > matched by the closing endElement("foo"). > I only did basic nesting test, plus checking that is never > embedded inside another ; we could strengthen this further to check > that all tags only appear in valid parents... > I was able to use this to fix issues with the new RTF parser > (TIKA-683), but I was surprised that some other parsers failed the new > asserts. > It could be these are relatively minor offenses (eg closing a table > w/o closing the tr) and we need not do anything here... but I think > it'd be cleaner if all our parsers produced matched, well-formed XHTML > events. > I haven't looked into any of these... it could be they are easy to fix. > Failures: > {noformat} > testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest) > Time elapsed: 0.032 sec <<< ERROR! > java.lang.AssertionError: end tag=body with no startElement > at > org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224) > at > org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275) > at > org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129) > at > org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158) > testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest) Time elapsed: > 0.116 sec <<< ERROR! > java.lang.AssertionError: mismatched elements open=tr close=table > at > org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226) > at > org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275) > at > org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252) > at > org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287) > at > org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136) > at > org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) > at > com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648) > at > com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511) > at > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808) > at > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737) > at > com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119) > at > com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205) > at > com.sun.org.apache.xerc
[jira] [Updated] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events
[ https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-715: Component/s: parser > Some parsers produce non-well-formed XHTML SAX events > - > > Key: TIKA-715 > URL: https://issues.apache.org/jira/browse/TIKA-715 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Michael McCandless > Fix For: 1.0 > > Attachments: TIKA-715.patch > > > With TIKA-683 I committed simple, commented out code to > SafeContentHandler, to verify that the SAX events produced by the > parser have valid (matched) tags. Ie, each startElement("foo") is > matched by the closing endElement("foo"). > I only did basic nesting test, plus checking that is never > embedded inside another ; we could strengthen this further to check > that all tags only appear in valid parents... > I was able to use this to fix issues with the new RTF parser > (TIKA-683), but I was surprised that some other parsers failed the new > asserts. > It could be these are relatively minor offenses (eg closing a table > w/o closing the tr) and we need not do anything here... but I think > it'd be cleaner if all our parsers produced matched, well-formed XHTML > events. > I haven't looked into any of these... it could be they are easy to fix. > Failures: > {noformat} > testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest) > Time elapsed: 0.032 sec <<< ERROR! > java.lang.AssertionError: end tag=body with no startElement > at > org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224) > at > org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275) > at > org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129) > at > org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158) > testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest) Time elapsed: > 0.116 sec <<< ERROR! > java.lang.AssertionError: mismatched elements open=tr close=table > at > org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226) > at > org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275) > at > org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252) > at > org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287) > at > org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136) > at > org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) > at > com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648) > at > com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511) > at > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808) > at > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737) > at > com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119) > at > com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205) > at > com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522) > at javax.xml.parsers.SAXParser.parse(SAXParser.java:395) > at javax.xml.parsers.SAXParser.parse(SAXParser.java:198) > at > org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:190) > at > org.apache.tika.parser.iwork.IWorkParserTest.testParseKeynote(IWorkParserTest.java:49) > testMultipart(org.apache.tika.parser.mail.RFC822ParserTest) Time elapsed: > 0.025 sec <<< ERROR! > java.lang.AssertionError
[jira] [Commented] (TIKA-683) RTF Parser issues with non european characters
[ https://issues.apache.org/jira/browse/TIKA-683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13105499#comment-13105499 ] Michael McCandless commented on TIKA-683: - I opened TIKA-715 for the mis-matched XHTML events. > RTF Parser issues with non european characters > -- > > Key: TIKA-683 > URL: https://issues.apache.org/jira/browse/TIKA-683 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 0.9 >Reporter: Nick Burch >Assignee: Michael McCandless > Fix For: 1.0 > > Attachments: TIKA-683-unicode-testcase.patch, TIKA-683.patch, > TIKA-683.patch, TIKA-683.patch, TIKA-683.patch, testRTFJapanese.rtf, > testUnicodeUCNControlWordCharacterDoubling.rtf, > testWORD_bold_character_runs.docx, testWORD_bold_character_runs2.docx > > > As reported on user@ in "non-West European languages support": > > http://mail-archives.apache.org/mod_mbox/tika-user/201107.mbox/%3cof0c0a3275.da7810e9-onc22578cc.0051eede-c22578cc.00525...@il.ibm.com%3E > The RTF Parser seems to be doubling up some non-european characters -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events
[ https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-715: Attachment: TIKA-715.patch Patch turning on the asserts in SafeContentHandler.java. > Some parsers produce non-well-formed XHTML SAX events > - > > Key: TIKA-715 > URL: https://issues.apache.org/jira/browse/TIKA-715 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Michael McCandless > Fix For: 1.0 > > Attachments: TIKA-715.patch > > > With TIKA-683 I committed simple, commented out code to > SafeContentHandler, to verify that the SAX events produced by the > parser have valid (matched) tags. Ie, each startElement("foo") is > matched by the closing endElement("foo"). > I only did basic nesting test, plus checking that is never > embedded inside another ; we could strengthen this further to check > that all tags only appear in valid parents... > I was able to use this to fix issues with the new RTF parser > (TIKA-683), but I was surprised that some other parsers failed the new > asserts. > It could be these are relatively minor offenses (eg closing a table > w/o closing the tr) and we need not do anything here... but I think > it'd be cleaner if all our parsers produced matched, well-formed XHTML > events. > I haven't looked into any of these... it could be they are easy to fix. > Failures: > {noformat} > testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest) > Time elapsed: 0.032 sec <<< ERROR! > java.lang.AssertionError: end tag=body with no startElement > at > org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224) > at > org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275) > at > org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129) > at > org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158) > testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest) Time elapsed: > 0.116 sec <<< ERROR! > java.lang.AssertionError: mismatched elements open=tr close=table > at > org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226) > at > org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275) > at > org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252) > at > org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287) > at > org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136) > at > org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) > at > com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648) > at > com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511) > at > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808) > at > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737) > at > com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119) > at > com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205) > at > com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522) > at javax.xml.parsers.SAXParser.parse(SAXParser.java:395) > at javax.xml.parsers.SAXParser.parse(SAXParser.java:198) > at > org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:190) > at > org.apache.tika.parser.iwork.IWorkParserTest.testParseKeynote(IWorkParserTest.java:49) > testMultipart(org.apache.tika.parser.mail.RFC822ParserTest)
[jira] [Created] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events
Some parsers produce non-well-formed XHTML SAX events - Key: TIKA-715 URL: https://issues.apache.org/jira/browse/TIKA-715 Project: Tika Issue Type: Bug Reporter: Michael McCandless Fix For: 1.0 With TIKA-683 I committed simple, commented out code to SafeContentHandler, to verify that the SAX events produced by the parser have valid (matched) tags. Ie, each startElement("foo") is matched by the closing endElement("foo"). I only did basic nesting test, plus checking that is never embedded inside another ; we could strengthen this further to check that all tags only appear in valid parents... I was able to use this to fix issues with the new RTF parser (TIKA-683), but I was surprised that some other parsers failed the new asserts. It could be these are relatively minor offenses (eg closing a table w/o closing the tr) and we need not do anything here... but I think it'd be cleaner if all our parsers produced matched, well-formed XHTML events. I haven't looked into any of these... it could be they are easy to fix. Failures: {noformat} testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest) Time elapsed: 0.032 sec <<< ERROR! java.lang.AssertionError: end tag=body with no startElement at org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224) at org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275) at org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129) at org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158) testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest) Time elapsed: 0.116 sec <<< ERROR! java.lang.AssertionError: mismatched elements open=tr close=table at org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226) at org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275) at org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252) at org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287) at org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136) at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938) at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648) at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737) at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205) at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522) at javax.xml.parsers.SAXParser.parse(SAXParser.java:395) at javax.xml.parsers.SAXParser.parse(SAXParser.java:198) at org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:190) at org.apache.tika.parser.iwork.IWorkParserTest.testParseKeynote(IWorkParserTest.java:49) testMultipart(org.apache.tika.parser.mail.RFC822ParserTest) Time elapsed: 0.025 sec <<< ERROR! java.lang.AssertionError: p inside p at org.apache.tika.sax.SafeContentHandler.verifyStartElement(SafeContentHandler.java:216) at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:245) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:241) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
[jira] [Resolved] (TIKA-683) RTF Parser issues with non european characters
[ https://issues.apache.org/jira/browse/TIKA-683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved TIKA-683. - Resolution: Fixed Fix Version/s: 1.0 I'll open a follow-on issue for the mis-matched XHTML events from some parsers > RTF Parser issues with non european characters > -- > > Key: TIKA-683 > URL: https://issues.apache.org/jira/browse/TIKA-683 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 0.9 >Reporter: Nick Burch >Assignee: Michael McCandless > Fix For: 1.0 > > Attachments: TIKA-683-unicode-testcase.patch, TIKA-683.patch, > TIKA-683.patch, TIKA-683.patch, TIKA-683.patch, testRTFJapanese.rtf, > testUnicodeUCNControlWordCharacterDoubling.rtf, > testWORD_bold_character_runs.docx, testWORD_bold_character_runs2.docx > > > As reported on user@ in "non-West European languages support": > > http://mail-archives.apache.org/mod_mbox/tika-user/201107.mbox/%3cof0c0a3275.da7810e9-onc22578cc.0051eede-c22578cc.00525...@il.ibm.com%3E > The RTF Parser seems to be doubling up some non-european characters -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-683) RTF Parser issues with non european characters
[ https://issues.apache.org/jira/browse/TIKA-683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13105413#comment-13105413 ] Michael McCandless commented on TIKA-683: - Thanks Chris, I'll commit today! > RTF Parser issues with non european characters > -- > > Key: TIKA-683 > URL: https://issues.apache.org/jira/browse/TIKA-683 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 0.9 >Reporter: Nick Burch >Assignee: Michael McCandless > Attachments: TIKA-683-unicode-testcase.patch, TIKA-683.patch, > TIKA-683.patch, TIKA-683.patch, TIKA-683.patch, testRTFJapanese.rtf, > testUnicodeUCNControlWordCharacterDoubling.rtf, > testWORD_bold_character_runs.docx, testWORD_bold_character_runs2.docx > > > As reported on user@ in "non-West European languages support": > > http://mail-archives.apache.org/mod_mbox/tika-user/201107.mbox/%3cof0c0a3275.da7810e9-onc22578cc.0051eede-c22578cc.00525...@il.ibm.com%3E > The RTF Parser seems to be doubling up some non-european characters -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (TIKA-666) Unable to extract content from RTF files
[ https://issues.apache.org/jira/browse/TIKA-666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann resolved TIKA-666. Resolution: Fixed - closed per Mike's comment, and the fix for TIKA-683 > Unable to extract content from RTF files > > > Key: TIKA-666 > URL: https://issues.apache.org/jira/browse/TIKA-666 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 0.8, 0.9 > Environment: Windows 32 bit OS, JDK 1.6.19 >Reporter: samraj > Labels: RTF > Attachments: Redline.rtf > > Original Estimate: 48h > Remaining Estimate: 48h > > HI, > I have tried with various set of RTF document to extract the text Content. I > have tried so many technique to extract the text from rtf.. Its failed. I > have attached the RTF document here -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-683) RTF Parser issues with non european characters
[ https://issues.apache.org/jira/browse/TIKA-683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13105362#comment-13105362 ] Chris A. Mattmann commented on TIKA-683: Hey Mike, +1 to commit, go for it! > RTF Parser issues with non european characters > -- > > Key: TIKA-683 > URL: https://issues.apache.org/jira/browse/TIKA-683 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 0.9 >Reporter: Nick Burch >Assignee: Michael McCandless > Attachments: TIKA-683-unicode-testcase.patch, TIKA-683.patch, > TIKA-683.patch, TIKA-683.patch, TIKA-683.patch, testRTFJapanese.rtf, > testUnicodeUCNControlWordCharacterDoubling.rtf, > testWORD_bold_character_runs.docx, testWORD_bold_character_runs2.docx > > > As reported on user@ in "non-West European languages support": > > http://mail-archives.apache.org/mod_mbox/tika-user/201107.mbox/%3cof0c0a3275.da7810e9-onc22578cc.0051eede-c22578cc.00525...@il.ibm.com%3E > The RTF Parser seems to be doubling up some non-european characters -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-666) Unable to extract content from RTF files
[ https://issues.apache.org/jira/browse/TIKA-666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13105310#comment-13105310 ] Michael McCandless commented on TIKA-666: - It looks like TIKA-683 fixes this issue, or at least I'm able to extract text for Redline.rtf. > Unable to extract content from RTF files > > > Key: TIKA-666 > URL: https://issues.apache.org/jira/browse/TIKA-666 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 0.8, 0.9 > Environment: Windows 32 bit OS, JDK 1.6.19 >Reporter: samraj > Labels: RTF > Attachments: Redline.rtf > > Original Estimate: 48h > Remaining Estimate: 48h > > HI, > I have tried with various set of RTF document to extract the text Content. I > have tried so many technique to extract the text from rtf.. Its failed. I > have attached the RTF document here -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira