[jira] [Updated] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events
[ https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-715: - Fix Version/s: (was: 2.0.0) 2.0.0-BETA > Some parsers produce non-well-formed XHTML SAX events > - > > Key: TIKA-715 > URL: https://issues.apache.org/jira/browse/TIKA-715 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 0.10 >Reporter: Michael McCandless >Priority: Major > Labels: newbie > Fix For: 1.17, 2.0.0-BETA, 2.0.1 > > Attachments: TIKA-715.patch > > > With TIKA-683 I committed simple, commented out code to > SafeContentHandler, to verify that the SAX events produced by the > parser have valid (matched) tags. Ie, each startElement("foo") is > matched by the closing endElement("foo"). > I only did basic nesting test, plus checking that is never > embedded inside another ; we could strengthen this further to check > that all tags only appear in valid parents... > I was able to use this to fix issues with the new RTF parser > (TIKA-683), but I was surprised that some other parsers failed the new > asserts. > It could be these are relatively minor offenses (eg closing a table > w/o closing the tr) and we need not do anything here... but I think > it'd be cleaner if all our parsers produced matched, well-formed XHTML > events. > I haven't looked into any of these... it could be they are easy to fix. > Failures: > {noformat} > testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest) > Time elapsed: 0.032 sec <<< ERROR! > java.lang.AssertionError: end tag=body with no startElement > at > org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224) > at > org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275) > at > org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129) > at > org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158) > testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest) Time elapsed: > 0.116 sec <<< ERROR! > java.lang.AssertionError: mismatched elements open=tr close=table > at > org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226) > at > org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275) > at > org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252) > at > org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287) > at > org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136) > at > org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) > at > com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648) > at > com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511) > at > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808) > at > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737) > at > com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119) > at > com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205) > at > com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522) > at javax.xml.parsers.SAXParser.parse(SAXParser.java:395) > at javax.xml.parsers.SAXParser.parse(SAXParser.java:198) > at > org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:190) > at >
[jira] [Updated] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events
[ https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-715: - Fix Version/s: (was: 2.0.0) 2.0.1 > Some parsers produce non-well-formed XHTML SAX events > - > > Key: TIKA-715 > URL: https://issues.apache.org/jira/browse/TIKA-715 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 0.10 >Reporter: Michael McCandless >Priority: Major > Labels: newbie > Fix For: 1.17, 2.0.0-BETA, 2.0.1 > > Attachments: TIKA-715.patch > > > With TIKA-683 I committed simple, commented out code to > SafeContentHandler, to verify that the SAX events produced by the > parser have valid (matched) tags. Ie, each startElement("foo") is > matched by the closing endElement("foo"). > I only did basic nesting test, plus checking that is never > embedded inside another ; we could strengthen this further to check > that all tags only appear in valid parents... > I was able to use this to fix issues with the new RTF parser > (TIKA-683), but I was surprised that some other parsers failed the new > asserts. > It could be these are relatively minor offenses (eg closing a table > w/o closing the tr) and we need not do anything here... but I think > it'd be cleaner if all our parsers produced matched, well-formed XHTML > events. > I haven't looked into any of these... it could be they are easy to fix. > Failures: > {noformat} > testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest) > Time elapsed: 0.032 sec <<< ERROR! > java.lang.AssertionError: end tag=body with no startElement > at > org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224) > at > org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275) > at > org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129) > at > org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158) > testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest) Time elapsed: > 0.116 sec <<< ERROR! > java.lang.AssertionError: mismatched elements open=tr close=table > at > org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226) > at > org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275) > at > org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252) > at > org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287) > at > org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136) > at > org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) > at > com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648) > at > com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511) > at > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808) > at > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737) > at > com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119) > at > com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205) > at > com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522) > at javax.xml.parsers.SAXParser.parse(SAXParser.java:395) > at javax.xml.parsers.SAXParser.parse(SAXParser.java:198) > at > org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:190) > at >
[jira] [Updated] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events
[ https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-715: --- Fix Version/s: (was: 1.15) 1.16 > Some parsers produce non-well-formed XHTML SAX events > - > > Key: TIKA-715 > URL: https://issues.apache.org/jira/browse/TIKA-715 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 0.10 >Reporter: Michael McCandless > Labels: newbie > Fix For: 1.16 > > Attachments: TIKA-715.patch > > > With TIKA-683 I committed simple, commented out code to > SafeContentHandler, to verify that the SAX events produced by the > parser have valid (matched) tags. Ie, each startElement("foo") is > matched by the closing endElement("foo"). > I only did basic nesting test, plus checking that is never > embedded inside another ; we could strengthen this further to check > that all tags only appear in valid parents... > I was able to use this to fix issues with the new RTF parser > (TIKA-683), but I was surprised that some other parsers failed the new > asserts. > It could be these are relatively minor offenses (eg closing a table > w/o closing the tr) and we need not do anything here... but I think > it'd be cleaner if all our parsers produced matched, well-formed XHTML > events. > I haven't looked into any of these... it could be they are easy to fix. > Failures: > {noformat} > testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest) > Time elapsed: 0.032 sec <<< ERROR! > java.lang.AssertionError: end tag=body with no startElement > at > org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224) > at > org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275) > at > org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129) > at > org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158) > testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest) Time elapsed: > 0.116 sec <<< ERROR! > java.lang.AssertionError: mismatched elements open=tr close=table > at > org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226) > at > org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275) > at > org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252) > at > org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287) > at > org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136) > at > org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) > at > com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648) > at > com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511) > at > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808) > at > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737) > at > com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119) > at > com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205) > at > com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522) > at javax.xml.parsers.SAXParser.parse(SAXParser.java:395) > at javax.xml.parsers.SAXParser.parse(SAXParser.java:198) > at > org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:190) > at > org.apache.tika.parser.iwork.IWorkParserTest.testParseKeynote(IWorkParserTest.java:49) >
[jira] [Updated] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events
[ https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-715: --- Fix Version/s: (was: 1.14) 1.15 > Some parsers produce non-well-formed XHTML SAX events > - > > Key: TIKA-715 > URL: https://issues.apache.org/jira/browse/TIKA-715 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 0.10 >Reporter: Michael McCandless > Labels: newbie > Fix For: 1.15 > > Attachments: TIKA-715.patch > > > With TIKA-683 I committed simple, commented out code to > SafeContentHandler, to verify that the SAX events produced by the > parser have valid (matched) tags. Ie, each startElement("foo") is > matched by the closing endElement("foo"). > I only did basic nesting test, plus checking that is never > embedded inside another ; we could strengthen this further to check > that all tags only appear in valid parents... > I was able to use this to fix issues with the new RTF parser > (TIKA-683), but I was surprised that some other parsers failed the new > asserts. > It could be these are relatively minor offenses (eg closing a table > w/o closing the tr) and we need not do anything here... but I think > it'd be cleaner if all our parsers produced matched, well-formed XHTML > events. > I haven't looked into any of these... it could be they are easy to fix. > Failures: > {noformat} > testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest) > Time elapsed: 0.032 sec <<< ERROR! > java.lang.AssertionError: end tag=body with no startElement > at > org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224) > at > org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275) > at > org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129) > at > org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158) > testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest) Time elapsed: > 0.116 sec <<< ERROR! > java.lang.AssertionError: mismatched elements open=tr close=table > at > org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226) > at > org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275) > at > org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252) > at > org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287) > at > org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136) > at > org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) > at > com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648) > at > com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511) > at > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808) > at > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737) > at > com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119) > at > com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205) > at > com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522) > at javax.xml.parsers.SAXParser.parse(SAXParser.java:395) > at javax.xml.parsers.SAXParser.parse(SAXParser.java:198) > at > org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:190) > at > org.apache.tika.parser.iwork.IWorkParserTest.testParseKeynote(IWorkParserTest.java:49) >
[jira] [Updated] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events
[ https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-715: --- Fix Version/s: (was: 1.12) 1.13 > Some parsers produce non-well-formed XHTML SAX events > - > > Key: TIKA-715 > URL: https://issues.apache.org/jira/browse/TIKA-715 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 0.10 >Reporter: Michael McCandless > Labels: newbie > Fix For: 1.13 > > Attachments: TIKA-715.patch > > > With TIKA-683 I committed simple, commented out code to > SafeContentHandler, to verify that the SAX events produced by the > parser have valid (matched) tags. Ie, each startElement("foo") is > matched by the closing endElement("foo"). > I only did basic nesting test, plus checking that is never > embedded inside another ; we could strengthen this further to check > that all tags only appear in valid parents... > I was able to use this to fix issues with the new RTF parser > (TIKA-683), but I was surprised that some other parsers failed the new > asserts. > It could be these are relatively minor offenses (eg closing a table > w/o closing the tr) and we need not do anything here... but I think > it'd be cleaner if all our parsers produced matched, well-formed XHTML > events. > I haven't looked into any of these... it could be they are easy to fix. > Failures: > {noformat} > testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest) > Time elapsed: 0.032 sec <<< ERROR! > java.lang.AssertionError: end tag=body with no startElement > at > org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224) > at > org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275) > at > org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129) > at > org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158) > testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest) Time elapsed: > 0.116 sec <<< ERROR! > java.lang.AssertionError: mismatched elements open=tr close=table > at > org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226) > at > org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275) > at > org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252) > at > org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287) > at > org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136) > at > org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) > at > com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648) > at > com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511) > at > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808) > at > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737) > at > com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119) > at > com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205) > at > com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522) > at javax.xml.parsers.SAXParser.parse(SAXParser.java:395) > at javax.xml.parsers.SAXParser.parse(SAXParser.java:198) > at > org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:190) > at > org.apache.tika.parser.iwork.IWorkParserTest.testParseKeynote(IWorkParserTest.java:49) >
[jira] [Updated] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events
[ https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-715: --- Fix Version/s: (was: 1.11) 1.12 > Some parsers produce non-well-formed XHTML SAX events > - > > Key: TIKA-715 > URL: https://issues.apache.org/jira/browse/TIKA-715 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 0.10 >Reporter: Michael McCandless > Labels: newbie > Fix For: 1.12 > > Attachments: TIKA-715.patch > > > With TIKA-683 I committed simple, commented out code to > SafeContentHandler, to verify that the SAX events produced by the > parser have valid (matched) tags. Ie, each startElement("foo") is > matched by the closing endElement("foo"). > I only did basic nesting test, plus checking that is never > embedded inside another ; we could strengthen this further to check > that all tags only appear in valid parents... > I was able to use this to fix issues with the new RTF parser > (TIKA-683), but I was surprised that some other parsers failed the new > asserts. > It could be these are relatively minor offenses (eg closing a table > w/o closing the tr) and we need not do anything here... but I think > it'd be cleaner if all our parsers produced matched, well-formed XHTML > events. > I haven't looked into any of these... it could be they are easy to fix. > Failures: > {noformat} > testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest) > Time elapsed: 0.032 sec <<< ERROR! > java.lang.AssertionError: end tag=body with no startElement > at > org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224) > at > org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275) > at > org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129) > at > org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158) > testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest) Time elapsed: > 0.116 sec <<< ERROR! > java.lang.AssertionError: mismatched elements open=tr close=table > at > org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226) > at > org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275) > at > org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252) > at > org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287) > at > org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136) > at > org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) > at > com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648) > at > com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511) > at > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808) > at > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737) > at > com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119) > at > com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205) > at > com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522) > at javax.xml.parsers.SAXParser.parse(SAXParser.java:395) > at javax.xml.parsers.SAXParser.parse(SAXParser.java:198) > at > org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:190) > at > org.apache.tika.parser.iwork.IWorkParserTest.testParseKeynote(IWorkParserTest.java:49) >
[jira] [Updated] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events
[ https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dave Meikle updated TIKA-715: - Fix Version/s: (was: 1.10) 1.11 * Pushed to 1.11 following 1.10 release Some parsers produce non-well-formed XHTML SAX events - Key: TIKA-715 URL: https://issues.apache.org/jira/browse/TIKA-715 Project: Tika Issue Type: Bug Components: parser Affects Versions: 0.10 Reporter: Michael McCandless Labels: newbie Fix For: 1.11 Attachments: TIKA-715.patch With TIKA-683 I committed simple, commented out code to SafeContentHandler, to verify that the SAX events produced by the parser have valid (matched) tags. Ie, each startElement(foo) is matched by the closing endElement(foo). I only did basic nesting test, plus checking that p is never embedded inside another p; we could strengthen this further to check that all tags only appear in valid parents... I was able to use this to fix issues with the new RTF parser (TIKA-683), but I was surprised that some other parsers failed the new asserts. It could be these are relatively minor offenses (eg closing a table w/o closing the tr) and we need not do anything here... but I think it'd be cleaner if all our parsers produced matched, well-formed XHTML events. I haven't looked into any of these... it could be they are easy to fix. Failures: {noformat} testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest) Time elapsed: 0.032 sec ERROR! java.lang.AssertionError: end tag=body with no startElement at org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224) at org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275) at org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129) at org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158) testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest) Time elapsed: 0.116 sec ERROR! java.lang.AssertionError: mismatched elements open=tr close=table at org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226) at org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275) at org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252) at org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287) at org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136) at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938) at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648) at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737) at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205) at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522) at javax.xml.parsers.SAXParser.parse(SAXParser.java:395) at javax.xml.parsers.SAXParser.parse(SAXParser.java:198) at org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:190) at org.apache.tika.parser.iwork.IWorkParserTest.testParseKeynote(IWorkParserTest.java:49) testMultipart(org.apache.tika.parser.mail.RFC822ParserTest) Time elapsed: 0.025 sec ERROR!
[jira] [Updated] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events
[ https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-715: - Labels: newbie (was: ) Some parsers produce non-well-formed XHTML SAX events - Key: TIKA-715 URL: https://issues.apache.org/jira/browse/TIKA-715 Project: Tika Issue Type: Bug Components: parser Affects Versions: 0.10 Reporter: Michael McCandless Labels: newbie Fix For: 1.8 Attachments: TIKA-715.patch With TIKA-683 I committed simple, commented out code to SafeContentHandler, to verify that the SAX events produced by the parser have valid (matched) tags. Ie, each startElement(foo) is matched by the closing endElement(foo). I only did basic nesting test, plus checking that p is never embedded inside another p; we could strengthen this further to check that all tags only appear in valid parents... I was able to use this to fix issues with the new RTF parser (TIKA-683), but I was surprised that some other parsers failed the new asserts. It could be these are relatively minor offenses (eg closing a table w/o closing the tr) and we need not do anything here... but I think it'd be cleaner if all our parsers produced matched, well-formed XHTML events. I haven't looked into any of these... it could be they are easy to fix. Failures: {noformat} testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest) Time elapsed: 0.032 sec ERROR! java.lang.AssertionError: end tag=body with no startElement at org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224) at org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275) at org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129) at org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158) testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest) Time elapsed: 0.116 sec ERROR! java.lang.AssertionError: mismatched elements open=tr close=table at org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226) at org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275) at org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252) at org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287) at org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136) at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938) at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648) at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737) at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205) at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522) at javax.xml.parsers.SAXParser.parse(SAXParser.java:395) at javax.xml.parsers.SAXParser.parse(SAXParser.java:198) at org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:190) at org.apache.tika.parser.iwork.IWorkParserTest.testParseKeynote(IWorkParserTest.java:49) testMultipart(org.apache.tika.parser.mail.RFC822ParserTest) Time elapsed: 0.025 sec ERROR! java.lang.AssertionError: p inside p at
[jira] [Updated] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events
[ https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dave Meikle updated TIKA-715: - Fix Version/s: (was: 1.5) 1.6 Pushed out to 1.6, preparing for 1.5 RC Some parsers produce non-well-formed XHTML SAX events - Key: TIKA-715 URL: https://issues.apache.org/jira/browse/TIKA-715 Project: Tika Issue Type: Bug Components: parser Affects Versions: 0.10 Reporter: Michael McCandless Fix For: 1.6 Attachments: TIKA-715.patch With TIKA-683 I committed simple, commented out code to SafeContentHandler, to verify that the SAX events produced by the parser have valid (matched) tags. Ie, each startElement(foo) is matched by the closing endElement(foo). I only did basic nesting test, plus checking that p is never embedded inside another p; we could strengthen this further to check that all tags only appear in valid parents... I was able to use this to fix issues with the new RTF parser (TIKA-683), but I was surprised that some other parsers failed the new asserts. It could be these are relatively minor offenses (eg closing a table w/o closing the tr) and we need not do anything here... but I think it'd be cleaner if all our parsers produced matched, well-formed XHTML events. I haven't looked into any of these... it could be they are easy to fix. Failures: {noformat} testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest) Time elapsed: 0.032 sec ERROR! java.lang.AssertionError: end tag=body with no startElement at org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224) at org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275) at org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129) at org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158) testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest) Time elapsed: 0.116 sec ERROR! java.lang.AssertionError: mismatched elements open=tr close=table at org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226) at org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275) at org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252) at org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287) at org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136) at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938) at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648) at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737) at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205) at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522) at javax.xml.parsers.SAXParser.parse(SAXParser.java:395) at javax.xml.parsers.SAXParser.parse(SAXParser.java:198) at org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:190) at org.apache.tika.parser.iwork.IWorkParserTest.testParseKeynote(IWorkParserTest.java:49) testMultipart(org.apache.tika.parser.mail.RFC822ParserTest) Time elapsed: 0.025 sec ERROR! java.lang.AssertionError: p inside p
[jira] [Updated] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events
[ https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-715: --- Fix Version/s: (was: 1.4) 1.5 - push to 1.5, get ready for 1.4 RC #1. Some parsers produce non-well-formed XHTML SAX events - Key: TIKA-715 URL: https://issues.apache.org/jira/browse/TIKA-715 Project: Tika Issue Type: Bug Components: parser Affects Versions: 0.10 Reporter: Michael McCandless Fix For: 1.5 Attachments: TIKA-715.patch With TIKA-683 I committed simple, commented out code to SafeContentHandler, to verify that the SAX events produced by the parser have valid (matched) tags. Ie, each startElement(foo) is matched by the closing endElement(foo). I only did basic nesting test, plus checking that p is never embedded inside another p; we could strengthen this further to check that all tags only appear in valid parents... I was able to use this to fix issues with the new RTF parser (TIKA-683), but I was surprised that some other parsers failed the new asserts. It could be these are relatively minor offenses (eg closing a table w/o closing the tr) and we need not do anything here... but I think it'd be cleaner if all our parsers produced matched, well-formed XHTML events. I haven't looked into any of these... it could be they are easy to fix. Failures: {noformat} testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest) Time elapsed: 0.032 sec ERROR! java.lang.AssertionError: end tag=body with no startElement at org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224) at org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275) at org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129) at org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158) testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest) Time elapsed: 0.116 sec ERROR! java.lang.AssertionError: mismatched elements open=tr close=table at org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226) at org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275) at org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252) at org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287) at org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136) at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938) at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648) at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737) at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205) at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522) at javax.xml.parsers.SAXParser.parse(SAXParser.java:395) at javax.xml.parsers.SAXParser.parse(SAXParser.java:198) at org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:190) at org.apache.tika.parser.iwork.IWorkParserTest.testParseKeynote(IWorkParserTest.java:49) testMultipart(org.apache.tika.parser.mail.RFC822ParserTest) Time elapsed: 0.025 sec ERROR!
[jira] [Updated] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events
[ https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-715: --- Fix Version/s: (was: 1.2) 1.3 - push to 1.3 Some parsers produce non-well-formed XHTML SAX events - Key: TIKA-715 URL: https://issues.apache.org/jira/browse/TIKA-715 Project: Tika Issue Type: Bug Components: parser Affects Versions: 0.10 Reporter: Michael McCandless Fix For: 1.3 Attachments: TIKA-715.patch With TIKA-683 I committed simple, commented out code to SafeContentHandler, to verify that the SAX events produced by the parser have valid (matched) tags. Ie, each startElement(foo) is matched by the closing endElement(foo). I only did basic nesting test, plus checking that p is never embedded inside another p; we could strengthen this further to check that all tags only appear in valid parents... I was able to use this to fix issues with the new RTF parser (TIKA-683), but I was surprised that some other parsers failed the new asserts. It could be these are relatively minor offenses (eg closing a table w/o closing the tr) and we need not do anything here... but I think it'd be cleaner if all our parsers produced matched, well-formed XHTML events. I haven't looked into any of these... it could be they are easy to fix. Failures: {noformat} testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest) Time elapsed: 0.032 sec ERROR! java.lang.AssertionError: end tag=body with no startElement at org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224) at org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275) at org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129) at org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158) testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest) Time elapsed: 0.116 sec ERROR! java.lang.AssertionError: mismatched elements open=tr close=table at org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226) at org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275) at org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252) at org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287) at org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136) at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938) at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648) at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737) at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205) at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522) at javax.xml.parsers.SAXParser.parse(SAXParser.java:395) at javax.xml.parsers.SAXParser.parse(SAXParser.java:198) at org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:190) at org.apache.tika.parser.iwork.IWorkParserTest.testParseKeynote(IWorkParserTest.java:49) testMultipart(org.apache.tika.parser.mail.RFC822ParserTest) Time elapsed: 0.025 sec ERROR! java.lang.AssertionError: p inside p
[jira] [Updated] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events
[ https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-715: --- Fix Version/s: (was: 1.1) 1.2 - push out to 1.2 Some parsers produce non-well-formed XHTML SAX events - Key: TIKA-715 URL: https://issues.apache.org/jira/browse/TIKA-715 Project: Tika Issue Type: Bug Components: parser Affects Versions: 0.10 Reporter: Michael McCandless Fix For: 1.2 Attachments: TIKA-715.patch With TIKA-683 I committed simple, commented out code to SafeContentHandler, to verify that the SAX events produced by the parser have valid (matched) tags. Ie, each startElement(foo) is matched by the closing endElement(foo). I only did basic nesting test, plus checking that p is never embedded inside another p; we could strengthen this further to check that all tags only appear in valid parents... I was able to use this to fix issues with the new RTF parser (TIKA-683), but I was surprised that some other parsers failed the new asserts. It could be these are relatively minor offenses (eg closing a table w/o closing the tr) and we need not do anything here... but I think it'd be cleaner if all our parsers produced matched, well-formed XHTML events. I haven't looked into any of these... it could be they are easy to fix. Failures: {noformat} testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest) Time elapsed: 0.032 sec ERROR! java.lang.AssertionError: end tag=body with no startElement at org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224) at org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275) at org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129) at org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158) testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest) Time elapsed: 0.116 sec ERROR! java.lang.AssertionError: mismatched elements open=tr close=table at org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226) at org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275) at org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252) at org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287) at org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136) at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938) at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648) at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737) at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205) at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522) at javax.xml.parsers.SAXParser.parse(SAXParser.java:395) at javax.xml.parsers.SAXParser.parse(SAXParser.java:198) at org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:190) at org.apache.tika.parser.iwork.IWorkParserTest.testParseKeynote(IWorkParserTest.java:49) testMultipart(org.apache.tika.parser.mail.RFC822ParserTest) Time elapsed: 0.025 sec ERROR! java.lang.AssertionError: p inside p
[jira] [Updated] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events
[ https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-715: --- Fix Version/s: (was: 1.0) 1.1 - push out to 1.1: prep for 1.0. Some parsers produce non-well-formed XHTML SAX events - Key: TIKA-715 URL: https://issues.apache.org/jira/browse/TIKA-715 Project: Tika Issue Type: Bug Components: parser Affects Versions: 0.10 Reporter: Michael McCandless Fix For: 1.1 Attachments: TIKA-715.patch With TIKA-683 I committed simple, commented out code to SafeContentHandler, to verify that the SAX events produced by the parser have valid (matched) tags. Ie, each startElement(foo) is matched by the closing endElement(foo). I only did basic nesting test, plus checking that p is never embedded inside another p; we could strengthen this further to check that all tags only appear in valid parents... I was able to use this to fix issues with the new RTF parser (TIKA-683), but I was surprised that some other parsers failed the new asserts. It could be these are relatively minor offenses (eg closing a table w/o closing the tr) and we need not do anything here... but I think it'd be cleaner if all our parsers produced matched, well-formed XHTML events. I haven't looked into any of these... it could be they are easy to fix. Failures: {noformat} testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest) Time elapsed: 0.032 sec ERROR! java.lang.AssertionError: end tag=body with no startElement at org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224) at org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275) at org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129) at org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158) testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest) Time elapsed: 0.116 sec ERROR! java.lang.AssertionError: mismatched elements open=tr close=table at org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226) at org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275) at org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252) at org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287) at org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136) at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938) at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648) at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737) at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205) at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522) at javax.xml.parsers.SAXParser.parse(SAXParser.java:395) at javax.xml.parsers.SAXParser.parse(SAXParser.java:198) at org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:190) at org.apache.tika.parser.iwork.IWorkParserTest.testParseKeynote(IWorkParserTest.java:49) testMultipart(org.apache.tika.parser.mail.RFC822ParserTest) Time elapsed: 0.025 sec ERROR!
[jira] [Updated] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events
[ https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-715: --- Affects Version/s: 0.10 Fix Version/s: (was: 0.10) 1.0 - pushing out: rolling 0.10 RC today. Some parsers produce non-well-formed XHTML SAX events - Key: TIKA-715 URL: https://issues.apache.org/jira/browse/TIKA-715 Project: Tika Issue Type: Bug Components: parser Affects Versions: 0.10 Reporter: Michael McCandless Fix For: 1.0 Attachments: TIKA-715.patch With TIKA-683 I committed simple, commented out code to SafeContentHandler, to verify that the SAX events produced by the parser have valid (matched) tags. Ie, each startElement(foo) is matched by the closing endElement(foo). I only did basic nesting test, plus checking that p is never embedded inside another p; we could strengthen this further to check that all tags only appear in valid parents... I was able to use this to fix issues with the new RTF parser (TIKA-683), but I was surprised that some other parsers failed the new asserts. It could be these are relatively minor offenses (eg closing a table w/o closing the tr) and we need not do anything here... but I think it'd be cleaner if all our parsers produced matched, well-formed XHTML events. I haven't looked into any of these... it could be they are easy to fix. Failures: {noformat} testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest) Time elapsed: 0.032 sec ERROR! java.lang.AssertionError: end tag=body with no startElement at org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224) at org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275) at org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129) at org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158) testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest) Time elapsed: 0.116 sec ERROR! java.lang.AssertionError: mismatched elements open=tr close=table at org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226) at org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275) at org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252) at org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287) at org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136) at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938) at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648) at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737) at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205) at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522) at javax.xml.parsers.SAXParser.parse(SAXParser.java:395) at javax.xml.parsers.SAXParser.parse(SAXParser.java:198) at org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:190) at org.apache.tika.parser.iwork.IWorkParserTest.testParseKeynote(IWorkParserTest.java:49) testMultipart(org.apache.tika.parser.mail.RFC822ParserTest) Time elapsed: 0.025 sec
[jira] [Updated] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events
[ https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-715: Attachment: TIKA-715.patch Patch turning on the asserts in SafeContentHandler.java. Some parsers produce non-well-formed XHTML SAX events - Key: TIKA-715 URL: https://issues.apache.org/jira/browse/TIKA-715 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless Fix For: 1.0 Attachments: TIKA-715.patch With TIKA-683 I committed simple, commented out code to SafeContentHandler, to verify that the SAX events produced by the parser have valid (matched) tags. Ie, each startElement(foo) is matched by the closing endElement(foo). I only did basic nesting test, plus checking that p is never embedded inside another p; we could strengthen this further to check that all tags only appear in valid parents... I was able to use this to fix issues with the new RTF parser (TIKA-683), but I was surprised that some other parsers failed the new asserts. It could be these are relatively minor offenses (eg closing a table w/o closing the tr) and we need not do anything here... but I think it'd be cleaner if all our parsers produced matched, well-formed XHTML events. I haven't looked into any of these... it could be they are easy to fix. Failures: {noformat} testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest) Time elapsed: 0.032 sec ERROR! java.lang.AssertionError: end tag=body with no startElement at org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224) at org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275) at org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129) at org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158) testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest) Time elapsed: 0.116 sec ERROR! java.lang.AssertionError: mismatched elements open=tr close=table at org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226) at org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275) at org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252) at org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287) at org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136) at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938) at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648) at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737) at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205) at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522) at javax.xml.parsers.SAXParser.parse(SAXParser.java:395) at javax.xml.parsers.SAXParser.parse(SAXParser.java:198) at org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:190) at org.apache.tika.parser.iwork.IWorkParserTest.testParseKeynote(IWorkParserTest.java:49) testMultipart(org.apache.tika.parser.mail.RFC822ParserTest) Time elapsed: 0.025 sec ERROR! java.lang.AssertionError: p inside p at
[jira] [Updated] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events
[ https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-715: Component/s: parser Some parsers produce non-well-formed XHTML SAX events - Key: TIKA-715 URL: https://issues.apache.org/jira/browse/TIKA-715 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless Fix For: 1.0 Attachments: TIKA-715.patch With TIKA-683 I committed simple, commented out code to SafeContentHandler, to verify that the SAX events produced by the parser have valid (matched) tags. Ie, each startElement(foo) is matched by the closing endElement(foo). I only did basic nesting test, plus checking that p is never embedded inside another p; we could strengthen this further to check that all tags only appear in valid parents... I was able to use this to fix issues with the new RTF parser (TIKA-683), but I was surprised that some other parsers failed the new asserts. It could be these are relatively minor offenses (eg closing a table w/o closing the tr) and we need not do anything here... but I think it'd be cleaner if all our parsers produced matched, well-formed XHTML events. I haven't looked into any of these... it could be they are easy to fix. Failures: {noformat} testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest) Time elapsed: 0.032 sec ERROR! java.lang.AssertionError: end tag=body with no startElement at org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224) at org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275) at org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129) at org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158) testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest) Time elapsed: 0.116 sec ERROR! java.lang.AssertionError: mismatched elements open=tr close=table at org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226) at org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275) at org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252) at org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287) at org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136) at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938) at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648) at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737) at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205) at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522) at javax.xml.parsers.SAXParser.parse(SAXParser.java:395) at javax.xml.parsers.SAXParser.parse(SAXParser.java:198) at org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:190) at org.apache.tika.parser.iwork.IWorkParserTest.testParseKeynote(IWorkParserTest.java:49) testMultipart(org.apache.tika.parser.mail.RFC822ParserTest) Time elapsed: 0.025 sec ERROR! java.lang.AssertionError: p inside p at