[ https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14342526#comment-14342526 ]
Tyler Palsulich commented on TIKA-715: -------------------------------------- List of parser tests that fail after applying the patch: {code} AutoDetectParserTest.testKeynote:164->assertAutoDetect:148->assertAutoDetect:132->assertAutoDetect:99 null AutoDetectParserTest.testPages:169->assertAutoDetect:148->assertAutoDetect:132->assertAutoDetect:99 mismatched elements open=div close=body AutoDetectParserTest.testZipBombPrevention:271 mismatched elements open=p close=div iBooksParserTest.testiBooksParser:40 mismatched elements open=title close=head IWorkParserTest.testKeynoteBulletPoints:115 null IWorkParserTest.testKeynoteMasterSlideTable:140 mismatched elements open=tr close=table IWorkParserTest.testKeynoteTables:127 null IWorkParserTest.testKeynoteTextBoxes:103 null IWorkParserTest.testPagesLayoutMode:204 mismatched elements open=div close=body IWorkParserTest.testParseKeynote:57 null IWorkParserTest.testParsePages:154 mismatched elements open=div close=body IWorkParserTest.testParsePagesHeadersAlphaLower:406 mismatched elements open=p close=div IWorkParserTest.testParsePagesHeadersAlphaUpper:385 mismatched elements open=p close=div IWorkParserTest.testParsePagesHeadersFootersFootnotes:316 mismatched elements open=p close=div IWorkParserTest.testParsePagesHeadersFootersRomanLower:364 mismatched elements open=p close=div IWorkParserTest.testParsePagesHeadersFootersRomanUpper:343 mismatched elements open=p close=div RFC822ParserTest.testEncryptedZipAttachment:277 null RFC822ParserTest.testMultipart:93 null RFC822ParserTest.testNormalZipAttachment:332 null RFC822ParserTest.testUnusualFromAddress:197 null MboxParserTest.testComplex:150 null ExcelParserTest.testExcel95:380 end tag=body with no startElement WordParserTest.testControlCharacter:383->TikaTest.getXML:114->TikaTest.getXML:123 mismatched elements open=a close=b OOXMLParserTest.testTextInsideTextBox:971->TikaTest.getXML:114->TikaTest.getXML:123 null ODFParserTest.testFromFile:342 null ODFParserTest.testOO3:58 null ODFParserTest.testOO3Metadata:218 null {code} > Some parsers produce non-well-formed XHTML SAX events > ----------------------------------------------------- > > Key: TIKA-715 > URL: https://issues.apache.org/jira/browse/TIKA-715 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 0.10 > Reporter: Michael McCandless > Fix For: 1.8 > > Attachments: TIKA-715.patch > > > With TIKA-683 I committed simple, commented out code to > SafeContentHandler, to verify that the SAX events produced by the > parser have valid (matched) tags. Ie, each startElement("foo") is > matched by the closing endElement("foo"). > I only did basic nesting test, plus checking that <p> is never > embedded inside another <p>; we could strengthen this further to check > that all tags only appear in valid parents... > I was able to use this to fix issues with the new RTF parser > (TIKA-683), but I was surprised that some other parsers failed the new > asserts. > It could be these are relatively minor offenses (eg closing a table > w/o closing the tr) and we need not do anything here... but I think > it'd be cleaner if all our parsers produced matched, well-formed XHTML > events. > I haven't looked into any of these... it could be they are easy to fix. > Failures: > {noformat} > testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest) > Time elapsed: 0.032 sec <<< ERROR! > java.lang.AssertionError: end tag=body with no startElement > at > org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224) > at > org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275) > at > org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129) > at > org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158) > testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest) Time elapsed: > 0.116 sec <<< ERROR! > java.lang.AssertionError: mismatched elements open=tr close=table > at > org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226) > at > org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275) > at > org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252) > at > org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287) > at > org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136) > at > org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) > at > com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648) > at > com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511) > at > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808) > at > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737) > at > com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119) > at > com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205) > at > com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522) > at javax.xml.parsers.SAXParser.parse(SAXParser.java:395) > at javax.xml.parsers.SAXParser.parse(SAXParser.java:198) > at > org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:190) > at > org.apache.tika.parser.iwork.IWorkParserTest.testParseKeynote(IWorkParserTest.java:49) > testMultipart(org.apache.tika.parser.mail.RFC822ParserTest) Time elapsed: > 0.025 sec <<< ERROR! > java.lang.AssertionError: p inside p > at > org.apache.tika.sax.SafeContentHandler.verifyStartElement(SafeContentHandler.java:216) > at > org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:245) > at > org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:241) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.SecureContentHandler.startElement(SecureContentHandler.java:203) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:267) > at > org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:241) > at > org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:271) > at org.apache.tika.parser.txt.TXTParser.parse(TXTParser.java:128) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:145) > at > org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:77) > at > org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:101) > at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:72) > at > org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:75) > testUnusualFromAddress(org.apache.tika.parser.mail.RFC822ParserTest) Time > elapsed: 0.037 sec <<< ERROR! > java.lang.AssertionError: p inside p > at > org.apache.tika.sax.SafeContentHandler.verifyStartElement(SafeContentHandler.java:216) > at > org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:245) > at > org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:241) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.SecureContentHandler.startElement(SecureContentHandler.java:203) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:267) > at > org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:241) > at > org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:283) > at > org.apache.tika.parser.html.HtmlHandler.startElementWithSafeAttributes(HtmlHandler.java:201) > at > org.apache.tika.parser.html.HtmlHandler.startElement(HtmlHandler.java:135) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.parser.html.XHTMLDowngradeHandler.startElement(XHTMLDowngradeHandler.java:61) > at org.ccil.cowan.tagsoup.Parser.push(Parser.java:794) > at org.ccil.cowan.tagsoup.Parser.rectify(Parser.java:1061) > at org.ccil.cowan.tagsoup.Parser.stagc(Parser.java:1016) > at org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:472) > at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449) > at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:202) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:145) > at > org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:77) > at > org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:101) > at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:72) > at > org.apache.tika.parser.mail.RFC822ParserTest.testUnusualFromAddress(RFC822ParserTest.java:166) > testOO3(org.apache.tika.parser.odf.ODFParserTest) Time elapsed: 0.003 sec > <<< ERROR! > java.lang.AssertionError: p inside p > at > org.apache.tika.sax.SafeContentHandler.verifyStartElement(SafeContentHandler.java:216) > at > org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:245) > at > org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:241) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.ElementMappingContentHandler.startElement(ElementMappingContentHandler.java:54) > at > org.apache.tika.parser.odf.OpenDocumentContentParser$1.startElement(OpenDocumentContentParser.java:271) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.parser.odf.NSNormalizerContentHandler.startElement(NSNormalizerContentHandler.java:68) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.startElement(AbstractSAXParser.java:501) > at > com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.scanStartElement(XMLNSDocumentScannerImpl.java:400) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2755) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648) > at > com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511) > at > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808) > at > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737) > at > com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119) > at > com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205) > at > com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522) > at javax.xml.parsers.SAXParser.parse(SAXParser.java:395) > at javax.xml.parsers.SAXParser.parse(SAXParser.java:198) > at > org.apache.tika.parser.odf.OpenDocumentContentParser.parse(OpenDocumentContentParser.java:335) > at > org.apache.tika.parser.odf.OpenDocumentParser.parse(OpenDocumentParser.java:125) > at > org.apache.tika.parser.odf.ODFParserTest.testOO3(ODFParserTest.java:49) > testOO3Metadata(org.apache.tika.parser.odf.ODFParserTest) Time elapsed: > 0.001 sec <<< ERROR! > java.lang.AssertionError: p inside p > at > org.apache.tika.sax.SafeContentHandler.verifyStartElement(SafeContentHandler.java:216) > at > org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:245) > at > org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:241) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.ElementMappingContentHandler.startElement(ElementMappingContentHandler.java:54) > at > org.apache.tika.parser.odf.OpenDocumentContentParser$1.startElement(OpenDocumentContentParser.java:271) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.parser.odf.NSNormalizerContentHandler.startElement(NSNormalizerContentHandler.java:68) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.startElement(AbstractSAXParser.java:501) > at > com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.scanStartElement(XMLNSDocumentScannerImpl.java:400) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2755) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648) > at > com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511) > at > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808) > at > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737) > at > com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119) > at > com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205) > at > com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522) > at javax.xml.parsers.SAXParser.parse(SAXParser.java:395) > at javax.xml.parsers.SAXParser.parse(SAXParser.java:198) > at > org.apache.tika.parser.odf.OpenDocumentContentParser.parse(OpenDocumentContentParser.java:335) > at > org.apache.tika.parser.odf.OpenDocumentParser.parse(OpenDocumentParser.java:125) > at org.apache.tika.parser.AbstractParser.parse(AbstractParser.java:48) > at > org.apache.tika.parser.odf.ODFParserTest.testOO3Metadata(ODFParserTest.java:168) > testZipBombPrevention(org.apache.tika.parser.AutoDetectParserTest) Time > elapsed: 0.055 sec <<< ERROR! > java.lang.AssertionError: mismatched elements open=p close=div > at > org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226) > at > org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275) > at > org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252) > at > org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:106) > at > org.apache.tika.parser.pkg.PackageExtractor.unpack(PackageExtractor.java:167) > at > org.apache.tika.parser.pkg.PackageExtractor.parse(PackageExtractor.java:107) > at org.apache.tika.parser.pkg.PackageParser.parse(PackageParser.java:61) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129) > at > org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) > at > org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:95) > at > org.apache.tika.parser.pkg.PackageExtractor.decompress(PackageExtractor.java:135) > at > org.apache.tika.parser.pkg.PackageExtractor.parse(PackageExtractor.java:93) > at org.apache.tika.parser.pkg.PackageParser.parse(PackageParser.java:61) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:145) > at > org.apache.tika.parser.AutoDetectParserTest.testZipBombPrevention(AutoDetectParserTest.java:224) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)