[jira] [Updated] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-715:
-
Fix Version/s: (was: 2.0.0)
   2.0.0-BETA

> Some parsers produce non-well-formed XHTML SAX events
> -
>
> Key: TIKA-715
> URL: https://issues.apache.org/jira/browse/TIKA-715
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 0.10
>Reporter: Michael McCandless
>Priority: Major
>  Labels: newbie
> Fix For: 1.17, 2.0.0-BETA, 2.0.1
>
> Attachments: TIKA-715.patch
>
>
> With TIKA-683 I committed simple, commented out code to
> SafeContentHandler, to verify that the SAX events produced by the
> parser have valid (matched) tags.  Ie, each startElement("foo") is
> matched by the closing endElement("foo").
> I only did basic nesting test, plus checking that  is never
> embedded inside another ; we could strengthen this further to check
> that all tags only appear in valid parents...
> I was able to use this to fix issues with the new RTF parser
> (TIKA-683), but I was surprised that some other parsers failed the new
> asserts.
> It could be these are relatively minor offenses (eg closing a table
> w/o closing the tr) and we need not do anything here... but I think
> it'd be cleaner if all our parsers produced matched, well-formed XHTML
> events.
> I haven't looked into any of these... it could be they are easy to fix.
> Failures:
> {noformat}
> testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest)  
> Time elapsed: 0.032 sec  <<< ERROR!
> java.lang.AssertionError: end tag=body with no startElement
>   at 
> org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224)
>   at 
> org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
>   at 
> org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158)
> testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest)  Time elapsed: 
> 0.116 sec  <<< ERROR!
> java.lang.AssertionError: mismatched elements open=tr close=table
>   at 
> org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226)
>   at 
> org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287)
>   at 
> org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>   at 
> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
>   at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808)
>   at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
>   at 
> com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
>   at 
> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205)
>   at 
> com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
>   at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
>   at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
>   at 
> org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:190)
>   at 
> 

[jira] [Updated] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events

2021-07-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-715:
-
Fix Version/s: (was: 2.0.0)
   2.0.1

> Some parsers produce non-well-formed XHTML SAX events
> -
>
> Key: TIKA-715
> URL: https://issues.apache.org/jira/browse/TIKA-715
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 0.10
>Reporter: Michael McCandless
>Priority: Major
>  Labels: newbie
> Fix For: 1.17, 2.0.0-BETA, 2.0.1
>
> Attachments: TIKA-715.patch
>
>
> With TIKA-683 I committed simple, commented out code to
> SafeContentHandler, to verify that the SAX events produced by the
> parser have valid (matched) tags.  Ie, each startElement("foo") is
> matched by the closing endElement("foo").
> I only did basic nesting test, plus checking that  is never
> embedded inside another ; we could strengthen this further to check
> that all tags only appear in valid parents...
> I was able to use this to fix issues with the new RTF parser
> (TIKA-683), but I was surprised that some other parsers failed the new
> asserts.
> It could be these are relatively minor offenses (eg closing a table
> w/o closing the tr) and we need not do anything here... but I think
> it'd be cleaner if all our parsers produced matched, well-formed XHTML
> events.
> I haven't looked into any of these... it could be they are easy to fix.
> Failures:
> {noformat}
> testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest)  
> Time elapsed: 0.032 sec  <<< ERROR!
> java.lang.AssertionError: end tag=body with no startElement
>   at 
> org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224)
>   at 
> org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
>   at 
> org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158)
> testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest)  Time elapsed: 
> 0.116 sec  <<< ERROR!
> java.lang.AssertionError: mismatched elements open=tr close=table
>   at 
> org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226)
>   at 
> org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287)
>   at 
> org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>   at 
> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
>   at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808)
>   at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
>   at 
> com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
>   at 
> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205)
>   at 
> com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
>   at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
>   at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
>   at 
> org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:190)
>   at 
> 

[jira] [Updated] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-715:
---
Fix Version/s: (was: 1.15)
   1.16

> Some parsers produce non-well-formed XHTML SAX events
> -
>
> Key: TIKA-715
> URL: https://issues.apache.org/jira/browse/TIKA-715
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 0.10
>Reporter: Michael McCandless
>  Labels: newbie
> Fix For: 1.16
>
> Attachments: TIKA-715.patch
>
>
> With TIKA-683 I committed simple, commented out code to
> SafeContentHandler, to verify that the SAX events produced by the
> parser have valid (matched) tags.  Ie, each startElement("foo") is
> matched by the closing endElement("foo").
> I only did basic nesting test, plus checking that  is never
> embedded inside another ; we could strengthen this further to check
> that all tags only appear in valid parents...
> I was able to use this to fix issues with the new RTF parser
> (TIKA-683), but I was surprised that some other parsers failed the new
> asserts.
> It could be these are relatively minor offenses (eg closing a table
> w/o closing the tr) and we need not do anything here... but I think
> it'd be cleaner if all our parsers produced matched, well-formed XHTML
> events.
> I haven't looked into any of these... it could be they are easy to fix.
> Failures:
> {noformat}
> testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest)  
> Time elapsed: 0.032 sec  <<< ERROR!
> java.lang.AssertionError: end tag=body with no startElement
>   at 
> org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224)
>   at 
> org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
>   at 
> org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158)
> testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest)  Time elapsed: 
> 0.116 sec  <<< ERROR!
> java.lang.AssertionError: mismatched elements open=tr close=table
>   at 
> org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226)
>   at 
> org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287)
>   at 
> org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>   at 
> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
>   at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808)
>   at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
>   at 
> com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
>   at 
> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205)
>   at 
> com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
>   at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
>   at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
>   at 
> org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:190)
>   at 
> org.apache.tika.parser.iwork.IWorkParserTest.testParseKeynote(IWorkParserTest.java:49)
> 

[jira] [Updated] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events

2016-10-19 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-715:
---
Fix Version/s: (was: 1.14)
   1.15

> Some parsers produce non-well-formed XHTML SAX events
> -
>
> Key: TIKA-715
> URL: https://issues.apache.org/jira/browse/TIKA-715
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 0.10
>Reporter: Michael McCandless
>  Labels: newbie
> Fix For: 1.15
>
> Attachments: TIKA-715.patch
>
>
> With TIKA-683 I committed simple, commented out code to
> SafeContentHandler, to verify that the SAX events produced by the
> parser have valid (matched) tags.  Ie, each startElement("foo") is
> matched by the closing endElement("foo").
> I only did basic nesting test, plus checking that  is never
> embedded inside another ; we could strengthen this further to check
> that all tags only appear in valid parents...
> I was able to use this to fix issues with the new RTF parser
> (TIKA-683), but I was surprised that some other parsers failed the new
> asserts.
> It could be these are relatively minor offenses (eg closing a table
> w/o closing the tr) and we need not do anything here... but I think
> it'd be cleaner if all our parsers produced matched, well-formed XHTML
> events.
> I haven't looked into any of these... it could be they are easy to fix.
> Failures:
> {noformat}
> testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest)  
> Time elapsed: 0.032 sec  <<< ERROR!
> java.lang.AssertionError: end tag=body with no startElement
>   at 
> org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224)
>   at 
> org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
>   at 
> org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158)
> testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest)  Time elapsed: 
> 0.116 sec  <<< ERROR!
> java.lang.AssertionError: mismatched elements open=tr close=table
>   at 
> org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226)
>   at 
> org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287)
>   at 
> org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>   at 
> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
>   at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808)
>   at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
>   at 
> com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
>   at 
> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205)
>   at 
> com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
>   at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
>   at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
>   at 
> org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:190)
>   at 
> org.apache.tika.parser.iwork.IWorkParserTest.testParseKeynote(IWorkParserTest.java:49)
> 

[jira] [Updated] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events

2016-01-24 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-715:
---
Fix Version/s: (was: 1.12)
   1.13

> Some parsers produce non-well-formed XHTML SAX events
> -
>
> Key: TIKA-715
> URL: https://issues.apache.org/jira/browse/TIKA-715
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 0.10
>Reporter: Michael McCandless
>  Labels: newbie
> Fix For: 1.13
>
> Attachments: TIKA-715.patch
>
>
> With TIKA-683 I committed simple, commented out code to
> SafeContentHandler, to verify that the SAX events produced by the
> parser have valid (matched) tags.  Ie, each startElement("foo") is
> matched by the closing endElement("foo").
> I only did basic nesting test, plus checking that  is never
> embedded inside another ; we could strengthen this further to check
> that all tags only appear in valid parents...
> I was able to use this to fix issues with the new RTF parser
> (TIKA-683), but I was surprised that some other parsers failed the new
> asserts.
> It could be these are relatively minor offenses (eg closing a table
> w/o closing the tr) and we need not do anything here... but I think
> it'd be cleaner if all our parsers produced matched, well-formed XHTML
> events.
> I haven't looked into any of these... it could be they are easy to fix.
> Failures:
> {noformat}
> testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest)  
> Time elapsed: 0.032 sec  <<< ERROR!
> java.lang.AssertionError: end tag=body with no startElement
>   at 
> org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224)
>   at 
> org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
>   at 
> org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158)
> testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest)  Time elapsed: 
> 0.116 sec  <<< ERROR!
> java.lang.AssertionError: mismatched elements open=tr close=table
>   at 
> org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226)
>   at 
> org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287)
>   at 
> org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>   at 
> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
>   at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808)
>   at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
>   at 
> com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
>   at 
> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205)
>   at 
> com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
>   at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
>   at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
>   at 
> org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:190)
>   at 
> org.apache.tika.parser.iwork.IWorkParserTest.testParseKeynote(IWorkParserTest.java:49)
> 

[jira] [Updated] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-715:
---
Fix Version/s: (was: 1.11)
   1.12

> Some parsers produce non-well-formed XHTML SAX events
> -
>
> Key: TIKA-715
> URL: https://issues.apache.org/jira/browse/TIKA-715
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 0.10
>Reporter: Michael McCandless
>  Labels: newbie
> Fix For: 1.12
>
> Attachments: TIKA-715.patch
>
>
> With TIKA-683 I committed simple, commented out code to
> SafeContentHandler, to verify that the SAX events produced by the
> parser have valid (matched) tags.  Ie, each startElement("foo") is
> matched by the closing endElement("foo").
> I only did basic nesting test, plus checking that  is never
> embedded inside another ; we could strengthen this further to check
> that all tags only appear in valid parents...
> I was able to use this to fix issues with the new RTF parser
> (TIKA-683), but I was surprised that some other parsers failed the new
> asserts.
> It could be these are relatively minor offenses (eg closing a table
> w/o closing the tr) and we need not do anything here... but I think
> it'd be cleaner if all our parsers produced matched, well-formed XHTML
> events.
> I haven't looked into any of these... it could be they are easy to fix.
> Failures:
> {noformat}
> testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest)  
> Time elapsed: 0.032 sec  <<< ERROR!
> java.lang.AssertionError: end tag=body with no startElement
>   at 
> org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224)
>   at 
> org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
>   at 
> org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158)
> testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest)  Time elapsed: 
> 0.116 sec  <<< ERROR!
> java.lang.AssertionError: mismatched elements open=tr close=table
>   at 
> org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226)
>   at 
> org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287)
>   at 
> org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>   at 
> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
>   at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808)
>   at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
>   at 
> com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
>   at 
> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205)
>   at 
> com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
>   at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
>   at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
>   at 
> org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:190)
>   at 
> org.apache.tika.parser.iwork.IWorkParserTest.testParseKeynote(IWorkParserTest.java:49)
> 

[jira] [Updated] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events

2015-08-08 Thread Dave Meikle (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-715:
-
Fix Version/s: (was: 1.10)
   1.11

* Pushed to 1.11 following 1.10 release

 Some parsers produce non-well-formed XHTML SAX events
 -

 Key: TIKA-715
 URL: https://issues.apache.org/jira/browse/TIKA-715
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 0.10
Reporter: Michael McCandless
  Labels: newbie
 Fix For: 1.11

 Attachments: TIKA-715.patch


 With TIKA-683 I committed simple, commented out code to
 SafeContentHandler, to verify that the SAX events produced by the
 parser have valid (matched) tags.  Ie, each startElement(foo) is
 matched by the closing endElement(foo).
 I only did basic nesting test, plus checking that p is never
 embedded inside another p; we could strengthen this further to check
 that all tags only appear in valid parents...
 I was able to use this to fix issues with the new RTF parser
 (TIKA-683), but I was surprised that some other parsers failed the new
 asserts.
 It could be these are relatively minor offenses (eg closing a table
 w/o closing the tr) and we need not do anything here... but I think
 it'd be cleaner if all our parsers produced matched, well-formed XHTML
 events.
 I haven't looked into any of these... it could be they are easy to fix.
 Failures:
 {noformat}
 testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest)  
 Time elapsed: 0.032 sec   ERROR!
 java.lang.AssertionError: end tag=body with no startElement
   at 
 org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224)
   at 
 org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
   at 
 org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
   at 
 org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158)
 testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest)  Time elapsed: 
 0.116 sec   ERROR!
 java.lang.AssertionError: mismatched elements open=tr close=table
   at 
 org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226)
   at 
 org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
   at 
 org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252)
   at 
 org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287)
   at 
 org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
   at 
 com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
   at 
 com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808)
   at 
 com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
   at 
 com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
   at 
 com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205)
   at 
 com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
   at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
   at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
   at 
 org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:190)
   at 
 org.apache.tika.parser.iwork.IWorkParserTest.testParseKeynote(IWorkParserTest.java:49)
 testMultipart(org.apache.tika.parser.mail.RFC822ParserTest)  Time elapsed: 
 0.025 sec   ERROR!
 

[jira] [Updated] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events

2015-03-01 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich updated TIKA-715:
-
Labels: newbie  (was: )

 Some parsers produce non-well-formed XHTML SAX events
 -

 Key: TIKA-715
 URL: https://issues.apache.org/jira/browse/TIKA-715
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 0.10
Reporter: Michael McCandless
  Labels: newbie
 Fix For: 1.8

 Attachments: TIKA-715.patch


 With TIKA-683 I committed simple, commented out code to
 SafeContentHandler, to verify that the SAX events produced by the
 parser have valid (matched) tags.  Ie, each startElement(foo) is
 matched by the closing endElement(foo).
 I only did basic nesting test, plus checking that p is never
 embedded inside another p; we could strengthen this further to check
 that all tags only appear in valid parents...
 I was able to use this to fix issues with the new RTF parser
 (TIKA-683), but I was surprised that some other parsers failed the new
 asserts.
 It could be these are relatively minor offenses (eg closing a table
 w/o closing the tr) and we need not do anything here... but I think
 it'd be cleaner if all our parsers produced matched, well-formed XHTML
 events.
 I haven't looked into any of these... it could be they are easy to fix.
 Failures:
 {noformat}
 testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest)  
 Time elapsed: 0.032 sec   ERROR!
 java.lang.AssertionError: end tag=body with no startElement
   at 
 org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224)
   at 
 org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
   at 
 org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
   at 
 org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158)
 testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest)  Time elapsed: 
 0.116 sec   ERROR!
 java.lang.AssertionError: mismatched elements open=tr close=table
   at 
 org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226)
   at 
 org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
   at 
 org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252)
   at 
 org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287)
   at 
 org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
   at 
 com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
   at 
 com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808)
   at 
 com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
   at 
 com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
   at 
 com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205)
   at 
 com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
   at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
   at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
   at 
 org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:190)
   at 
 org.apache.tika.parser.iwork.IWorkParserTest.testParseKeynote(IWorkParserTest.java:49)
 testMultipart(org.apache.tika.parser.mail.RFC822ParserTest)  Time elapsed: 
 0.025 sec   ERROR!
 java.lang.AssertionError: p inside p
   at 
 

[jira] [Updated] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events

2014-02-04 Thread Dave Meikle (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-715:
-

Fix Version/s: (was: 1.5)
   1.6

Pushed out to 1.6, preparing for 1.5 RC

 Some parsers produce non-well-formed XHTML SAX events
 -

 Key: TIKA-715
 URL: https://issues.apache.org/jira/browse/TIKA-715
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 0.10
Reporter: Michael McCandless
 Fix For: 1.6

 Attachments: TIKA-715.patch


 With TIKA-683 I committed simple, commented out code to
 SafeContentHandler, to verify that the SAX events produced by the
 parser have valid (matched) tags.  Ie, each startElement(foo) is
 matched by the closing endElement(foo).
 I only did basic nesting test, plus checking that p is never
 embedded inside another p; we could strengthen this further to check
 that all tags only appear in valid parents...
 I was able to use this to fix issues with the new RTF parser
 (TIKA-683), but I was surprised that some other parsers failed the new
 asserts.
 It could be these are relatively minor offenses (eg closing a table
 w/o closing the tr) and we need not do anything here... but I think
 it'd be cleaner if all our parsers produced matched, well-formed XHTML
 events.
 I haven't looked into any of these... it could be they are easy to fix.
 Failures:
 {noformat}
 testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest)  
 Time elapsed: 0.032 sec   ERROR!
 java.lang.AssertionError: end tag=body with no startElement
   at 
 org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224)
   at 
 org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
   at 
 org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
   at 
 org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158)
 testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest)  Time elapsed: 
 0.116 sec   ERROR!
 java.lang.AssertionError: mismatched elements open=tr close=table
   at 
 org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226)
   at 
 org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
   at 
 org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252)
   at 
 org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287)
   at 
 org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
   at 
 com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
   at 
 com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808)
   at 
 com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
   at 
 com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
   at 
 com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205)
   at 
 com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
   at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
   at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
   at 
 org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:190)
   at 
 org.apache.tika.parser.iwork.IWorkParserTest.testParseKeynote(IWorkParserTest.java:49)
 testMultipart(org.apache.tika.parser.mail.RFC822ParserTest)  Time elapsed: 
 0.025 sec   ERROR!
 java.lang.AssertionError: p inside p
   

[jira] [Updated] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events

2013-05-27 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-715:
---

Fix Version/s: (was: 1.4)
   1.5

- push to 1.5, get ready for 1.4 RC #1.

 Some parsers produce non-well-formed XHTML SAX events
 -

 Key: TIKA-715
 URL: https://issues.apache.org/jira/browse/TIKA-715
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 0.10
Reporter: Michael McCandless
 Fix For: 1.5

 Attachments: TIKA-715.patch


 With TIKA-683 I committed simple, commented out code to
 SafeContentHandler, to verify that the SAX events produced by the
 parser have valid (matched) tags.  Ie, each startElement(foo) is
 matched by the closing endElement(foo).
 I only did basic nesting test, plus checking that p is never
 embedded inside another p; we could strengthen this further to check
 that all tags only appear in valid parents...
 I was able to use this to fix issues with the new RTF parser
 (TIKA-683), but I was surprised that some other parsers failed the new
 asserts.
 It could be these are relatively minor offenses (eg closing a table
 w/o closing the tr) and we need not do anything here... but I think
 it'd be cleaner if all our parsers produced matched, well-formed XHTML
 events.
 I haven't looked into any of these... it could be they are easy to fix.
 Failures:
 {noformat}
 testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest)  
 Time elapsed: 0.032 sec   ERROR!
 java.lang.AssertionError: end tag=body with no startElement
   at 
 org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224)
   at 
 org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
   at 
 org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
   at 
 org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158)
 testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest)  Time elapsed: 
 0.116 sec   ERROR!
 java.lang.AssertionError: mismatched elements open=tr close=table
   at 
 org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226)
   at 
 org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
   at 
 org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252)
   at 
 org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287)
   at 
 org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
   at 
 com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
   at 
 com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808)
   at 
 com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
   at 
 com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
   at 
 com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205)
   at 
 com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
   at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
   at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
   at 
 org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:190)
   at 
 org.apache.tika.parser.iwork.IWorkParserTest.testParseKeynote(IWorkParserTest.java:49)
 testMultipart(org.apache.tika.parser.mail.RFC822ParserTest)  Time elapsed: 
 0.025 sec   ERROR!
 

[jira] [Updated] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events

2012-07-01 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-715:
---

Fix Version/s: (was: 1.2)
   1.3

- push to 1.3

 Some parsers produce non-well-formed XHTML SAX events
 -

 Key: TIKA-715
 URL: https://issues.apache.org/jira/browse/TIKA-715
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 0.10
Reporter: Michael McCandless
 Fix For: 1.3

 Attachments: TIKA-715.patch


 With TIKA-683 I committed simple, commented out code to
 SafeContentHandler, to verify that the SAX events produced by the
 parser have valid (matched) tags.  Ie, each startElement(foo) is
 matched by the closing endElement(foo).
 I only did basic nesting test, plus checking that p is never
 embedded inside another p; we could strengthen this further to check
 that all tags only appear in valid parents...
 I was able to use this to fix issues with the new RTF parser
 (TIKA-683), but I was surprised that some other parsers failed the new
 asserts.
 It could be these are relatively minor offenses (eg closing a table
 w/o closing the tr) and we need not do anything here... but I think
 it'd be cleaner if all our parsers produced matched, well-formed XHTML
 events.
 I haven't looked into any of these... it could be they are easy to fix.
 Failures:
 {noformat}
 testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest)  
 Time elapsed: 0.032 sec   ERROR!
 java.lang.AssertionError: end tag=body with no startElement
   at 
 org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224)
   at 
 org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
   at 
 org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
   at 
 org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158)
 testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest)  Time elapsed: 
 0.116 sec   ERROR!
 java.lang.AssertionError: mismatched elements open=tr close=table
   at 
 org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226)
   at 
 org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
   at 
 org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252)
   at 
 org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287)
   at 
 org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
   at 
 com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
   at 
 com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808)
   at 
 com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
   at 
 com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
   at 
 com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205)
   at 
 com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
   at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
   at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
   at 
 org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:190)
   at 
 org.apache.tika.parser.iwork.IWorkParserTest.testParseKeynote(IWorkParserTest.java:49)
 testMultipart(org.apache.tika.parser.mail.RFC822ParserTest)  Time elapsed: 
 0.025 sec   ERROR!
 java.lang.AssertionError: p inside p
 

[jira] [Updated] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events

2012-03-07 Thread Chris A. Mattmann (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-715:
---

Fix Version/s: (was: 1.1)
   1.2

- push out to 1.2

 Some parsers produce non-well-formed XHTML SAX events
 -

 Key: TIKA-715
 URL: https://issues.apache.org/jira/browse/TIKA-715
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 0.10
Reporter: Michael McCandless
 Fix For: 1.2

 Attachments: TIKA-715.patch


 With TIKA-683 I committed simple, commented out code to
 SafeContentHandler, to verify that the SAX events produced by the
 parser have valid (matched) tags.  Ie, each startElement(foo) is
 matched by the closing endElement(foo).
 I only did basic nesting test, plus checking that p is never
 embedded inside another p; we could strengthen this further to check
 that all tags only appear in valid parents...
 I was able to use this to fix issues with the new RTF parser
 (TIKA-683), but I was surprised that some other parsers failed the new
 asserts.
 It could be these are relatively minor offenses (eg closing a table
 w/o closing the tr) and we need not do anything here... but I think
 it'd be cleaner if all our parsers produced matched, well-formed XHTML
 events.
 I haven't looked into any of these... it could be they are easy to fix.
 Failures:
 {noformat}
 testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest)  
 Time elapsed: 0.032 sec   ERROR!
 java.lang.AssertionError: end tag=body with no startElement
   at 
 org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224)
   at 
 org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
   at 
 org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
   at 
 org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158)
 testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest)  Time elapsed: 
 0.116 sec   ERROR!
 java.lang.AssertionError: mismatched elements open=tr close=table
   at 
 org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226)
   at 
 org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
   at 
 org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252)
   at 
 org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287)
   at 
 org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
   at 
 com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
   at 
 com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808)
   at 
 com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
   at 
 com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
   at 
 com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205)
   at 
 com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
   at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
   at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
   at 
 org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:190)
   at 
 org.apache.tika.parser.iwork.IWorkParserTest.testParseKeynote(IWorkParserTest.java:49)
 testMultipart(org.apache.tika.parser.mail.RFC822ParserTest)  Time elapsed: 
 0.025 sec   ERROR!
 java.lang.AssertionError: p inside p
 

[jira] [Updated] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events

2011-10-25 Thread Chris A. Mattmann (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-715:
---

Fix Version/s: (was: 1.0)
   1.1

- push out to 1.1: prep for 1.0.

 Some parsers produce non-well-formed XHTML SAX events
 -

 Key: TIKA-715
 URL: https://issues.apache.org/jira/browse/TIKA-715
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 0.10
Reporter: Michael McCandless
 Fix For: 1.1

 Attachments: TIKA-715.patch


 With TIKA-683 I committed simple, commented out code to
 SafeContentHandler, to verify that the SAX events produced by the
 parser have valid (matched) tags.  Ie, each startElement(foo) is
 matched by the closing endElement(foo).
 I only did basic nesting test, plus checking that p is never
 embedded inside another p; we could strengthen this further to check
 that all tags only appear in valid parents...
 I was able to use this to fix issues with the new RTF parser
 (TIKA-683), but I was surprised that some other parsers failed the new
 asserts.
 It could be these are relatively minor offenses (eg closing a table
 w/o closing the tr) and we need not do anything here... but I think
 it'd be cleaner if all our parsers produced matched, well-formed XHTML
 events.
 I haven't looked into any of these... it could be they are easy to fix.
 Failures:
 {noformat}
 testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest)  
 Time elapsed: 0.032 sec   ERROR!
 java.lang.AssertionError: end tag=body with no startElement
   at 
 org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224)
   at 
 org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
   at 
 org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
   at 
 org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158)
 testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest)  Time elapsed: 
 0.116 sec   ERROR!
 java.lang.AssertionError: mismatched elements open=tr close=table
   at 
 org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226)
   at 
 org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
   at 
 org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252)
   at 
 org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287)
   at 
 org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
   at 
 com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
   at 
 com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808)
   at 
 com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
   at 
 com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
   at 
 com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205)
   at 
 com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
   at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
   at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
   at 
 org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:190)
   at 
 org.apache.tika.parser.iwork.IWorkParserTest.testParseKeynote(IWorkParserTest.java:49)
 testMultipart(org.apache.tika.parser.mail.RFC822ParserTest)  Time elapsed: 
 0.025 sec   ERROR!
 

[jira] [Updated] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events

2011-09-25 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-715:
---

Affects Version/s: 0.10
Fix Version/s: (was: 0.10)
   1.0

- pushing out: rolling 0.10 RC today.

 Some parsers produce non-well-formed XHTML SAX events
 -

 Key: TIKA-715
 URL: https://issues.apache.org/jira/browse/TIKA-715
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 0.10
Reporter: Michael McCandless
 Fix For: 1.0

 Attachments: TIKA-715.patch


 With TIKA-683 I committed simple, commented out code to
 SafeContentHandler, to verify that the SAX events produced by the
 parser have valid (matched) tags.  Ie, each startElement(foo) is
 matched by the closing endElement(foo).
 I only did basic nesting test, plus checking that p is never
 embedded inside another p; we could strengthen this further to check
 that all tags only appear in valid parents...
 I was able to use this to fix issues with the new RTF parser
 (TIKA-683), but I was surprised that some other parsers failed the new
 asserts.
 It could be these are relatively minor offenses (eg closing a table
 w/o closing the tr) and we need not do anything here... but I think
 it'd be cleaner if all our parsers produced matched, well-formed XHTML
 events.
 I haven't looked into any of these... it could be they are easy to fix.
 Failures:
 {noformat}
 testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest)  
 Time elapsed: 0.032 sec   ERROR!
 java.lang.AssertionError: end tag=body with no startElement
   at 
 org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224)
   at 
 org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
   at 
 org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
   at 
 org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158)
 testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest)  Time elapsed: 
 0.116 sec   ERROR!
 java.lang.AssertionError: mismatched elements open=tr close=table
   at 
 org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226)
   at 
 org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
   at 
 org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252)
   at 
 org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287)
   at 
 org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
   at 
 com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
   at 
 com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808)
   at 
 com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
   at 
 com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
   at 
 com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205)
   at 
 com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
   at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
   at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
   at 
 org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:190)
   at 
 org.apache.tika.parser.iwork.IWorkParserTest.testParseKeynote(IWorkParserTest.java:49)
 testMultipart(org.apache.tika.parser.mail.RFC822ParserTest)  Time elapsed: 
 0.025 sec   

[jira] [Updated] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events

2011-09-15 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated TIKA-715:


Attachment: TIKA-715.patch

Patch turning on the asserts in SafeContentHandler.java.

 Some parsers produce non-well-formed XHTML SAX events
 -

 Key: TIKA-715
 URL: https://issues.apache.org/jira/browse/TIKA-715
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Michael McCandless
 Fix For: 1.0

 Attachments: TIKA-715.patch


 With TIKA-683 I committed simple, commented out code to
 SafeContentHandler, to verify that the SAX events produced by the
 parser have valid (matched) tags.  Ie, each startElement(foo) is
 matched by the closing endElement(foo).
 I only did basic nesting test, plus checking that p is never
 embedded inside another p; we could strengthen this further to check
 that all tags only appear in valid parents...
 I was able to use this to fix issues with the new RTF parser
 (TIKA-683), but I was surprised that some other parsers failed the new
 asserts.
 It could be these are relatively minor offenses (eg closing a table
 w/o closing the tr) and we need not do anything here... but I think
 it'd be cleaner if all our parsers produced matched, well-formed XHTML
 events.
 I haven't looked into any of these... it could be they are easy to fix.
 Failures:
 {noformat}
 testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest)  
 Time elapsed: 0.032 sec   ERROR!
 java.lang.AssertionError: end tag=body with no startElement
   at 
 org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224)
   at 
 org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
   at 
 org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
   at 
 org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158)
 testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest)  Time elapsed: 
 0.116 sec   ERROR!
 java.lang.AssertionError: mismatched elements open=tr close=table
   at 
 org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226)
   at 
 org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
   at 
 org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252)
   at 
 org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287)
   at 
 org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
   at 
 com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
   at 
 com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808)
   at 
 com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
   at 
 com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
   at 
 com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205)
   at 
 com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
   at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
   at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
   at 
 org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:190)
   at 
 org.apache.tika.parser.iwork.IWorkParserTest.testParseKeynote(IWorkParserTest.java:49)
 testMultipart(org.apache.tika.parser.mail.RFC822ParserTest)  Time elapsed: 
 0.025 sec   ERROR!
 java.lang.AssertionError: p inside p
   at 
 

[jira] [Updated] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events

2011-09-15 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated TIKA-715:


Component/s: parser

 Some parsers produce non-well-formed XHTML SAX events
 -

 Key: TIKA-715
 URL: https://issues.apache.org/jira/browse/TIKA-715
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Michael McCandless
 Fix For: 1.0

 Attachments: TIKA-715.patch


 With TIKA-683 I committed simple, commented out code to
 SafeContentHandler, to verify that the SAX events produced by the
 parser have valid (matched) tags.  Ie, each startElement(foo) is
 matched by the closing endElement(foo).
 I only did basic nesting test, plus checking that p is never
 embedded inside another p; we could strengthen this further to check
 that all tags only appear in valid parents...
 I was able to use this to fix issues with the new RTF parser
 (TIKA-683), but I was surprised that some other parsers failed the new
 asserts.
 It could be these are relatively minor offenses (eg closing a table
 w/o closing the tr) and we need not do anything here... but I think
 it'd be cleaner if all our parsers produced matched, well-formed XHTML
 events.
 I haven't looked into any of these... it could be they are easy to fix.
 Failures:
 {noformat}
 testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest)  
 Time elapsed: 0.032 sec   ERROR!
 java.lang.AssertionError: end tag=body with no startElement
   at 
 org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224)
   at 
 org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
   at 
 org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
   at 
 org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158)
 testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest)  Time elapsed: 
 0.116 sec   ERROR!
 java.lang.AssertionError: mismatched elements open=tr close=table
   at 
 org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226)
   at 
 org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
   at 
 org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252)
   at 
 org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287)
   at 
 org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
   at 
 com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
   at 
 com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808)
   at 
 com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
   at 
 com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
   at 
 com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205)
   at 
 com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
   at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
   at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
   at 
 org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:190)
   at 
 org.apache.tika.parser.iwork.IWorkParserTest.testParseKeynote(IWorkParserTest.java:49)
 testMultipart(org.apache.tika.parser.mail.RFC822ParserTest)  Time elapsed: 
 0.025 sec   ERROR!
 java.lang.AssertionError: p inside p
   at