[jira] [Commented] (TIKA-1215) Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4
[ https://issues.apache.org/jira/browse/TIKA-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13870573#comment-13870573 ] Hong-Thai Nguyen commented on TIKA-1215: Great catch. Thank [~jukkaz] Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4 -- Key: TIKA-1215 URL: https://issues.apache.org/jira/browse/TIKA-1215 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.5 Reporter: Hong-Thai Nguyen Priority: Critical Attachments: Centres 080805@0650 RTBF Matin Première - A propos des rues de Dublin et Dubreucq.mp3, TIKA-1215-fix-prefix-namespaces.patch, tika-1215-without-wildcard.patch With attached file, 1.5 raises this exception on parsing. This file has no problem on 1.4 {code} ... Caused by: org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml not declared at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62) at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68) at org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284) at org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:323) at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:107) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at com.polyspot.document.converter.DocumentConverter.realizeTikaConversion(DocumentConverter.java:221) ... 15 more {code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1215) Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4
[ https://issues.apache.org/jira/browse/TIKA-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13869528#comment-13869528 ] Tim Allison commented on TIKA-1215: --- [~thaichat04] thank you for sending a clean patch. This area of the code base is not exceedingly familiar to me, but if I understand Tika's history and your code correctly, your if statement wasn't necessary in 1.4, and (based on a very quick look) it looks like nothing else in the relevant lines of the MP3 parser changed between 1.4 and trunk. Are you able to determine what changed btwn 1.4 and trunk that led to this regression? Thank you! Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4 -- Key: TIKA-1215 URL: https://issues.apache.org/jira/browse/TIKA-1215 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.5 Reporter: Hong-Thai Nguyen Priority: Critical Attachments: Centres 080805@0650 RTBF Matin Première - A propos des rues de Dublin et Dubreucq.mp3, TIKA-1215-fix-prefix-namespaces.patch, tika-1215-without-wildcard.patch With attached file, 1.5 raises this exception on parsing. This file has no problem on 1.4 {code} ... Caused by: org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml not declared at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62) at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68) at org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284) at org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:323) at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:107) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at com.polyspot.document.converter.DocumentConverter.realizeTikaConversion(DocumentConverter.java:221) ... 15 more {code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1215) Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4
[ https://issues.apache.org/jira/browse/TIKA-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13869590#comment-13869590 ] Hong-Thai Nguyen commented on TIKA-1215: [~talli...@apache.org], here's XML of input to parse: {noformat} h1 xmlns=http://www.w3.org/1999/xhtml;Matin Première - Tour des régions 080806/h1 pRTBF - La Première/p pSpeech/p p101698.914/p pXXX - A propos du contrat de quartier rues Dublin/Dubreucq/p {noformat} I think this regression came from TIKA-1070 {code} currentElement = currentElement.parent; {code} The parentElement of p is null, then getPrefix() raised exception, that's different from 1.4 Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4 -- Key: TIKA-1215 URL: https://issues.apache.org/jira/browse/TIKA-1215 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.5 Reporter: Hong-Thai Nguyen Priority: Critical Attachments: Centres 080805@0650 RTBF Matin Première - A propos des rues de Dublin et Dubreucq.mp3, TIKA-1215-fix-prefix-namespaces.patch, tika-1215-without-wildcard.patch With attached file, 1.5 raises this exception on parsing. This file has no problem on 1.4 {code} ... Caused by: org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml not declared at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62) at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68) at org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284) at org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:323) at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:107) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at com.polyspot.document.converter.DocumentConverter.realizeTikaConversion(DocumentConverter.java:221) ... 15 more {code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1215) Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4
[ https://issues.apache.org/jira/browse/TIKA-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13868868#comment-13868868 ] Nick Burch commented on TIKA-1215: -- Are you able to reproduce the file with a smaller MP3 than the one in your patch? Also, your patch is a bit hard to review, as most of it is whitespace changes. If there is inconsistent whitespace in a file that needs fixing, it's normally better to post separate patches for the whitespace bit and the bug fix part, to make it easier to see what changed where, and hence focus the review on the important parts Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4 -- Key: TIKA-1215 URL: https://issues.apache.org/jira/browse/TIKA-1215 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.5 Reporter: Hong-Thai Nguyen Priority: Critical Attachments: Centres 080805@0650 RTBF Matin Première - A propos des rues de Dublin et Dubreucq.mp3, TIKA-1215-fix-prefix-namespaces.patch With attached file, 1.5 raises this exception on parsing. This file has no problem on 1.4 {code} ... Caused by: org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml not declared at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62) at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68) at org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284) at org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:323) at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:107) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at com.polyspot.document.converter.DocumentConverter.realizeTikaConversion(DocumentConverter.java:221) ... 15 more {code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)