[jira] [Updated] (TIKA-1215) Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4
[ https://issues.apache.org/jira/browse/TIKA-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen updated TIKA-1215: --- Attachment: tika-1215-without-wildcard.patch [~gagravarr], my code style is different the one of Apache convention. Apologize for that. I attached new patch file containing changes only. Thanks Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4 -- Key: TIKA-1215 URL: https://issues.apache.org/jira/browse/TIKA-1215 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.5 Reporter: Hong-Thai Nguyen Priority: Critical Attachments: Centres 080805@0650 RTBF Matin Première - A propos des rues de Dublin et Dubreucq.mp3, TIKA-1215-fix-prefix-namespaces.patch, tika-1215-without-wildcard.patch With attached file, 1.5 raises this exception on parsing. This file has no problem on 1.4 {code} ... Caused by: org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml not declared at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62) at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68) at org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284) at org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:323) at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:107) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at com.polyspot.document.converter.DocumentConverter.realizeTikaConversion(DocumentConverter.java:221) ... 15 more {code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1215) Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4
[ https://issues.apache.org/jira/browse/TIKA-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13869528#comment-13869528 ] Tim Allison commented on TIKA-1215: --- [~thaichat04] thank you for sending a clean patch. This area of the code base is not exceedingly familiar to me, but if I understand Tika's history and your code correctly, your if statement wasn't necessary in 1.4, and (based on a very quick look) it looks like nothing else in the relevant lines of the MP3 parser changed between 1.4 and trunk. Are you able to determine what changed btwn 1.4 and trunk that led to this regression? Thank you! Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4 -- Key: TIKA-1215 URL: https://issues.apache.org/jira/browse/TIKA-1215 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.5 Reporter: Hong-Thai Nguyen Priority: Critical Attachments: Centres 080805@0650 RTBF Matin Première - A propos des rues de Dublin et Dubreucq.mp3, TIKA-1215-fix-prefix-namespaces.patch, tika-1215-without-wildcard.patch With attached file, 1.5 raises this exception on parsing. This file has no problem on 1.4 {code} ... Caused by: org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml not declared at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62) at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68) at org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284) at org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:323) at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:107) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at com.polyspot.document.converter.DocumentConverter.realizeTikaConversion(DocumentConverter.java:221) ... 15 more {code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Closed] (TIKA-1216) parse method of Mp3Parser doesn't work for few mp3 files
[ https://issues.apache.org/jira/browse/TIKA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sumeet Gorab closed TIKA-1216. -- Resolution: Fixed parse method of Mp3Parser doesn't work for few mp3 files Key: TIKA-1216 URL: https://issues.apache.org/jira/browse/TIKA-1216 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Environment: Windows 7 ultimate 32-bit OS, Java 1.7 Reporter: Sumeet Gorab Priority: Blocker Labels: patch Fix For: 1.5 Attachments: 05 - Dharti - Sarkaaran [www.DJMaza.Com].mp3 Try to parse a Mp3 file but parse method of Mp3Parser class is not able to parse that mp3 file. Parse method is not able to complete its execution their is some issue in that method. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1215) Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4
[ https://issues.apache.org/jira/browse/TIKA-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13869590#comment-13869590 ] Hong-Thai Nguyen commented on TIKA-1215: [~talli...@apache.org], here's XML of input to parse: {noformat} h1 xmlns=http://www.w3.org/1999/xhtml;Matin Première - Tour des régions 080806/h1 pRTBF - La Première/p pSpeech/p p101698.914/p pXXX - A propos du contrat de quartier rues Dublin/Dubreucq/p {noformat} I think this regression came from TIKA-1070 {code} currentElement = currentElement.parent; {code} The parentElement of p is null, then getPrefix() raised exception, that's different from 1.4 Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4 -- Key: TIKA-1215 URL: https://issues.apache.org/jira/browse/TIKA-1215 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.5 Reporter: Hong-Thai Nguyen Priority: Critical Attachments: Centres 080805@0650 RTBF Matin Première - A propos des rues de Dublin et Dubreucq.mp3, TIKA-1215-fix-prefix-namespaces.patch, tika-1215-without-wildcard.patch With attached file, 1.5 raises this exception on parsing. This file has no problem on 1.4 {code} ... Caused by: org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml not declared at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62) at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68) at org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284) at org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:323) at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:107) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at com.polyspot.document.converter.DocumentConverter.realizeTikaConversion(DocumentConverter.java:221) ... 15 more {code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (TIKA-1218) Unable to parse a mp3 file on 1.5 getting a exception
[ https://issues.apache.org/jira/browse/TIKA-1218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sumeet Gorab updated TIKA-1218: --- Attachment: Save-the-World-Knife-Party-Remix.mp3 Getting exception using attached file. Unable to parse a mp3 file on 1.5 getting a exception - Key: TIKA-1218 URL: https://issues.apache.org/jira/browse/TIKA-1218 Project: Tika Issue Type: Bug Affects Versions: 1.5 Environment: Win 7, Java 1.7 Reporter: Sumeet Gorab Priority: Blocker Attachments: Save-the-World-Knife-Party-Remix.mp3 Unable to parse a mp3 file on 1.5 getting following exception: Exception in thread main java.lang.NegativeArraySizeException at org.apache.tika.parser.mp3.ID3v2Frame$RawTag.init(ID3v2Frame.java:417) at org.apache.tika.parser.mp3.ID3v2Frame$RawTag.init(ID3v2Frame.java:382) at org.apache.tika.parser.mp3.ID3v2Frame$RawTagIterator.next(ID3v2Frame.java:371) at org.apache.tika.parser.mp3.ID3v24Handler.init(ID3v24Handler.java:49) at org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:174) at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:71) -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (TIKA-1218) Unable to parse a mp3 file on 1.5 getting a exception
Sumeet Gorab created TIKA-1218: -- Summary: Unable to parse a mp3 file on 1.5 getting a exception Key: TIKA-1218 URL: https://issues.apache.org/jira/browse/TIKA-1218 Project: Tika Issue Type: Bug Affects Versions: 1.5 Environment: Win 7, Java 1.7 Reporter: Sumeet Gorab Priority: Blocker Unable to parse a mp3 file on 1.5 getting following exception: Exception in thread main java.lang.NegativeArraySizeException at org.apache.tika.parser.mp3.ID3v2Frame$RawTag.init(ID3v2Frame.java:417) at org.apache.tika.parser.mp3.ID3v2Frame$RawTag.init(ID3v2Frame.java:382) at org.apache.tika.parser.mp3.ID3v2Frame$RawTagIterator.next(ID3v2Frame.java:371) at org.apache.tika.parser.mp3.ID3v24Handler.init(ID3v24Handler.java:49) at org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:174) at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:71) -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1218) Unable to parse a mp3 file on 1.5 getting a exception
[ https://issues.apache.org/jira/browse/TIKA-1218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13869662#comment-13869662 ] Jukka Zitting commented on TIKA-1218: - Reproduced. It looks like the last frame that Tika can interpret is a TRAKTOR4 PRIV frame at offset 114 with size of 335387 bytes. It could be that this frame is malformed (wrong size, etc.), or there might be a bug in the way Tika handles the frame. The tooling at http://dope.cz/code/ might be helpful in debugging this case. Unable to parse a mp3 file on 1.5 getting a exception - Key: TIKA-1218 URL: https://issues.apache.org/jira/browse/TIKA-1218 Project: Tika Issue Type: Bug Affects Versions: 1.5 Environment: Win 7, Java 1.7 Reporter: Sumeet Gorab Priority: Blocker Attachments: Save-the-World-Knife-Party-Remix.mp3 Unable to parse a mp3 file on 1.5 getting following exception: Exception in thread main java.lang.NegativeArraySizeException at org.apache.tika.parser.mp3.ID3v2Frame$RawTag.init(ID3v2Frame.java:417) at org.apache.tika.parser.mp3.ID3v2Frame$RawTag.init(ID3v2Frame.java:382) at org.apache.tika.parser.mp3.ID3v2Frame$RawTagIterator.next(ID3v2Frame.java:371) at org.apache.tika.parser.mp3.ID3v24Handler.init(ID3v24Handler.java:49) at org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:174) at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:71) -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Resolved] (TIKA-1215) Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4
[ https://issues.apache.org/jira/browse/TIKA-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jukka Zitting resolved TIKA-1215. - Resolution: Not A Problem You're misusing the {{ToHTMLContentHandler}} class: {code} ToHTMLContentHandler toHtmlContentHandler = new ToHTMLContentHandler(outputStream, UTF-8); WriteOutContentHandler handler = new WriteOutContentHandler(toHtmlContentHandler, (int) 400); ContentHandler bodyHandler = new BodyContentHandler(handler); {code} The {{ToHTMLContentHandler}} javadoc says: bq. The incoming SAX events are expected to be well-formed (properly nested, etc.) and valid HTML. This is not true since you're using the {{BodyContentHandler}} to strip out anything outside the {{body}} element. Thus resolving as Not A Problem. If you want to format the parse output as HTML, you should pass the {{ToHTMLContentHandler}} directly to the parser, without the {{BodyContentHandler}} wrapper. Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4 -- Key: TIKA-1215 URL: https://issues.apache.org/jira/browse/TIKA-1215 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.5 Reporter: Hong-Thai Nguyen Priority: Critical Attachments: Centres 080805@0650 RTBF Matin Première - A propos des rues de Dublin et Dubreucq.mp3, TIKA-1215-fix-prefix-namespaces.patch, tika-1215-without-wildcard.patch With attached file, 1.5 raises this exception on parsing. This file has no problem on 1.4 {code} ... Caused by: org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml not declared at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62) at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68) at org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284) at org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:323) at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:107) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at com.polyspot.document.converter.DocumentConverter.realizeTikaConversion(DocumentConverter.java:221) ... 15 more {code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Resolved] (TIKA-1214) Infinity Loop in Mpeg Stream
[ https://issues.apache.org/jira/browse/TIKA-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jukka Zitting resolved TIKA-1214. - Resolution: Duplicate Fix Version/s: (was: 1.5) Resolving as duplicate of TIKA-1179. Please reopen, preferably with a test case/document, if this problem still occurs. Infinity Loop in Mpeg Stream Key: TIKA-1214 URL: https://issues.apache.org/jira/browse/TIKA-1214 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Environment: local system Reporter: Georg Hartmann Scanning MP3 Files accounter a infiniy loop in the MpegStream Method skipStream The Call of in.skip returnes zero so the loop never ends. Simple fix with zero count below private static void skipStream(InputStream in, long count) throws IOException { long size = count; long skipped = 0; // 5 Times zero equals Error break the loop int zeroCount = 5; while (size 0 skipped = 0) { skipped = in.skip(size); if (skipped != -1) { size -= skipped; } // Checking for zero to break the infinity loop if (skipped == 0) { zeroCount--; } if (zeroCount 0) { break; } } } -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1217) Integrate with Java-7 FileTypeDetector API
[ https://issues.apache.org/jira/browse/TIKA-1217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13869780#comment-13869780 ] Jukka Zitting commented on TIKA-1217: - Thanks! I committed the patch in revision 1557795. Before we resolve this as fixed, I think it would be better to use {{detect(File)}} directly, without first trying type detection based on just the file name. Name-based type lookup is less accurate than file-based, and AFAICT there are few applications where file type lookup is performance-critical (and if it is, like in a directory browser, the results are often cached). And assuming the file-based lookup was conditional on the existence of the file, a client could still do name-only lookups by prepending a fictional directory name to the path passed to the {{probeContentType()}} call. WDYT? Integrate with Java-7 FileTypeDetector API -- Key: TIKA-1217 URL: https://issues.apache.org/jira/browse/TIKA-1217 Project: Tika Issue Type: New Feature Components: detector, mime Reporter: Peter Ansell Attachments: TIKA-1217-v2.patch, TIKA-1217.patch It would be useful if Tika natively provided Java-7 FileTypeDetector [1] implementations. Adding the corresponding META-INF/services/java.nio.file.spi.FileTypeDetector files would allow the use of Files.probeContentType [2] without any specific links to Tika for this functionality. If you do not want to rely on Java-7 for the core, then this could be added as an extension module. [1] http://docs.oracle.com/javase/7/docs/api/java/nio/file/spi/FileTypeDetector.html [2] http://docs.oracle.com/javase/7/docs/api/java/nio/file/Files.html#probeContentType(java.nio.file.Path) -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1217) Integrate with Java-7 FileTypeDetector API
[ https://issues.apache.org/jira/browse/TIKA-1217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13870087#comment-13870087 ] Peter Ansell commented on TIKA-1217: [~jukkaz] The rationale for checking first on filename, in a Java-7 context, was that Path objects do not hold File Descriptors. Hence, a content type detection method taking a Path object may also be able to avoid getting a File Descriptor. However, if there is an unacceptable loss in fidelity by checking first on the filename then feel free to remove that clause, as it isn't critical to the functionality for me. There cannot, however, easily be two different implementations in the same module, as java.util.ServiceLoader isn't ordered so it cannot preference one over the other. In addition, there are no OpenOptions or LinkOptions attached to Files.probeContentType as there are with other methods such as Files.isRegularFile. That makes it difficult for users to pass in their preferences about how Files.probeContentType should operate (ie, whether it should try to avoid getting a file descriptor if possible, or not to follow symbolic links). If we wanted to do a second implementation that always used File it would be perfectly possible, but it would need to go in a separate module to distinguish between the META-INF/services files based on which module is loaded. We would also have to rename the current module from tika-java7 to something more specific. As you say, in a performance critical application, the results will be cached to avoid duplication, so it isn't a big deal in the greater scheme of things. [~lewismc] You can find the patch that Jukka committed in the Tika trunk if you want to test it, but it isn't necessary to do it now if you have other things to do. https://github.com/apache/tika/commit/39370848b8bd9214dc4b7720539edc0eb595300c Integrate with Java-7 FileTypeDetector API -- Key: TIKA-1217 URL: https://issues.apache.org/jira/browse/TIKA-1217 Project: Tika Issue Type: New Feature Components: detector, mime Reporter: Peter Ansell Attachments: TIKA-1217-v2.patch, TIKA-1217.patch It would be useful if Tika natively provided Java-7 FileTypeDetector [1] implementations. Adding the corresponding META-INF/services/java.nio.file.spi.FileTypeDetector files would allow the use of Files.probeContentType [2] without any specific links to Tika for this functionality. If you do not want to rely on Java-7 for the core, then this could be added as an extension module. [1] http://docs.oracle.com/javase/7/docs/api/java/nio/file/spi/FileTypeDetector.html [2] http://docs.oracle.com/javase/7/docs/api/java/nio/file/Files.html#probeContentType(java.nio.file.Path) -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (TIKA-1078) TikaCLI: invalid characters in embedded document name causes FNFE when trying to save
[ https://issues.apache.org/jira/browse/TIKA-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stefano Fornari updated TIKA-1078: -- Attachment: tika-1078-2.patch TikaCLI: invalid characters in embedded document name causes FNFE when trying to save - Key: TIKA-1078 URL: https://issues.apache.org/jira/browse/TIKA-1078 Project: Tika Issue Type: Bug Components: cli, parser Reporter: Michael McCandless Fix For: 1.5 Attachments: T-DS_Excel2003-PPT2003_1.xls, tika-1078-2.patch, tika-1078.patch Attached document hits this on Windows: {noformat} C:\java.exe -jar tika-app-1.3.jar -z -x c:\data\idit\T-DS_Excel2003-PPT2003_1.xls Extracting 'file0.png' (image/png) to .\file0.png Extracting 'file1.emf' (application/x-emf) to .\file1.emf Extracting 'file2.jpg' (image/jpeg) to .\file2.jpg Extracting 'file3.emf' (application/x-emf) to .\file3.emf Extracting 'file4.wmf' (application/x-msmetafile) to .\file4.wmf Extracting 'MBD0016BDE4/?£☺.bin' (application/octet-stream) to .\MBD0016BDE4\?£☺.bin Exception in thread main org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser@75f875f8 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:248) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:139) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:415) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:109) Caused by: java.io.FileNotFoundException: .\MBD0016BDE4\?£☺.bin (The filename, directory name, or volume label syntax is incorrect.) at java.io.FileOutputStream.init(FileOutputStream.java:205) at java.io.FileOutputStream.init(FileOutputStream.java:156) at org.apache.tika.cli.TikaCLI$FileEmbeddedDocumentExtractor.parseEmbedded(TikaCLI.java:722) at org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:201) at org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:158) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:194) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 5 more {noformat} TikaCLI manages to create the sub-directory, but because the embedded fileName has invalid (for Windows) characters, it fails. On Linux it runs fine. I think somehow ... we have to sanitize the embedded file name ... -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1078) TikaCLI: invalid characters in embedded document name causes FNFE when trying to save
[ https://issues.apache.org/jira/browse/TIKA-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13870125#comment-13870125 ] Stefano Fornari commented on TIKA-1078: --- Hi Michael, thanks for the review. I took into account all your comments. About the directory structure, I reverted my change now that I understood better the original behaviour. I think the original behaviour is cleaner and nicer. attaching the new patch. TikaCLI: invalid characters in embedded document name causes FNFE when trying to save - Key: TIKA-1078 URL: https://issues.apache.org/jira/browse/TIKA-1078 Project: Tika Issue Type: Bug Components: cli, parser Reporter: Michael McCandless Fix For: 1.5 Attachments: T-DS_Excel2003-PPT2003_1.xls, tika-1078-2.patch, tika-1078.patch Attached document hits this on Windows: {noformat} C:\java.exe -jar tika-app-1.3.jar -z -x c:\data\idit\T-DS_Excel2003-PPT2003_1.xls Extracting 'file0.png' (image/png) to .\file0.png Extracting 'file1.emf' (application/x-emf) to .\file1.emf Extracting 'file2.jpg' (image/jpeg) to .\file2.jpg Extracting 'file3.emf' (application/x-emf) to .\file3.emf Extracting 'file4.wmf' (application/x-msmetafile) to .\file4.wmf Extracting 'MBD0016BDE4/?£☺.bin' (application/octet-stream) to .\MBD0016BDE4\?£☺.bin Exception in thread main org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser@75f875f8 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:248) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:139) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:415) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:109) Caused by: java.io.FileNotFoundException: .\MBD0016BDE4\?£☺.bin (The filename, directory name, or volume label syntax is incorrect.) at java.io.FileOutputStream.init(FileOutputStream.java:205) at java.io.FileOutputStream.init(FileOutputStream.java:156) at org.apache.tika.cli.TikaCLI$FileEmbeddedDocumentExtractor.parseEmbedded(TikaCLI.java:722) at org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:201) at org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:158) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:194) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 5 more {noformat} TikaCLI manages to create the sub-directory, but because the embedded fileName has invalid (for Windows) characters, it fails. On Linux it runs fine. I think somehow ... we have to sanitize the embedded file name ... -- This message was sent by Atlassian JIRA (v6.1.5#6160)