[ https://issues.apache.org/jira/browse/TIKA-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13869021#comment-13869021 ]
Michael McCandless commented on TIKA-1078: ------------------------------------------ bq. I'd like to fix this one as a way to get familiar with tika. Wonderful! bq. 1. As far as I understand it (and based on the tests I have done) the problem here is with special characters not allowed in file names by the different file systems, not to special (i.e. not ASCII or UTF8) characters. can anyone confirm? Yes, I think so. I.e., each OS/filesystem imposes its own restrictions on what characters are allowed in a filename. bq. 2. Is there any general policy in tika development I should follow wrt java version? shall I stick to a particular version of java, or can I go with Java 7? Tika must work with Java 6 ... so you can use Java 7 for development, but before committing we need to make sure it works on Java 6 as well. > TikaCLI: invalid characters in embedded document name causes FNFE when trying > to save > ------------------------------------------------------------------------------------- > > Key: TIKA-1078 > URL: https://issues.apache.org/jira/browse/TIKA-1078 > Project: Tika > Issue Type: Bug > Components: cli, parser > Reporter: Michael McCandless > Fix For: 1.5 > > Attachments: T-DS_Excel2003-PPT2003_1.xls > > > Attached document hits this on Windows: > {noformat} > C:\>java.exe -jar tika-app-1.3.jar -z -x > c:\data\idit\T-DS_Excel2003-PPT2003_1.xls > Extracting 'file0.png' (image/png) to .\file0.png > Extracting 'file1.emf' (application/x-emf) to .\file1.emf > Extracting 'file2.jpg' (image/jpeg) to .\file2.jpg > Extracting 'file3.emf' (application/x-emf) to .\file3.emf > Extracting 'file4.wmf' (application/x-msmetafile) to .\file4.wmf > Extracting 'MBD0016BDE4/?£☺.bin' (application/octet-stream) to > .\MBD0016BDE4\?£☺.bin > Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198: > Illegal IOException from > org.apache.tika.parser.microsoft.OfficeParser@75f875f8 > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:248) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:139) > at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:415) > at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:109) > Caused by: java.io.FileNotFoundException: .\MBD0016BDE4\?£☺.bin (The > filename, directory name, or volume label syntax is incorrect.) > at java.io.FileOutputStream.<init>(FileOutputStream.java:205) > at java.io.FileOutputStream.<init>(FileOutputStream.java:156) > at > org.apache.tika.cli.TikaCLI$FileEmbeddedDocumentExtractor.parseEmbedded(TikaCLI.java:722) > at > org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:201) > at > org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:158) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:194) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > ... 5 more > {noformat} > TikaCLI manages to create the sub-directory, but because the embedded > fileName has invalid (for Windows) characters, it fails. > On Linux it runs fine. > I think somehow ... we have to sanitize the embedded file name ... -- This message was sent by Atlassian JIRA (v6.1.5#6160)