[ 
https://issues.apache.org/jira/browse/TIKA-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13868987#comment-13868987
 ] 

Stefano Fornari commented on TIKA-1078:
---------------------------------------

I'd like to fix this one as a way to get familiar with tika.
I have a couple of questions:

1. As far as I understand it (and based on the tests I have done) the problem 
here is with special characters not allowed in file names by the different file 
systems, not to special (i.e. not ASCII or UTF8) characters. can anyone confirm?
2. Is there any general policy in tika development I should follow wrt java 
version? shall I stick to a particular version of java, or can I go with Java 7?



> TikaCLI: invalid characters in embedded document name causes FNFE when trying 
> to save
> -------------------------------------------------------------------------------------
>
>                 Key: TIKA-1078
>                 URL: https://issues.apache.org/jira/browse/TIKA-1078
>             Project: Tika
>          Issue Type: Bug
>          Components: cli, parser
>            Reporter: Michael McCandless
>             Fix For: 1.5
>
>         Attachments: T-DS_Excel2003-PPT2003_1.xls
>
>
> Attached document hits this on Windows:
> {noformat}
> C:\>java.exe -jar tika-app-1.3.jar -z -x 
> c:\data\idit\T-DS_Excel2003-PPT2003_1.xls
> Extracting 'file0.png' (image/png) to .\file0.png
> Extracting 'file1.emf' (application/x-emf) to .\file1.emf
> Extracting 'file2.jpg' (image/jpeg) to .\file2.jpg
> Extracting 'file3.emf' (application/x-emf) to .\file3.emf
> Extracting 'file4.wmf' (application/x-msmetafile) to .\file4.wmf
> Extracting 'MBD0016BDE4/?£☺.bin' (application/octet-stream) to 
> .\MBD0016BDE4\?£☺.bin
> Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198: 
> Illegal IOException from 
> org.apache.tika.parser.microsoft.OfficeParser@75f875f8
>         at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:248)
>         at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>         at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>         at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:139)
>         at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:415)
>         at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:109)
> Caused by: java.io.FileNotFoundException: .\MBD0016BDE4\?£☺.bin (The 
> filename, directory name, or volume label syntax is incorrect.)
>         at java.io.FileOutputStream.<init>(FileOutputStream.java:205)
>         at java.io.FileOutputStream.<init>(FileOutputStream.java:156)
>         at 
> org.apache.tika.cli.TikaCLI$FileEmbeddedDocumentExtractor.parseEmbedded(TikaCLI.java:722)
>         at 
> org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:201)
>         at 
> org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:158)
>         at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:194)
>         at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161)
>         at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>         ... 5 more
> {noformat}
> TikaCLI manages to create the sub-directory, but because the embedded 
> fileName has invalid (for Windows) characters, it fails.
> On Linux it runs fine.
> I think somehow ... we have to sanitize the embedded file name ...



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to