[ 
https://issues.apache.org/jira/browse/TIKA-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13869021#comment-13869021
 ] 

Michael McCandless commented on TIKA-1078:
------------------------------------------

bq. I'd like to fix this one as a way to get familiar with tika.

Wonderful!

bq. 1. As far as I understand it (and based on the tests I have done) the 
problem here is with special characters not allowed in file names by the 
different file systems, not to special (i.e. not ASCII or UTF8) characters. can 
anyone confirm?

Yes, I think so.  I.e., each OS/filesystem imposes its own restrictions on what 
characters are allowed in a filename.

bq. 2. Is there any general policy in tika development I should follow wrt java 
version? shall I stick to a particular version of java, or can I go with Java 7?

Tika must work with Java 6 ... so you can use Java 7 for development, but 
before committing we need to make sure it works on Java 6 as well.

> TikaCLI: invalid characters in embedded document name causes FNFE when trying 
> to save
> -------------------------------------------------------------------------------------
>
>                 Key: TIKA-1078
>                 URL: https://issues.apache.org/jira/browse/TIKA-1078
>             Project: Tika
>          Issue Type: Bug
>          Components: cli, parser
>            Reporter: Michael McCandless
>             Fix For: 1.5
>
>         Attachments: T-DS_Excel2003-PPT2003_1.xls
>
>
> Attached document hits this on Windows:
> {noformat}
> C:\>java.exe -jar tika-app-1.3.jar -z -x 
> c:\data\idit\T-DS_Excel2003-PPT2003_1.xls
> Extracting 'file0.png' (image/png) to .\file0.png
> Extracting 'file1.emf' (application/x-emf) to .\file1.emf
> Extracting 'file2.jpg' (image/jpeg) to .\file2.jpg
> Extracting 'file3.emf' (application/x-emf) to .\file3.emf
> Extracting 'file4.wmf' (application/x-msmetafile) to .\file4.wmf
> Extracting 'MBD0016BDE4/?£☺.bin' (application/octet-stream) to 
> .\MBD0016BDE4\?£☺.bin
> Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198: 
> Illegal IOException from 
> org.apache.tika.parser.microsoft.OfficeParser@75f875f8
>         at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:248)
>         at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>         at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>         at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:139)
>         at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:415)
>         at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:109)
> Caused by: java.io.FileNotFoundException: .\MBD0016BDE4\?£☺.bin (The 
> filename, directory name, or volume label syntax is incorrect.)
>         at java.io.FileOutputStream.<init>(FileOutputStream.java:205)
>         at java.io.FileOutputStream.<init>(FileOutputStream.java:156)
>         at 
> org.apache.tika.cli.TikaCLI$FileEmbeddedDocumentExtractor.parseEmbedded(TikaCLI.java:722)
>         at 
> org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:201)
>         at 
> org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:158)
>         at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:194)
>         at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161)
>         at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>         ... 5 more
> {noformat}
> TikaCLI manages to create the sub-directory, but because the embedded 
> fileName has invalid (for Windows) characters, it fails.
> On Linux it runs fine.
> I think somehow ... we have to sanitize the embedded file name ...



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to