[ 
https://issues.apache.org/jira/browse/TIKA-3864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-3864.
-------------------------------
    Fix Version/s: 2.5.1
       Resolution: Fixed

> Non-ascii UTF-8 characters in fetchKey not working with FileSystemFetcher
> -------------------------------------------------------------------------
>
>                 Key: TIKA-3864
>                 URL: https://issues.apache.org/jira/browse/TIKA-3864
>             Project: Tika
>          Issue Type: Bug
>          Components: tika-pipes, tika-server
>    Affects Versions: 2.4.1
>         Environment: debian:bullseye docker container running 
> tika-server-standard-2.4.1jar
>            Reporter: Tong Wang
>            Priority: Major
>             Fix For: 2.5.1
>
>
> When use FileSystemFetcher, if there is non-ascii characters in fetchKey, 
> Tika Server throws exception because the file name is incorrect. Here is an 
> example:
> {code:java}
> curl -v -X PUT http://tika:9998/rmeta/text --header "fetcherName: restricted" 
> --header "fetchKey: 中文.txt" {code}
> I get java.nio.file.NoSuchFileException:
> {code:java}
> Caused by: java.nio.file.NoSuchFileException: /restricted/ä¸æ–‡.txt   at 
> java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
>      at 
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
>       at 
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116)
>       at java.base/sun.nio.fs.UnixPath.toRealPath(UnixPath.java:860)  at 
> org.apache.tika.pipes.fetcher.fs.FileSystemFetcher.fetch(FileSystemFetcher.java:64)
>   at 
> org.apache.tika.server.core.FetcherStreamFactory.getInputStream(FetcherStreamFactory.java:90)
>         at 
> org.apache.tika.server.core.resource.TikaResource.getInputStream(TikaResource.java:159)
>  {code}
>  
> When I try to quote the characters:
> {code:java}
> curl -v -X PUT http://tika:9998/rmeta/text --header "fetcherName: restricted" 
> --header "fetchKey: %E4%B8%AD%E6%96%87.txt" {code}
> I still get a java.nio.file.NoSuchFileException:
> {code:java}
> Caused by: java.nio.file.NoSuchFileException: 
> /restricted/%E4%B8%AD%E6%96%87.txt      at 
> java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
>      at 
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
>       at 
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116)
>       at java.base/sun.nio.fs.UnixPath.toRealPath(UnixPath.java:860)  at 
> org.apache.tika.pipes.fetcher.fs.FileSystemFetcher.fetch(FileSystemFetcher.java:64)
>   at 
> org.apache.tika.server.core.FetcherStreamFactory.getInputStream(FetcherStreamFactory.java:90)
>         at 
> org.apache.tika.server.core.resource.TikaResource.getInputStream(TikaResource.java:159){code}
> BTW, locale is set to C.UTF-8 on Tika Server:
> {code:java}
> # locale
> LANG=C.UTF-8
> LANGUAGE=
> LC_CTYPE="C.UTF-8"
> LC_NUMERIC="C.UTF-8"
> LC_TIME="C.UTF-8"
> LC_COLLATE="C.UTF-8"
> LC_MONETARY="C.UTF-8"
> LC_MESSAGES="C.UTF-8"
> LC_PAPER="C.UTF-8"
> LC_NAME="C.UTF-8"
> LC_ADDRESS="C.UTF-8"
> LC_TELEPHONE="C.UTF-8"
> LC_MEASUREMENT="C.UTF-8"
> LC_IDENTIFICATION="C.UTF-8"
> LC_ALL= {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to