Tong Wang created TIKA-3864:
-------------------------------

             Summary: Non-ascii UTF-8 characters in fetchKey not working with 
FileSystemFetcher
                 Key: TIKA-3864
                 URL: https://issues.apache.org/jira/browse/TIKA-3864
             Project: Tika
          Issue Type: Bug
          Components: tika-pipes, tika-server
    Affects Versions: 2.4.1
         Environment: debian:bullseye docker container running 
tika-server-standard-2.4.1jar
            Reporter: Tong Wang


When use FileSystemFetcher, if there is non-ascii characters in fetchKey, Tika 
Server throws exception. Here is an example:
{code:java}
curl -v -X PUT http://tika:9998/rmeta/text --header "fetcherName: restricted" 
--header "fetchKey: 中文.txt" {code}
I get java.nio.file.NoSuchFileException:
{code:java}
Caused by: java.nio.file.NoSuchFileException: /restricted/䏿–‡.txt     at 
java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
     at 
java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) 
     at 
java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116) 
     at java.base/sun.nio.fs.UnixPath.toRealPath(UnixPath.java:860)  at 
org.apache.tika.pipes.fetcher.fs.FileSystemFetcher.fetch(FileSystemFetcher.java:64)
  at 
org.apache.tika.server.core.FetcherStreamFactory.getInputStream(FetcherStreamFactory.java:90)
        at 
org.apache.tika.server.core.resource.TikaResource.getInputStream(TikaResource.java:159)
 {code}
When I try to quote the characters:

 
{code:java}
curl -v -X PUT http://tika:9998/rmeta/text --header "fetcherName: restricted" 
--header "fetchKey: %E4%B8%AD%E6%96%87.txt" {code}
I still get a java.nio.file.NoSuchFileException:

 

 
{code:java}
Caused by: java.nio.file.NoSuchFileException: 
/restricted/%E4%B8%AD%E6%96%87.txt        at 
java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
     at 
java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) 
     at 
java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116) 
     at java.base/sun.nio.fs.UnixPath.toRealPath(UnixPath.java:860)  at 
org.apache.tika.pipes.fetcher.fs.FileSystemFetcher.fetch(FileSystemFetcher.java:64)
  at 
org.apache.tika.server.core.FetcherStreamFactory.getInputStream(FetcherStreamFactory.java:90)
        at 
org.apache.tika.server.core.resource.TikaResource.getInputStream(TikaResource.java:159){code}
BTW, locale is set to C.UTF-8 on Tika Server:
{code:java}
# locale
LANG=C.UTF-8
LANGUAGE=
LC_CTYPE="C.UTF-8"
LC_NUMERIC="C.UTF-8"
LC_TIME="C.UTF-8"
LC_COLLATE="C.UTF-8"
LC_MONETARY="C.UTF-8"
LC_MESSAGES="C.UTF-8"
LC_PAPER="C.UTF-8"
LC_NAME="C.UTF-8"
LC_ADDRESS="C.UTF-8"
LC_TELEPHONE="C.UTF-8"
LC_MEASUREMENT="C.UTF-8"
LC_IDENTIFICATION="C.UTF-8"
LC_ALL= {code}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to