[ https://issues.apache.org/jira/browse/TIKA-3864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tim Allison resolved TIKA-3864. ------------------------------- Fix Version/s: 2.5.1 Resolution: Fixed > Non-ascii UTF-8 characters in fetchKey not working with FileSystemFetcher > ------------------------------------------------------------------------- > > Key: TIKA-3864 > URL: https://issues.apache.org/jira/browse/TIKA-3864 > Project: Tika > Issue Type: Bug > Components: tika-pipes, tika-server > Affects Versions: 2.4.1 > Environment: debian:bullseye docker container running > tika-server-standard-2.4.1jar > Reporter: Tong Wang > Priority: Major > Fix For: 2.5.1 > > > When use FileSystemFetcher, if there is non-ascii characters in fetchKey, > Tika Server throws exception because the file name is incorrect. Here is an > example: > {code:java} > curl -v -X PUT http://tika:9998/rmeta/text --header "fetcherName: restricted" > --header "fetchKey: 中文.txt" {code} > I get java.nio.file.NoSuchFileException: > {code:java} > Caused by: java.nio.file.NoSuchFileException: /restricted/ä¸æ.txt at > java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) > at > java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) > at > java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116) > at java.base/sun.nio.fs.UnixPath.toRealPath(UnixPath.java:860) at > org.apache.tika.pipes.fetcher.fs.FileSystemFetcher.fetch(FileSystemFetcher.java:64) > at > org.apache.tika.server.core.FetcherStreamFactory.getInputStream(FetcherStreamFactory.java:90) > at > org.apache.tika.server.core.resource.TikaResource.getInputStream(TikaResource.java:159) > {code} > > When I try to quote the characters: > {code:java} > curl -v -X PUT http://tika:9998/rmeta/text --header "fetcherName: restricted" > --header "fetchKey: %E4%B8%AD%E6%96%87.txt" {code} > I still get a java.nio.file.NoSuchFileException: > {code:java} > Caused by: java.nio.file.NoSuchFileException: > /restricted/%E4%B8%AD%E6%96%87.txt at > java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) > at > java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) > at > java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116) > at java.base/sun.nio.fs.UnixPath.toRealPath(UnixPath.java:860) at > org.apache.tika.pipes.fetcher.fs.FileSystemFetcher.fetch(FileSystemFetcher.java:64) > at > org.apache.tika.server.core.FetcherStreamFactory.getInputStream(FetcherStreamFactory.java:90) > at > org.apache.tika.server.core.resource.TikaResource.getInputStream(TikaResource.java:159){code} > BTW, locale is set to C.UTF-8 on Tika Server: > {code:java} > # locale > LANG=C.UTF-8 > LANGUAGE= > LC_CTYPE="C.UTF-8" > LC_NUMERIC="C.UTF-8" > LC_TIME="C.UTF-8" > LC_COLLATE="C.UTF-8" > LC_MONETARY="C.UTF-8" > LC_MESSAGES="C.UTF-8" > LC_PAPER="C.UTF-8" > LC_NAME="C.UTF-8" > LC_ADDRESS="C.UTF-8" > LC_TELEPHONE="C.UTF-8" > LC_MEASUREMENT="C.UTF-8" > LC_IDENTIFICATION="C.UTF-8" > LC_ALL= {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)