Hi, thank you for your curl examples. I tested a bit around and now I think I have found the main problem: all POST requests also send a content-type depending on the filename's ending. I think the tools like curl and postman add the content-type themself. So in the examples you wrote, is the content-type automatically added as "text/plain" (without adding it explicitly). I tested this with the help of WireShark. It is a tool to catch the request which are sent, so you can get a better understanding what happens really in the background. My guess is that Tika only works with the content-type and not with the filename hint. The problem is that in my coding project I don't want to implement such a logic to add a content-type. I could add a default content-type like application/octed-stream, but I don't want to guess the type before I send the file to Tika. Is there a workaround or something like that? Or is this a bug in Tika?
-----Ursprüngliche Nachricht----- Von: Tim Allison <talli...@apache.org> Gesendet: Montag, 8. Mai 2023 15:59 An: user@tika.apache.org Betreff: Re: post request with shift_jis encoding and filename hint The file looks to work for me (if google translate is any indication) with curl and tika-server 2.7.0. curl commands that work: curl -T 20161110_20161017_shiftjis.txt -H "Content-Disposition: attachment; filename=blah.txt" http://localhost:9998/tika curl -F upload=@20161110_20161017_shiftjis.txt -H "Content-Disposition: attachment; filename=blah.txt" http://localhost:9998/tika/form curl -F upload=@20161110_20161017_shiftjis.txt http://localhost:9998/tika/form curl -F upload=@20161110_20161017_shiftjis.txt http://localhost:9998/rmeta/form result for the /tika or /tika/form endpoints: <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.DefaultParser"/> <meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.csv.TextAndCSVParser"/> <meta name="Content-Encoding" content="Shift_JIS"/> <meta name="resourceName" content="blah.txt"/> <meta name="Content-Length" content="378"/> <meta name="Content-Type" content="text/plain; charset=Shift_JIS"/> <title> </title> </head> <body> <p>全文検索の技術調査について2 ■Apache Solr ライセンス:Apache Lisence 2.0 開発環境:Java (サードパーティー提供の環境は Javascript、Ruby 等多数) http://lucene.apache.org/solr/ ■Elasticsearch ライセンス:Apache Lisence 2.0 開発環境:Java、Javascript、.Net Framework、Ruby 等多数 https://www.elastic.co/jp/products/elasticsearch </p> </body> </html> On Mon, May 8, 2023 at 4:05 AM Medea Springmeier <medea.springme...@raytion.com> wrote: > > Hi, > > thanks for the hint. I tested the new version of Tika (2.7.0), but I cannot > see any difference (the detection of the shift_jis file do not work). > Did you test it as a server? I must use Tika as a server and with a Post > request. > > > > -----Ursprüngliche Nachricht----- > Von: Tim Allison <talli...@apache.org> > Gesendet: Montag, 1. Mai 2023 16:47 > An: user@tika.apache.org > Betreff: Re: post request with shift_jis encoding and filename hint > > In Tika 2.7.0, we migrated to a living fork of the Universal Charset Detector > (TIKA-3213). I just tried the main branch's detection of the file attached > to TIKA-2473, and the detection now works for that file. > > I completely understand the problems you're having and appreciate your > attempted workarounds, but you're right, Tika should _just work_. So, give > 2.7.0 a try. > > That said, charset detection is not always perfect, and charset detection on > short files is notoriously challenging. > > On Fri, Apr 28, 2023 at 11:43 AM Medea Springmeier > <medea.springme...@raytion.com> wrote: > > > > Hi, > > > > I want that Tika can detect a textfile with shift_jis as charEncoding. > > > > I found this one here: > > > > https://github.com/dadoonet/fscrawler/issues/400 > > > > (and there is also a ticket for the problem in the Jira of Tika: > > https://issues.apache.org/jira/browse/TIKA-2437) > > > > > > > > So, I put the filename also in my request to give Tika a hint. When I make > > a PUT request there is all fine (I get back the "Content-Type": > > "text/plain; charset=Shift_JIS" and also the shift_jis text I want to > > have). But when I make a POST request I get the problem that I cannot add a > > Content-Disposition header in the Post-Body without also adding a > > Content-Type header (I use Java and the MultipartEntityBuilder for my > > request to Tika Server (2.6.0)). However, when I add a Content-Type header > > than Tika uses it for his detection also when it is set as Wildcard. So, > > all what I get in this situation is "Content-Type": > > "application/octet-stream" without any detected text and the information > > that Tika used the EmptyParser. > > > > > > > > I don't want to add the "Content-Type": "text/plain" in the request (this > > would work) because I do not have only textfiles. And I do not want to make > > a guess myself on the filename for the Content-Type. In my expectation that > > should Tika able to do. > > > > > > > > I want to use Tika with Post requests. Is there any way to use it in this > > way and to detect shift_jis encoded textfiles? > > > > Maybe, is there a method that I can tell Tika only to use Mime-Magic and > > the filename, but not to use the Content-Type for guessing the Mime-Type? > > > > > > > >