The file looks to work for me (if google translate is any indication)
with curl and tika-server 2.7.0.

curl commands that work:
curl -T 20161110_20161017_shiftjis.txt -H "Content-Disposition:
attachment; filename=blah.txt" http://localhost:9998/tika
curl -F upload=@20161110_20161017_shiftjis.txt -H
"Content-Disposition: attachment; filename=blah.txt"
http://localhost:9998/tika/form
curl -F upload=@20161110_20161017_shiftjis.txt http://localhost:9998/tika/form
curl -F upload=@20161110_20161017_shiftjis.txt http://localhost:9998/rmeta/form

result for the /tika or /tika/form endpoints:
<?xml version="1.0" encoding="UTF-8"?><html
xmlns="http://www.w3.org/1999/xhtml";>
<head>
<meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.DefaultParser"/>
<meta name="X-TIKA:Parsed-By"
content="org.apache.tika.parser.csv.TextAndCSVParser"/>
<meta name="Content-Encoding" content="Shift_JIS"/>
<meta name="resourceName" content="blah.txt"/>
<meta name="Content-Length" content="378"/>
<meta name="Content-Type" content="text/plain; charset=Shift_JIS"/>
<title>
</title>
</head>
<body>
<p>全文検索の技術調査について2&#13;
&#13;
&#13;
 ■Apache Solr&#13;
   ライセンス:Apache Lisence 2.0&#13;
  開発環境:Java (サードパーティー提供の環境は Javascript、Ruby 等多数)&#13;
&#13;
    http://lucene.apache.org/solr/&#13;
&#13;
 ■Elasticsearch&#13;
   ライセンス:Apache Lisence 2.0&#13;
  開発環境:Java、Javascript、.Net Framework、Ruby 等多数&#13;
&#13;
    https://www.elastic.co/jp/products/elasticsearch&#13;
&#13;
</p>
</body>
</html>

On Mon, May 8, 2023 at 4:05 AM Medea Springmeier
<medea.springme...@raytion.com> wrote:
>
> Hi,
>
> thanks for the hint. I tested the new version of Tika (2.7.0), but I cannot 
> see any difference (the detection of the shift_jis file do not work).
> Did you test it as a server? I must use Tika as a server and with a Post 
> request.
>
>
>
> -----Ursprüngliche Nachricht-----
> Von: Tim Allison <talli...@apache.org>
> Gesendet: Montag, 1. Mai 2023 16:47
> An: user@tika.apache.org
> Betreff: Re: post request with shift_jis encoding and filename hint
>
> In Tika 2.7.0, we migrated to a living fork of the Universal Charset Detector 
> (TIKA-3213).  I just tried the main branch's detection of the file attached 
> to TIKA-2473, and the detection now works for that file.
>
> I completely understand the problems you're having and appreciate your 
> attempted workarounds, but you're right, Tika should _just work_.  So, give 
> 2.7.0 a try.
>
> That said, charset detection is not always perfect, and charset detection on 
> short files is notoriously challenging.
>
> On Fri, Apr 28, 2023 at 11:43 AM Medea Springmeier 
> <medea.springme...@raytion.com> wrote:
> >
> > Hi,
> >
> > I want that Tika can detect a textfile with shift_jis as charEncoding.
> >
> > I found this one here:
> >
> > https://github.com/dadoonet/fscrawler/issues/400
> >
> > (and there is also a ticket for the problem in the Jira of Tika:
> > https://issues.apache.org/jira/browse/TIKA-2437)
> >
> >
> >
> > So, I put the filename also in my request to give Tika a hint. When I make 
> > a PUT request there is all fine (I get back the "Content-Type": 
> > "text/plain; charset=Shift_JIS" and also the shift_jis text I want to 
> > have). But when I make a POST request I get the problem that I cannot add a 
> > Content-Disposition header in the Post-Body without also adding a 
> > Content-Type header (I use Java and the MultipartEntityBuilder for my 
> > request to Tika Server (2.6.0)). However, when I add a Content-Type header 
> > than Tika uses it for his detection also when it is set as Wildcard. So, 
> > all what I get in this situation is "Content-Type": 
> > "application/octet-stream" without any detected text and the information 
> > that Tika used the EmptyParser.
> >
> >
> >
> > I don't want to add the "Content-Type": "text/plain" in the request (this 
> > would work) because I do not have only textfiles. And I do not want to make 
> > a guess myself on the filename for the Content-Type. In my expectation that 
> > should Tika able to do.
> >
> >
> >
> > I want to use Tika with Post requests. Is there any way to use it in this 
> > way and to detect shift_jis encoded textfiles?
> >
> > Maybe, is there a method that I can tell Tika only to use Mime-Magic and 
> > the filename, but not to use the Content-Type for guessing the Mime-Type?
> >
> >
> >
> >

Reply via email to