The headers were missing in second part for attachment call.
Is that by mistake as you may be getting invalid response to parsing error 
at the server side?

On Jul 18, 2023, at 8:55 AM, Tim Allison <talli...@apache.org> wrote:


Hi Medea,
  I'm sorry for my delay.  To confirm some points:

1) It is difficult to identify text files, especially when they are short. Tika has a known weakness in identifying non-ascii text files as text files vs application/octet.
2) It is difficult to identify charsets on short files.

That said, if you are willing to use a file name hint, Tika will take that and correctly parse the file linked in your original email.

I completely understand your desire not to send in a media type, and you shouldn't have to, and for this file, you don't need to.



Based on your feedback on curl, I tried a plain Java client with a post request:
Path p = Paths.get("/Users/allison/tools/tika/20161110_20161017_shiftjis");
MetadataMap headers = new MetadataMap<>();
Attachment attachmentPart = new Attachment(headers, Files.newInputStream(p));

Response response =
WebClient.create(endPoint + TIKA_PATH + "/form").type("multipart/form-data")
.accept("text/plain")
.post(attachmentPart);
String responseMsg = getStringFromInputStream((InputStream) response.getEntity());
System.out.println("response>" + responseMsg + "<");
And that worked!  I'm actually somewhat surprised that it worked without the file name hint, but so it goes.
The following (setting the id to something even if it ended in ".txt" and setting the mediaType to empty string) did NOT work:
Path p = Paths.get("/Users/allison/tools/tika/20161110_20161017_shiftjis");
Attachment attachmentPart = new Attachment("my-id", "", Files.newInputStream(p));
Response response =
WebClient.create(endPoint + TIKA_PATH + "/form").type("multipart/form-data")
.accept("text/plain")
.post(attachmentPart);
String responseMsg = getStringFromInputStream((InputStream) response.getEntity());
System.out.println("response>" + responseMsg + "<");

On Mon, Jul 17, 2023 at 8:50 AM Medea Springmeier <medea.springme...@raytion.com> wrote:
Hi, a few weeks ago I sent this email, but I did not get an answer. However, I want to know whether this is a bug of Tika or how I can solve this problem. If something is unclear, please ask.

-----Ursprüngliche Nachricht-----
Von: Medea Springmeier
Gesendet: Montag, 12. Juni 2023 12:34
An: user@tika.apache.org
Betreff: AW: post request with shift_jis encoding and filename hint

Hi, thank you for your curl examples. I tested a bit around and now I think I have found the main problem: all POST requests also send a content-type depending on the filename's ending. I think the tools like curl and postman add the content-type themself. So in the examples you wrote, is the content-type automatically added as "text/plain" (without adding it explicitly).
I tested this with the help of WireShark. It is a tool to catch the request which are sent, so you can get a better understanding what happens really in the background.
My guess is that Tika only works with the content-type and not with the filename hint. The problem is that in my coding project I don't want to implement such a logic to add a content-type. I could add a default content-type like application/octed-stream, but I don't want to guess the type before I send the file to Tika.
Is there a workaround or something like that? Or is this a bug in Tika?

-----Ursprüngliche Nachricht-----
Von: Tim Allison <talli...@apache.org>
Gesendet: Montag, 8. Mai 2023 15:59
An: user@tika.apache.org
Betreff: Re: post request with shift_jis encoding and filename hint

The file looks to work for me (if google translate is any indication) with curl and tika-server 2.7.0.

curl commands that work:
curl -T 20161110_20161017_shiftjis.txt -H "Content-Disposition:
attachment; filename=blah.txt" http://localhost:9998/tika curl -F upload=@20161110_20161017_shiftjis.txt -H
"Content-Disposition: attachment; filename=blah.txt"
http://localhost:9998/tika/form
curl -F upload=@20161110_20161017_shiftjis.txt http://localhost:9998/tika/form curl -F upload=@20161110_20161017_shiftjis.txt http://localhost:9998/rmeta/form

result for the /tika or /tika/form endpoints:
<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.DefaultParser"/>
<meta name="X-TIKA:Parsed-By"
content="org.apache.tika.parser.csv.TextAndCSVParser"/>
<meta name="Content-Encoding" content="Shift_JIS"/> <meta name="resourceName" content="blah.txt"/> <meta name="Content-Length" content="378"/> <meta name="Content-Type" content="text/plain; charset=Shift_JIS"/> <title> </title> </head> <body>
<p>全文検索の技術調査について2&#13;
&#13;
&#13;
 ■Apache Solr&#13;
   ライセンス:Apache Lisence 2.0&#13;
  開発環境:Java (サードパーティー提供の環境は _javascript_、Ruby 等多数)&#13;
&#13;
    http://lucene.apache.org/solr/&#13;
&#13;
 ■Elasticsearch&#13;
   ライセンス:Apache Lisence 2.0&#13;
  開発環境:Java、_javascript_、.Net Framework、Ruby 等多数&#13;
&#13;
    https://www.elastic.co/jp/products/elasticsearch&#13;
&#13;
</p>
</body>
</html>

On Mon, May 8, 2023 at 4:05 AM Medea Springmeier <medea.springme...@raytion.com> wrote:
>
> Hi,
>
> thanks for the hint. I tested the new version of Tika (2.7.0), but I cannot see any difference (the detection of the shift_jis file do not work).
> Did you test it as a server? I must use Tika as a server and with a Post request.
>
>
>
> -----Ursprüngliche Nachricht-----
> Von: Tim Allison <talli...@apache.org>
> Gesendet: Montag, 1. Mai 2023 16:47
> An: user@tika.apache.org
> Betreff: Re: post request with shift_jis encoding and filename hint
>
> In Tika 2.7.0, we migrated to a living fork of the Universal Charset Detector (TIKA-3213).  I just tried the main branch's detection of the file attached to TIKA-2473, and the detection now works for that file.
>
> I completely understand the problems you're having and appreciate your attempted workarounds, but you're right, Tika should _just work_.  So, give 2.7.0 a try.
>
> That said, charset detection is not always perfect, and charset detection on short files is notoriously challenging.
>
> On Fri, Apr 28, 2023 at 11:43 AM Medea Springmeier <medea.springme...@raytion.com> wrote:
> >
> > Hi,
> >
> > I want that Tika can detect a textfile with shift_jis as charEncoding.
> >
> > I found this one here:
> >
> > https://github.com/dadoonet/fscrawler/issues/400
> >
> > (and there is also a ticket for the problem in the Jira of Tika:
> > https://issues.apache.org/jira/browse/TIKA-2437)
> >
> >
> >
> > So, I put the filename also in my request to give Tika a hint. When I make a PUT request there is all fine (I get back the "Content-Type": "text/plain; charset=Shift_JIS" and also the shift_jis text I want to have). But when I make a POST request I get the problem that I cannot add a Content-Disposition header in the Post-Body without also adding a Content-Type header (I use Java and the MultipartEntityBuilder for my request to Tika Server (2.6.0)). However, when I add a Content-Type header than Tika uses it for his detection also when it is set as Wildcard. So, all what I get in this situation is "Content-Type": "application/octet-stream" without any detected text and the information that Tika used the EmptyParser.
> >
> >
> >
> > I don't want to add the "Content-Type": "text/plain" in the request (this would work) because I do not have only textfiles. And I do not want to make a guess myself on the filename for the Content-Type. In my expectation that should Tika able to do.
> >
> >
> >
> > I want to use Tika with Post requests. Is there any way to use it in this way and to detect shift_jis encoded textfiles?
> >
> > Maybe, is there a method that I can tell Tika only to use Mime-Magic and the filename, but not to use the Content-Type for guessing the Mime-Type?
> >
> >
> >
> >

Reply via email to