Hey Lewis

Thanks for the reply.

I think the charset stuff is related to https://issues.apache.org/jira/browse/TIKA-431

Regarding you single argument point, I'm certainly not a tika expert and was just updaing some of the OODT code from what was recommended on there, I shall investigate an alternative way of doing it, although I still think this should probably be looked a bit more if it is indeed a bug.

I've written a small project to show whats going on:

https://github.com/buggtb/tikatest

Run the target/main-1.0-SNAPSHOT-jar-with-dependencies.jar after compiling that and you'll see the output.

bugg@bugg-pc:~/Projects/tikatest$ java -jar target/main-1.0-SNAPSHOT-jar-with-dependencies.jar
Detected Type is: text/html; charset=UTF-8
Registered Mime Type is :null
If I pass text/html :text/html


Tom

On 08/09/14 19:17, Lewis John Mcgibbney wrote:
Hi Tom,

On Sun, Sep 7, 2014 at 1:28 PM, <dev-digest-h...@tika.apache.org> wrote:

now when parsing HTML files these days Tika adds the charset attribute to
the string.

Is this behavhiour consistent with other MimeTypes?


I would have thought the normalize call was designed to remove this
because tika-mimetypes.xml surely isn't supposed to contain charset
matching tags?

That was also my own understanding.


Anyway if you do

Tika.detect(myurl)

This is a tricky one for me. If at all possible, I always try not to use
single argument methods within the Tika interface. I've found that the
detection algorithms work best when more information is provided as
parameters, e.g. using a Metadata object which has been populated with more
comprehensive values. I know that this doesn't help you much but if it is
at all possible then I would suggest you try the same.


followed by

MimeTypes.getRegisteredMimeType("text/html; charset=UTF-8");

It returns null because it doesn't strip the charset, without it its fine.

Bug/Feature/Misunderstanding?

I would say the former however I can't reproduce this right now.
Can you possibly provide some output from the method calls so that we can
see exactly the output?
Thanks
Lewis



--
*Tom Barber* | Technical Director

meteorite bi
*T:* +44 20 8133 3730
*W:* www.meteorite.bi | *Skype:* meteorite.consulting
*A:* Surrey Technology Centre, Surrey Research Park, Guildford, GU2 7YG, UK

Reply via email to