Hey Lewis
Thanks for the reply.
I think the charset stuff is related to
https://issues.apache.org/jira/browse/TIKA-431
Regarding you single argument point, I'm certainly not a tika expert and
was just updaing some of the OODT code from what was recommended on
there, I shall investigate an alternative way of doing it, although I
still think this should probably be looked a bit more if it is indeed a bug.
I've written a small project to show whats going on:
https://github.com/buggtb/tikatest
Run the target/main-1.0-SNAPSHOT-jar-with-dependencies.jar after
compiling that and you'll see the output.
bugg@bugg-pc:~/Projects/tikatest$ java -jar
target/main-1.0-SNAPSHOT-jar-with-dependencies.jar
Detected Type is: text/html; charset=UTF-8
Registered Mime Type is :null
If I pass text/html :text/html
Tom
On 08/09/14 19:17, Lewis John Mcgibbney wrote:
Hi Tom,
On Sun, Sep 7, 2014 at 1:28 PM, <dev-digest-h...@tika.apache.org> wrote:
now when parsing HTML files these days Tika adds the charset attribute to
the string.
Is this behavhiour consistent with other MimeTypes?
I would have thought the normalize call was designed to remove this
because tika-mimetypes.xml surely isn't supposed to contain charset
matching tags?
That was also my own understanding.
Anyway if you do
Tika.detect(myurl)
This is a tricky one for me. If at all possible, I always try not to use
single argument methods within the Tika interface. I've found that the
detection algorithms work best when more information is provided as
parameters, e.g. using a Metadata object which has been populated with more
comprehensive values. I know that this doesn't help you much but if it is
at all possible then I would suggest you try the same.
followed by
MimeTypes.getRegisteredMimeType("text/html; charset=UTF-8");
It returns null because it doesn't strip the charset, without it its fine.
Bug/Feature/Misunderstanding?
I would say the former however I can't reproduce this right now.
Can you possibly provide some output from the method calls so that we can
see exactly the output?
Thanks
Lewis
--
*Tom Barber* | Technical Director
meteorite bi
*T:* +44 20 8133 3730
*W:* www.meteorite.bi | *Skype:* meteorite.consulting
*A:* Surrey Technology Centre, Surrey Research Park, Guildford, GU2 7YG, UK