On Feb 14, 2012, at 2:34pm, Lewis John Mcgibbney wrote:

> It's in HTMLParser#private static String sniffCharacterEncoding
> 
> I'm still wondering where TikaParser gets the character encoding from though?

FYI, the individual Tika parsers have their own detection logic.

The HTML parser, for example, uses the response headers and metadata tags in 
addition to ICU's statistical method.

That's something I'm still working on cleaning up, but haven't made much 
progress in the past few months.

-- Ken

> Additionally, this doesn't look like something we check for in our JUnit 
> classes? If we don't then I would like to write some tests to test for this.
> 
> I am working on Any23 tests first, so this provides the justification behind 
> my question.
> 
> Thanks
> 
> Lewis
> 
> On Tue, Feb 14, 2012 at 10:00 PM, Lewis John Mcgibbney 
> <lewis.mcgibb...@gmail.com> wrote:
> Hi,
> 
> I can't see anywhere within our parser plugins where we detect encoding of 
> documents. I've also begun looking through the o.a.n.p package but again I 
> can't see anything.
> 
> Can anyone provide some detail on this please?
> 
> Thank you
> 
> Lewis 
> 
> 
> 
> -- 
> Lewis 
> 
> 
> 
> 
> -- 
> Lewis 
> 

--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr




Reply via email to