Re: Detecting Encoding with plugins

Ken Krugler Tue, 14 Feb 2012 16:10:36 -0800

On Feb 14, 2012, at 2:34pm, Lewis John Mcgibbney wrote:

> It's in HTMLParser#private static String sniffCharacterEncoding
> 
> I'm still wondering where TikaParser gets the character encoding from though?


FYI, the individual Tika parsers have their own detection logic.

The HTML parser, for example, uses the response headers and metadata tags in 
addition to ICU's statistical method.

That's something I'm still working on cleaning up, but haven't made much 
progress in the past few months.

-- Ken

> Additionally, this doesn't look like something we check for in our JUnit 
> classes? If we don't then I would like to write some tests to test for this.
> 
> I am working on Any23 tests first, so this provides the justification behind 
> my question.
> 
> Thanks
> 
> Lewis
> 
> On Tue, Feb 14, 2012 at 10:00 PM, Lewis John Mcgibbney 
> <[email protected]> wrote:
> Hi,
> 
> I can't see anywhere within our parser plugins where we detect encoding of 
> documents. I've also begun looking through the o.a.n.p package but again I 
> can't see anything.
> 
> Can anyone provide some detail on this please?
> 
> Thank you
> 
> Lewis 
> 
> 
> 
> -- 
> Lewis 
> 
> 
> 
> 
> -- 
> Lewis 
> 

--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr

Re: Detecting Encoding with plugins

Reply via email to