I've had reasonable success with the ICU charset
detection, but that's the only one I've tried and
so can't compare it to any other.

--Thilo

On 12/9/2009 17:10, [email protected] wrote:
> Yeah, there are many indefinites with regards to charset detection and
> there is no 100% accurate method of interpreting the charset. Its more art
> than science. That said, I will hunt around for a decent library too.
> 
>> Hi Antoni,
>>
>> I tried many charset detection libraries while working on Nutch but none
>> of
>> them was really working.
>> I also tried to take a look at the mozilla charset detector , but it was
>> really too complicated to integrate into Nutch (or Tika).
>>
>> Best regards
>>
>> Jérôme
>>
>> 2009/12/9 Antoni Mylka <[email protected]>
>>
>>> Aperturians, Tika
>>>
>>> I was wondering if anyone has any experience with the jchardet library
>>> for charset detection. Does it work? What kinds of documents does it
>>> actually support.
>>>
>>> Christiaan has posted an idea to the Aperture tracker how we could use
>>> jchardet to improve the plain text extractor, but it doesn't seem to
>>> work.  Or maybe the Tika guys have figured it out already and I can just
>>> use Tika for this? :)
>>>
>>> Antoni Mylka
>>> [email protected]
>>>
>>
>>
>>
>> --
>> Jérôme Charron
>> Directeur Technique @ WebPulse
>> Tel: +33675742890 <= ** NEW **
>> eMail : [email protected]
>> http://www.webpulse.fr/
>> http://www.shopreflex.com/
>> http://www.staragora.com/
>> ------------------------------------------------------------------------------
>> Return on Information:
>> Google Enterprise Search pays you back
>> Get the facts.
>> http://p.sf.net/sfu/google-dev2dev
>> _______________________________________________
>> Aperture-devel mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/aperture-devel
>>

Reply via email to