I've had reasonable success with the ICU charset detection, but that's the only one I've tried and so can't compare it to any other.
--Thilo On 12/9/2009 17:10, [email protected] wrote: > Yeah, there are many indefinites with regards to charset detection and > there is no 100% accurate method of interpreting the charset. Its more art > than science. That said, I will hunt around for a decent library too. > >> Hi Antoni, >> >> I tried many charset detection libraries while working on Nutch but none >> of >> them was really working. >> I also tried to take a look at the mozilla charset detector , but it was >> really too complicated to integrate into Nutch (or Tika). >> >> Best regards >> >> Jérôme >> >> 2009/12/9 Antoni Mylka <[email protected]> >> >>> Aperturians, Tika >>> >>> I was wondering if anyone has any experience with the jchardet library >>> for charset detection. Does it work? What kinds of documents does it >>> actually support. >>> >>> Christiaan has posted an idea to the Aperture tracker how we could use >>> jchardet to improve the plain text extractor, but it doesn't seem to >>> work. Or maybe the Tika guys have figured it out already and I can just >>> use Tika for this? :) >>> >>> Antoni Mylka >>> [email protected] >>> >> >> >> >> -- >> Jérôme Charron >> Directeur Technique @ WebPulse >> Tel: +33675742890 <= ** NEW ** >> eMail : [email protected] >> http://www.webpulse.fr/ >> http://www.shopreflex.com/ >> http://www.staragora.com/ >> ------------------------------------------------------------------------------ >> Return on Information: >> Google Enterprise Search pays you back >> Get the facts. >> http://p.sf.net/sfu/google-dev2dev >> _______________________________________________ >> Aperture-devel mailing list >> [email protected] >> https://lists.sourceforge.net/lists/listinfo/aperture-devel >>
