Hi Claude,

> On May 12, 2017, at 10:59am, Claude Garceau <claude.garceau.vi...@gmail.com> 
> wrote:
> 
> 
> Thank you Ken, realy useful reply...I guess than an high false negative rate 
> (silence) do much more harm than an high false positive rate (noise).
> 
> I would say that more than 90% of the targeted documents are in French 
> although they might have some paragraphs in English but they are not 
> half-half French-English within the same document. And most of them have more 
> than 2 pages, so I guess (you can tell me if not) with enough characters so 
> that the detector operates with fair enough precision ?

Yes, that’s more than enough. An example of short text would be tweets (e.g. < 
100 bytes).

> Another question...can we assign, at the same time, the Tika's French 
> Detector and the English Detector on the same document being parsed so it can 
> be parsed with the two detector on ?

There’s only one detector, and it returns the “best” language. We currently 
don’t support paragraph-by-paragraph detection, though that would be very cool. 
The main problem is the we’d have to buffer up text before emitting it, so that 
we could send out the <p> element with the “lang” = <whatever> attribute before 
emitting the text.

If that’s important, though, it wouldn’t be hard to create your own version of 
the BodyHandler that does this.

— Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr



Reply via email to