Hi Claude, > On May 12, 2017, at 10:59am, Claude Garceau <claude.garceau.vi...@gmail.com> > wrote: > > > Thank you Ken, realy useful reply...I guess than an high false negative rate > (silence) do much more harm than an high false positive rate (noise). > > I would say that more than 90% of the targeted documents are in French > although they might have some paragraphs in English but they are not > half-half French-English within the same document. And most of them have more > than 2 pages, so I guess (you can tell me if not) with enough characters so > that the detector operates with fair enough precision ?
Yes, that’s more than enough. An example of short text would be tweets (e.g. < 100 bytes). > Another question...can we assign, at the same time, the Tika's French > Detector and the English Detector on the same document being parsed so it can > be parsed with the two detector on ? There’s only one detector, and it returns the “best” language. We currently don’t support paragraph-by-paragraph detection, though that would be very cool. The main problem is the we’d have to buffer up text before emitting it, so that we could send out the <p> element with the “lang” = <whatever> attribute before emitting the text. If that’s important, though, it wouldn’t be hard to create your own version of the BodyHandler that does this. — Ken -------------------------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr