Re: Detecting multiple languages in a long text

Julia Ruzicka Tue, 02 Feb 2021 07:08:50 -0800

Hi Ken,


yes, exactly. I don’t care about single sentences because those could give 
wrong results but if it’s, let’s say, more than maybe 5 sentences, so basically 
something the language of which can be detected without a problem, then it 
should show up in the results. This could probably be done by ignoring every 
language that’s less than … % but for that to work I’d actually need the full 
list first.

 

Splitting into pages won’t work because it already doesn‘t output the second 
language if it’s just one less sentence, so a page with 3 paragraphs in English 
and 1 paragraph in e.g. Greek is always going to result in „en (0.999xxxx)“.

I don’t know how it calculates the probabilities but with that example I’d 
expect the result:

en (0.75)

el (0.25)

 

In my 400 page example the French part is about 4 pages (= one chapter), so 1% 
of the full text and I’m a bit confused why that isn’t mentioned in the result.

 

Splitting into paragraphs would probably be done with PDFBox, right? Looking at 
old questions on Stackexchange it seems to be semi-easy to do (basically a 
split with „\n“?) but there’s no guarantee that it’ll actually find all 
paragraphs.

 

In the end I only actually care about the languages, the probabilities I’d only 
use to see if it’s even worth mentioning a specific one if it should return 
more than one for longer text samples.

 

 

Von: Ken Krugler <[email protected]> 
Gesendet: Montag, 1. Februar 2021 17:29
An: [email protected]
Betreff: Re: Detecting multiple languages in a long text

 

Hi Julia,

 

So the goal is to have detection results show some non-zero probability for the 
other languages, right?

 

In general doing this for long runs of text is almost impossible using 
probabilistic models.

 

What you need to do is break the text up into some smaller units (by page or 
even better by paragraph, for example) and then do detection separately on each 
chunk of text.

 

Then based on those results, you can decide how you want to report actual 
content…which isn’t straightforward.

 

E.g. what if only one paragraph (out of many) had a 10% chance of being Greek, 
because it contained one sentence in Greek, but everything else was English? 
Would you want to report the total document as English, or English with some 
Greek, or something else?

 

Regards,

 

— Ken

 

 

On Feb 1, 2021, at 5:39 AM, Julia Ruzicka <[email protected] 
<mailto:[email protected]> > wrote:

 

Hello everyone!

 

I’m using Tika 1.25 to detect the language of a long text that I read from a 
PDF (using PDFBox 2.0.22):

 

LanguageDetector detector = new OptimaizeLangDetector();

detector.loadModels();

List<LanguageResult> languages = detector.detectAll(text);

 

The text is about 400 pages and most of it is in English, with a couple of 
pages in French, a few paragraphs in Greek and a couple of Arabic and German 
sentences.

I know that language detection needs a long-ish text sample for the detection 
to work, so I'm fine with the short Arabic/German sentences not being detected. 
Running the code above with just a short sample in French or Greek, the 
detector finds the right language but if I use the whole text as input, the 
result is:

en (0.9999969) = English with a 99.99969% probability

 

It doesn’t list the other languages.

 

If I give the detector a mixed sample, it only detects both languages if 
they’re about the same amount of text.

If one part in e.g. French is 5 lines of text (~65 words) and the second in 
e.g. Greek is 7 lines of text (~80 word), the result is:

el (0.99999815) = Greek

 

With 55 words in French and 45 words in Greek the result is:

fr (0.5714264)

el (0.4285709)

 

I also tried to do it the alternative way:

 

detector.setMixedLanguages(true);

detector.addText(text);

List<LanguageResult> languages = detector.detectAll();

 

This also only lists a single language with the full text and my first 
French-Greek text sample.

 

How do I get the other languages (in my case: French & Greek) as a result too?

 

--------------------------

Ken Krugler

http://www.scaleunlimited.com

custom big data solutions & training

Hadoop, Cascading, Cassandra & Solr

Re: Detecting multiple languages in a long text

Reply via email to