Re: Detecting multiple languages in a long text

Ken Krugler Mon, 01 Feb 2021 09:54:56 -0800

Hi Julia,

So the goal is to have detection results show some non-zero probability for the 
other languages, right?


In general doing this for long runs of text is almost impossible using 
probabilistic models.

What you need to do is break the text up into some smaller units (by page or 
even better by paragraph, for example) and then do detection separately on each 
chunk of text.

Then based on those results, you can decide how you want to report actual 
content…which isn’t straightforward.

E.g. what if only one paragraph (out of many) had a 10% chance of being Greek, 
because it contained one sentence in Greek, but everything else was English? 
Would you want to report the total document as English, or English with some 
Greek, or something else?

Regards,

— Ken


> On Feb 1, 2021, at 5:39 AM, Julia Ruzicka <[email protected]> wrote:
> 
> Hello everyone!
>  
> I’m using Tika 1.25 to detect the language of a long text that I read from a 
> PDF (using PDFBox 2.0.22):
>  
> LanguageDetector detector = new OptimaizeLangDetector();
> detector.loadModels();
> List<LanguageResult> languages = detector.detectAll(text);
>  
> The text is about 400 pages and most of it is in English, with a couple of 
> pages in French, a few paragraphs in Greek and a couple of Arabic and German 
> sentences.
> I know that language detection needs a long-ish text sample for the detection 
> to work, so I'm fine with the short Arabic/German sentences not being 
> detected. Running the code above with just a short sample in French or Greek, 
> the detector finds the right language but if I use the whole text as input, 
> the result is:
> en (0.9999969) = English with a 99.99969% probability
>  
> It doesn’t list the other languages.
>  
> If I give the detector a mixed sample, it only detects both languages if 
> they’re about the same amount of text.
> If one part in e.g. French is 5 lines of text (~65 words) and the second in 
> e.g. Greek is 7 lines of text (~80 word), the result is:
> el (0.99999815) = Greek
>  
> With 55 words in French and 45 words in Greek the result is:
> fr (0.5714264)
> el (0.4285709)
>  
> I also tried to do it the alternative way:
>  
> detector.setMixedLanguages(true);
> detector.addText(text);
> List<LanguageResult> languages = detector.detectAll();
>  
> This also only lists a single language with the full text and my first 
> French-Greek text sample.
>  
> How do I get the other languages (in my case: French & Greek) as a result too?

--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr

Re: Detecting multiple languages in a long text

Reply via email to