Hi Ken,
yes, exactly. I don’t care about single sentences because those could give wrong results but if it’s, let’s say, more than maybe 5 sentences, so basically something the language of which can be detected without a problem, then it should show up in the results. This could probably be done by ignoring every language that’s less than … % but for that to work I’d actually need the full list first. Splitting into pages won’t work because it already doesn‘t output the second language if it’s just one less sentence, so a page with 3 paragraphs in English and 1 paragraph in e.g. Greek is always going to result in „en (0.999xxxx)“. I don’t know how it calculates the probabilities but with that example I’d expect the result: en (0.75) el (0.25) In my 400 page example the French part is about 4 pages (= one chapter), so 1% of the full text and I’m a bit confused why that isn’t mentioned in the result. Splitting into paragraphs would probably be done with PDFBox, right? Looking at old questions on Stackexchange it seems to be semi-easy to do (basically a split with „\n“?) but there’s no guarantee that it’ll actually find all paragraphs. In the end I only actually care about the languages, the probabilities I’d only use to see if it’s even worth mentioning a specific one if it should return more than one for longer text samples. Von: Ken Krugler <[email protected]> Gesendet: Montag, 1. Februar 2021 17:29 An: [email protected] Betreff: Re: Detecting multiple languages in a long text Hi Julia, So the goal is to have detection results show some non-zero probability for the other languages, right? In general doing this for long runs of text is almost impossible using probabilistic models. What you need to do is break the text up into some smaller units (by page or even better by paragraph, for example) and then do detection separately on each chunk of text. Then based on those results, you can decide how you want to report actual content…which isn’t straightforward. E.g. what if only one paragraph (out of many) had a 10% chance of being Greek, because it contained one sentence in Greek, but everything else was English? Would you want to report the total document as English, or English with some Greek, or something else? Regards, — Ken On Feb 1, 2021, at 5:39 AM, Julia Ruzicka <[email protected] <mailto:[email protected]> > wrote: Hello everyone! I’m using Tika 1.25 to detect the language of a long text that I read from a PDF (using PDFBox 2.0.22): LanguageDetector detector = new OptimaizeLangDetector(); detector.loadModels(); List<LanguageResult> languages = detector.detectAll(text); The text is about 400 pages and most of it is in English, with a couple of pages in French, a few paragraphs in Greek and a couple of Arabic and German sentences. I know that language detection needs a long-ish text sample for the detection to work, so I'm fine with the short Arabic/German sentences not being detected. Running the code above with just a short sample in French or Greek, the detector finds the right language but if I use the whole text as input, the result is: en (0.9999969) = English with a 99.99969% probability It doesn’t list the other languages. If I give the detector a mixed sample, it only detects both languages if they’re about the same amount of text. If one part in e.g. French is 5 lines of text (~65 words) and the second in e.g. Greek is 7 lines of text (~80 word), the result is: el (0.99999815) = Greek With 55 words in French and 45 words in Greek the result is: fr (0.5714264) el (0.4285709) I also tried to do it the alternative way: detector.setMixedLanguages(true); detector.addText(text); List<LanguageResult> languages = detector.detectAll(); This also only lists a single language with the full text and my first French-Greek text sample. How do I get the other languages (in my case: French & Greek) as a result too? -------------------------- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr
