Y. AFAIK, open source (at least the ecosystem I'm working in) hasn't gotten there yet. Commercial options might be the best option for multilingual OCR.
On Thu, Jan 26, 2023 at 4:03 PM שי ברק <[email protected]> wrote: > > So I guess another issue that we can’t run OCR on document that contains > multiple languages… > > On Thu, 26 Jan 2023 at 22:58 Tim Allison <[email protected]> wrote: >> >> I sent offline the text extracted by tesseract when told the language >> is "ara". The English is completely garbled. I can't evaluate the >> quality of the Arabic. >> >> On Thu, Jan 26, 2023 at 3:53 PM Tim Allison <[email protected]> wrote: >> > >> > Ha. Cool. I was going to recommend that. >> > >> > This file does trigger OCR on my local dev environment. If you use >> > the /rmeta endpoint on tika-server, you'll see something like: >> > X-TIKA:Parsed-By-Full-Set : org.apache.tika.parser.DefaultParser >> > X-TIKA:Parsed-By-Full-Set : org.apache.tika.parser.pdf.PDFParser >> > X-TIKA:Parsed-By-Full-Set : org.apache.tika.parser.ocr.TesseractOCRParser >> > >> > There are two areas of bad news: 1) Arabic is not loaded by default in >> > the tika-full docker container. 2) We don't have a good way of doing >> > language detection to tell tesseract which language to apply by >> > default.. >> > >> > On Thu, Jan 26, 2023 at 3:28 PM שי ברק <[email protected]> wrote: >> > > >> > > I use the full docker image of Tika 2.6, >> > > How can I check if I have it or not and where am i supposed to see the >> > > outcome of the OCR? >> > > >> > > On Thu, 26 Jan 2023 at 22:25 Tim Allison <[email protected]> wrote: >> > >> >> > >> If tesseract is installed on your system and callable as 'tesseract' >> > >> and if you don't make any modifications via tika-config.xml, tesseract >> > >> will be applied to images automatically and to pages of PDFs that have >> > >> a) only a few characters (<10?) or b) have more than a handful of >> > >> unmapped unicode characters. >> > >> >> > >> On Thu, Jan 26, 2023 at 3:17 PM שי ברק <[email protected]> wrote: >> > >> > >> > >> > Does Tika support OCR on pdf, is there an endpoint or header for this? >> > >> > >> > >> > On Thu, 26 Jan 2023 at 21:54 Tim Allison <[email protected]> wrote: >> > >> >> >> > >> >> Sorry, one more thing. >> > >> >> >> > >> >> If you use tika-eval's metadata filter, that will tell you that the >> > >> >> out of vocabulary statistic (an indicator of "garbage") would likely >> > >> >> be quite high for this file. >> > >> >> >> > >> >> On Thu, Jan 26, 2023 at 2:51 PM Tim Allison <[email protected]> >> > >> >> wrote: >> > >> >> > >> > >> >> > A user dm'd me with an example file that contained English and >> > >> >> > Arabic. >> > >> >> > The Arabic that was extracted was gibberish/mojibake. I wanted to >> > >> >> > archive my response on our user list. >> > >> >> > >> > >> >> > * Extracting text from PDFs is a challenge. >> > >> >> > * For troubleshooting, see: >> > >> >> > https://cwiki.apache.org/confluence/display/TIKA/Troubleshooting+Tika#TroubleshootingTika-PDFTextProblems >> > >> >> > * Text extracted by other tools is also gibberish: Foxit, pdftotext >> > >> >> > and Mac's Preview >> > >> >> > * PDFBox logs warnings about missing unicode mappings >> > >> >> > * Tika reports that there are a bunch of unicode mappings missing >> > >> >> > per >> > >> >> > page. The point of this is that integrators might choose to run >> > >> >> > OCR >> > >> >> > on pages with high counts of missing unicode mappings. From the >> > >> >> > metadata: "pdf:charsPerPage":["1224","662"] >> > >> >> > "pdf:unmappedUnicodeCharsPerPage":["620","249"] >> > >> >> > >> > >> >> > Finally, if you want a medium dive on some of the things that can >> > >> >> > go >> > >> >> > wrong with text extraction in PDFs: >> > >> >> > https://irsg.bcs.org/informer/wp-content/uploads/OverviewOfTextExtractionFromPDFs.pdf
