Sorry, one more thing. If you use tika-eval's metadata filter, that will tell you that the out of vocabulary statistic (an indicator of "garbage") would likely be quite high for this file.
On Thu, Jan 26, 2023 at 2:51 PM Tim Allison <[email protected]> wrote: > > A user dm'd me with an example file that contained English and Arabic. > The Arabic that was extracted was gibberish/mojibake. I wanted to > archive my response on our user list. > > * Extracting text from PDFs is a challenge. > * For troubleshooting, see: > https://cwiki.apache.org/confluence/display/TIKA/Troubleshooting+Tika#TroubleshootingTika-PDFTextProblems > * Text extracted by other tools is also gibberish: Foxit, pdftotext > and Mac's Preview > * PDFBox logs warnings about missing unicode mappings > * Tika reports that there are a bunch of unicode mappings missing per > page. The point of this is that integrators might choose to run OCR > on pages with high counts of missing unicode mappings. From the > metadata: "pdf:charsPerPage":["1224","662"] > "pdf:unmappedUnicodeCharsPerPage":["620","249"] > > Finally, if you want a medium dive on some of the things that can go > wrong with text extraction in PDFs: > https://irsg.bcs.org/informer/wp-content/uploads/OverviewOfTextExtractionFromPDFs.pdf
