Sorry, one more thing.

If you use tika-eval's metadata filter, that will tell you that the
out of vocabulary statistic (an indicator of "garbage") would likely
be quite high for this file.

On Thu, Jan 26, 2023 at 2:51 PM Tim Allison <[email protected]> wrote:
>
> A user dm'd me with an example file that contained English and Arabic.
> The Arabic that was extracted was gibberish/mojibake.  I wanted to
> archive my response on our user list.
>
> * Extracting text from PDFs is a challenge.
> * For troubleshooting, see:
> https://cwiki.apache.org/confluence/display/TIKA/Troubleshooting+Tika#TroubleshootingTika-PDFTextProblems
> * Text extracted by other tools is also gibberish: Foxit, pdftotext
> and Mac's Preview
> * PDFBox logs warnings about missing unicode mappings
> * Tika reports that there are a bunch of unicode mappings missing per
> page.  The point of this is that integrators might choose to run OCR
> on pages with high counts of missing unicode mappings. From the
> metadata: "pdf:charsPerPage":["1224","662"]
> "pdf:unmappedUnicodeCharsPerPage":["620","249"]
>
> Finally, if you want a medium dive on some of the things that can go
> wrong with text extraction in PDFs:
> https://irsg.bcs.org/informer/wp-content/uploads/OverviewOfTextExtractionFromPDFs.pdf

Reply via email to