Re: Problem parsing DOCX

Tim Allison Fri, 08 Jan 2021 09:50:12 -0800

Y. That means that somehow the OOXMLParser didn't make it to your path.  I
just added that docx and a unit (not really) test, and it seems to work for
me:
https://github.com/tballison/tika-2_0-client-examples/blob/master/src/test/java/TestRotation.java#L82


What does your config look like?

I'd recommend merging from {{main}} and rebuilding -- TIKA-3268 fixed a bug
that can silently prevent parsers from loading if there's a typo in the
exclude-parser's class.

On Fri, Jan 8, 2021 at 11:38 AM Peter Kronenberg <[email protected]>
wrote:

> Trying to parse the attached Word file.  Not matter what it does with the
> images, I would expect to at least see the extracted text.  I realize that
> the PDF options have no bearing here.    But here is all I’m getting.  Also
> note that it does not even identify it as a Word document.  Only as
> Office.  And there is hardly any other metadata.  Only X-Parsed-By and
> Content-Type
>
> I’m sure the EmptyParser is a clue.  Am I not including the correct parser?
>
>
>
>
>
> checking: [c:\Program Files (x86)\Tesseract-OCR-4.0.0\tesseract.exe]
>
> [main] WARN org.apache.tika.parser.ocr.TesseractOCRParser - Tesseract OCR
> is installed and will be automatically applied to image files unless
>
> you've excluded the TesseractOCRParser from the default parser.
>
> Tesseract may dramatically slow down content extraction (TIKA-2359).
>
> As of Tika 1.15 (and prior versions), Tesseract is automatically called.
>
> In future versions of Tika, users may need to turn the TesseractOCRParser
> on via TikaConfig.
>
> [main] INFO org.torchai.TikaOCRParser - Tesseract path: c:\Program Files
> (x86)\Tesseract-OCR-4.0.0\, exists: true
>
> [main] INFO org.torchai.TikaOCRParser - Tessdata path:  c:\Program Files
> (x86)\Tesseract-OCR-4.0.0\tessdata\, exists: true
>
> [main] INFO org.torchai.TikaOCRParser - Image Magick path: c:\Program
> Files\ImageMagick-7.0.10-Q16-HDRI\, exists: true
>
> [main] INFO org.torchai.TikaOCRParser - Python path: c:\python39\, exists:
> true
>
> [main] INFO org.torchai.TikaOCRParser - enableImageProcessing: true
>
> [main] INFO org.torchai.TikaOCRParser - apply rotation: false
>
> [main] INFO org.torchai.TikaOCRParser - PDF Extract inline images: true
>
> [main] INFO org.torchai.TikaOCRParser - PDF OCR Strategy: AUTO
>
> [main] INFO org.torchai.TikaOCRParser - PDF OCR DPI: 100
>
> [main] INFO org.torchai.TikaOCRParser - PDF Detect angles: true
>
> [main] INFO org.torchai.TikaOCRParser - calling parse on
> c:\testFiles\Skewed Dickens.docx
>
> [main] INFO org.torchai.TikaOCRParser - mimeType = application/x-tika-ooxml
>
> [main] INFO org.torchai.TikaOCRParser - X-Parsed-By:
> org.apache.tika.parser.EmptyParser
>
> [main] INFO org.torchai.TikaOCRParser - Content-Type:
> application/x-tika-ooxml
>
> Text: <html xmlns="http://www.w3.org/1999/xhtml";>
>
> <head>
>
> <meta name="X-Parsed-By" content="org.apache.tika.parser.EmptyParser" />
>
> <meta name="Content-Type" content="application/x-tika-ooxml" />
>
> <title></title>
>
> </head>
>
> <body /></html>
>

Re: Problem parsing DOCX

Reply via email to