Tim Allison created TIKA-4256:
---------------------------------
Summary: Allow inlining of ocr'd text in container document
Key: TIKA-4256
URL: https://issues.apache.org/jira/browse/TIKA-4256
Project: Tika
Issue Type: Task
Reporter: Tim Allison
For legacy tika, we're inlining all content from embedded files including ocr
content of embedded images.
However, for the RecursiveParserWrapper, /rmeta , -J option, users have to
stitch inlined image ocr text back into the container file's content.
For example, if a docx has an image in it and tesseract is invoked, the
structure will notionally be:
[
{ "type":"docx", "content": "main content of the file"}
{ "type":"jpeg", "content": "ocr'd content", "embeddedType":"INLINE"}
]
It would be useful to allow an option to inline the extracted text in the
parent document. I think we want to keep the embedded inline object so that we
don't lose metadata from it. So I propose this kind of output:
[
{ "type":"docx", "content": "<body>main content of the file <div
type=\"ocr\">ocr'd content</div></body>"}
{ "type":"jpeg", "content": "ocr'd content", "embeddedType":"INLINE"}
]
This will allow a more intuitive search for non-file forensics users and will
be more similar to what we're doing with rendering a page -> ocr in PDFs when
that is configured.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)