Andrea Cosentino created CAMEL-23457:
----------------------------------------
Summary: camel-docling: OCR fails to detect footer regions in
scanned images
Key: CAMEL-23457
URL: https://issues.apache.org/jira/browse/CAMEL-23457
Project: Camel
Issue Type: Bug
Components: camel-docling
Reporter: Andrea Cosentino
When OCR is enabled and applied to a scanned image with a clearly visible
footer, the OCR result does not include the footer text. This is captured as an
open issue in {{OcrExtractionIT.java}} (around line 181): the test contains a
TODO noting "footer is not found by the ocr by Camel docling".
h3. Reproduction
# Send a scanned PDF or image with a known footer (e.g., page number, copyright
line) to a docling endpoint with {{enableOCR=true}}
# Inspect the extracted text
h3. Expected behavior
The footer text is present in the OCR output, possibly with positional/layout
information when {{includeLayoutInfo=true}}.
h3. Actual behavior
Footer text is missing from the OCR output. The TODO in
{{OcrExtractionIT.java}} acknowledges this gap.
h3. Investigation hints
* Verify whether the issue is in docling's OCR pipeline (region detection cuts
off page bottom) or in how camel-docling configures the OCR call
* Check whether different {{ocrEngine}} values change the result
* Check whether {{forceOcr=true}} or {{doOcr=true}} produces a different outcome
* Confirm against the latest docling-serve / docling CLI version
h3. Acceptance criteria
* Footer regions are reliably included in OCR output for typical document
layouts
* The TODO in {{OcrExtractionIT.java}} is removed and the test asserts on
footer text
* If the issue turns out to be upstream-only, file an upstream issue and
document the workaround/limitation in {{docling-component.adoc}}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)