Andrea Cosentino created CAMEL-23457:
----------------------------------------

             Summary: camel-docling: OCR fails to detect footer regions in 
scanned images
                 Key: CAMEL-23457
                 URL: https://issues.apache.org/jira/browse/CAMEL-23457
             Project: Camel
          Issue Type: Bug
          Components: camel-docling
            Reporter: Andrea Cosentino


When OCR is enabled and applied to a scanned image with a clearly visible 
footer, the OCR result does not include the footer text. This is captured as an 
open issue in {{OcrExtractionIT.java}} (around line 181): the test contains a 
TODO noting "footer is not found by the ocr by Camel docling".

h3. Reproduction
# Send a scanned PDF or image with a known footer (e.g., page number, copyright 
line) to a docling endpoint with {{enableOCR=true}}
# Inspect the extracted text

h3. Expected behavior
The footer text is present in the OCR output, possibly with positional/layout 
information when {{includeLayoutInfo=true}}.

h3. Actual behavior
Footer text is missing from the OCR output. The TODO in 
{{OcrExtractionIT.java}} acknowledges this gap.

h3. Investigation hints
* Verify whether the issue is in docling's OCR pipeline (region detection cuts 
off page bottom) or in how camel-docling configures the OCR call
* Check whether different {{ocrEngine}} values change the result
* Check whether {{forceOcr=true}} or {{doOcr=true}} produces a different outcome
* Confirm against the latest docling-serve / docling CLI version

h3. Acceptance criteria
* Footer regions are reliably included in OCR output for typical document 
layouts
* The TODO in {{OcrExtractionIT.java}} is removed and the test asserts on 
footer text
* If the issue turns out to be upstream-only, file an upstream issue and 
document the workaround/limitation in {{docling-component.adoc}}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to