[ 
https://issues.apache.org/jira/browse/CAMEL-23457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrea Cosentino reassigned CAMEL-23457:
----------------------------------------

    Assignee: Andrea Cosentino

> camel-docling: OCR fails to detect footer regions in scanned images
> -------------------------------------------------------------------
>
>                 Key: CAMEL-23457
>                 URL: https://issues.apache.org/jira/browse/CAMEL-23457
>             Project: Camel
>          Issue Type: Bug
>          Components: camel-docling
>            Reporter: Andrea Cosentino
>            Assignee: Andrea Cosentino
>            Priority: Major
>             Fix For: 4.21.0
>
>
> When OCR is enabled and applied to a scanned image with a clearly visible 
> footer, the OCR result does not include the footer text. This is captured as 
> an open issue in {{OcrExtractionIT.java}} (around line 181): the test 
> contains a TODO noting "footer is not found by the ocr by Camel docling".
> h3. Reproduction
> # Send a scanned PDF or image with a known footer (e.g., page number, 
> copyright line) to a docling endpoint with {{enableOCR=true}}
> # Inspect the extracted text
> h3. Expected behavior
> The footer text is present in the OCR output, possibly with positional/layout 
> information when {{includeLayoutInfo=true}}.
> h3. Actual behavior
> Footer text is missing from the OCR output. The TODO in 
> {{OcrExtractionIT.java}} acknowledges this gap.
> h3. Investigation hints
> * Verify whether the issue is in docling's OCR pipeline (region detection 
> cuts off page bottom) or in how camel-docling configures the OCR call
> * Check whether different {{ocrEngine}} values change the result
> * Check whether {{forceOcr=true}} or {{doOcr=true}} produces a different 
> outcome
> * Confirm against the latest docling-serve / docling CLI version
> h3. Acceptance criteria
> * Footer regions are reliably included in OCR output for typical document 
> layouts
> * The TODO in {{OcrExtractionIT.java}} is removed and the test asserts on 
> footer text
> * If the issue turns out to be upstream-only, file an upstream issue and 
> document the workaround/limitation in {{docling-component.adoc}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to