[ 
https://issues.apache.org/jira/browse/CAMEL-23456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrea Cosentino reassigned CAMEL-23456:
----------------------------------------

    Assignee: Andrea Cosentino

> camel-docling: Document title and type metadata not extracted reliably
> ----------------------------------------------------------------------
>
>                 Key: CAMEL-23456
>                 URL: https://issues.apache.org/jira/browse/CAMEL-23456
>             Project: Camel
>          Issue Type: Bug
>          Components: camel-docling
>            Reporter: Andrea Cosentino
>            Assignee: Andrea Cosentino
>            Priority: Major
>
> The {{EXTRACT_METADATA}} operation does not reliably populate the {{title}} 
> and {{documentType}} fields on the returned {{DocumentMetadata}}. This is 
> documented as an open issue in {{MetadataExtractionIT.java}} (lines 74-75): 
> the integration test asserts on these fields but a TODO comment indicates the 
> values are missing or incorrect.
> h3. Reproduction
> # Configure a docling-serve endpoint with {{operation=EXTRACT_METADATA}}
> # Send a document that has a clear title and a known type (e.g., a PDF with 
> explicit metadata)
> # Inspect the {{DocumentMetadata}} object returned in the body
> h3. Expected behavior
> {{title}} reflects the document's title metadata; {{documentType}} reflects 
> the document type as detected by docling.
> h3. Actual behavior
> Both fields are empty or null; the test in {{MetadataExtractionIT.java}} 
> carries a TODO acknowledging this.
> h3. Investigation hints
> * The metadata is parsed from docling's JSON output in 
> {{DoclingProducer.handleExtractMetadata()}}; verify the JSON path used to 
> read these fields against the current docling-serve schema
> * It is possible the field names or nesting changed in a recent docling 
> release
> h3. Acceptance criteria
> * {{title}} and {{documentType}} are populated when present in the source 
> document
> * The TODO comments in {{MetadataExtractionIT.java}} are removed and the 
> assertions pass
> * If the upstream docling format does not expose these fields, document the 
> limitation and remove the fields from {{DocumentMetadata}} (rather than 
> silently leaving them empty)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to