[
https://issues.apache.org/jira/browse/CAMEL-23456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andrea Cosentino reassigned CAMEL-23456:
----------------------------------------
Assignee: Andrea Cosentino
> camel-docling: Document title and type metadata not extracted reliably
> ----------------------------------------------------------------------
>
> Key: CAMEL-23456
> URL: https://issues.apache.org/jira/browse/CAMEL-23456
> Project: Camel
> Issue Type: Bug
> Components: camel-docling
> Reporter: Andrea Cosentino
> Assignee: Andrea Cosentino
> Priority: Major
>
> The {{EXTRACT_METADATA}} operation does not reliably populate the {{title}}
> and {{documentType}} fields on the returned {{DocumentMetadata}}. This is
> documented as an open issue in {{MetadataExtractionIT.java}} (lines 74-75):
> the integration test asserts on these fields but a TODO comment indicates the
> values are missing or incorrect.
> h3. Reproduction
> # Configure a docling-serve endpoint with {{operation=EXTRACT_METADATA}}
> # Send a document that has a clear title and a known type (e.g., a PDF with
> explicit metadata)
> # Inspect the {{DocumentMetadata}} object returned in the body
> h3. Expected behavior
> {{title}} reflects the document's title metadata; {{documentType}} reflects
> the document type as detected by docling.
> h3. Actual behavior
> Both fields are empty or null; the test in {{MetadataExtractionIT.java}}
> carries a TODO acknowledging this.
> h3. Investigation hints
> * The metadata is parsed from docling's JSON output in
> {{DoclingProducer.handleExtractMetadata()}}; verify the JSON path used to
> read these fields against the current docling-serve schema
> * It is possible the field names or nesting changed in a recent docling
> release
> h3. Acceptance criteria
> * {{title}} and {{documentType}} are populated when present in the source
> document
> * The TODO comments in {{MetadataExtractionIT.java}} are removed and the
> assertions pass
> * If the upstream docling format does not expose these fields, document the
> limitation and remove the fields from {{DocumentMetadata}} (rather than
> silently leaving them empty)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)