Andrea Cosentino created CAMEL-23456:
----------------------------------------
Summary: camel-docling: Document title and type metadata not
extracted reliably
Key: CAMEL-23456
URL: https://issues.apache.org/jira/browse/CAMEL-23456
Project: Camel
Issue Type: Bug
Components: camel-docling
Reporter: Andrea Cosentino
The {{EXTRACT_METADATA}} operation does not reliably populate the {{title}} and
{{documentType}} fields on the returned {{DocumentMetadata}}. This is
documented as an open issue in {{MetadataExtractionIT.java}} (lines 74-75): the
integration test asserts on these fields but a TODO comment indicates the
values are missing or incorrect.
h3. Reproduction
# Configure a docling-serve endpoint with {{operation=EXTRACT_METADATA}}
# Send a document that has a clear title and a known type (e.g., a PDF with
explicit metadata)
# Inspect the {{DocumentMetadata}} object returned in the body
h3. Expected behavior
{{title}} reflects the document's title metadata; {{documentType}} reflects the
document type as detected by docling.
h3. Actual behavior
Both fields are empty or null; the test in {{MetadataExtractionIT.java}}
carries a TODO acknowledging this.
h3. Investigation hints
* The metadata is parsed from docling's JSON output in
{{DoclingProducer.handleExtractMetadata()}}; verify the JSON path used to read
these fields against the current docling-serve schema
* It is possible the field names or nesting changed in a recent docling release
h3. Acceptance criteria
* {{title}} and {{documentType}} are populated when present in the source
document
* The TODO comments in {{MetadataExtractionIT.java}} are removed and the
assertions pass
* If the upstream docling format does not expose these fields, document the
limitation and remove the fields from {{DocumentMetadata}} (rather than
silently leaving them empty)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)