[
https://issues.apache.org/jira/browse/TIKA-4722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Nicholas DiPiazza updated TIKA-4722:
------------------------------------
Description:
h2. Summary
Add a {{parse_context_json}} field to the gRPC {{FetchAndParseRequest}} message
that lets callers configure any registered ParseContext component on a
per-request basis.
h2. Motivation
Downstream users of tika-grpc need to control the content output format (e.g.,
HTML output for further processing into Markdown) on individual requests, not
just globally. The generic {{parse_context_json}} field enables this and also
exposes other ParseContext components such as timeout limits, embedded document
limits, and output limits.
h2. Changes
* *{{tika.proto}}*: add {{string parse_context_json = 5;}} to
{{FetchAndParseRequest}}
* *{{TikaGrpcServerImpl.java}}*: iterate the JSON fields and call
{{parseContext.setJsonConfig(componentName, valueJson)}} for each entry when
the field is non-empty
* *{{HandlerTypeTest.java}}* (e2e): new test verifying HTML vs TEXT output via
{{parse_context_json}}
h2. Usage Example
{code:language=json}
{
"fetch_key": "my-doc.pdf",
"fetcher_id": "myFetcher",
"parse_context_json": "{\"basic-content-handler-factory\": {\"type\":
\"HTML\"}}"
}
{code}
Available {{basic-content-handler-factory}} types: TEXT, HTML, XML, BODY,
IGNORE, MARKDOWN
h2. PR
https://github.com/apache/tika/pull/2797
was:
h2. Summary
Add a {{handler_type}} field to the {{FetchAndParseRequest}} gRPC message so
callers can specify the output content handler type (e.g., {{html}}, {{text}},
{{xml}}) on a per-request basis, without needing to change the server-level
configuration.
h2. Background
The {{tika-grpc}} server currently creates a bare {{ParseContext}} for every
parse request with no way for clients to control the output format. The
underlying {{tika-pipes-core}} infrastructure already supports per-request
{{ContentHandlerFactory}} via {{ParseContext}} (see
{{ParseHandler.getContentHandlerFactory()}}), but this capability is not
exposed through the gRPC API.
A downstream user (Atolio) implemented this in their fork and uses it to get
HTML output from Tika, which they then convert to Markdown. Apache Tika's gRPC
API should expose this natively.
h2. Proposed Change
# Add {{handler_type}} (string, field 5) to {{FetchAndParseRequest}} in
{{tika.proto}}
# In {{TikaGrpcServerImpl.fetchAndParseImpl()}}, if {{handler_type}} is set,
resolve it to a {{BasicContentHandlerFactory}} and place it in the
{{ParseContext}}
h2. Valid handler_type Values
* {{text}} (default) - plain text
* {{html}} - structured HTML output
* {{xml}} - XHTML output
* {{body}} - HTML body only
* {{ignore}} - no content
* {{markdown}} - Markdown output
h2. Example Usage (Java gRPC client)
{code:java}
FetchAndParseRequest request = FetchAndParseRequest.newBuilder()
.setFetcherId("my-fetcher")
.setFetchKey("document.pdf")
.setHandlerType("html")
.build();
{code}
Summary: Add parse_context_json to FetchAndParseRequest for per-request
ParseContext configuration (was: tika-grpc: Add handler_type field to
FetchAndParseRequest for per-request content handler configuration)
> Add parse_context_json to FetchAndParseRequest for per-request ParseContext
> configuration
> -----------------------------------------------------------------------------------------
>
> Key: TIKA-4722
> URL: https://issues.apache.org/jira/browse/TIKA-4722
> Project: Tika
> Issue Type: New Feature
> Reporter: Nicholas DiPiazza
> Assignee: Nicholas DiPiazza
> Priority: Major
>
> h2. Summary
> Add a {{parse_context_json}} field to the gRPC {{FetchAndParseRequest}}
> message that lets callers configure any registered ParseContext component on
> a per-request basis.
> h2. Motivation
> Downstream users of tika-grpc need to control the content output format
> (e.g., HTML output for further processing into Markdown) on individual
> requests, not just globally. The generic {{parse_context_json}} field enables
> this and also exposes other ParseContext components such as timeout limits,
> embedded document limits, and output limits.
> h2. Changes
> * *{{tika.proto}}*: add {{string parse_context_json = 5;}} to
> {{FetchAndParseRequest}}
> * *{{TikaGrpcServerImpl.java}}*: iterate the JSON fields and call
> {{parseContext.setJsonConfig(componentName, valueJson)}} for each entry when
> the field is non-empty
> * *{{HandlerTypeTest.java}}* (e2e): new test verifying HTML vs TEXT output
> via {{parse_context_json}}
> h2. Usage Example
> {code:language=json}
> {
> "fetch_key": "my-doc.pdf",
> "fetcher_id": "myFetcher",
> "parse_context_json": "{\"basic-content-handler-factory\": {\"type\":
> \"HTML\"}}"
> }
> {code}
> Available {{basic-content-handler-factory}} types: TEXT, HTML, XML, BODY,
> IGNORE, MARKDOWN
> h2. PR
> https://github.com/apache/tika/pull/2797
--
This message was sent by Atlassian Jira
(v8.20.10#820010)