[ 
https://issues.apache.org/jira/browse/TIKA-4722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas DiPiazza updated TIKA-4722:
------------------------------------
    Description: 
h2. Summary

Add a {{parse_context_json}} field to the gRPC {{FetchAndParseRequest}} message 
that lets callers configure any registered ParseContext component on a 
per-request basis.

h2. Motivation

Downstream users of tika-grpc need to control the content output format (e.g., 
HTML output for further processing into Markdown) on individual requests, not 
just globally. The generic {{parse_context_json}} field enables this and also 
exposes other ParseContext components such as timeout limits, embedded document 
limits, and output limits.

h2. Changes

* *{{tika.proto}}*: add {{string parse_context_json = 5;}} to 
{{FetchAndParseRequest}}
* *{{TikaGrpcServerImpl.java}}*: iterate the JSON fields and call 
{{parseContext.setJsonConfig(componentName, valueJson)}} for each entry when 
the field is non-empty
* *{{HandlerTypeTest.java}}* (e2e): new test verifying HTML vs TEXT output via 
{{parse_context_json}}

h2. Usage Example

{code:language=json}
{
  "fetch_key": "my-doc.pdf",
  "fetcher_id": "myFetcher",
  "parse_context_json": "{\"basic-content-handler-factory\": {\"type\": 
\"HTML\"}}"
}
{code}

Available {{basic-content-handler-factory}} types: TEXT, HTML, XML, BODY, 
IGNORE, MARKDOWN

h2. PR

https://github.com/apache/tika/pull/2797

  was:
h2. Summary

Add a {{handler_type}} field to the {{FetchAndParseRequest}} gRPC message so 
callers can specify the output content handler type (e.g., {{html}}, {{text}}, 
{{xml}}) on a per-request basis, without needing to change the server-level 
configuration.

h2. Background

The {{tika-grpc}} server currently creates a bare {{ParseContext}} for every 
parse request with no way for clients to control the output format. The 
underlying {{tika-pipes-core}} infrastructure already supports per-request 
{{ContentHandlerFactory}} via {{ParseContext}} (see 
{{ParseHandler.getContentHandlerFactory()}}), but this capability is not 
exposed through the gRPC API.

A downstream user (Atolio) implemented this in their fork and uses it to get 
HTML output from Tika, which they then convert to Markdown. Apache Tika's gRPC 
API should expose this natively.

h2. Proposed Change

# Add {{handler_type}} (string, field 5) to {{FetchAndParseRequest}} in 
{{tika.proto}}
# In {{TikaGrpcServerImpl.fetchAndParseImpl()}}, if {{handler_type}} is set, 
resolve it to a {{BasicContentHandlerFactory}} and place it in the 
{{ParseContext}}

h2. Valid handler_type Values

* {{text}} (default) - plain text
* {{html}} - structured HTML output
* {{xml}} - XHTML output
* {{body}} - HTML body only
* {{ignore}} - no content
* {{markdown}} - Markdown output

h2. Example Usage (Java gRPC client)

{code:java}
FetchAndParseRequest request = FetchAndParseRequest.newBuilder()
    .setFetcherId("my-fetcher")
    .setFetchKey("document.pdf")
    .setHandlerType("html")
    .build();
{code}


        Summary: Add parse_context_json to FetchAndParseRequest for per-request 
ParseContext configuration  (was: tika-grpc: Add handler_type field to 
FetchAndParseRequest for per-request content handler configuration)

> Add parse_context_json to FetchAndParseRequest for per-request ParseContext 
> configuration
> -----------------------------------------------------------------------------------------
>
>                 Key: TIKA-4722
>                 URL: https://issues.apache.org/jira/browse/TIKA-4722
>             Project: Tika
>          Issue Type: New Feature
>            Reporter: Nicholas DiPiazza
>            Assignee: Nicholas DiPiazza
>            Priority: Major
>
> h2. Summary
> Add a {{parse_context_json}} field to the gRPC {{FetchAndParseRequest}} 
> message that lets callers configure any registered ParseContext component on 
> a per-request basis.
> h2. Motivation
> Downstream users of tika-grpc need to control the content output format 
> (e.g., HTML output for further processing into Markdown) on individual 
> requests, not just globally. The generic {{parse_context_json}} field enables 
> this and also exposes other ParseContext components such as timeout limits, 
> embedded document limits, and output limits.
> h2. Changes
> * *{{tika.proto}}*: add {{string parse_context_json = 5;}} to 
> {{FetchAndParseRequest}}
> * *{{TikaGrpcServerImpl.java}}*: iterate the JSON fields and call 
> {{parseContext.setJsonConfig(componentName, valueJson)}} for each entry when 
> the field is non-empty
> * *{{HandlerTypeTest.java}}* (e2e): new test verifying HTML vs TEXT output 
> via {{parse_context_json}}
> h2. Usage Example
> {code:language=json}
> {
>   "fetch_key": "my-doc.pdf",
>   "fetcher_id": "myFetcher",
>   "parse_context_json": "{\"basic-content-handler-factory\": {\"type\": 
> \"HTML\"}}"
> }
> {code}
> Available {{basic-content-handler-factory}} types: TEXT, HTML, XML, BODY, 
> IGNORE, MARKDOWN
> h2. PR
> https://github.com/apache/tika/pull/2797



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to