[ 
https://issues.apache.org/jira/browse/TIKA-4722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18077635#comment-18077635
 ] 

ASF GitHub Bot commented on TIKA-4722:
--------------------------------------

nddipiazza opened a new pull request, #2797:
URL: https://github.com/apache/tika/pull/2797

   ## Summary
   Adds a `handler_type` field to the `FetchAndParseRequest` gRPC message, 
allowing clients to specify the output content format on a per-request basis 
without changing server configuration.
   
   JIRA: https://issues.apache.org/jira/browse/TIKA-4722
   
   ## Changes
   - **`tika.proto`**: Added `handler_type` (string, field 5) to 
`FetchAndParseRequest`
   - **`TikaGrpcServerImpl.java`**: When `handler_type` is set, creates a 
`BasicContentHandlerFactory` with the requested type and places it in the 
`ParseContext`
   - **`HandlerTypeTest.java`**: New e2e test verifying HTML output contains 
markup tags and differs from text output
   
   ## Review Focus Areas
   - Proto backward compatibility: field 5 addition is safe in proto3
   - `BasicContentHandlerFactory.parseHandlerType()` handles unrecognized 
values by falling back to TEXT
   - The `BasicContentHandlerFactory` placed in `ParseContext` is picked up by 
`ParseHandler.getContentHandlerFactory()` on the forked server side
   
   ## Critical Files
   - `tika-grpc/src/main/proto/tika.proto`
   - 
`tika-grpc/src/main/java/org/apache/tika/pipes/grpc/TikaGrpcServerImpl.java`
   - 
`tika-e2e-tests/tika-grpc/src/test/java/org/apache/tika/pipes/filesystem/HandlerTypeTest.java`
   
   ## Testing Instructions
   ```bash
   cd tika-e2e-tests/tika-grpc
   mvn test -Dtest=HandlerTypeTest -Dtika.e2e.useLocalServer=true
   ```
   
   ## Review Checklist
   - [ ] proto field number does not conflict with existing fields
   - [ ] Falls back gracefully to TEXT for unrecognized handler_type values
   - [ ] E2E test verifies HTML vs text output differs
   
   ## Potential Concerns
   - The `ContentHandlerFactory` in `ParseContext` is serialized across the IPC 
boundary to the forked `PipesServer` via Jackson Smile format — this works 
because `BasicContentHandlerFactory` is a registered parse-context component 
(`basic-content-handler-factory` in `parse-context.idx`)




> tika-grpc: Add handler_type field to FetchAndParseRequest for per-request 
> content handler configuration
> -------------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-4722
>                 URL: https://issues.apache.org/jira/browse/TIKA-4722
>             Project: Tika
>          Issue Type: New Feature
>            Reporter: Nicholas DiPiazza
>            Assignee: Nicholas DiPiazza
>            Priority: Major
>
> h2. Summary
> Add a {{handler_type}} field to the {{FetchAndParseRequest}} gRPC message so 
> callers can specify the output content handler type (e.g., {{html}}, 
> {{text}}, {{xml}}) on a per-request basis, without needing to change the 
> server-level configuration.
> h2. Background
> The {{tika-grpc}} server currently creates a bare {{ParseContext}} for every 
> parse request with no way for clients to control the output format. The 
> underlying {{tika-pipes-core}} infrastructure already supports per-request 
> {{ContentHandlerFactory}} via {{ParseContext}} (see 
> {{ParseHandler.getContentHandlerFactory()}}), but this capability is not 
> exposed through the gRPC API.
> A downstream user (Atolio) implemented this in their fork and uses it to get 
> HTML output from Tika, which they then convert to Markdown. Apache Tika's 
> gRPC API should expose this natively.
> h2. Proposed Change
> # Add {{handler_type}} (string, field 5) to {{FetchAndParseRequest}} in 
> {{tika.proto}}
> # In {{TikaGrpcServerImpl.fetchAndParseImpl()}}, if {{handler_type}} is set, 
> resolve it to a {{BasicContentHandlerFactory}} and place it in the 
> {{ParseContext}}
> h2. Valid handler_type Values
> * {{text}} (default) - plain text
> * {{html}} - structured HTML output
> * {{xml}} - XHTML output
> * {{body}} - HTML body only
> * {{ignore}} - no content
> * {{markdown}} - Markdown output
> h2. Example Usage (Java gRPC client)
> {code:java}
> FetchAndParseRequest request = FetchAndParseRequest.newBuilder()
>     .setFetcherId("my-fetcher")
>     .setFetchKey("document.pdf")
>     .setHandlerType("html")
>     .build();
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to