[ 
https://issues.apache.org/jira/browse/TIKA-4735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18092666#comment-18092666
 ] 

ASF GitHub Bot commented on TIKA-4735:
--------------------------------------

nddipiazza opened a new pull request, #2919:
URL: https://github.com/apache/tika/pull/2919

   ## Summary
   
   Fixes two bugs reported in 
[TIKA-4735](https://issues.apache.org/jira/browse/TIKA-4735).
   
   ## Changes
   
   ### Bug 1 — `-h` option conflict (`TikaAsyncCLI.java`)
   
   `-h` was bound to `--help`, so using it as a handler shorthand (as 
documented in the CLI docs) triggered the help menu instead of setting the 
handler type. Fix: removed the `-h` short option from `--help`; `--help` still 
works as a long option.
   
   ### Bug 2 — `--content-only` produced JSON output instead of raw content 
(`ParseHandler.java`)
   
   `ParseHandler.parseWithStream()` resolves `ParseMode` using a fallback to 
`defaultParseMode` (loaded from the pipes config). However, it never wrote the 
resolved mode back into `ParseContext`. `EmitHandler.emit()` checks 
`parseContext.get(ParseMode.class)` directly—with no fallback—so it always 
received `null` when `parseMode` was set only as a config default (e.g. via 
`--content-only` on the CLI). This caused it to fall through to full JSON emit.
   
   **Fix:** one line in `ParseHandler.parseWithStream()`:
   
   ```java
   parseContext.set(ParseMode.class, parseMode); // Ensure EmitHandler sees the 
resolved mode
   ```
   
   ## Critical Files
   
   - 
`tika-pipes/tika-pipes-core/src/main/java/org/apache/tika/pipes/core/server/ParseHandler.java`
 — core fix
   - 
`tika-pipes/tika-async-cli/src/main/java/org/apache/tika/async/cli/TikaAsyncCLI.java`
 — remove `-h` short option
   
   ## Testing Instructions
   
   ```bash
   mvn test -pl tika-pipes/tika-pipes-core,tika-pipes/tika-async-cli
   ```
   
   ## Review Checklist
   
   - [x] Reproducer tests added for both bugs
   - [x] Integration test `AsyncProcessorTest.testContentOnlyFromConfigDefault` 
confirms the runtime fix
   - [x] All 65 tests in affected modules pass
   - [x] No unrelated changes
   
   ## Potential Concerns
   
   The `ParseMode` is now always written into `ParseContext` inside 
`ParseHandler.parseWithStream()`. This is safe because `ParseHandler` is the 
authoritative resolver of parse mode (it already has the defaultParseMode 
fallback) and writing it into the context makes the state explicit for all 
downstream consumers (`EmitHandler`, `PipesWorker`, etc.).




> tika-4.0.0-alpha1 - batch output contains JSON wrapper and metadata with 
> --content-only
> ---------------------------------------------------------------------------------------
>
>                 Key: TIKA-4735
>                 URL: https://issues.apache.org/jira/browse/TIKA-4735
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 4.0.0
>         Environment: Windows 11 with Java 17
>            Reporter: Adrian Bird
>            Priority: Major
>
> The [Basic Batch Usage 
> Documentation|https://tika.apache.org/docs/4.0.0-SNAPSHOT/using-tika/cli/index.html#_basic_batch_usage]
>  has this example:
> {noformat}
> java -jar tika-async-cli.jar -i /path/to/input -o /path/to/output -h m 
> --content-only{noformat}
> and description:
> This produces .md files in the output directory containing just the extracted 
> markdown content — no JSON wrappers, no metadata fields.
> The example doesn't work because -h means help. -h is listed in the options 
> section.
> The help that was produced just lists '--handler' for the option.
> My actual issue is with the output of the batch processing. My example:
> {noformat}
> %JAVA_HOME%\bin\java -jar %TIKA_JAR%  -i Input -o Output --handler m 
> --content-only{noformat}
> creates a .md file but it has a JSON wrapper and metadata fields and the 
> content isn't plain text.
> I get a JSON wrapper and metadata for all the --handler formats.
> Also, if I remove the --content-only argument I get a .json file and not a 
> .md file.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to