[
https://issues.apache.org/jira/browse/TIKA-4735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18092706#comment-18092706
]
ASF GitHub Bot commented on TIKA-4735:
--------------------------------------
tballison commented on PR #2919:
URL: https://github.com/apache/tika/pull/2919#issuecomment-4846044029
Claude makes these two points. They sound reasonable?
```
Two things worth knowing before it merges
1. The ParseHandler change is redundant (harmless). The PR re-adds
parseContext.set(ParseMode.class, parseMode) for "Root cause 1," but that's
already fixed on
main/rc1 — PipesWorker.setupParseContext() (commit 0b38268d4, the earlier
TIKA-4735 fix) already writes the resolved mode back into the same context
object before
EmitHandler reads it. That's why, on current rc1, you see a filtered
[{"X-TIKA:content":"…"}] (filter applied → ParseMode was visible) rather than
full metadata.
So only "Root cause 2" was actually still open. The ParseHandler line is
belt-and-suspenders — fine to keep, just not load-bearing.
2. ⚠️ The tests don't actually lock in this regression. This is the one
thing I'd push on. None of the added/changed tests exercise the
DYNAMIC-passback path that
was the bug:
- AsyncCliParserTest — tests arg parsing only (no emit).
- TikaConfigAsyncWriterTest — tests generated config has
parseMode=CONTENT_ONLY (no emit).
- The existing AsyncProcessorTest#testContentOnlyFromConfigDefault it
cites uses a config with emitStrategy: EMIT_ALL hardcoded — that test passed
before this fix
too, because EMIT_ALL never hits the passback branch.
```
> tika-4.0.0-alpha1 - batch output contains JSON wrapper and metadata with
> --content-only
> ---------------------------------------------------------------------------------------
>
> Key: TIKA-4735
> URL: https://issues.apache.org/jira/browse/TIKA-4735
> Project: Tika
> Issue Type: Bug
> Affects Versions: 4.0.0
> Environment: Windows 11 with Java 17
> Reporter: Adrian Bird
> Priority: Major
>
> The [Basic Batch Usage
> Documentation|https://tika.apache.org/docs/4.0.0-SNAPSHOT/using-tika/cli/index.html#_basic_batch_usage]
> has this example:
> {noformat}
> java -jar tika-async-cli.jar -i /path/to/input -o /path/to/output -h m
> --content-only{noformat}
> and description:
> This produces .md files in the output directory containing just the extracted
> markdown content — no JSON wrappers, no metadata fields.
> The example doesn't work because -h means help. -h is listed in the options
> section.
> The help that was produced just lists '--handler' for the option.
> My actual issue is with the output of the batch processing. My example:
> {noformat}
> %JAVA_HOME%\bin\java -jar %TIKA_JAR% -i Input -o Output --handler m
> --content-only{noformat}
> creates a .md file but it has a JSON wrapper and metadata fields and the
> content isn't plain text.
> I get a JSON wrapper and metadata for all the --handler formats.
> Also, if I remove the --content-only argument I get a .json file and not a
> .md file.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)