Adrian Bird created TIKA-4739:
---------------------------------
Summary: tika-4.0.0-alpha1 - configuration file issues
Key: TIKA-4739
URL: https://issues.apache.org/jira/browse/TIKA-4739
Project: Tika
Issue Type: Bug
Reporter: Adrian Bird
I've got some issues with the configuration and I've put them all in here.
*1.* Error in Tika-App Integration Test 20
The
[test|https://tika.apache.org/docs/4.0.0-SNAPSHOT/advanced/integration-testing/tika-app.html#_test_20_create_custom_config_file]
has a custom tika-config.json file. When I tried it I got the following error:
{code:java}
Exception in thread "main"
com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: Unrecognized
field "timeoutMillis" (class org.apache.tika.pipes.core.PipesConfig), not
marked as ignorable (25 known properties: "sleepOnStartupTimeoutMillis",
"shutdownClientAfterMillis", "numClients", "emitWithinMillis",
"configStoreParams", "emitStrategy", "heartbeatIntervalMs",
"startupTimeoutMillis", "numEmitters", "staleFetcherTimeoutSeconds",
"maxFilesProcessedPerProcess", "useSharedServer", "queueSize",
"socketTimeoutMs", "parseMode", "stopOnlyOnFatal", "tempDirectory",
"onParseException", "forkedJvmArgs", "maxWaitForClientMillis", "javaPath",
"staleFetcherDelaySeconds", "configStoreType", "emitMaxEstimatedBytes",
"emitIntermediateResults")
at [Source: UNKNOWN; byte offset: #UNKNOWN] (through reference chain:
org.apache.tika.pipes.core.PipesConfig["timeoutMillis"]) {code}
*2.* parsers '_exclude' doesn't seem to work
Using the config file from Test 20 above, and fixing the issue by using
'startupTimeoutMillis' I tried excluding a parser. I really wanted to do it for
Tesseract but decided an easier option was PDF.
I removed the 'pdf-parser' section from the config and did this:
{code:java}
{
"default-parser": {
"_exclude": ["pdf-parser"]
}
},{code}
When I ran Tika it produced the same output as previously and processed my PDF
file.
*2a.* There is a documentation example that has 'exclude' rather than
'_exclude'
[vlm-pdf-parsing.json|https://github.com/apache/tika/blob/main/docs/modules/ROOT/examples/vlm-pdf-parsing.json]
*3.* [Getting Started with Tika
Pipes|https://tika.apache.org/docs/4.0.0-SNAPSHOT/pipes/getting-started.html#_json_configuration]
JSON Configuration example doesn't work.
When I try the example using the JSON Configuration I get the following:
{code:java}
INFO [pool-2-thread-1] 08:52:11,748
org.apache.tika.pipes.core.server.FetchHandler Couldn't initialize fetcher for
fetch id=MyTestFile.pdf
org.apache.tika.pipes.api.fetcher.FetcherNotFoundException: Can't find fetcher
for id=fsf. Available: []{code}
I assume it is because there is no 'pipes-iterator' in the configuration and it
is picking up a default.
In my tika-config.json I changed the ids to 'fsf' and 'fse' and got the same
error.
I noticed that the structure of the 'fetchers' and 'emitters' is different in
this example and the one in 1. above.
This has an array with an 'id' key/value pair and the one in 1. above has a map
with the 'id' being the key.
I changed the structure to reflect what is in 1. above and it worked (if I left
the 'id' key in there I got an error saying 'id' wasn't valid).
I noticed a lot of test files in the repository that have the format listed in
the Getting Started section.
*My questions are:*
a. what structure(s) of the 'fetchers' and 'emitters' are supported?
b. what should the example configuration be?
*3a.* There is a note below the command to run the config file: ??'The -i and
-o flags override the basePath values in the config when used with tika-app.'??
I'm not seeing this. The values used are from the 'basePath'. If neither the
'-i' value on the command line, or in the config file exist, I get this message
about the value in the config file:
Exception in thread "main" java.lang.RuntimeException:
java.lang.IllegalArgumentException: "basePath" directory does not exist:
L:\Apache-Tika\batch-inputxxx
--
This message was sent by Atlassian Jira
(v8.20.10#820010)