[
https://issues.apache.org/jira/browse/TIKA-4547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Nicholas DiPiazza updated TIKA-4547:
------------------------------------
Description:
Plan: Enable Distributed State Management for Tika Pipes Clustering
The current Tika Pipes architecture stores Fetcher/Emitter/PipesIterator
configurations in local memory (ExpiringFetcherStore using synchronized
HashMaps), making it impossible to create a fetcher on one server and use it on
another. This plan introduces a pluggable distributed state abstraction to
enable true clustering for both gRPC and REST servers.
was:
Plan: Enable Distributed State Management for Tika Pipes Clustering
The current Tika Pipes architecture stores Fetcher/Emitter/PipesIterator
configurations in local memory (ExpiringFetcherStore using synchronized
HashMaps), making it impossible to create a fetcher on one server and use it on
another. This plan introduces a pluggable distributed state abstraction to
enable true clustering for both gRPC and REST servers.
* Create StateStore abstraction in tika-pipes-api as an interface with methods
put(String key, byte[] value), get(String key), delete(String key), list(), and
lifecycle operations, allowing pluggable implementations (in-memory, Apache
Ignite, Redis, Hazelcast, etc.).
* Refactor ExpiringFetcherStore to use StateStore in TikaGrpcServerImpl.java,
replacing Collections.synchronizedMap with StateStore API calls for fetchers,
fetcherConfigs, and fetcherLastAccessed maps to enable cross-server state
sharing.
* Create parallel EmitterStore and PipesIteratorStore abstractions mirroring
ExpiringFetcherStore pattern in tika-pipes-core, applying the same
StateStore-backed approach for Emitters and PipesIterators to achieve full
component distribution.
* Add StateStoreFactory plugin system in tika-pipes-core using PF4J pattern
(similar to FetcherManager and EmitterManager), loading implementations from
Tika config's stateStore section with default in-memory implementation.
* Update PipesConfig to include state store configuration in PipesConfig.java
with fields like stateStoreClass and stateStoreParams, ensuring backward
compatibility with local-only deployments via sensible defaults.
Make PipesClient and PipesServer state-aware by injecting StateStore references
in PipesClient.java and PipesServer.java, enabling forked processes to retrieve
fetcher/emitter configs from distributed store rather than requiring XML
rewrites.
> Update tika pipes so that it can be properly clustered
> ------------------------------------------------------
>
> Key: TIKA-4547
> URL: https://issues.apache.org/jira/browse/TIKA-4547
> Project: Tika
> Issue Type: Task
> Reporter: Nicholas DiPiazza
> Priority: Major
>
> Plan: Enable Distributed State Management for Tika Pipes Clustering
> The current Tika Pipes architecture stores Fetcher/Emitter/PipesIterator
> configurations in local memory (ExpiringFetcherStore using synchronized
> HashMaps), making it impossible to create a fetcher on one server and use it
> on another. This plan introduces a pluggable distributed state abstraction to
> enable true clustering for both gRPC and REST servers.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)