I wanted to get some preliminary feedback before filing this proposal as a 
Jira(s):

Package Solr Data Import Handler and Solr Cell as standalone jars with command 
line interfaces to run as separate processes to promote more efficient 
distributed processing, both by separating them from the Solr JVM and allowing 
multiple instances running in parallel on multiple machines. And to make it 
easier for mere mortals to customize the ingestion code without diving deep 
into core Solr.

There are four motivations:

1. DIH and SolrCell are both cumbersome to use in the form of Solr request 
handlers.

2.  Too much compute-intensive and memory-intensive code running in the Solr 
server JVM, such as Tika PDFBox, etc.

3. Desire to exploit the parallelism of modern computing clouds. Also possibly 
co-locate ingestion process near the data source and separated from Solr itself.

4. Make it easier to customize complex ingestion processes without overloading 
and compromising “core” Solr processing.

Whether there might be some new Solr API requirements needed to enable the 
separation is unknown, for the moment.

Any reasons or strong objections not to file this proposal as a Jira(s)?

Whether DIH and Solr Cell would be removed from the Solr server would be an 
independent issue. The goal here is simply to be able to run any number of DIH 
and Solr Cell processes separate from the Solr server.

One way to think of these separate jars would be as “grown up” siblings of the 
current SimplePostTool.

-- Jack Krupansky

Reply via email to