I wanted to get some preliminary feedback before filing this proposal as a Jira(s):
Package Solr Data Import Handler and Solr Cell as standalone jars with command line interfaces to run as separate processes to promote more efficient distributed processing, both by separating them from the Solr JVM and allowing multiple instances running in parallel on multiple machines. And to make it easier for mere mortals to customize the ingestion code without diving deep into core Solr. There are four motivations: 1. DIH and SolrCell are both cumbersome to use in the form of Solr request handlers. 2. Too much compute-intensive and memory-intensive code running in the Solr server JVM, such as Tika PDFBox, etc. 3. Desire to exploit the parallelism of modern computing clouds. Also possibly co-locate ingestion process near the data source and separated from Solr itself. 4. Make it easier to customize complex ingestion processes without overloading and compromising “core” Solr processing. Whether there might be some new Solr API requirements needed to enable the separation is unknown, for the moment. Any reasons or strong objections not to file this proposal as a Jira(s)? Whether DIH and Solr Cell would be removed from the Solr server would be an independent issue. The goal here is simply to be able to run any number of DIH and Solr Cell processes separate from the Solr server. One way to think of these separate jars would be as “grown up” siblings of the current SimplePostTool. -- Jack Krupansky
