Hello, aanno2. Don't start it. Threads were fixed to the certain level as 3.6.1 under https://issues.apache.org/jira/browse/SOLR-3360 But right after that threads were dropped out of DIH for overal sanity under https://issues.apache.org/jira/browse/SOLR-3262 If you really need to get certain level of concurrency, declare multiple DataImportHandlers in solrconfig.xml and submit multiple subrequest sharded with explicit filters in parallel.
Good luck. You'd better to try any full-fledged ETL rather than bandaiding DIH. On Tue, Jan 7, 2020 at 1:50 PM aanno.trash <[email protected]> wrote: > Hello, > > I looked a bit into the code of DIH (solr dataimporthandler and > dataimporthandler-extra). I wonder what is the state of this code. It is > in a 'contrib' folder and seems to work (and maintained). But is there > ongoing development (e.g. additional features)? > > The reason I'm asking is that I'm in a project where DIH is used. > However, the import is very slow, especially into a solr cluster. I > glanced over the code for my case and it looks like DIH is only > single-threaded. I guess that changing DIH to support multi-threading on > the 'root' (top level) entity should result in a dramatic performance > boost. > > Hence I hacked DIH a bit. To get started, I concentrated on the 'tika' > example case with a bunch of private PDFs and only for a 'full-import'. > From this (dirty) experiment, a multi-threaded DIH seems to be possible. > However, some bigger code changes are needed. This is a incomplete list: > > * Make VariableResolver immutable and change its interface/contract > * All EntityProcessors seems to be written with only a single-thread in > mind. I circumvented the problem by (a) supporting a clone operation and > (b) cloning the EntityProcessors for each thread. > * To get the code more handy, I introduced several interfaces where only > complete abstract classes has been around before (Context, DataSource, > DIHProperties, EntityProcessor, ...). Perhaps this in not absolutely > needed but has simplified the refactoring substantially. > > So this is my question: Would you consider the contribution of a BIG DIH > change for merging into the project? Or is DIH just dead and should go > away soon? And if you would consider the contribution, would it be best > with several small changes or with a 'big-bang' pull request? Would you > consider the contribution even if some features of DIH are dropped? > (From my experiment, a very hot candidate to drop is the > XPathEntityProcessor.) > > Kind regards, > > aanno2 > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > > -- Sincerely yours Mikhail Khludnev
