Hello, I looked a bit into the code of DIH (solr dataimporthandler and dataimporthandler-extra). I wonder what is the state of this code. It is in a 'contrib' folder and seems to work (and maintained). But is there ongoing development (e.g. additional features)?
The reason I'm asking is that I'm in a project where DIH is used. However, the import is very slow, especially into a solr cluster. I glanced over the code for my case and it looks like DIH is only single-threaded. I guess that changing DIH to support multi-threading on the 'root' (top level) entity should result in a dramatic performance boost. Hence I hacked DIH a bit. To get started, I concentrated on the 'tika' example case with a bunch of private PDFs and only for a 'full-import'. >From this (dirty) experiment, a multi-threaded DIH seems to be possible. However, some bigger code changes are needed. This is a incomplete list: * Make VariableResolver immutable and change its interface/contract * All EntityProcessors seems to be written with only a single-thread in mind. I circumvented the problem by (a) supporting a clone operation and (b) cloning the EntityProcessors for each thread. * To get the code more handy, I introduced several interfaces where only complete abstract classes has been around before (Context, DataSource, DIHProperties, EntityProcessor, ...). Perhaps this in not absolutely needed but has simplified the refactoring substantially. So this is my question: Would you consider the contribution of a BIG DIH change for merging into the project? Or is DIH just dead and should go away soon? And if you would consider the contribution, would it be best with several small changes or with a 'big-bang' pull request? Would you consider the contribution even if some features of DIH are dropped? (From my experiment, a very hot candidate to drop is the XPathEntityProcessor.) Kind regards, aanno2 --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
