https://issues.apache.org/jira/plugins/servlet/mobile#issue/SOLR-14066



On Tue, 7 Jan, 2020, 4:20 PM aanno.trash, <[email protected]> wrote:

> Hello,
>
> I looked a bit into the code of DIH (solr dataimporthandler and
> dataimporthandler-extra). I wonder what is the state of this code. It is
> in a 'contrib' folder and seems to work (and maintained). But is there
> ongoing development (e.g. additional features)?
>
> The reason I'm asking is that I'm in a project where DIH is used.
> However, the import is very slow, especially into a solr cluster. I
> glanced over the code for my case and it looks like DIH is only
> single-threaded. I guess that changing DIH to support multi-threading on
> the 'root' (top level) entity should result in a dramatic performance
> boost.
>
> Hence I hacked DIH a bit. To get started, I concentrated on the 'tika'
> example case with a bunch of private PDFs and only for a 'full-import'.
> From this (dirty) experiment, a multi-threaded DIH seems to be possible.
> However, some bigger code changes are needed. This is a incomplete list:
>
> * Make VariableResolver immutable and change its interface/contract
> * All EntityProcessors seems to be written with only a single-thread in
> mind. I circumvented the problem by (a) supporting a clone operation and
> (b) cloning the EntityProcessors for each thread.
> * To get the code more handy, I introduced several interfaces where only
> complete abstract classes has been around before (Context, DataSource,
> DIHProperties, EntityProcessor, ...). Perhaps this in not absolutely
> needed but has simplified the refactoring substantially.
>
> So this is my question: Would you consider the contribution of a BIG DIH
> change for merging into the project? Or is DIH just dead and should go
> away soon? And if you would consider the contribution, would it be best
> with several small changes or with a 'big-bang' pull request? Would you
> consider the contribution even if some features of DIH are dropped?
> (From my experiment, a very hot candidate to drop is the
> XPathEntityProcessor.)
>
> Kind regards,
>
> aanno2
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Reply via email to