a multithreaded DIH is in my top priority list. There are muliple approaches

1) create multiple instances of dataImporter instances in the same DIH
instance and run them in parallel and commit when all of them are done
2) run the add() calls to Solr in a dedicated thread
3) make DIH automatically multithreaded . This is much harder to implement.

but a and #1 and #2 can be implemented with ease. It does not have to
be aother implementation called ParallelDataImportHandler. I believe
it can be done in DIH itself

you may not need to create a project in google code. you can open a
JIRA issue and start posting patches and we can put it back into Solr.


On Sun, Aug 2, 2009 at 7:33 PM, Avlesh Singh<avl...@gmail.com> wrote:
> In my quest to improve indexing time (in a multi-core environment), I tried
> writing a Solr RequestHandler called ParallelDataImportHandler.
> I had a few lame questions to begin with, which Noble and Shalin answered
> here -
> http://www.lucidimagination.com/search/document/22b7371c063fdb06/using_dih_for_parallel_indexing
> As the name suggests, the handler, when invoked, tries to execute multiple
> DIH instances on the same core in parallel. Of-course the catch here is
> that, only those data-sources, that can be batched can benifit from this
> handler. In my case, I am writing this for import from a MySQL database. So,
> I have a single data-config.xml, in which the query has to add placeholders
> for "limit" and "offset". Each DIH instance uses the same data-config file,
> and replaces its own values for the limit and offset (which is in fact
> supplied by the parent ParallelDataImportHandler).
> I am achieving this by making my handler SolrCoreAware, and creating
> maxNumberOfDIHInstances (configurable) in the inform method. These instances
> are then initialized and  registered with the core. Whenever a request comes
> in, the ParallelDataImportHandler delegates the task to these instances,
> schedules the remainder and aggregates responses from each of these
> instances to return back to the user.
> Thankfully, all of these worked, and preliminary benchmarking with 5million
> records indicated 50% decrease in re-indexing time. Moreover, all my cores
> (Solr in my case is hosted on a quad-core machine), indicated above 70% CPU
> utilization. All that I could have asked for!
> With respect to this whole thing, I have a few questions -
>   1. Is something similar available out of the box?
>   2. Is the idea flawed? Is the approach fundamentally correct?
>   3. I am using Solr 1.3. DIH did not have "EventListeners" in the stone
>   age. I need to know, if a DIH instance is done with its task (mostly the
>   "commit") operation. I could not figure a clean way out. As a hack, I keep
>   pinging the DIH instances with command=status at regular intervals (in a
>   separate thread), to figure out if it is free to be assigned some task. With
>   works, but obviously with an overhead of unnessecary wasted CPU cycles. Is
>   there a better approach?
>   4. I can better the time taken, even further if there was a way for me to
>   tell a DIH instance not to open a new IndexSearcher. In the current scheme
>   of things, as soon as one DIH instance is done committing, a new searcher is
>   opened. This is blocking for other DIH instances (which were active) and
>   they cannot continue without the searcher being initialized. Is there a way
>   I can implement, single commit once all these DIH instances are done with
>   their tasks? I tried each DIH instance with a commit=false without luck.
>   5. Can this implementation be extended to support other data-sources
>   supported in DIH (HTTP, File, URL etc)?
>   6. If the utility is worth it, can I host this on Google code as an open
>   source contrib?
> Any help will be deeply acknowledged and appreciated. While suggesting,
> please don't forget that I am using Solr 1.3. If it all goes well, I don't
> mind writing one for Solr 1.4.
> Cheers
> Avlesh

Noble Paul | Principal Engineer| AOL | http://aol.com

Reply via email to