Re: Queries regarding a "ParallelDataImportHandler"

Avlesh Singh Sun, 02 Aug 2009 08:10:12 -0700

>
> run the add() calls to Solr in a dedicated thread

Makes absolute sense. This would actually mean, DIH sits on top of all the
add/update operations making it easier to implement a multi-threaded DIH.


I would create a JIRA issue, right away.
However, I would still love to see responses to my problems due to
limitations in 1.3

Cheers
Avlesh

2009/8/2 Noble Paul നോബിള്‍ नोब्ळ् <noble.p...@corp.aol.com>

> a multithreaded DIH is in my top priority list. There are muliple
> approaches
>
> 1) create multiple instances of dataImporter instances in the same DIH
> instance and run them in parallel and commit when all of them are done
> 2) run the add() calls to Solr in a dedicated thread
> 3) make DIH automatically multithreaded . This is much harder to implement.
>
> but a and #1 and #2 can be implemented with ease. It does not have to
> be aother implementation called ParallelDataImportHandler. I believe
> it can be done in DIH itself
>
> you may not need to create a project in google code. you can open a
> JIRA issue and start posting patches and we can put it back into Solr.
>
> .
>
> On Sun, Aug 2, 2009 at 7:33 PM, Avlesh Singh<avl...@gmail.com> wrote:
> > In my quest to improve indexing time (in a multi-core environment), I
> tried
> > writing a Solr RequestHandler called ParallelDataImportHandler.
> > I had a few lame questions to begin with, which Noble and Shalin answered
> > here -
> >
> http://www.lucidimagination.com/search/document/22b7371c063fdb06/using_dih_for_parallel_indexing
> >
> > As the name suggests, the handler, when invoked, tries to execute
> multiple
> > DIH instances on the same core in parallel. Of-course the catch here is
> > that, only those data-sources, that can be batched can benifit from this
> > handler. In my case, I am writing this for import from a MySQL database.
> So,
> > I have a single data-config.xml, in which the query has to add
> placeholders
> > for "limit" and "offset". Each DIH instance uses the same data-config
> file,
> > and replaces its own values for the limit and offset (which is in fact
> > supplied by the parent ParallelDataImportHandler).
> >
> > I am achieving this by making my handler SolrCoreAware, and creating
> > maxNumberOfDIHInstances (configurable) in the inform method. These
> instances
> > are then initialized and  registered with the core. Whenever a request
> comes
> > in, the ParallelDataImportHandler delegates the task to these instances,
> > schedules the remainder and aggregates responses from each of these
> > instances to return back to the user.
> >
> > Thankfully, all of these worked, and preliminary benchmarking with
> 5million
> > records indicated 50% decrease in re-indexing time. Moreover, all my
> cores
> > (Solr in my case is hosted on a quad-core machine), indicated above 70%
> CPU
> > utilization. All that I could have asked for!
> >
> > With respect to this whole thing, I have a few questions -
> >
> >   1. Is something similar available out of the box?
> >   2. Is the idea flawed? Is the approach fundamentally correct?
> >   3. I am using Solr 1.3. DIH did not have "EventListeners" in the stone
> >   age. I need to know, if a DIH instance is done with its task (mostly
> the
> >   "commit") operation. I could not figure a clean way out. As a hack, I
> keep
> >   pinging the DIH instances with command=status at regular intervals (in
> a
> >   separate thread), to figure out if it is free to be assigned some task.
> With
> >   works, but obviously with an overhead of unnessecary wasted CPU cycles.
> Is
> >   there a better approach?
> >   4. I can better the time taken, even further if there was a way for me
> to
> >   tell a DIH instance not to open a new IndexSearcher. In the current
> scheme
> >   of things, as soon as one DIH instance is done committing, a new
> searcher is
> >   opened. This is blocking for other DIH instances (which were active)
> and
> >   they cannot continue without the searcher being initialized. Is there a
> way
> >   I can implement, single commit once all these DIH instances are done
> with
> >   their tasks? I tried each DIH instance with a commit=false without
> luck.
> >   5. Can this implementation be extended to support other data-sources
> >   supported in DIH (HTTP, File, URL etc)?
> >   6. If the utility is worth it, can I host this on Google code as an
> open
> >   source contrib?
> >
> > Any help will be deeply acknowledged and appreciated. While suggesting,
> > please don't forget that I am using Solr 1.3. If it all goes well, I
> don't
> > mind writing one for Solr 1.4.
> >
> > Cheers
> > Avlesh
> >
>
>
>
> --
> -----------------------------------------------------
> Noble Paul | Principal Engineer| AOL | http://aol.com
>

Re: Queries regarding a "ParallelDataImportHandler"

Reply via email to