Re: Queries regarding a "ParallelDataImportHandler"

Avlesh Singh Sun, 02 Aug 2009 08:27:28 -0700

I have one more question w.r.t the MultiThreaded DIH - What would be the
logic behind distributing tasks to thread?


I am sorry to have not mentioned this earlier - In my case, I take a "count
query" parameter as an configuration element. Based on this count and the
maxNumberOfDIHInstances, task assignment scheduling is done by "injecting"
limit and offset values in the import query for each DIH instance.
And this is, one of the reasons, why I call it a ParallelDataImportHandler.

Cheers
Avlesh

On Sun, Aug 2, 2009 at 8:39 PM, Avlesh Singh <avl...@gmail.com> wrote:

> run the add() calls to Solr in a dedicated thread
>
> Makes absolute sense. This would actually mean, DIH sits on top of all the
> add/update operations making it easier to implement a multi-threaded DIH.
>
> I would create a JIRA issue, right away.
> However, I would still love to see responses to my problems due to
> limitations in 1.3
>
> Cheers
> Avlesh
>
> 2009/8/2 Noble Paul നോബിള്‍ नोब्ळ् <noble.p...@corp.aol.com>
>
> a multithreaded DIH is in my top priority list. There are muliple
>> approaches
>>
>> 1) create multiple instances of dataImporter instances in the same DIH
>> instance and run them in parallel and commit when all of them are done
>> 2) run the add() calls to Solr in a dedicated thread
>> 3) make DIH automatically multithreaded . This is much harder to
>> implement.
>>
>> but a and #1 and #2 can be implemented with ease. It does not have to
>> be aother implementation called ParallelDataImportHandler. I believe
>> it can be done in DIH itself
>>
>> you may not need to create a project in google code. you can open a
>> JIRA issue and start posting patches and we can put it back into Solr.
>>
>> .
>>
>> On Sun, Aug 2, 2009 at 7:33 PM, Avlesh Singh<avl...@gmail.com> wrote:
>> > In my quest to improve indexing time (in a multi-core environment), I
>> tried
>> > writing a Solr RequestHandler called ParallelDataImportHandler.
>> > I had a few lame questions to begin with, which Noble and Shalin
>> answered
>> > here -
>> >
>> http://www.lucidimagination.com/search/document/22b7371c063fdb06/using_dih_for_parallel_indexing
>> >
>> > As the name suggests, the handler, when invoked, tries to execute
>> multiple
>> > DIH instances on the same core in parallel. Of-course the catch here is
>> > that, only those data-sources, that can be batched can benifit from this
>> > handler. In my case, I am writing this for import from a MySQL database.
>> So,
>> > I have a single data-config.xml, in which the query has to add
>> placeholders
>> > for "limit" and "offset". Each DIH instance uses the same data-config
>> file,
>> > and replaces its own values for the limit and offset (which is in fact
>> > supplied by the parent ParallelDataImportHandler).
>> >
>> > I am achieving this by making my handler SolrCoreAware, and creating
>> > maxNumberOfDIHInstances (configurable) in the inform method. These
>> instances
>> > are then initialized and  registered with the core. Whenever a request
>> comes
>> > in, the ParallelDataImportHandler delegates the task to these instances,
>> > schedules the remainder and aggregates responses from each of these
>> > instances to return back to the user.
>> >
>> > Thankfully, all of these worked, and preliminary benchmarking with
>> 5million
>> > records indicated 50% decrease in re-indexing time. Moreover, all my
>> cores
>> > (Solr in my case is hosted on a quad-core machine), indicated above 70%
>> CPU
>> > utilization. All that I could have asked for!
>> >
>> > With respect to this whole thing, I have a few questions -
>> >
>> >   1. Is something similar available out of the box?
>> >   2. Is the idea flawed? Is the approach fundamentally correct?
>> >   3. I am using Solr 1.3. DIH did not have "EventListeners" in the stone
>> >   age. I need to know, if a DIH instance is done with its task (mostly
>> the
>> >   "commit") operation. I could not figure a clean way out. As a hack, I
>> keep
>> >   pinging the DIH instances with command=status at regular intervals (in
>> a
>> >   separate thread), to figure out if it is free to be assigned some
>> task. With
>> >   works, but obviously with an overhead of unnessecary wasted CPU
>> cycles. Is
>> >   there a better approach?
>> >   4. I can better the time taken, even further if there was a way for me
>> to
>> >   tell a DIH instance not to open a new IndexSearcher. In the current
>> scheme
>> >   of things, as soon as one DIH instance is done committing, a new
>> searcher is
>> >   opened. This is blocking for other DIH instances (which were active)
>> and
>> >   they cannot continue without the searcher being initialized. Is there
>> a way
>> >   I can implement, single commit once all these DIH instances are done
>> with
>> >   their tasks? I tried each DIH instance with a commit=false without
>> luck.
>> >   5. Can this implementation be extended to support other data-sources
>> >   supported in DIH (HTTP, File, URL etc)?
>> >   6. If the utility is worth it, can I host this on Google code as an
>> open
>> >   source contrib?
>> >
>> > Any help will be deeply acknowledged and appreciated. While suggesting,
>> > please don't forget that I am using Solr 1.3. If it all goes well, I
>> don't
>> > mind writing one for Solr 1.4.
>> >
>> > Cheers
>> > Avlesh
>> >
>>
>>
>>
>> --
>> -----------------------------------------------------
>> Noble Paul | Principal Engineer| AOL | http://aol.com
>>
>
>

Re: Queries regarding a "ParallelDataImportHandler"

Reply via email to