Re: Queries regarding a "ParallelDataImportHandler"

Avlesh Singh Sun, 02 Aug 2009 09:10:09 -0700

>
> There can be a batch command (which) will take in multiple commands in one
> http request.
>
You seem to be obsessed with this approach, Noble.
Solr-1093<http://issues.apache.org/jira/browse/SOLR-1093>also echoes
the same sentiments :)
I personally find this approach a bit restrictive and difficult to adapt to.
IMHO, it is better handled as a configuration. i.e. user tells us how the
single task can be "batched" (or 'sliced', as you call it) while configuring
the Parallel(or, MultiThreaded) DIH inside solrconfig.


As an example, for non-jdbc data sources where batching might be difficult
to achieve in an abstract way, the user might choose to configure different
data-config.xml's (for different DIH instances) altogether.

Cheers
Avlesh

2009/8/2 Noble Paul നോബിള്‍ नोब्ळ् <noble.p...@corp.aol.com>

> On Sun, Aug 2, 2009 at 8:56 PM, Avlesh Singh<avl...@gmail.com> wrote:
> > I have one more question w.r.t the MultiThreaded DIH - What would be the
> > logic behind distributing tasks to thread?
> >
> > I am sorry to have not mentioned this earlier - In my case, I take a
> "count
> > query" parameter as an configuration element. Based on this count and the
> > maxNumberOfDIHInstances, task assignment scheduling is done by
> "injecting"
> > limit and offset values in the import query for each DIH instance.
> > And this is, one of the reasons, why I call it a
> ParallelDataImportHandler.
> There can be a batch command will take in multiple commands in one
> http request. so it will be like invoking multiple DIH instances and
> the user will have to find ways to split up the whole task into
> multiple 'slices'. DIH in turn would fire up multiple threads and once
> all the threads are returned it should issue a commit
>
> this is a very dumb implementation but is a very easy path.
> >
> > Cheers
> > Avlesh
> >
> > On Sun, Aug 2, 2009 at 8:39 PM, Avlesh Singh <avl...@gmail.com> wrote:
> >
> >> run the add() calls to Solr in a dedicated thread
> >>
> >> Makes absolute sense. This would actually mean, DIH sits on top of all
> the
> >> add/update operations making it easier to implement a multi-threaded
> DIH.
> >>
> >> I would create a JIRA issue, right away.
> >> However, I would still love to see responses to my problems due to
> >> limitations in 1.3
> >>
> >> Cheers
> >> Avlesh
> >>
> >> 2009/8/2 Noble Paul നോബിള്‍ नोब्ळ् <noble.p...@corp.aol.com>
> >>
> >> a multithreaded DIH is in my top priority list. There are muliple
> >>> approaches
> >>>
> >>> 1) create multiple instances of dataImporter instances in the same DIH
> >>> instance and run them in parallel and commit when all of them are done
> >>> 2) run the add() calls to Solr in a dedicated thread
> >>> 3) make DIH automatically multithreaded . This is much harder to
> >>> implement.
> >>>
> >>> but a and #1 and #2 can be implemented with ease. It does not have to
> >>> be aother implementation called ParallelDataImportHandler. I believe
> >>> it can be done in DIH itself
> >>>
> >>> you may not need to create a project in google code. you can open a
> >>> JIRA issue and start posting patches and we can put it back into Solr.
> >>>
> >>> .
> >>>
> >>> On Sun, Aug 2, 2009 at 7:33 PM, Avlesh Singh<avl...@gmail.com> wrote:
> >>> > In my quest to improve indexing time (in a multi-core environment), I
> >>> tried
> >>> > writing a Solr RequestHandler called ParallelDataImportHandler.
> >>> > I had a few lame questions to begin with, which Noble and Shalin
> >>> answered
> >>> > here -
> >>> >
> >>>
> http://www.lucidimagination.com/search/document/22b7371c063fdb06/using_dih_for_parallel_indexing
> >>> >
> >>> > As the name suggests, the handler, when invoked, tries to execute
> >>> multiple
> >>> > DIH instances on the same core in parallel. Of-course the catch here
> is
> >>> > that, only those data-sources, that can be batched can benifit from
> this
> >>> > handler. In my case, I am writing this for import from a MySQL
> database.
> >>> So,
> >>> > I have a single data-config.xml, in which the query has to add
> >>> placeholders
> >>> > for "limit" and "offset". Each DIH instance uses the same data-config
> >>> file,
> >>> > and replaces its own values for the limit and offset (which is in
> fact
> >>> > supplied by the parent ParallelDataImportHandler).
> >>> >
> >>> > I am achieving this by making my handler SolrCoreAware, and creating
> >>> > maxNumberOfDIHInstances (configurable) in the inform method. These
> >>> instances
> >>> > are then initialized and  registered with the core. Whenever a
> request
> >>> comes
> >>> > in, the ParallelDataImportHandler delegates the task to these
> instances,
> >>> > schedules the remainder and aggregates responses from each of these
> >>> > instances to return back to the user.
> >>> >
> >>> > Thankfully, all of these worked, and preliminary benchmarking with
> >>> 5million
> >>> > records indicated 50% decrease in re-indexing time. Moreover, all my
> >>> cores
> >>> > (Solr in my case is hosted on a quad-core machine), indicated above
> 70%
> >>> CPU
> >>> > utilization. All that I could have asked for!
> >>> >
> >>> > With respect to this whole thing, I have a few questions -
> >>> >
> >>> >   1. Is something similar available out of the box?
> >>> >   2. Is the idea flawed? Is the approach fundamentally correct?
> >>> >   3. I am using Solr 1.3. DIH did not have "EventListeners" in the
> stone
> >>> >   age. I need to know, if a DIH instance is done with its task
> (mostly
> >>> the
> >>> >   "commit") operation. I could not figure a clean way out. As a hack,
> I
> >>> keep
> >>> >   pinging the DIH instances with command=status at regular intervals
> (in
> >>> a
> >>> >   separate thread), to figure out if it is free to be assigned some
> >>> task. With
> >>> >   works, but obviously with an overhead of unnessecary wasted CPU
> >>> cycles. Is
> >>> >   there a better approach?
> >>> >   4. I can better the time taken, even further if there was a way for
> me
> >>> to
> >>> >   tell a DIH instance not to open a new IndexSearcher. In the current
> >>> scheme
> >>> >   of things, as soon as one DIH instance is done committing, a new
> >>> searcher is
> >>> >   opened. This is blocking for other DIH instances (which were
> active)
> >>> and
> >>> >   they cannot continue without the searcher being initialized. Is
> there
> >>> a way
> >>> >   I can implement, single commit once all these DIH instances are
> done
> >>> with
> >>> >   their tasks? I tried each DIH instance with a commit=false without
> >>> luck.
> >>> >   5. Can this implementation be extended to support other
> data-sources
> >>> >   supported in DIH (HTTP, File, URL etc)?
> >>> >   6. If the utility is worth it, can I host this on Google code as an
> >>> open
> >>> >   source contrib?
> >>> >
> >>> > Any help will be deeply acknowledged and appreciated. While
> suggesting,
> >>> > please don't forget that I am using Solr 1.3. If it all goes well, I
> >>> don't
> >>> > mind writing one for Solr 1.4.
> >>> >
> >>> > Cheers
> >>> > Avlesh
> >>> >
> >>>
> >>>
> >>>
> >>> --
> >>> -----------------------------------------------------
> >>> Noble Paul | Principal Engineer| AOL | http://aol.com
> >>>
> >>
> >>
> >
>
>
>
> --
> -----------------------------------------------------
> Noble Paul | Principal Engineer| AOL | http://aol.com
>

Re: Queries regarding a "ParallelDataImportHandler"

Reply via email to