> > There can be a batch command (which) will take in multiple commands in one > http request. > You seem to be obsessed with this approach, Noble. Solr-1093<http://issues.apache.org/jira/browse/SOLR-1093>also echoes the same sentiments :) I personally find this approach a bit restrictive and difficult to adapt to. IMHO, it is better handled as a configuration. i.e. user tells us how the single task can be "batched" (or 'sliced', as you call it) while configuring the Parallel(or, MultiThreaded) DIH inside solrconfig.
As an example, for non-jdbc data sources where batching might be difficult to achieve in an abstract way, the user might choose to configure different data-config.xml's (for different DIH instances) altogether. Cheers Avlesh 2009/8/2 Noble Paul നോബിള് नोब्ळ् <noble.p...@corp.aol.com> > On Sun, Aug 2, 2009 at 8:56 PM, Avlesh Singh<avl...@gmail.com> wrote: > > I have one more question w.r.t the MultiThreaded DIH - What would be the > > logic behind distributing tasks to thread? > > > > I am sorry to have not mentioned this earlier - In my case, I take a > "count > > query" parameter as an configuration element. Based on this count and the > > maxNumberOfDIHInstances, task assignment scheduling is done by > "injecting" > > limit and offset values in the import query for each DIH instance. > > And this is, one of the reasons, why I call it a > ParallelDataImportHandler. > There can be a batch command will take in multiple commands in one > http request. so it will be like invoking multiple DIH instances and > the user will have to find ways to split up the whole task into > multiple 'slices'. DIH in turn would fire up multiple threads and once > all the threads are returned it should issue a commit > > this is a very dumb implementation but is a very easy path. > > > > Cheers > > Avlesh > > > > On Sun, Aug 2, 2009 at 8:39 PM, Avlesh Singh <avl...@gmail.com> wrote: > > > >> run the add() calls to Solr in a dedicated thread > >> > >> Makes absolute sense. This would actually mean, DIH sits on top of all > the > >> add/update operations making it easier to implement a multi-threaded > DIH. > >> > >> I would create a JIRA issue, right away. > >> However, I would still love to see responses to my problems due to > >> limitations in 1.3 > >> > >> Cheers > >> Avlesh > >> > >> 2009/8/2 Noble Paul നോബിള് नोब्ळ् <noble.p...@corp.aol.com> > >> > >> a multithreaded DIH is in my top priority list. There are muliple > >>> approaches > >>> > >>> 1) create multiple instances of dataImporter instances in the same DIH > >>> instance and run them in parallel and commit when all of them are done > >>> 2) run the add() calls to Solr in a dedicated thread > >>> 3) make DIH automatically multithreaded . This is much harder to > >>> implement. > >>> > >>> but a and #1 and #2 can be implemented with ease. It does not have to > >>> be aother implementation called ParallelDataImportHandler. I believe > >>> it can be done in DIH itself > >>> > >>> you may not need to create a project in google code. you can open a > >>> JIRA issue and start posting patches and we can put it back into Solr. > >>> > >>> . > >>> > >>> On Sun, Aug 2, 2009 at 7:33 PM, Avlesh Singh<avl...@gmail.com> wrote: > >>> > In my quest to improve indexing time (in a multi-core environment), I > >>> tried > >>> > writing a Solr RequestHandler called ParallelDataImportHandler. > >>> > I had a few lame questions to begin with, which Noble and Shalin > >>> answered > >>> > here - > >>> > > >>> > http://www.lucidimagination.com/search/document/22b7371c063fdb06/using_dih_for_parallel_indexing > >>> > > >>> > As the name suggests, the handler, when invoked, tries to execute > >>> multiple > >>> > DIH instances on the same core in parallel. Of-course the catch here > is > >>> > that, only those data-sources, that can be batched can benifit from > this > >>> > handler. In my case, I am writing this for import from a MySQL > database. > >>> So, > >>> > I have a single data-config.xml, in which the query has to add > >>> placeholders > >>> > for "limit" and "offset". Each DIH instance uses the same data-config > >>> file, > >>> > and replaces its own values for the limit and offset (which is in > fact > >>> > supplied by the parent ParallelDataImportHandler). > >>> > > >>> > I am achieving this by making my handler SolrCoreAware, and creating > >>> > maxNumberOfDIHInstances (configurable) in the inform method. These > >>> instances > >>> > are then initialized and registered with the core. Whenever a > request > >>> comes > >>> > in, the ParallelDataImportHandler delegates the task to these > instances, > >>> > schedules the remainder and aggregates responses from each of these > >>> > instances to return back to the user. > >>> > > >>> > Thankfully, all of these worked, and preliminary benchmarking with > >>> 5million > >>> > records indicated 50% decrease in re-indexing time. Moreover, all my > >>> cores > >>> > (Solr in my case is hosted on a quad-core machine), indicated above > 70% > >>> CPU > >>> > utilization. All that I could have asked for! > >>> > > >>> > With respect to this whole thing, I have a few questions - > >>> > > >>> > 1. Is something similar available out of the box? > >>> > 2. Is the idea flawed? Is the approach fundamentally correct? > >>> > 3. I am using Solr 1.3. DIH did not have "EventListeners" in the > stone > >>> > age. I need to know, if a DIH instance is done with its task > (mostly > >>> the > >>> > "commit") operation. I could not figure a clean way out. As a hack, > I > >>> keep > >>> > pinging the DIH instances with command=status at regular intervals > (in > >>> a > >>> > separate thread), to figure out if it is free to be assigned some > >>> task. With > >>> > works, but obviously with an overhead of unnessecary wasted CPU > >>> cycles. Is > >>> > there a better approach? > >>> > 4. I can better the time taken, even further if there was a way for > me > >>> to > >>> > tell a DIH instance not to open a new IndexSearcher. In the current > >>> scheme > >>> > of things, as soon as one DIH instance is done committing, a new > >>> searcher is > >>> > opened. This is blocking for other DIH instances (which were > active) > >>> and > >>> > they cannot continue without the searcher being initialized. Is > there > >>> a way > >>> > I can implement, single commit once all these DIH instances are > done > >>> with > >>> > their tasks? I tried each DIH instance with a commit=false without > >>> luck. > >>> > 5. Can this implementation be extended to support other > data-sources > >>> > supported in DIH (HTTP, File, URL etc)? > >>> > 6. If the utility is worth it, can I host this on Google code as an > >>> open > >>> > source contrib? > >>> > > >>> > Any help will be deeply acknowledged and appreciated. While > suggesting, > >>> > please don't forget that I am using Solr 1.3. If it all goes well, I > >>> don't > >>> > mind writing one for Solr 1.4. > >>> > > >>> > Cheers > >>> > Avlesh > >>> > > >>> > >>> > >>> > >>> -- > >>> ----------------------------------------------------- > >>> Noble Paul | Principal Engineer| AOL | http://aol.com > >>> > >> > >> > > > > > > -- > ----------------------------------------------------- > Noble Paul | Principal Engineer| AOL | http://aol.com >