RE: UpdateProcessor as a batch

Markus Jelsma Fri, 04 Nov 2016 07:34:44 -0700
Thanks all for sharing your thoughts! 
 
-----Original message-----
> From:Joel Bernstein <joels...@gmail.com>
> Sent: Friday 4th November 2016 1:28
> To: solr-user@lucene.apache.org
> Subject: Re: UpdateProcessor as a batch
> 
> This might be useful. In this scenario you load you content into Solr for
> staging and perform your ETL from Solr to Solr:
> 
> http://joelsolr.blogspot.com/2016/10/solr-63-batch-jobs-parallel-etl-and.html
> 
> Basically Solr becomes a text processing warehouse.
> 
> Joel Bernstein
> http://joelsolr.blogspot.com/
> 
> On Thu, Nov 3, 2016 at 5:05 PM, Alexandre Rafalovitch <arafa...@gmail.com>
> wrote:
> 
> > How big a batch we are talking about?
> >
> > Because I believe you could accumulate the docs in the first URP in
> > the processAdd and then do the batch lookup and actually processing of
> > them on processCommit.
> >
> > They are daisy chain, so as long as you are holding on to the chain,
> > the rest of the URPs don't happen.
> >
> > Obviously you are relying on the commit here to trigger the final call.
> >
> > Or you could do a two collection sequence with indexing to first
> > collection, querying for whatever you need to batch lookup and then
> > doing Collection-to-Collection enhanced copy.
> >
> > Regards,
> >    Alex.
> > ----
> > Solr Example reading group is starting November 2016, join us at
> > http://j.mp/SolrERG
> > Newsletter and resources for Solr beginners and intermediates:
> > http://www.solr-start.com/
> >
> >
> > On 4 November 2016 at 07:35, mike st. john <mstj...@gmail.com> wrote:
> > > maybe introduce a distributed queue such as apache ignite,  hazelcast or
> > > even redis.   Read from the queue in batches, do your lookup then index
> > the
> > > same batch.
> > >
> > > just a thought.
> > >
> > > Mike St. John.
> > >
> > > On Nov 3, 2016 3:58 PM, "Erick Erickson" <erickerick...@gmail.com>
> > wrote:
> > >
> > >> I thought we might be talking past each other...
> > >>
> > >> I think you're into "roll your own" here. Anything that
> > >> accumulated docs for a while, did a batch lookup
> > >> on the external system, then passed on the docs
> > >> runs the risk of losing docs if the server is abnormally
> > >> shut down.
> > >>
> > >> I guess ideally you'd like to augment the list coming in
> > >> rather than the docs once they're removed from the
> > >> incoming batch and passed on, but I admit I have no
> > >> clue where to do that. Possibly in an update chain? If
> > >> so, you'd need to be careful to only augment when
> > >> they'd reached their final shard leader or all at once
> > >> before distribution to shard leaders.
> > >>
> > >> Is the expense for the external lookup doing the actual
> > >> lookups or establishing the connection? Would
> > >> having some kind of shared connection to the external
> > >> source be worthwhile?
> > >>
> > >> FWIW,
> > >> Erick
> > >>
> > >> On Thu, Nov 3, 2016 at 12:06 PM, Markus Jelsma
> > >> <markus.jel...@openindex.io> wrote:
> > >> > Hi - i believe i did not explain myself well enough.
> > >> >
> > >> > Getting the data in Solr is not a problem, various sources index docs
> > to
> > >> Solr, all in fine batches as everyone should do indeed. The thing is
> > that i
> > >> need to do some preprocessing before it is indexed. Normally,
> > >> UpdateProcessors are the way to go. I've made quite a few of them and
> > they
> > >> work fine.
> > >> >
> > >> > The problem is, i need to do a remote lookup for each document being
> > >> indexed. Right now, i make an external connection for each doc being
> > >> indexed in the current UpdateProcessor. This is still fast. But the
> > remote
> > >> backend supports batched lookups, which are faster.
> > >> >
> > >> > This is why i'd love to be able to buffer documents in an
> > >> UpdateProcessor, and if there are enough, i do a remote lookup for all
> > of
> > >> them, do some processing and let them be indexed.
> > >> >
> > >> > Thanks,
> > >> > Markus
> > >> >
> > >> >
> > >> >
> > >> > -----Original message-----
> > >> >> From:Erick Erickson <erickerick...@gmail.com>
> > >> >> Sent: Thursday 3rd November 2016 19:18
> > >> >> To: solr-user <solr-user@lucene.apache.org>
> > >> >> Subject: Re: UpdateProcessor as a batch
> > >> >>
> > >> >> I _thought_ you'd been around long enough to know about the options I
> > >> >> mentioned ;).
> > >> >>
> > >> >> Right. I'd guess you're in UpdateHandler.addDoc and there's really no
> > >> >> batching at that level that I know of. I'm pretty sure that even
> > >> >> indexing batches of 1,000 documents from, say, SolrJ go through this
> > >> >> method.
> > >> >>
> > >> >> I don't think there's much to be gained by any batching at this
> > level,
> > >> >> it pretty immediately tells Lucene to index the doc.
> > >> >>
> > >> >> FWIW
> > >> >> Erick
> > >> >>
> > >> >> On Thu, Nov 3, 2016 at 11:10 AM, Markus Jelsma
> > >> >> <markus.jel...@openindex.io> wrote:
> > >> >> > Erick - in this case data can come from anywhere. There is one
> > piece
> > >> of code all incoming documents, regardless of their origin, are passed
> > >> thru, the update handler and update processors of Solr.
> > >> >> >
> > >> >> > In my case that is the most convenient point to partially modify
> > the
> > >> documents, instead of moving that logic to separate places.
> > >> >> >
> > >> >> > I've seen the ContentStream in SolrQueryResponse and i probably
> > could
> > >> tear incoming data apart and put it back together again, but that would
> > not
> > >> be so easy as working with already deserialized objects such as
> > >> SolrInputDocument.
> > >> >> >
> > >> >> > UpdateHandler doesn't seem to work on a list of documents, it
> > looked
> > >> like it works on incoming stuff, not a whole list. I've also looked if i
> > >> could buffer a batch in UpdateProcessor, work on them, and release them,
> > >> but that seems impossible.
> > >> >> >
> > >> >> > Thanks,
> > >> >> > Markus
> > >> >> >
> > >> >> > -----Original message-----
> > >> >> >> From:Erick Erickson <erickerick...@gmail.com>
> > >> >> >> Sent: Thursday 3rd November 2016 18:57
> > >> >> >> To: solr-user <solr-user@lucene.apache.org>
> > >> >> >> Subject: Re: UpdateProcessor as a batch
> > >> >> >>
> > >> >> >> Markus:
> > >> >> >>
> > >> >> >> How are you indexing? SolrJ has a client.add(List<
> > >> SolrInputDocument>)
> > >> >> >> form, and post.jar lets you add as many documents as you want in a
> > >> >> >> batch....
> > >> >> >>
> > >> >> >> Best,
> > >> >> >> Erick
> > >> >> >>
> > >> >> >> On Thu, Nov 3, 2016 at 10:18 AM, Markus Jelsma
> > >> >> >> <markus.jel...@openindex.io> wrote:
> > >> >> >> > Hi - i need to process a batch of documents on update but i
> > cannot
> > >> seem to find a point where i can hook in and process a list of
> > >> SolrInputDocuments, not in UpdateProcessor nor in UpdateHandler.
> > >> >> >> >
> > >> >> >> > For now i let it go and implemented it on a per-document basis,
> > it
> > >> is fast, but i'd prefer batches. Is that possible at all?
> > >> >> >> >
> > >> >> >> > Thanks,
> > >> >> >> > Markus
> > >> >> >>
> > >> >>
> > >>
> >
>
RE: UpdateProcessor as a batch

Reply via email to