Re: Best Indexing Approaches - To max the throughput

Alessandro Benedetti Tue, 06 Oct 2015 09:50:56 -0700

Hi Walter,
can you explain better your use case ?
You index a batch of e-commerce products ( Solr documents) if one fails,
you want to stop and invalidate the entire batch ( using the almost never
used solr rollback, or manual deletion ?)
And then log the exception indexing size.
To then re-index the whole batch od docs ?


In this scenario, the ConcurrentUpdateSolrClient will not be ideal?
Only curiosity.

Cheers

On 6 October 2015 at 17:29, Walter Underwood <wun...@wunderwood.org> wrote:

> It depends on the document. In a e-commerce search, you might want to fail
> immediately and be notified. That is what we do, fail, rollback, and notify.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Oct 6, 2015, at 7:58 AM, Alessandro Benedetti <
> benedetti.ale...@gmail.com> wrote:
> >
> > mmmmmm one broken document in a batch should not break the entire batch ,
> > right ( whatever approach used) ?
> > Are you referring to the fact that you want to programmatically re-index
> > the broken docs ?
> >
> > Would be interesting to return the id of the broken docs along with the
> > solr update response!
> >
> > Cheers
> >
> >
> > On 6 October 2015 at 15:30, Bill Dueber <b...@dueber.com> wrote:
> >
> >> Just to add...my informal tests show that batching has waaaaay more
> effect
> >> than solrj vs json.
> >>
> >> I haven't look at CUSC in a while, last time I looked it was impossible
> to
> >> do anything smart about error handling, so check that out before you get
> >> too deeply into it. We use a strategy of sending a batch of json
> documents,
> >> and if it returns an error sending each record one at a time until we
> find
> >> the bad one and can log something useful.
> >>
> >>
> >>
> >> On Mon, Oct 5, 2015 at 12:07 PM, Alessandro Benedetti <
> >> benedetti.ale...@gmail.com> wrote:
> >>
> >>> Thanks Erick,
> >>> you confirmed my impressions!
> >>> Thank you very much for the insights, an other opinion is welcome :)
> >>>
> >>> Cheers
> >>>
> >>> 2015-10-05 14:55 GMT+01:00 Erick Erickson <erickerick...@gmail.com>:
> >>>
> >>>> SolrJ tends to be faster for several reasons, not the least of which
> >>>> is that it sends packets to Solr in a more efficient binary format.
> >>>>
> >>>> Batching is critical. I did some rough tests using SolrJ and sending
> >>>> docs one at a time gave a throughput of < 400 docs/second.
> >>>> Sending 10 gave 2,300 or so. Sending 100 at a time gave
> >>>> over 5,300 docs/second. Curiously, 1,000 at a time gave only
> >>>> marginal improvement over 100. This was with a single thread.
> >>>> YMMV of course.
> >>>>
> >>>> CloudSolrClient is definitely the better way to go with SolrCloud,
> >>>> it routes the docs to the correct leader instead of having the
> >>>> node you send the docs to do the routing.
> >>>>
> >>>> Best,
> >>>> Erick
> >>>>
> >>>> On Mon, Oct 5, 2015 at 4:57 AM, Alessandro Benedetti
> >>>> <abenede...@apache.org> wrote:
> >>>>> I was doing some studies and analysis, just wondering in your opinion
> >>>> which
> >>>>> one is the best approach to use to index in Solr to reach the best
> >>>>> throughput possible.
> >>>>> I know that a lot of factor are affecting Indexing time, so let's
> >> only
> >>>>> focus in the feeding approach.
> >>>>> Let's isolate different scenarios :
> >>>>>
> >>>>> *Single Solr Infrastructure*
> >>>>>
> >>>>> 1) Xml/Json batch request to /update IndexHandler (xml/json)
> >>>>>
> >>>>> 2) SolrJ ConcurrentUpdateSolrClient ( javabin)
> >>>>> I was thinking this to be the fastest approach for a multi threaded
> >>>>> indexing application.
> >>>>> Posting batch of docs if possible per request.
> >>>>>
> >>>>> *Solr Cloud*
> >>>>>
> >>>>> 1) Xml/Json batch request to /update IndexHandler(xml/json)
> >>>>>
> >>>>> 2) SolrJ ConcurrentUpdateSolrClient ( javabin)
> >>>>>
> >>>>> 3) CloudSolrClient ( javabin)
> >>>>> it seems the best approach accordingly to this improvements [1]
> >>>>>
> >>>>> What are your opinions ?
> >>>>>
> >>>>> A bonus observation should be for using some Map/Reduce big data
> >>> indexer,
> >>>>> but let's assume we don't have a big cluster of cpus, but the average
> >>>>> Indexer server.
> >>>>>
> >>>>>
> >>>>> [1]
> >>>>>
> >>>>
> >>>
> >>
> https://lucidworks.com/blog/indexing-performance-solr-5-2-now-twice-fast/
> >>>>>
> >>>>>
> >>>>> Cheers
> >>>>>
> >>>>>
> >>>>> --
> >>>>> --------------------------
> >>>>>
> >>>>> Benedetti Alessandro
> >>>>> Visiting card : http://about.me/alessandro_benedetti
> >>>>>
> >>>>> "Tyger, tyger burning bright
> >>>>> In the forests of the night,
> >>>>> What immortal hand or eye
> >>>>> Could frame thy fearful symmetry?"
> >>>>>
> >>>>> William Blake - Songs of Experience -1794 England
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> --------------------------
> >>>
> >>> Benedetti Alessandro
> >>> Visiting card - http://about.me/alessandro_benedetti
> >>> Blog - http://alexbenedetti.blogspot.co.uk
> >>>
> >>> "Tyger, tyger burning bright
> >>> In the forests of the night,
> >>> What immortal hand or eye
> >>> Could frame thy fearful symmetry?"
> >>>
> >>> William Blake - Songs of Experience -1794 England
> >>>
> >>
> >>
> >>
> >> --
> >> Bill Dueber
> >> Library Systems Programmer
> >> University of Michigan Library
> >>
> >
> >
> >
> > --
> > --------------------------
> >
> > Benedetti Alessandro
> > Visiting card - http://about.me/alessandro_benedetti
> > Blog - http://alexbenedetti.blogspot.co.uk
> >
> > "Tyger, tyger burning bright
> > In the forests of the night,
> > What immortal hand or eye
> > Could frame thy fearful symmetry?"
> >
> > William Blake - Songs of Experience -1794 England
>
>


-- 
--------------------------

Benedetti Alessandro
Visiting card - http://about.me/alessandro_benedetti
Blog - http://alexbenedetti.blogspot.co.uk

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Re: Best Indexing Approaches - To max the throughput

Reply via email to