Re: Solr architecture

Mark Robinson Fri, 12 Feb 2016 05:57:04 -0800

Thanks All for your suggestions!

Rgds,
Mark.


On Thu, Feb 11, 2016 at 9:45 AM, Upayavira <u...@odoko.co.uk> wrote:

> Your biggest issue here is likely to be http connections. Making an HTTP
> connection to Solr is way more expensive than the ask of adding a single
> document to the index. If you are expecting to add 24 billion docs per
> day, I'd suggest that somehow merging those documents into batches
> before sending them to Solr will be necessary.
>
> To my previous question - what do you gain by using Solr that you don't
> get from other solutions? I'd suggest that to make this system really
> work, you are going to need a deep understanding of how Lucene works -
> segments, segment merges, deletions, and many other things because when
> you start to work at that scale, the implementation details behind
> Lucene really start to matter and impact upon your ability to succeed.
>
> I'd suggest that what you are undertaking can certainly be done, but is
> a substantial project.
>
> Upayavira
>
> On Wed, Feb 10, 2016, at 09:48 PM, Mark Robinson wrote:
> > Thanks everyone for your suggestions.
> > Based on it I am planning to have one doc per event with sessionId
> > common.
> >
> > So in this case hopefully indexing each doc as and when it comes would be
> > okay? Or do we still need to batch and index to Solr?
> >
> > Also with 4M sessions a day with about 6000 docs (events) per session we
> > can expect about 24Billion docs per day!
> >
> > Will Solr still hold good. If so could some one please recommend a sizing
> > to cater to this levels of data.
> > The queries per second is around 320 qps.
> >
> > Thanks!
> > Mark
> >
> >
> > On Wed, Feb 10, 2016 at 3:38 AM, Emir Arnautovic <
> > emir.arnauto...@sematext.com> wrote:
> >
> > > Hi Mark,
> > > Appending session actions just to be able to return more than one
> session
> > > without retrieving large number of results is not good tradeoff. Like
> > > Upayavira suggested, you should consider storing one action per doc and
> > > aggregate on read time or push to Solr once session ends and aggregate
> on
> > > some other layer.
> > > If you are thinking handling infrastructure might be too much, you may
> > > consider using some of logging services to hold data. One such service
> is
> > > Sematext's Logsene (http://sematext.com/logsene).
> > >
> > > Thanks,
> > > Emir
> > >
> > > --
> > > Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> > > Solr & Elasticsearch Support * http://sematext.com/
> > >
> > >
> > >
> > > On 10.02.2016 03:22, Mark Robinson wrote:
> > >
> > >> Thanks for your replies and suggestions!
> > >>
> > >> Why I store all events related to a session under one doc?
> > >> Each session can have about 500 total entries (events) corresponding
> to
> > >> it.
> > >> So when I try to retrieve a session's info it can back with around 500
> > >> records. If it is this compounded one doc per session, I can retrieve
> more
> > >> sessions at a time with one doc per session.
> > >> eg under a sessionId an array of eventA activities, eventB activities
> > >>   (using json). When an eventA activity again occurs, we will read all
> > >> that
> > >> data for that session, append this extra info to evenA data and push
> the
> > >> whole session related data back (indexing) to Solr. Like this for many
> > >> sessions parallely.
> > >>
> > >>
> > >> Why NRT?
> > >> Parallely many sessions are being written (4Million sessions hence
> > >> 4Million
> > >> docs per day). A person can do this querying any time.
> > >>
> > >> It is just a look up?
> > >> Yes. We just need to retrieve all info for a session and pass it on to
> > >> another system. We may even do some extra querying on some data like
> > >> timestamps, pageurl etc in that info added to a session.
> > >>
> > >> Thinking of having the data separate from the actual Solr Instance and
> > >> mention the loc of the dataDir in solrconfig.
> > >>
> > >> If Solr is not a good option could you please suggest something which
> will
> > >> satisfy this use case with min response time while querying.
> > >>
> > >> Thanks!
> > >> Mark
> > >>
> > >> On Tue, Feb 9, 2016 at 6:02 PM, Daniel Collins <danwcoll...@gmail.com
> >
> > >> wrote:
> > >>
> > >> So as I understand your use case, its effectively logging actions
> within a
> > >>> user session, why do you have to do the update in NRT?  Why not just
> log
> > >>> all the user session events (with some unique key, and ensuring the
> > >>> session
> > >>> Id is in the document somewhere), then when you want to do the
> query, you
> > >>> join on the session id, and that gives you all the data records for
> that
> > >>> session. I don't really follow why it has to be 1 document (which you
> > >>> continually update). If you really need that aggregation, couldn't
> that
> > >>> happen offline?
> > >>>
> > >>> I guess your 1 saving grace is that you query using the unique ID (in
> > >>> your
> > >>> scenario) so you could use the real-time get handler, since you
> aren't
> > >>> doing a complex query (strictly its not a search, its a raw key
> lookup).
> > >>>
> > >>> But I would still question your use case, if you go the Solr route
> for
> > >>> that
> > >>> kind of scale with querying and indexing that much, you're going to
> have
> > >>> to
> > >>> throw a lot of hardware at it, as Jack says probably in the order of
> > >>> hundreds of machines...
> > >>>
> > >>> On 9 February 2016 at 19:00, Upayavira <u...@odoko.co.uk> wrote:
> > >>>
> > >>> Bear in mind that Lucene is optimised towards high read lower write.
> > >>>> That is, it puts in a lot of effort at write time to make reading
> > >>>> efficient. It sounds like you are going to be doing far more writing
> > >>>> than reading, and I wonder whether you are necessarily choosing the
> > >>>> right tool for the job.
> > >>>>
> > >>>> How would you later use this data, and what advantage is there to
> > >>>> storing it in Solr?
> > >>>>
> > >>>> Upayavira
> > >>>>
> > >>>> On Tue, Feb 9, 2016, at 03:40 PM, Mark Robinson wrote:
> > >>>>
> > >>>>> Hi,
> > >>>>> Thanks for all your suggestions. I took some time to get the
> details to
> > >>>>> be
> > >>>>> more accurate. Please find what I have gathered:-
> > >>>>>
> > >>>>> My data being indexed is something like this.
> > >>>>> I am basically capturing all data related to a user session.
> > >>>>> Inside a session I have categorized my actions like actionA,
> actionB
> > >>>>> etc..,
> > >>>>> per page.
> > >>>>> So each time an action pertaining to say actionA or actionB etc..
> (in
> > >>>>> each
> > >>>>> page) happens, it is updated in Solr under that session
> (sessionId).
> > >>>>>
> > >>>>> So in short there is only one doc pertaining to a single session
> > >>>>> (identified by sessionid) in my Solr index and that is retrieved
> and
> > >>>>> updated
> > >>>>> whenever a new action under that session occurs.
> > >>>>> We expect upto 4Million session per day.
> > >>>>>
> > >>>>> On an average *one session's* *doc has a size* of *3MB to 20MB*.
> > >>>>> So if it is *4Million sessions per day*, each session writing
> around
> > >>>>>
> > >>>> *500
> > >>>
> > >>>> times to Solr*, it is* 2Billion writes or (indexing) per day to
> Solr*.
> > >>>>> As it is one doc per session, it is *4Million docs per day*.
> > >>>>> This is around *80K docs indexed per second* during *peak* hours
> and
> > >>>>> around *15K
> > >>>>> docs indexed per second* into Solr during* non-peak* hours.
> > >>>>> Number of queries per second is around *320 queries per second*.
> > >>>>>
> > >>>>>
> > >>>>> 1. Average size of a doc
> > >>>>>       3MB to 20MB
> > >>>>> 2. Query types:-
> > >>>>>       Until that session is in progress, whatever data is there for
> > >>>>> that
> > >>>>> session so far is queried and the new action's details captured and
> > >>>>> appended to existing data already captured        related to that
> > >>>>>
> > >>>> session
> > >>>
> > >>>> and indexed back into Solr. So, longer the session the more data
> > >>>>> retrieved
> > >>>>> for each subsequent query to get current data captured for that
> > >>>>>
> > >>>> session.
> > >>>
> > >>>>       Also querying can be done on timestamp etc... which is
> captured
> > >>>>>       along
> > >>>>> with each action.
> > >>>>> 3. Are docs grouped somehow?
> > >>>>>       All data related to a session are retrieved from Solr,
> updated
> > >>>>> and
> > >>>>> indexed back to Solr based on sessionId. No other grouping.
> > >>>>> 4. Are they time sensitive (NRT or offline process does this)
> > >>>>>       As mentioned above this is in NRT. Each time a new user
> action in
> > >>>>>       that
> > >>>>> session happens, we need to query existing session info already
> > >>>>>
> > >>>> captured
> > >>>
> > >>>> related to that session and        append this new data  to this
> > >>>>>
> > >>>> existing
> > >>>
> > >>>> info retrieved and index it back to Solr.
> > >>>>> 5. Will they update or it is rebuild every time, etc.
> > >>>>>       Each time a new user action occurs, the full data pertaining
> to
> > >>>>>
> > >>>> that
> > >>>
> > >>>> session so far captured is retrieved from Solr, the extra latest
> data
> > >>>>> pertaining to this new action is appended      and indexed  back to
> > >>>>>
> > >>>> Solr.
> > >>>
> > >>>> 6. And the other thing you haven't told us is whether you plan on
> > >>>>> _adding_
> > >>>>> 2B docs a day or whether that number is the total corpus size and
> you
> > >>>>>
> > >>>> are
> > >>>
> > >>>> re-indexing the 2B docs/day. IOW, if you are  adding 2B docs/day, 30
> > >>>>>
> > >>>> days
> > >>>
> > >>>> later do you have 2B docs or 60B docs in your
> > >>>>>     corpus?
> > >>>>>     We are expecting around 4 million sessions per day (per
> session 500
> > >>>>> writes to Solr), which turns out to be 2B indexing done per day. So
> > >>>>>
> > >>>> after
> > >>>
> > >>>> 30 days it would be 4Milion*30          docs in the index.
> > >>>>> 7. Is there any aging of docs
> > >>>>>       No we always query against the whole corpus present.
> > >>>>> 8. Is any doc deleted?
> > >>>>>       No all data remains in the index
> > >>>>>
> > >>>>> Any suggestion is very welcome!
> > >>>>>
> > >>>>> Thanks!
> > >>>>> Mark.
> > >>>>>
> > >>>>>
> > >>>>> On Mon, Feb 8, 2016 at 3:30 PM, Jack Krupansky <
> > >>>>>
> > >>>> jack.krupan...@gmail.com
> > >>>
> > >>>> wrote:
> > >>>>>
> > >>>>> Oops... at 100 qps for a single node you would need 120 nodes to
> get
> > >>>>>>
> > >>>>> to 12K
> > >>>>
> > >>>>> qps and 800 nodes to get 80K qps, but that is just an extremely
> rough
> > >>>>>> ballpark estimate, not some precise and firm number. And that's if
> > >>>>>>
> > >>>>> all
> > >>>
> > >>>> the
> > >>>>
> > >>>>> queries can be evenly distributed throughout the cluster and don't
> > >>>>>>
> > >>>>> require
> > >>>>
> > >>>>> fanout to other shards, which effectively turns each incoming query
> > >>>>>>
> > >>>>> into n
> > >>>>
> > >>>>> queries where n is the number of shards.
> > >>>>>>
> > >>>>>> -- Jack Krupansky
> > >>>>>>
> > >>>>>> On Mon, Feb 8, 2016 at 12:07 PM, Jack Krupansky <
> > >>>>>>
> > >>>>> jack.krupan...@gmail.com>
> > >>>>
> > >>>>> wrote:
> > >>>>>>
> > >>>>>> So is there any aging or TTL (in database terminology) of older
> > >>>>>>>
> > >>>>>> docs?
> > >>>
> > >>>> And do all of your queries need to query all of the older documents
> > >>>>>>>
> > >>>>>> all
> > >>>>
> > >>>>> of
> > >>>>>>
> > >>>>>>> the time or is there a clear hierarchy of querying for aged
> > >>>>>>>
> > >>>>>> documents,
> > >>>>
> > >>>>> like
> > >>>>>>
> > >>>>>>> past 24-hours vs. past week vs. past year vs. older than a year?
> > >>>>>>>
> > >>>>>> Sure,
> > >>>>
> > >>>>> you
> > >>>>>>
> > >>>>>>> can always use a function query to boost by the inverse of
> document
> > >>>>>>>
> > >>>>>> age,
> > >>>>
> > >>>>> but Solr would be more efficient with filter queries or separate
> > >>>>>>>
> > >>>>>> indexes
> > >>>>
> > >>>>> for different time scales.
> > >>>>>>>
> > >>>>>>> Are documents ever updated or are they write-once?
> > >>>>>>>
> > >>>>>>> Are documents explicitly deleted?
> > >>>>>>>
> > >>>>>>> Technically you probably could meet those specs, but... how many
> > >>>>>>> organizations have the resources and the energy to do so?
> > >>>>>>>
> > >>>>>>> As a back of the envelope calculation, if Solr gave you 100
> queries
> > >>>>>>>
> > >>>>>> per
> > >>>>
> > >>>>> second per node, that would mean you would need 1,200 nodes. It
> > >>>>>>>
> > >>>>>> would
> > >>>
> > >>>> also
> > >>>>>>
> > >>>>>>> depend on whether those queries are very narrow so that a single
> > >>>>>>>
> > >>>>>> node can
> > >>>>
> > >>>>> execute them or if they require fanout to other shards and then
> > >>>>>>>
> > >>>>>> aggregation
> > >>>>>>
> > >>>>>>> of results from those other shards.
> > >>>>>>>
> > >>>>>>> -- Jack Krupansky
> > >>>>>>>
> > >>>>>>> On Mon, Feb 8, 2016 at 11:24 AM, Erick Erickson <
> > >>>>>>>
> > >>>>>> erickerick...@gmail.com
> > >>>>
> > >>>>> wrote:
> > >>>>>>>
> > >>>>>>> Short form: You really have to prototype. Here's the long form:
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>
> https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
> > >>>
> > >>>> I've seen between 20M and 200M docs fit on a single piece of
> > >>>>>>>>
> > >>>>>>> hardware,
> > >>>>
> > >>>>> so you'll absolutely have to shard.
> > >>>>>>>>
> > >>>>>>>> And the other thing you haven't told us is whether you plan on
> > >>>>>>>> _adding_ 2B docs a day or whether that number is the total
> corpus
> > >>>>>>>>
> > >>>>>>> size
> > >>>>
> > >>>>> and you are re-indexing the 2B docs/day. IOW, if you are adding 2B
> > >>>>>>>> docs/day, 30 days later do you have 2B docs or 60B docs in your
> > >>>>>>>> corpus?
> > >>>>>>>>
> > >>>>>>>> Best,
> > >>>>>>>> Erick
> > >>>>>>>>
> > >>>>>>>> On Mon, Feb 8, 2016 at 8:09 AM, Susheel Kumar <
> > >>>>>>>>
> > >>>>>>> susheel2...@gmail.com>
> > >>>>
> > >>>>> wrote:
> > >>>>>>>>
> > >>>>>>>>> Also if you are expecting indexing of 2 billion docs as NRT or
> > >>>>>>>>>
> > >>>>>>>> if
> > >>>
> > >>>> it
> > >>>>
> > >>>>> will
> > >>>>>>>>
> > >>>>>>>>> be offline (during off hours etc).  For more accurate sizing
> you
> > >>>>>>>>>
> > >>>>>>>> may
> > >>>>
> > >>>>> also
> > >>>>>>>>
> > >>>>>>>>> want to index say 10 million documents which may give you idea
> > >>>>>>>>>
> > >>>>>>>> how
> > >>>
> > >>>> much
> > >>>>>>
> > >>>>>>> is
> > >>>>>>>>
> > >>>>>>>>> your index size and then use that for extrapolation to come up
> > >>>>>>>>>
> > >>>>>>>> with
> > >>>>
> > >>>>> memory
> > >>>>>>>>
> > >>>>>>>>> requirements.
> > >>>>>>>>>
> > >>>>>>>>> Thanks,
> > >>>>>>>>> Susheel
> > >>>>>>>>>
> > >>>>>>>>> On Mon, Feb 8, 2016 at 11:00 AM, Emir Arnautovic <
> > >>>>>>>>> emir.arnauto...@sematext.com> wrote:
> > >>>>>>>>>
> > >>>>>>>>> Hi Mark,
> > >>>>>>>>>> Can you give us bit more details: size of docs, query types,
> > >>>>>>>>>>
> > >>>>>>>>> are
> > >>>
> > >>>> docs
> > >>>>
> > >>>>> grouped somehow, are they time sensitive, will they update or
> > >>>>>>>>>>
> > >>>>>>>>> it
> > >>>
> > >>>> is
> > >>>>
> > >>>>> rebuild
> > >>>>>>>>
> > >>>>>>>>> every time, etc.
> > >>>>>>>>>>
> > >>>>>>>>>> Thanks,
> > >>>>>>>>>> Emir
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> On 08.02.2016 16:56, Mark Robinson wrote:
> > >>>>>>>>>>
> > >>>>>>>>>> Hi,
> > >>>>>>>>>>> We have a requirement where we would need to index around 2
> > >>>>>>>>>>>
> > >>>>>>>>>> Billion
> > >>>>
> > >>>>> docs
> > >>>>>>>>
> > >>>>>>>>> in
> > >>>>>>>>>>> a day.
> > >>>>>>>>>>> The queries against this indexed data set can be around 80K
> > >>>>>>>>>>>
> > >>>>>>>>>> queries
> > >>>>
> > >>>>> per
> > >>>>>>>>
> > >>>>>>>>> second during peak time and during non peak hours around 12K
> > >>>>>>>>>>>
> > >>>>>>>>>> queries
> > >>>>
> > >>>>> per
> > >>>>>>>>
> > >>>>>>>>> second.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Can Solr realize this huge volumes.
> > >>>>>>>>>>>
> > >>>>>>>>>>> If so, assuming we have no constraints for budget what would
> > >>>>>>>>>>>
> > >>>>>>>>>> be
> > >>>
> > >>>> a
> > >>>>
> > >>>>> recommended Solr set up (number of shards, number of Solr
> > >>>>>>>>>>>
> > >>>>>>>>>> instances
> > >>>>
> > >>>>> etc...)
> > >>>>>>>>>>>
> > >>>>>>>>>>> Thanks!
> > >>>>>>>>>>> Mark
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> --
> > >>>>>>>>>> Monitoring * Alerting * Anomaly Detection * Centralized Log
> > >>>>>>>>>>
> > >>>>>>>>> Management
> > >>>>>>
> > >>>>>>> Solr & Elasticsearch Support * http://sematext.com/
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>
>

Re: Solr architecture

Reply via email to