Re: Solr architecture

Mark Robinson Wed, 10 Feb 2016 13:27:06 -0800

Thanks everyone for your suggestions.
Based on it I am planning to have a doc per event.




On Wed, Feb 10, 2016 at 3:38 AM, Emir Arnautovic <
emir.arnauto...@sematext.com> wrote:

> Hi Mark,
> Appending session actions just to be able to return more than one session
> without retrieving large number of results is not good tradeoff. Like
> Upayavira suggested, you should consider storing one action per doc and
> aggregate on read time or push to Solr once session ends and aggregate on
> some other layer.
> If you are thinking handling infrastructure might be too much, you may
> consider using some of logging services to hold data. One such service is
> Sematext's Logsene (http://sematext.com/logsene).
>
> Thanks,
> Emir
>
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>
>
>
> On 10.02.2016 03:22, Mark Robinson wrote:
>
>> Thanks for your replies and suggestions!
>>
>> Why I store all events related to a session under one doc?
>> Each session can have about 500 total entries (events) corresponding to
>> it.
>> So when I try to retrieve a session's info it can back with around 500
>> records. If it is this compounded one doc per session, I can retrieve more
>> sessions at a time with one doc per session.
>> eg under a sessionId an array of eventA activities, eventB activities
>>   (using json). When an eventA activity again occurs, we will read all
>> that
>> data for that session, append this extra info to evenA data and push the
>> whole session related data back (indexing) to Solr. Like this for many
>> sessions parallely.
>>
>>
>> Why NRT?
>> Parallely many sessions are being written (4Million sessions hence
>> 4Million
>> docs per day). A person can do this querying any time.
>>
>> It is just a look up?
>> Yes. We just need to retrieve all info for a session and pass it on to
>> another system. We may even do some extra querying on some data like
>> timestamps, pageurl etc in that info added to a session.
>>
>> Thinking of having the data separate from the actual Solr Instance and
>> mention the loc of the dataDir in solrconfig.
>>
>> If Solr is not a good option could you please suggest something which will
>> satisfy this use case with min response time while querying.
>>
>> Thanks!
>> Mark
>>
>> On Tue, Feb 9, 2016 at 6:02 PM, Daniel Collins <danwcoll...@gmail.com>
>> wrote:
>>
>> So as I understand your use case, its effectively logging actions within a
>>> user session, why do you have to do the update in NRT?  Why not just log
>>> all the user session events (with some unique key, and ensuring the
>>> session
>>> Id is in the document somewhere), then when you want to do the query, you
>>> join on the session id, and that gives you all the data records for that
>>> session. I don't really follow why it has to be 1 document (which you
>>> continually update). If you really need that aggregation, couldn't that
>>> happen offline?
>>>
>>> I guess your 1 saving grace is that you query using the unique ID (in
>>> your
>>> scenario) so you could use the real-time get handler, since you aren't
>>> doing a complex query (strictly its not a search, its a raw key lookup).
>>>
>>> But I would still question your use case, if you go the Solr route for
>>> that
>>> kind of scale with querying and indexing that much, you're going to have
>>> to
>>> throw a lot of hardware at it, as Jack says probably in the order of
>>> hundreds of machines...
>>>
>>> On 9 February 2016 at 19:00, Upayavira <u...@odoko.co.uk> wrote:
>>>
>>> Bear in mind that Lucene is optimised towards high read lower write.
>>>> That is, it puts in a lot of effort at write time to make reading
>>>> efficient. It sounds like you are going to be doing far more writing
>>>> than reading, and I wonder whether you are necessarily choosing the
>>>> right tool for the job.
>>>>
>>>> How would you later use this data, and what advantage is there to
>>>> storing it in Solr?
>>>>
>>>> Upayavira
>>>>
>>>> On Tue, Feb 9, 2016, at 03:40 PM, Mark Robinson wrote:
>>>>
>>>>> Hi,
>>>>> Thanks for all your suggestions. I took some time to get the details to
>>>>> be
>>>>> more accurate. Please find what I have gathered:-
>>>>>
>>>>> My data being indexed is something like this.
>>>>> I am basically capturing all data related to a user session.
>>>>> Inside a session I have categorized my actions like actionA, actionB
>>>>> etc..,
>>>>> per page.
>>>>> So each time an action pertaining to say actionA or actionB etc.. (in
>>>>> each
>>>>> page) happens, it is updated in Solr under that session (sessionId).
>>>>>
>>>>> So in short there is only one doc pertaining to a single session
>>>>> (identified by sessionid) in my Solr index and that is retrieved and
>>>>> updated
>>>>> whenever a new action under that session occurs.
>>>>> We expect upto 4Million session per day.
>>>>>
>>>>> On an average *one session's* *doc has a size* of *3MB to 20MB*.
>>>>> So if it is *4Million sessions per day*, each session writing around
>>>>>
>>>> *500
>>>
>>>> times to Solr*, it is* 2Billion writes or (indexing) per day to Solr*.
>>>>> As it is one doc per session, it is *4Million docs per day*.
>>>>> This is around *80K docs indexed per second* during *peak* hours and
>>>>> around *15K
>>>>> docs indexed per second* into Solr during* non-peak* hours.
>>>>> Number of queries per second is around *320 queries per second*.
>>>>>
>>>>>
>>>>> 1. Average size of a doc
>>>>>       3MB to 20MB
>>>>> 2. Query types:-
>>>>>       Until that session is in progress, whatever data is there for
>>>>> that
>>>>> session so far is queried and the new action's details captured and
>>>>> appended to existing data already captured        related to that
>>>>>
>>>> session
>>>
>>>> and indexed back into Solr. So, longer the session the more data
>>>>> retrieved
>>>>> for each subsequent query to get current data captured for that
>>>>>
>>>> session.
>>>
>>>>       Also querying can be done on timestamp etc... which is captured
>>>>>       along
>>>>> with each action.
>>>>> 3. Are docs grouped somehow?
>>>>>       All data related to a session are retrieved from Solr, updated
>>>>> and
>>>>> indexed back to Solr based on sessionId. No other grouping.
>>>>> 4. Are they time sensitive (NRT or offline process does this)
>>>>>       As mentioned above this is in NRT. Each time a new user action in
>>>>>       that
>>>>> session happens, we need to query existing session info already
>>>>>
>>>> captured
>>>
>>>> related to that session and        append this new data  to this
>>>>>
>>>> existing
>>>
>>>> info retrieved and index it back to Solr.
>>>>> 5. Will they update or it is rebuild every time, etc.
>>>>>       Each time a new user action occurs, the full data pertaining to
>>>>>
>>>> that
>>>
>>>> session so far captured is retrieved from Solr, the extra latest data
>>>>> pertaining to this new action is appended      and indexed  back to
>>>>>
>>>> Solr.
>>>
>>>> 6. And the other thing you haven't told us is whether you plan on
>>>>> _adding_
>>>>> 2B docs a day or whether that number is the total corpus size and you
>>>>>
>>>> are
>>>
>>>> re-indexing the 2B docs/day. IOW, if you are  adding 2B docs/day, 30
>>>>>
>>>> days
>>>
>>>> later do you have 2B docs or 60B docs in your
>>>>>     corpus?
>>>>>     We are expecting around 4 million sessions per day (per session 500
>>>>> writes to Solr), which turns out to be 2B indexing done per day. So
>>>>>
>>>> after
>>>
>>>> 30 days it would be 4Milion*30          docs in the index.
>>>>> 7. Is there any aging of docs
>>>>>       No we always query against the whole corpus present.
>>>>> 8. Is any doc deleted?
>>>>>       No all data remains in the index
>>>>>
>>>>> Any suggestion is very welcome!
>>>>>
>>>>> Thanks!
>>>>> Mark.
>>>>>
>>>>>
>>>>> On Mon, Feb 8, 2016 at 3:30 PM, Jack Krupansky <
>>>>>
>>>> jack.krupan...@gmail.com
>>>
>>>> wrote:
>>>>>
>>>>> Oops... at 100 qps for a single node you would need 120 nodes to get
>>>>>>
>>>>> to 12K
>>>>
>>>>> qps and 800 nodes to get 80K qps, but that is just an extremely rough
>>>>>> ballpark estimate, not some precise and firm number. And that's if
>>>>>>
>>>>> all
>>>
>>>> the
>>>>
>>>>> queries can be evenly distributed throughout the cluster and don't
>>>>>>
>>>>> require
>>>>
>>>>> fanout to other shards, which effectively turns each incoming query
>>>>>>
>>>>> into n
>>>>
>>>>> queries where n is the number of shards.
>>>>>>
>>>>>> -- Jack Krupansky
>>>>>>
>>>>>> On Mon, Feb 8, 2016 at 12:07 PM, Jack Krupansky <
>>>>>>
>>>>> jack.krupan...@gmail.com>
>>>>
>>>>> wrote:
>>>>>>
>>>>>> So is there any aging or TTL (in database terminology) of older
>>>>>>>
>>>>>> docs?
>>>
>>>> And do all of your queries need to query all of the older documents
>>>>>>>
>>>>>> all
>>>>
>>>>> of
>>>>>>
>>>>>>> the time or is there a clear hierarchy of querying for aged
>>>>>>>
>>>>>> documents,
>>>>
>>>>> like
>>>>>>
>>>>>>> past 24-hours vs. past week vs. past year vs. older than a year?
>>>>>>>
>>>>>> Sure,
>>>>
>>>>> you
>>>>>>
>>>>>>> can always use a function query to boost by the inverse of document
>>>>>>>
>>>>>> age,
>>>>
>>>>> but Solr would be more efficient with filter queries or separate
>>>>>>>
>>>>>> indexes
>>>>
>>>>> for different time scales.
>>>>>>>
>>>>>>> Are documents ever updated or are they write-once?
>>>>>>>
>>>>>>> Are documents explicitly deleted?
>>>>>>>
>>>>>>> Technically you probably could meet those specs, but... how many
>>>>>>> organizations have the resources and the energy to do so?
>>>>>>>
>>>>>>> As a back of the envelope calculation, if Solr gave you 100 queries
>>>>>>>
>>>>>> per
>>>>
>>>>> second per node, that would mean you would need 1,200 nodes. It
>>>>>>>
>>>>>> would
>>>
>>>> also
>>>>>>
>>>>>>> depend on whether those queries are very narrow so that a single
>>>>>>>
>>>>>> node can
>>>>
>>>>> execute them or if they require fanout to other shards and then
>>>>>>>
>>>>>> aggregation
>>>>>>
>>>>>>> of results from those other shards.
>>>>>>>
>>>>>>> -- Jack Krupansky
>>>>>>>
>>>>>>> On Mon, Feb 8, 2016 at 11:24 AM, Erick Erickson <
>>>>>>>
>>>>>> erickerick...@gmail.com
>>>>
>>>>> wrote:
>>>>>>>
>>>>>>> Short form: You really have to prototype. Here's the long form:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>> https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
>>>
>>>> I've seen between 20M and 200M docs fit on a single piece of
>>>>>>>>
>>>>>>> hardware,
>>>>
>>>>> so you'll absolutely have to shard.
>>>>>>>>
>>>>>>>> And the other thing you haven't told us is whether you plan on
>>>>>>>> _adding_ 2B docs a day or whether that number is the total corpus
>>>>>>>>
>>>>>>> size
>>>>
>>>>> and you are re-indexing the 2B docs/day. IOW, if you are adding 2B
>>>>>>>> docs/day, 30 days later do you have 2B docs or 60B docs in your
>>>>>>>> corpus?
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Erick
>>>>>>>>
>>>>>>>> On Mon, Feb 8, 2016 at 8:09 AM, Susheel Kumar <
>>>>>>>>
>>>>>>> susheel2...@gmail.com>
>>>>
>>>>> wrote:
>>>>>>>>
>>>>>>>>> Also if you are expecting indexing of 2 billion docs as NRT or
>>>>>>>>>
>>>>>>>> if
>>>
>>>> it
>>>>
>>>>> will
>>>>>>>>
>>>>>>>>> be offline (during off hours etc).  For more accurate sizing you
>>>>>>>>>
>>>>>>>> may
>>>>
>>>>> also
>>>>>>>>
>>>>>>>>> want to index say 10 million documents which may give you idea
>>>>>>>>>
>>>>>>>> how
>>>
>>>> much
>>>>>>
>>>>>>> is
>>>>>>>>
>>>>>>>>> your index size and then use that for extrapolation to come up
>>>>>>>>>
>>>>>>>> with
>>>>
>>>>> memory
>>>>>>>>
>>>>>>>>> requirements.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Susheel
>>>>>>>>>
>>>>>>>>> On Mon, Feb 8, 2016 at 11:00 AM, Emir Arnautovic <
>>>>>>>>> emir.arnauto...@sematext.com> wrote:
>>>>>>>>>
>>>>>>>>> Hi Mark,
>>>>>>>>>> Can you give us bit more details: size of docs, query types,
>>>>>>>>>>
>>>>>>>>> are
>>>
>>>> docs
>>>>
>>>>> grouped somehow, are they time sensitive, will they update or
>>>>>>>>>>
>>>>>>>>> it
>>>
>>>> is
>>>>
>>>>> rebuild
>>>>>>>>
>>>>>>>>> every time, etc.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Emir
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 08.02.2016 16:56, Mark Robinson wrote:
>>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>> We have a requirement where we would need to index around 2
>>>>>>>>>>>
>>>>>>>>>> Billion
>>>>
>>>>> docs
>>>>>>>>
>>>>>>>>> in
>>>>>>>>>>> a day.
>>>>>>>>>>> The queries against this indexed data set can be around 80K
>>>>>>>>>>>
>>>>>>>>>> queries
>>>>
>>>>> per
>>>>>>>>
>>>>>>>>> second during peak time and during non peak hours around 12K
>>>>>>>>>>>
>>>>>>>>>> queries
>>>>
>>>>> per
>>>>>>>>
>>>>>>>>> second.
>>>>>>>>>>>
>>>>>>>>>>> Can Solr realize this huge volumes.
>>>>>>>>>>>
>>>>>>>>>>> If so, assuming we have no constraints for budget what would
>>>>>>>>>>>
>>>>>>>>>> be
>>>
>>>> a
>>>>
>>>>> recommended Solr set up (number of shards, number of Solr
>>>>>>>>>>>
>>>>>>>>>> instances
>>>>
>>>>> etc...)
>>>>>>>>>>>
>>>>>>>>>>> Thanks!
>>>>>>>>>>> Mark
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>> Monitoring * Alerting * Anomaly Detection * Centralized Log
>>>>>>>>>>
>>>>>>>>> Management
>>>>>>
>>>>>>> Solr & Elasticsearch Support * http://sematext.com/
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>

Re: Solr architecture

Reply via email to