Thanks everyone for your suggestions. Based on it I am planning to have a doc per event.
On Wed, Feb 10, 2016 at 3:38 AM, Emir Arnautovic < emir.arnauto...@sematext.com> wrote: > Hi Mark, > Appending session actions just to be able to return more than one session > without retrieving large number of results is not good tradeoff. Like > Upayavira suggested, you should consider storing one action per doc and > aggregate on read time or push to Solr once session ends and aggregate on > some other layer. > If you are thinking handling infrastructure might be too much, you may > consider using some of logging services to hold data. One such service is > Sematext's Logsene (http://sematext.com/logsene). > > Thanks, > Emir > > -- > Monitoring * Alerting * Anomaly Detection * Centralized Log Management > Solr & Elasticsearch Support * http://sematext.com/ > > > > On 10.02.2016 03:22, Mark Robinson wrote: > >> Thanks for your replies and suggestions! >> >> Why I store all events related to a session under one doc? >> Each session can have about 500 total entries (events) corresponding to >> it. >> So when I try to retrieve a session's info it can back with around 500 >> records. If it is this compounded one doc per session, I can retrieve more >> sessions at a time with one doc per session. >> eg under a sessionId an array of eventA activities, eventB activities >> (using json). When an eventA activity again occurs, we will read all >> that >> data for that session, append this extra info to evenA data and push the >> whole session related data back (indexing) to Solr. Like this for many >> sessions parallely. >> >> >> Why NRT? >> Parallely many sessions are being written (4Million sessions hence >> 4Million >> docs per day). A person can do this querying any time. >> >> It is just a look up? >> Yes. We just need to retrieve all info for a session and pass it on to >> another system. We may even do some extra querying on some data like >> timestamps, pageurl etc in that info added to a session. >> >> Thinking of having the data separate from the actual Solr Instance and >> mention the loc of the dataDir in solrconfig. >> >> If Solr is not a good option could you please suggest something which will >> satisfy this use case with min response time while querying. >> >> Thanks! >> Mark >> >> On Tue, Feb 9, 2016 at 6:02 PM, Daniel Collins <danwcoll...@gmail.com> >> wrote: >> >> So as I understand your use case, its effectively logging actions within a >>> user session, why do you have to do the update in NRT? Why not just log >>> all the user session events (with some unique key, and ensuring the >>> session >>> Id is in the document somewhere), then when you want to do the query, you >>> join on the session id, and that gives you all the data records for that >>> session. I don't really follow why it has to be 1 document (which you >>> continually update). If you really need that aggregation, couldn't that >>> happen offline? >>> >>> I guess your 1 saving grace is that you query using the unique ID (in >>> your >>> scenario) so you could use the real-time get handler, since you aren't >>> doing a complex query (strictly its not a search, its a raw key lookup). >>> >>> But I would still question your use case, if you go the Solr route for >>> that >>> kind of scale with querying and indexing that much, you're going to have >>> to >>> throw a lot of hardware at it, as Jack says probably in the order of >>> hundreds of machines... >>> >>> On 9 February 2016 at 19:00, Upayavira <u...@odoko.co.uk> wrote: >>> >>> Bear in mind that Lucene is optimised towards high read lower write. >>>> That is, it puts in a lot of effort at write time to make reading >>>> efficient. It sounds like you are going to be doing far more writing >>>> than reading, and I wonder whether you are necessarily choosing the >>>> right tool for the job. >>>> >>>> How would you later use this data, and what advantage is there to >>>> storing it in Solr? >>>> >>>> Upayavira >>>> >>>> On Tue, Feb 9, 2016, at 03:40 PM, Mark Robinson wrote: >>>> >>>>> Hi, >>>>> Thanks for all your suggestions. I took some time to get the details to >>>>> be >>>>> more accurate. Please find what I have gathered:- >>>>> >>>>> My data being indexed is something like this. >>>>> I am basically capturing all data related to a user session. >>>>> Inside a session I have categorized my actions like actionA, actionB >>>>> etc.., >>>>> per page. >>>>> So each time an action pertaining to say actionA or actionB etc.. (in >>>>> each >>>>> page) happens, it is updated in Solr under that session (sessionId). >>>>> >>>>> So in short there is only one doc pertaining to a single session >>>>> (identified by sessionid) in my Solr index and that is retrieved and >>>>> updated >>>>> whenever a new action under that session occurs. >>>>> We expect upto 4Million session per day. >>>>> >>>>> On an average *one session's* *doc has a size* of *3MB to 20MB*. >>>>> So if it is *4Million sessions per day*, each session writing around >>>>> >>>> *500 >>> >>>> times to Solr*, it is* 2Billion writes or (indexing) per day to Solr*. >>>>> As it is one doc per session, it is *4Million docs per day*. >>>>> This is around *80K docs indexed per second* during *peak* hours and >>>>> around *15K >>>>> docs indexed per second* into Solr during* non-peak* hours. >>>>> Number of queries per second is around *320 queries per second*. >>>>> >>>>> >>>>> 1. Average size of a doc >>>>> 3MB to 20MB >>>>> 2. Query types:- >>>>> Until that session is in progress, whatever data is there for >>>>> that >>>>> session so far is queried and the new action's details captured and >>>>> appended to existing data already captured related to that >>>>> >>>> session >>> >>>> and indexed back into Solr. So, longer the session the more data >>>>> retrieved >>>>> for each subsequent query to get current data captured for that >>>>> >>>> session. >>> >>>> Also querying can be done on timestamp etc... which is captured >>>>> along >>>>> with each action. >>>>> 3. Are docs grouped somehow? >>>>> All data related to a session are retrieved from Solr, updated >>>>> and >>>>> indexed back to Solr based on sessionId. No other grouping. >>>>> 4. Are they time sensitive (NRT or offline process does this) >>>>> As mentioned above this is in NRT. Each time a new user action in >>>>> that >>>>> session happens, we need to query existing session info already >>>>> >>>> captured >>> >>>> related to that session and append this new data to this >>>>> >>>> existing >>> >>>> info retrieved and index it back to Solr. >>>>> 5. Will they update or it is rebuild every time, etc. >>>>> Each time a new user action occurs, the full data pertaining to >>>>> >>>> that >>> >>>> session so far captured is retrieved from Solr, the extra latest data >>>>> pertaining to this new action is appended and indexed back to >>>>> >>>> Solr. >>> >>>> 6. And the other thing you haven't told us is whether you plan on >>>>> _adding_ >>>>> 2B docs a day or whether that number is the total corpus size and you >>>>> >>>> are >>> >>>> re-indexing the 2B docs/day. IOW, if you are adding 2B docs/day, 30 >>>>> >>>> days >>> >>>> later do you have 2B docs or 60B docs in your >>>>> corpus? >>>>> We are expecting around 4 million sessions per day (per session 500 >>>>> writes to Solr), which turns out to be 2B indexing done per day. So >>>>> >>>> after >>> >>>> 30 days it would be 4Milion*30 docs in the index. >>>>> 7. Is there any aging of docs >>>>> No we always query against the whole corpus present. >>>>> 8. Is any doc deleted? >>>>> No all data remains in the index >>>>> >>>>> Any suggestion is very welcome! >>>>> >>>>> Thanks! >>>>> Mark. >>>>> >>>>> >>>>> On Mon, Feb 8, 2016 at 3:30 PM, Jack Krupansky < >>>>> >>>> jack.krupan...@gmail.com >>> >>>> wrote: >>>>> >>>>> Oops... at 100 qps for a single node you would need 120 nodes to get >>>>>> >>>>> to 12K >>>> >>>>> qps and 800 nodes to get 80K qps, but that is just an extremely rough >>>>>> ballpark estimate, not some precise and firm number. And that's if >>>>>> >>>>> all >>> >>>> the >>>> >>>>> queries can be evenly distributed throughout the cluster and don't >>>>>> >>>>> require >>>> >>>>> fanout to other shards, which effectively turns each incoming query >>>>>> >>>>> into n >>>> >>>>> queries where n is the number of shards. >>>>>> >>>>>> -- Jack Krupansky >>>>>> >>>>>> On Mon, Feb 8, 2016 at 12:07 PM, Jack Krupansky < >>>>>> >>>>> jack.krupan...@gmail.com> >>>> >>>>> wrote: >>>>>> >>>>>> So is there any aging or TTL (in database terminology) of older >>>>>>> >>>>>> docs? >>> >>>> And do all of your queries need to query all of the older documents >>>>>>> >>>>>> all >>>> >>>>> of >>>>>> >>>>>>> the time or is there a clear hierarchy of querying for aged >>>>>>> >>>>>> documents, >>>> >>>>> like >>>>>> >>>>>>> past 24-hours vs. past week vs. past year vs. older than a year? >>>>>>> >>>>>> Sure, >>>> >>>>> you >>>>>> >>>>>>> can always use a function query to boost by the inverse of document >>>>>>> >>>>>> age, >>>> >>>>> but Solr would be more efficient with filter queries or separate >>>>>>> >>>>>> indexes >>>> >>>>> for different time scales. >>>>>>> >>>>>>> Are documents ever updated or are they write-once? >>>>>>> >>>>>>> Are documents explicitly deleted? >>>>>>> >>>>>>> Technically you probably could meet those specs, but... how many >>>>>>> organizations have the resources and the energy to do so? >>>>>>> >>>>>>> As a back of the envelope calculation, if Solr gave you 100 queries >>>>>>> >>>>>> per >>>> >>>>> second per node, that would mean you would need 1,200 nodes. It >>>>>>> >>>>>> would >>> >>>> also >>>>>> >>>>>>> depend on whether those queries are very narrow so that a single >>>>>>> >>>>>> node can >>>> >>>>> execute them or if they require fanout to other shards and then >>>>>>> >>>>>> aggregation >>>>>> >>>>>>> of results from those other shards. >>>>>>> >>>>>>> -- Jack Krupansky >>>>>>> >>>>>>> On Mon, Feb 8, 2016 at 11:24 AM, Erick Erickson < >>>>>>> >>>>>> erickerick...@gmail.com >>>> >>>>> wrote: >>>>>>> >>>>>>> Short form: You really have to prototype. Here's the long form: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>> https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/ >>> >>>> I've seen between 20M and 200M docs fit on a single piece of >>>>>>>> >>>>>>> hardware, >>>> >>>>> so you'll absolutely have to shard. >>>>>>>> >>>>>>>> And the other thing you haven't told us is whether you plan on >>>>>>>> _adding_ 2B docs a day or whether that number is the total corpus >>>>>>>> >>>>>>> size >>>> >>>>> and you are re-indexing the 2B docs/day. IOW, if you are adding 2B >>>>>>>> docs/day, 30 days later do you have 2B docs or 60B docs in your >>>>>>>> corpus? >>>>>>>> >>>>>>>> Best, >>>>>>>> Erick >>>>>>>> >>>>>>>> On Mon, Feb 8, 2016 at 8:09 AM, Susheel Kumar < >>>>>>>> >>>>>>> susheel2...@gmail.com> >>>> >>>>> wrote: >>>>>>>> >>>>>>>>> Also if you are expecting indexing of 2 billion docs as NRT or >>>>>>>>> >>>>>>>> if >>> >>>> it >>>> >>>>> will >>>>>>>> >>>>>>>>> be offline (during off hours etc). For more accurate sizing you >>>>>>>>> >>>>>>>> may >>>> >>>>> also >>>>>>>> >>>>>>>>> want to index say 10 million documents which may give you idea >>>>>>>>> >>>>>>>> how >>> >>>> much >>>>>> >>>>>>> is >>>>>>>> >>>>>>>>> your index size and then use that for extrapolation to come up >>>>>>>>> >>>>>>>> with >>>> >>>>> memory >>>>>>>> >>>>>>>>> requirements. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Susheel >>>>>>>>> >>>>>>>>> On Mon, Feb 8, 2016 at 11:00 AM, Emir Arnautovic < >>>>>>>>> emir.arnauto...@sematext.com> wrote: >>>>>>>>> >>>>>>>>> Hi Mark, >>>>>>>>>> Can you give us bit more details: size of docs, query types, >>>>>>>>>> >>>>>>>>> are >>> >>>> docs >>>> >>>>> grouped somehow, are they time sensitive, will they update or >>>>>>>>>> >>>>>>>>> it >>> >>>> is >>>> >>>>> rebuild >>>>>>>> >>>>>>>>> every time, etc. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Emir >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 08.02.2016 16:56, Mark Robinson wrote: >>>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>>> We have a requirement where we would need to index around 2 >>>>>>>>>>> >>>>>>>>>> Billion >>>> >>>>> docs >>>>>>>> >>>>>>>>> in >>>>>>>>>>> a day. >>>>>>>>>>> The queries against this indexed data set can be around 80K >>>>>>>>>>> >>>>>>>>>> queries >>>> >>>>> per >>>>>>>> >>>>>>>>> second during peak time and during non peak hours around 12K >>>>>>>>>>> >>>>>>>>>> queries >>>> >>>>> per >>>>>>>> >>>>>>>>> second. >>>>>>>>>>> >>>>>>>>>>> Can Solr realize this huge volumes. >>>>>>>>>>> >>>>>>>>>>> If so, assuming we have no constraints for budget what would >>>>>>>>>>> >>>>>>>>>> be >>> >>>> a >>>> >>>>> recommended Solr set up (number of shards, number of Solr >>>>>>>>>>> >>>>>>>>>> instances >>>> >>>>> etc...) >>>>>>>>>>> >>>>>>>>>>> Thanks! >>>>>>>>>>> Mark >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>> Monitoring * Alerting * Anomaly Detection * Centralized Log >>>>>>>>>> >>>>>>>>> Management >>>>>> >>>>>>> Solr & Elasticsearch Support * http://sematext.com/ >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>