Thanks All for your suggestions! Rgds, Mark.
On Thu, Feb 11, 2016 at 9:45 AM, Upayavira <u...@odoko.co.uk> wrote: > Your biggest issue here is likely to be http connections. Making an HTTP > connection to Solr is way more expensive than the ask of adding a single > document to the index. If you are expecting to add 24 billion docs per > day, I'd suggest that somehow merging those documents into batches > before sending them to Solr will be necessary. > > To my previous question - what do you gain by using Solr that you don't > get from other solutions? I'd suggest that to make this system really > work, you are going to need a deep understanding of how Lucene works - > segments, segment merges, deletions, and many other things because when > you start to work at that scale, the implementation details behind > Lucene really start to matter and impact upon your ability to succeed. > > I'd suggest that what you are undertaking can certainly be done, but is > a substantial project. > > Upayavira > > On Wed, Feb 10, 2016, at 09:48 PM, Mark Robinson wrote: > > Thanks everyone for your suggestions. > > Based on it I am planning to have one doc per event with sessionId > > common. > > > > So in this case hopefully indexing each doc as and when it comes would be > > okay? Or do we still need to batch and index to Solr? > > > > Also with 4M sessions a day with about 6000 docs (events) per session we > > can expect about 24Billion docs per day! > > > > Will Solr still hold good. If so could some one please recommend a sizing > > to cater to this levels of data. > > The queries per second is around 320 qps. > > > > Thanks! > > Mark > > > > > > On Wed, Feb 10, 2016 at 3:38 AM, Emir Arnautovic < > > emir.arnauto...@sematext.com> wrote: > > > > > Hi Mark, > > > Appending session actions just to be able to return more than one > session > > > without retrieving large number of results is not good tradeoff. Like > > > Upayavira suggested, you should consider storing one action per doc and > > > aggregate on read time or push to Solr once session ends and aggregate > on > > > some other layer. > > > If you are thinking handling infrastructure might be too much, you may > > > consider using some of logging services to hold data. One such service > is > > > Sematext's Logsene (http://sematext.com/logsene). > > > > > > Thanks, > > > Emir > > > > > > -- > > > Monitoring * Alerting * Anomaly Detection * Centralized Log Management > > > Solr & Elasticsearch Support * http://sematext.com/ > > > > > > > > > > > > On 10.02.2016 03:22, Mark Robinson wrote: > > > > > >> Thanks for your replies and suggestions! > > >> > > >> Why I store all events related to a session under one doc? > > >> Each session can have about 500 total entries (events) corresponding > to > > >> it. > > >> So when I try to retrieve a session's info it can back with around 500 > > >> records. If it is this compounded one doc per session, I can retrieve > more > > >> sessions at a time with one doc per session. > > >> eg under a sessionId an array of eventA activities, eventB activities > > >> (using json). When an eventA activity again occurs, we will read all > > >> that > > >> data for that session, append this extra info to evenA data and push > the > > >> whole session related data back (indexing) to Solr. Like this for many > > >> sessions parallely. > > >> > > >> > > >> Why NRT? > > >> Parallely many sessions are being written (4Million sessions hence > > >> 4Million > > >> docs per day). A person can do this querying any time. > > >> > > >> It is just a look up? > > >> Yes. We just need to retrieve all info for a session and pass it on to > > >> another system. We may even do some extra querying on some data like > > >> timestamps, pageurl etc in that info added to a session. > > >> > > >> Thinking of having the data separate from the actual Solr Instance and > > >> mention the loc of the dataDir in solrconfig. > > >> > > >> If Solr is not a good option could you please suggest something which > will > > >> satisfy this use case with min response time while querying. > > >> > > >> Thanks! > > >> Mark > > >> > > >> On Tue, Feb 9, 2016 at 6:02 PM, Daniel Collins <danwcoll...@gmail.com > > > > >> wrote: > > >> > > >> So as I understand your use case, its effectively logging actions > within a > > >>> user session, why do you have to do the update in NRT? Why not just > log > > >>> all the user session events (with some unique key, and ensuring the > > >>> session > > >>> Id is in the document somewhere), then when you want to do the > query, you > > >>> join on the session id, and that gives you all the data records for > that > > >>> session. I don't really follow why it has to be 1 document (which you > > >>> continually update). If you really need that aggregation, couldn't > that > > >>> happen offline? > > >>> > > >>> I guess your 1 saving grace is that you query using the unique ID (in > > >>> your > > >>> scenario) so you could use the real-time get handler, since you > aren't > > >>> doing a complex query (strictly its not a search, its a raw key > lookup). > > >>> > > >>> But I would still question your use case, if you go the Solr route > for > > >>> that > > >>> kind of scale with querying and indexing that much, you're going to > have > > >>> to > > >>> throw a lot of hardware at it, as Jack says probably in the order of > > >>> hundreds of machines... > > >>> > > >>> On 9 February 2016 at 19:00, Upayavira <u...@odoko.co.uk> wrote: > > >>> > > >>> Bear in mind that Lucene is optimised towards high read lower write. > > >>>> That is, it puts in a lot of effort at write time to make reading > > >>>> efficient. It sounds like you are going to be doing far more writing > > >>>> than reading, and I wonder whether you are necessarily choosing the > > >>>> right tool for the job. > > >>>> > > >>>> How would you later use this data, and what advantage is there to > > >>>> storing it in Solr? > > >>>> > > >>>> Upayavira > > >>>> > > >>>> On Tue, Feb 9, 2016, at 03:40 PM, Mark Robinson wrote: > > >>>> > > >>>>> Hi, > > >>>>> Thanks for all your suggestions. I took some time to get the > details to > > >>>>> be > > >>>>> more accurate. Please find what I have gathered:- > > >>>>> > > >>>>> My data being indexed is something like this. > > >>>>> I am basically capturing all data related to a user session. > > >>>>> Inside a session I have categorized my actions like actionA, > actionB > > >>>>> etc.., > > >>>>> per page. > > >>>>> So each time an action pertaining to say actionA or actionB etc.. > (in > > >>>>> each > > >>>>> page) happens, it is updated in Solr under that session > (sessionId). > > >>>>> > > >>>>> So in short there is only one doc pertaining to a single session > > >>>>> (identified by sessionid) in my Solr index and that is retrieved > and > > >>>>> updated > > >>>>> whenever a new action under that session occurs. > > >>>>> We expect upto 4Million session per day. > > >>>>> > > >>>>> On an average *one session's* *doc has a size* of *3MB to 20MB*. > > >>>>> So if it is *4Million sessions per day*, each session writing > around > > >>>>> > > >>>> *500 > > >>> > > >>>> times to Solr*, it is* 2Billion writes or (indexing) per day to > Solr*. > > >>>>> As it is one doc per session, it is *4Million docs per day*. > > >>>>> This is around *80K docs indexed per second* during *peak* hours > and > > >>>>> around *15K > > >>>>> docs indexed per second* into Solr during* non-peak* hours. > > >>>>> Number of queries per second is around *320 queries per second*. > > >>>>> > > >>>>> > > >>>>> 1. Average size of a doc > > >>>>> 3MB to 20MB > > >>>>> 2. Query types:- > > >>>>> Until that session is in progress, whatever data is there for > > >>>>> that > > >>>>> session so far is queried and the new action's details captured and > > >>>>> appended to existing data already captured related to that > > >>>>> > > >>>> session > > >>> > > >>>> and indexed back into Solr. So, longer the session the more data > > >>>>> retrieved > > >>>>> for each subsequent query to get current data captured for that > > >>>>> > > >>>> session. > > >>> > > >>>> Also querying can be done on timestamp etc... which is > captured > > >>>>> along > > >>>>> with each action. > > >>>>> 3. Are docs grouped somehow? > > >>>>> All data related to a session are retrieved from Solr, > updated > > >>>>> and > > >>>>> indexed back to Solr based on sessionId. No other grouping. > > >>>>> 4. Are they time sensitive (NRT or offline process does this) > > >>>>> As mentioned above this is in NRT. Each time a new user > action in > > >>>>> that > > >>>>> session happens, we need to query existing session info already > > >>>>> > > >>>> captured > > >>> > > >>>> related to that session and append this new data to this > > >>>>> > > >>>> existing > > >>> > > >>>> info retrieved and index it back to Solr. > > >>>>> 5. Will they update or it is rebuild every time, etc. > > >>>>> Each time a new user action occurs, the full data pertaining > to > > >>>>> > > >>>> that > > >>> > > >>>> session so far captured is retrieved from Solr, the extra latest > data > > >>>>> pertaining to this new action is appended and indexed back to > > >>>>> > > >>>> Solr. > > >>> > > >>>> 6. And the other thing you haven't told us is whether you plan on > > >>>>> _adding_ > > >>>>> 2B docs a day or whether that number is the total corpus size and > you > > >>>>> > > >>>> are > > >>> > > >>>> re-indexing the 2B docs/day. IOW, if you are adding 2B docs/day, 30 > > >>>>> > > >>>> days > > >>> > > >>>> later do you have 2B docs or 60B docs in your > > >>>>> corpus? > > >>>>> We are expecting around 4 million sessions per day (per > session 500 > > >>>>> writes to Solr), which turns out to be 2B indexing done per day. So > > >>>>> > > >>>> after > > >>> > > >>>> 30 days it would be 4Milion*30 docs in the index. > > >>>>> 7. Is there any aging of docs > > >>>>> No we always query against the whole corpus present. > > >>>>> 8. Is any doc deleted? > > >>>>> No all data remains in the index > > >>>>> > > >>>>> Any suggestion is very welcome! > > >>>>> > > >>>>> Thanks! > > >>>>> Mark. > > >>>>> > > >>>>> > > >>>>> On Mon, Feb 8, 2016 at 3:30 PM, Jack Krupansky < > > >>>>> > > >>>> jack.krupan...@gmail.com > > >>> > > >>>> wrote: > > >>>>> > > >>>>> Oops... at 100 qps for a single node you would need 120 nodes to > get > > >>>>>> > > >>>>> to 12K > > >>>> > > >>>>> qps and 800 nodes to get 80K qps, but that is just an extremely > rough > > >>>>>> ballpark estimate, not some precise and firm number. And that's if > > >>>>>> > > >>>>> all > > >>> > > >>>> the > > >>>> > > >>>>> queries can be evenly distributed throughout the cluster and don't > > >>>>>> > > >>>>> require > > >>>> > > >>>>> fanout to other shards, which effectively turns each incoming query > > >>>>>> > > >>>>> into n > > >>>> > > >>>>> queries where n is the number of shards. > > >>>>>> > > >>>>>> -- Jack Krupansky > > >>>>>> > > >>>>>> On Mon, Feb 8, 2016 at 12:07 PM, Jack Krupansky < > > >>>>>> > > >>>>> jack.krupan...@gmail.com> > > >>>> > > >>>>> wrote: > > >>>>>> > > >>>>>> So is there any aging or TTL (in database terminology) of older > > >>>>>>> > > >>>>>> docs? > > >>> > > >>>> And do all of your queries need to query all of the older documents > > >>>>>>> > > >>>>>> all > > >>>> > > >>>>> of > > >>>>>> > > >>>>>>> the time or is there a clear hierarchy of querying for aged > > >>>>>>> > > >>>>>> documents, > > >>>> > > >>>>> like > > >>>>>> > > >>>>>>> past 24-hours vs. past week vs. past year vs. older than a year? > > >>>>>>> > > >>>>>> Sure, > > >>>> > > >>>>> you > > >>>>>> > > >>>>>>> can always use a function query to boost by the inverse of > document > > >>>>>>> > > >>>>>> age, > > >>>> > > >>>>> but Solr would be more efficient with filter queries or separate > > >>>>>>> > > >>>>>> indexes > > >>>> > > >>>>> for different time scales. > > >>>>>>> > > >>>>>>> Are documents ever updated or are they write-once? > > >>>>>>> > > >>>>>>> Are documents explicitly deleted? > > >>>>>>> > > >>>>>>> Technically you probably could meet those specs, but... how many > > >>>>>>> organizations have the resources and the energy to do so? > > >>>>>>> > > >>>>>>> As a back of the envelope calculation, if Solr gave you 100 > queries > > >>>>>>> > > >>>>>> per > > >>>> > > >>>>> second per node, that would mean you would need 1,200 nodes. It > > >>>>>>> > > >>>>>> would > > >>> > > >>>> also > > >>>>>> > > >>>>>>> depend on whether those queries are very narrow so that a single > > >>>>>>> > > >>>>>> node can > > >>>> > > >>>>> execute them or if they require fanout to other shards and then > > >>>>>>> > > >>>>>> aggregation > > >>>>>> > > >>>>>>> of results from those other shards. > > >>>>>>> > > >>>>>>> -- Jack Krupansky > > >>>>>>> > > >>>>>>> On Mon, Feb 8, 2016 at 11:24 AM, Erick Erickson < > > >>>>>>> > > >>>>>> erickerick...@gmail.com > > >>>> > > >>>>> wrote: > > >>>>>>> > > >>>>>>> Short form: You really have to prototype. Here's the long form: > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>> > https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/ > > >>> > > >>>> I've seen between 20M and 200M docs fit on a single piece of > > >>>>>>>> > > >>>>>>> hardware, > > >>>> > > >>>>> so you'll absolutely have to shard. > > >>>>>>>> > > >>>>>>>> And the other thing you haven't told us is whether you plan on > > >>>>>>>> _adding_ 2B docs a day or whether that number is the total > corpus > > >>>>>>>> > > >>>>>>> size > > >>>> > > >>>>> and you are re-indexing the 2B docs/day. IOW, if you are adding 2B > > >>>>>>>> docs/day, 30 days later do you have 2B docs or 60B docs in your > > >>>>>>>> corpus? > > >>>>>>>> > > >>>>>>>> Best, > > >>>>>>>> Erick > > >>>>>>>> > > >>>>>>>> On Mon, Feb 8, 2016 at 8:09 AM, Susheel Kumar < > > >>>>>>>> > > >>>>>>> susheel2...@gmail.com> > > >>>> > > >>>>> wrote: > > >>>>>>>> > > >>>>>>>>> Also if you are expecting indexing of 2 billion docs as NRT or > > >>>>>>>>> > > >>>>>>>> if > > >>> > > >>>> it > > >>>> > > >>>>> will > > >>>>>>>> > > >>>>>>>>> be offline (during off hours etc). For more accurate sizing > you > > >>>>>>>>> > > >>>>>>>> may > > >>>> > > >>>>> also > > >>>>>>>> > > >>>>>>>>> want to index say 10 million documents which may give you idea > > >>>>>>>>> > > >>>>>>>> how > > >>> > > >>>> much > > >>>>>> > > >>>>>>> is > > >>>>>>>> > > >>>>>>>>> your index size and then use that for extrapolation to come up > > >>>>>>>>> > > >>>>>>>> with > > >>>> > > >>>>> memory > > >>>>>>>> > > >>>>>>>>> requirements. > > >>>>>>>>> > > >>>>>>>>> Thanks, > > >>>>>>>>> Susheel > > >>>>>>>>> > > >>>>>>>>> On Mon, Feb 8, 2016 at 11:00 AM, Emir Arnautovic < > > >>>>>>>>> emir.arnauto...@sematext.com> wrote: > > >>>>>>>>> > > >>>>>>>>> Hi Mark, > > >>>>>>>>>> Can you give us bit more details: size of docs, query types, > > >>>>>>>>>> > > >>>>>>>>> are > > >>> > > >>>> docs > > >>>> > > >>>>> grouped somehow, are they time sensitive, will they update or > > >>>>>>>>>> > > >>>>>>>>> it > > >>> > > >>>> is > > >>>> > > >>>>> rebuild > > >>>>>>>> > > >>>>>>>>> every time, etc. > > >>>>>>>>>> > > >>>>>>>>>> Thanks, > > >>>>>>>>>> Emir > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> On 08.02.2016 16:56, Mark Robinson wrote: > > >>>>>>>>>> > > >>>>>>>>>> Hi, > > >>>>>>>>>>> We have a requirement where we would need to index around 2 > > >>>>>>>>>>> > > >>>>>>>>>> Billion > > >>>> > > >>>>> docs > > >>>>>>>> > > >>>>>>>>> in > > >>>>>>>>>>> a day. > > >>>>>>>>>>> The queries against this indexed data set can be around 80K > > >>>>>>>>>>> > > >>>>>>>>>> queries > > >>>> > > >>>>> per > > >>>>>>>> > > >>>>>>>>> second during peak time and during non peak hours around 12K > > >>>>>>>>>>> > > >>>>>>>>>> queries > > >>>> > > >>>>> per > > >>>>>>>> > > >>>>>>>>> second. > > >>>>>>>>>>> > > >>>>>>>>>>> Can Solr realize this huge volumes. > > >>>>>>>>>>> > > >>>>>>>>>>> If so, assuming we have no constraints for budget what would > > >>>>>>>>>>> > > >>>>>>>>>> be > > >>> > > >>>> a > > >>>> > > >>>>> recommended Solr set up (number of shards, number of Solr > > >>>>>>>>>>> > > >>>>>>>>>> instances > > >>>> > > >>>>> etc...) > > >>>>>>>>>>> > > >>>>>>>>>>> Thanks! > > >>>>>>>>>>> Mark > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> -- > > >>>>>>>>>> Monitoring * Alerting * Anomaly Detection * Centralized Log > > >>>>>>>>>> > > >>>>>>>>> Management > > >>>>>> > > >>>>>>> Solr & Elasticsearch Support * http://sematext.com/ > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>> >