This model is not efficient for this type of querying. You cannot do this in one query using this model, and the pre-processing work you do now + traversing all documents is very costly.
Is it possible for you to index the data (even as a projection) into Elasticsearch using a different model, so you can use ES properly using queries or the aggregations framework? -- Itamar Syn-Hershko http://code972.com | @synhershko <https://twitter.com/synhershko> Freelance Developer & Consultant Author of RavenDB in Action <http://manning.com/synhershko/> On Thu, Jun 5, 2014 at 12:04 AM, Zennet Wheatcroft <zwheatcr...@atypon.com> wrote: > Hi, > > I am looking for an efficient way to do inter-document queries in > Elasticsearch. Specifically, I want to count the number of users that went > through an exit point B after visiting point A. > > In general terms, say we have some event log data about users actions on a > website: > .... > {"userid":"xyz", "machineid":"110530745", "path":"/promo/A", "country": > "US", "tstamp":"2013-04-01 00:01:01"} > {"userid":"pdq", "machineid":"110519774", "path":"/page/1", "country":"CN" > , "tstamp":"2013-04-01 00:02:11"} > {"userid":"xyz", "machineid":"110530745", "path":"/promo/D", "country": > "US", "tstamp":"2013-04-01 00:06:31"} > {"userid":"abc", "machineid":"110527022", "path":"/page/23", "country": > "DE", "tstamp":"2013-04-01 00:08:00"} > {"userid":"pdq", "machineid":"110519774", "path":"/page/2", "country":"CN" > , "tstamp":"2013-04-01 00:08:55"} > {"userid":"xyz", "machineid":"110530745", "path":"/sale/B", "country":"US" > , "tstamp":"2013-04-01 00:09:46"} > {"userid":"abc", "machineid":"110527022 ", "path":"/promo/A", "country": > "DE", "tstamp":"2013-04-01 00:10:46"} > .... > And we have 500+M such entries. > > We want a count of the number of userids that visited path=/sale/B after > visiting path=/promo/A. > > What I did is to preprocess the data, sorting by <userid, tstamp>, then > compacting all events by the same userid into the same document. Then I > wrote a script filter which traverses the path array per document, and > returns true if it finds any occurrence of B followed by A. This however is > inefficient. Most of our queries take 1 or 2 seconds on 100+M events. This > script filter query takes over 300 seconds. Specifically, it can process > events at about 400K events per second. BY comparison, I wrote a naive > program that does a linear pass of the un-compacted data and that process > 11M events per second. By which I conclude that Elasticsearch does not do > well on this type of query. > > I am hoping someone can indicate a more efficient way to do this query in > ES. Or else confirm that ES cannot do inter-document queries well. > > Thanks, > Zennet > > > -- > You received this message because you are subscribed to the Google Groups > "elasticsearch" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to elasticsearch+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/28c93f2d-e870-4347-8677-e9da41b6be62%40googlegroups.com > <https://groups.google.com/d/msgid/elasticsearch/28c93f2d-e870-4347-8677-e9da41b6be62%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHTr4ZsCs2LnbYyz5sAc9CLDMqaHYDseQwS8mgsB4PepCsZHpw%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.