Yes. I can re-index the data or transform it in any way to make this query efficient.
What would you suggest? On Wednesday, June 4, 2014 2:14:09 PM UTC-7, Itamar Syn-Hershko wrote: > > This model is not efficient for this type of querying. You cannot do this > in one query using this model, and the pre-processing work you do now + > traversing all documents is very costly. > > Is it possible for you to index the data (even as a projection) into > Elasticsearch using a different model, so you can use ES properly using > queries or the aggregations framework? > > -- > > Itamar Syn-Hershko > http://code972.com | @synhershko <https://twitter.com/synhershko> > Freelance Developer & Consultant > Author of RavenDB in Action <http://manning.com/synhershko/> > > > On Thu, Jun 5, 2014 at 12:04 AM, Zennet Wheatcroft <zwhea...@atypon.com > <javascript:>> wrote: > >> Hi, >> >> I am looking for an efficient way to do inter-document queries in >> Elasticsearch. Specifically, I want to count the number of users that went >> through an exit point B after visiting point A. >> >> In general terms, say we have some event log data about users actions on >> a website: >> .... >> {"userid":"xyz", "machineid":"110530745", "path":"/promo/A", "country": >> "US", "tstamp":"2013-04-01 00:01:01"} >> {"userid":"pdq", "machineid":"110519774", "path":"/page/1", "country": >> "CN", "tstamp":"2013-04-01 00:02:11"} >> {"userid":"xyz", "machineid":"110530745", "path":"/promo/D", "country": >> "US", "tstamp":"2013-04-01 00:06:31"} >> {"userid":"abc", "machineid":"110527022", "path":"/page/23", "country": >> "DE", "tstamp":"2013-04-01 00:08:00"} >> {"userid":"pdq", "machineid":"110519774", "path":"/page/2", "country": >> "CN", "tstamp":"2013-04-01 00:08:55"} >> {"userid":"xyz", "machineid":"110530745", "path":"/sale/B", "country": >> "US", "tstamp":"2013-04-01 00:09:46"} >> {"userid":"abc", "machineid":"110527022 ", "path":"/promo/A", "country": >> "DE", "tstamp":"2013-04-01 00:10:46"} >> .... >> And we have 500+M such entries. >> >> We want a count of the number of userids that visited path=/sale/B after >> visiting path=/promo/A. >> >> What I did is to preprocess the data, sorting by <userid, tstamp>, then >> compacting all events by the same userid into the same document. Then I >> wrote a script filter which traverses the path array per document, and >> returns true if it finds any occurrence of B followed by A. This however is >> inefficient. Most of our queries take 1 or 2 seconds on 100+M events. This >> script filter query takes over 300 seconds. Specifically, it can process >> events at about 400K events per second. BY comparison, I wrote a naive >> program that does a linear pass of the un-compacted data and that process >> 11M events per second. By which I conclude that Elasticsearch does not do >> well on this type of query. >> >> I am hoping someone can indicate a more efficient way to do this query in >> ES. Or else confirm that ES cannot do inter-document queries well. >> >> Thanks, >> Zennet >> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "elasticsearch" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to elasticsearc...@googlegroups.com <javascript:>. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/elasticsearch/28c93f2d-e870-4347-8677-e9da41b6be62%40googlegroups.com >> >> <https://groups.google.com/d/msgid/elasticsearch/28c93f2d-e870-4347-8677-e9da41b6be62%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5c576f27-4b14-4a2d-9415-17ac50e41371%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.