This model is not efficient for this type of querying. You cannot do this
in one query using this model, and the pre-processing work you do now +
traversing all documents is very costly.

Is it possible for you to index the data (even as a projection) into
Elasticsearch using a different model, so you can use ES properly using
queries or the aggregations framework?

--

Itamar Syn-Hershko
http://code972.com | @synhershko <https://twitter.com/synhershko>
Freelance Developer & Consultant
Author of RavenDB in Action <http://manning.com/synhershko/>


On Thu, Jun 5, 2014 at 12:04 AM, Zennet Wheatcroft <zwheatcr...@atypon.com>
wrote:

> Hi,
>
> I am looking for an efficient way to do inter-document queries in
> Elasticsearch. Specifically, I want to count the number of users that went
> through an exit point B after visiting point A.
>
> In general terms, say we have some event log data about users actions on a
> website:
> ....
> {"userid":"xyz", "machineid":"110530745", "path":"/promo/A", "country":
> "US", "tstamp":"2013-04-01 00:01:01"}
> {"userid":"pdq", "machineid":"110519774", "path":"/page/1", "country":"CN"
> , "tstamp":"2013-04-01 00:02:11"}
> {"userid":"xyz", "machineid":"110530745", "path":"/promo/D", "country":
> "US", "tstamp":"2013-04-01 00:06:31"}
> {"userid":"abc", "machineid":"110527022", "path":"/page/23", "country":
> "DE", "tstamp":"2013-04-01 00:08:00"}
> {"userid":"pdq", "machineid":"110519774", "path":"/page/2", "country":"CN"
> , "tstamp":"2013-04-01 00:08:55"}
> {"userid":"xyz", "machineid":"110530745", "path":"/sale/B", "country":"US"
> , "tstamp":"2013-04-01 00:09:46"}
> {"userid":"abc", "machineid":"110527022 ", "path":"/promo/A", "country":
> "DE", "tstamp":"2013-04-01 00:10:46"}
> ....
> And we have 500+M such entries.
>
> We want a count of the number of userids that visited path=/sale/B after
> visiting path=/promo/A.
>
> What I did is to preprocess the data, sorting by <userid, tstamp>, then
> compacting all events by the same userid into the same document. Then I
> wrote a script filter which traverses the path array per document, and
> returns true if it finds any occurrence of B followed by A. This however is
> inefficient. Most of our queries take 1 or 2 seconds on 100+M events. This
> script filter query takes over 300 seconds. Specifically, it can process
> events at about 400K events per second. BY comparison, I wrote a naive
> program that does a linear pass of the un-compacted data and that process
> 11M events per second. By which I conclude that Elasticsearch does not do
> well on this type of query.
>
> I am hoping someone can indicate a more efficient way to do this query in
> ES. Or else confirm that ES cannot do inter-document queries well.
>
> Thanks,
> Zennet
>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/28c93f2d-e870-4347-8677-e9da41b6be62%40googlegroups.com
> <https://groups.google.com/d/msgid/elasticsearch/28c93f2d-e870-4347-8677-e9da41b6be62%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAHTr4ZsCs2LnbYyz5sAc9CLDMqaHYDseQwS8mgsB4PepCsZHpw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to