Re: Inter-document Queries

Theo Harris Wed, 02 Jul 2014 07:05:12 -0700

Together with Zennet we brainstormed a solution building on top of Itamar's 
proposal.


In one string field we append the current path to the all previous ones and 
since we are talking about funnels we need to store them only on the last 
event/document generated, e.g SessionEndedEvent.
Then we can use regex pattern matching to identify if the sequence of steps 
can be found anywhere in the stored paths string. This solution appears to 
be extremely fast. 

On Wednesday, June 11, 2014 1:14:59 AM UTC+3, Zennet Wheatcroft wrote:
>
> I simplified the actual problem in order to avoid explaining the domain 
> specific details. Allow me to add back more detail.
>
> We want to be able to search for multiple points of user action, towards a 
> conversion funnel, and condition on multiple fields. Let's add another 
> field (response) to the above model:
> {.., "path":"/promo/A", "response": 200, ..}
> {.., "path":"/page/1", "response": 401, ..}
> {.., "path":"/promo/D","response": 200, ..}
> {.., "path":"/page/23", "response": 301, ..}
> {.., "path":"/page/2", "response": 418, ..}
> Let's say we define three points through the conversion funnel:
> A: Visited path=/page/1
> B: Got response=401 from some path
> C: Exited at path=/sale/C
>
> And we want to know how many users did steps A-B-C in that order. If we 
> add an array prev_response like we did for prev_path, then we can use a 
> term filter to find documents with term path=/sale/C and prev_path=/page/1 
> and prev_response=401. But this will not distinguish between A->B->C and 
> B->A->C. Perhaps I could use the script filter for the "last mile" and from 
> the term filtered results throw out B-A-C and it will run more quickly 
> because of the reduced document set.
>
> Is there another way to implement this query?
>
> Zennet
>
>
> On Wednesday, June 4, 2014 5:01:19 PM UTC-7, Itamar Syn-Hershko wrote:
>>
>> You need to be able to form buckets that can be reduced again, either 
>> using the aggregations framework or a query. One model that will allow you 
>> to do that is something like this:
>>
>> { "userid": "xyz", "path":"/sale/B", "previous_paths":[...], 
>> "tstamp":"...", ... }
>>
>> So whenever you add a new path, you denormalize and add previous paths 
>> that could be relevant. This might bloat your storage a bit and be slower 
>> on writes, but it is very optimized for reads since now you can do an 
>> aggregation that queries for the desired "path" and buckets on the user. To 
>> check the condition of the previous path you should be able to bucket again 
>> using a script, or maybe even with a query on a nested type.
>>
>> This is just from the top of my head but should definitely work if you 
>> can get to that model
>>
>> --
>>
>> Itamar Syn-Hershko
>> http://code972.com | @synhershko <https://twitter.com/synhershko>
>> Freelance Developer & Consultant
>> Author of RavenDB in Action <http://manning.com/synhershko/>
>>
>>
>> On Thu, Jun 5, 2014 at 2:36 AM, Zennet Wheatcroft <zwhea...@atypon.com> 
>> wrote:
>>
>>> Yes. I can re-index the data or transform it in any way to make this 
>>> query efficient. 
>>>
>>> What would you suggest?
>>>
>>>
>>>
>>> On Wednesday, June 4, 2014 2:14:09 PM UTC-7, Itamar Syn-Hershko wrote:
>>>
>>>> This model is not efficient for this type of querying. You cannot do 
>>>> this in one query using this model, and the pre-processing work you do now 
>>>> + traversing all documents is very costly.
>>>>
>>>> Is it possible for you to index the data (even as a projection) into 
>>>> Elasticsearch using a different model, so you can use ES properly using 
>>>> queries or the aggregations framework?
>>>>
>>>> --
>>>>
>>>> Itamar Syn-Hershko
>>>> http://code972.com | @synhershko <https://twitter.com/synhershko>
>>>> Freelance Developer & Consultant
>>>> Author of RavenDB in Action <http://manning.com/synhershko/>
>>>>
>>>>
>>>> On Thu, Jun 5, 2014 at 12:04 AM, Zennet Wheatcroft <zwhea...@atypon.com
>>>> > wrote:
>>>>
>>>>>  Hi,
>>>>>
>>>>> I am looking for an efficient way to do inter-document queries in 
>>>>> Elasticsearch. Specifically, I want to count the number of users that 
>>>>> went 
>>>>> through an exit point B after visiting point A.
>>>>>
>>>>> In general terms, say we have some event log data about users actions 
>>>>> on a website:
>>>>> ....
>>>>> {"userid":"xyz", "machineid":"110530745", "path":"/promo/A", "country"
>>>>> :"US", "tstamp":"2013-04-01 00:01:01"}
>>>>> {"userid":"pdq", "machineid":"110519774", "path":"/page/1", "country":
>>>>> "CN", "tstamp":"2013-04-01 00:02:11"}
>>>>> {"userid":"xyz", "machineid":"110530745", "path":"/promo/D", "country"
>>>>> :"US", "tstamp":"2013-04-01 00:06:31"}
>>>>> {"userid":"abc", "machineid":"110527022", "path":"/page/23", "country"
>>>>> :"DE", "tstamp":"2013-04-01 00:08:00"}
>>>>> {"userid":"pdq", "machineid":"110519774", "path":"/page/2", "country":
>>>>> "CN", "tstamp":"2013-04-01 00:08:55"}
>>>>> {"userid":"xyz", "machineid":"110530745", "path":"/sale/B", "country":
>>>>> "US", "tstamp":"2013-04-01 00:09:46"}
>>>>> {"userid":"abc", "machineid":"110527022 ", "path":"/promo/A", 
>>>>> "country":"DE", "tstamp":"2013-04-01 00:10:46"}
>>>>> ....
>>>>> And we have 500+M such entries.
>>>>>
>>>>> We want a count of the number of userids that visited path=/sale/B 
>>>>> after visiting path=/promo/A.
>>>>>
>>>>> What I did is to preprocess the data, sorting by <userid, tstamp>, 
>>>>> then compacting all events by the same userid into the same document. 
>>>>> Then 
>>>>> I wrote a script filter which traverses the path array per document, and 
>>>>> returns true if it finds any occurrence of B followed by A. This however 
>>>>> is 
>>>>> inefficient. Most of our queries take 1 or 2 seconds on 100+M events. 
>>>>> This 
>>>>> script filter query takes over 300 seconds. Specifically, it can process 
>>>>> events at about 400K events per second. BY comparison, I wrote a naive 
>>>>> program that does a linear pass of the un-compacted data and that process 
>>>>> 11M events per second. By which I conclude that Elasticsearch does not do 
>>>>> well on this type of query.
>>>>>
>>>>> I am hoping someone can indicate a more efficient way to do this query 
>>>>> in ES. Or else confirm that ES cannot do inter-document queries well. 
>>>>>
>>>>> Thanks,
>>>>> Zennet
>>>>>
>>>>>
>>>>>  -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "elasticsearch" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to elasticsearc...@googlegroups.com.
>>>>>
>>>>> To view this discussion on the web visit https://groups.google.com/d/
>>>>> msgid/elasticsearch/28c93f2d-e870-4347-8677-e9da41b6be62%
>>>>> 40googlegroups.com 
>>>>> <https://groups.google.com/d/msgid/elasticsearch/28c93f2d-e870-4347-8677-e9da41b6be62%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>  -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "elasticsearch" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to elasticsearc...@googlegroups.com.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/elasticsearch/5c576f27-4b14-4a2d-9415-17ac50e41371%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/elasticsearch/5c576f27-4b14-4a2d-9415-17ac50e41371%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/ad876869-e280-4b5d-b405-7aa8e88c6094%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Inter-document Queries

Reply via email to