Hi all,

We are trying to move some of our offline data analytics from hadoop hive 
stack to elasticsearch, but ran in to some issue.  

We have daily event, in hive we use partition (hdfs directories) to store 
daily events. For instance ,  the hdfs directory layout of event table  is 
like below

event/dt=20141112
event/dt=20141113 

user retention is tracking if a user produce an event(activity) today and 
produce an event in another day. the sql is like 

SELECT count(*)
FROM event-log-20141112 AS l
JOIN event-log-20141112   AS r
ON l.user_id = r.user_id

According to the documentation of elasticsearch, we can build one index per 
day, like  log-20141112/event, log-20141113/event.  But seems different 
index can't do a join as fast as co-locate through routing.  If we store 
all the events in one index, each type represent one day's event. Seems 
there is still no way to do user retention query.

Actually we can collapse all the events by user id.  Maintaining a parent 
table stores users' information, including user id.  Each day of event 
declares user information table as its parent table. The layout should like 

event/user
event/log-20141112
event/log-20141113

All of those tables can be routed by user_id, so that those table will 
co-located. If they doing a join, no data shuffling needed. However, seems 
currently easlticsearch can't do a query related to multiple children 
tables join, they just do parent-child join, right?

Can anyone help me on this? or if there is another solution on 
elasticsearch?

Min


-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/3d2f12ed-96aa-4239-98fe-1297b196397d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to