Hi,

We have an analytics usecase where we are collecting user click logs. The
data can be considered as hierarchical with 3 type of logs -
User (attributes like userId, emailId)
- Session (attributes like sessionId, device, OS, browser, city etc.)
- - PageView (attributes like url, referrer, page-type etc.)

userId is present in every session and pageView log
sessionId is present in every pageView log.

*Objective*: To store data in a query-able format to run queries like -
"select city, unique(user), count(session), count(pageView), group by city"
To get number of (unique) users, sessions and of pageviews per city.
This is just an example, but you get the idea that a query may span across
the hierarchical nature of data, and should allow for select, aggregate,
groupby and where clause.

*RDD approach:*
* Read all the logs as single RDD<Log>
* Convert to pair RDD<SessionId, Log>
* GroupByKey for RDD<SessionId, Iterator<Log>>
* Shift the key and flat-map to pair RDD<City, <Tuple : (userId, sessionId,
pageViewId) >
* Reduce By Key to RDD<City, unique(userId), count(sessionId),
count(pageViewId) >

As I understand, with catalyst and tungsten, Dataframes are lot more
optimized than vanilla RDDs. So can anyone suggest on how can I support
such queries using DataFrames (purely or atleast more optimally)?

One way which I thought was to create different DataFrames for user,
session and page-view logs. However, then I would have to join and it will
cause network shuffle.

If I keep all the logs originally partitioned on user, read them as a
single df, split to 2 df (based on log type = session and log type =
pageview), and do a join, will it still cause network shuffle ?


Thanks,

Kapil

Reply via email to