Does it make sense to just use UDF functions for each dimension. So for instance if there are 2 dimensions:
1. geo/network 2. visitor We write 2 UDFs that converts query parameters in respective format which then gets stored in 2 separate files for each dimension. I am thinking UDF functions would give more control over how we process it than using maps. On Fri, Jun 15, 2012 at 3:34 PM, Jonathan Coveney <jcove...@gmail.com>wrote: > We just use the Java Map class, with the restriction that the key must be a > String. There are some helper methods in trunk to work with maps, and you > can you # to dereference ie map#'key' > > 2012/6/15 Mohit Anchlia <mohitanch...@gmail.com> > > > On Fri, Jun 15, 2012 at 9:12 AM, Alan Gates <ga...@hortonworks.com> > wrote: > > > > > This seems reasonable, except it seems like it would make more sense to > > > convert query parameters to maps. By definition a query parameter is > > > key=value. And a map is easier to work with in general then a bag, > since > > > there's no need to flatten them. > > > > > > I've never used them. Is this Map format in hadoop? > > > > > > > Alan. > > > > > > On Jun 11, 2012, at 10:55 AM, Mohit Anchlia wrote: > > > > > > > I am looking at how to parse URL with query parameters to process > > > > clickstream data. Are there any examples I can look at? My steps > that I > > > > envision are: > > > > > > > > 1) Read lines and convert query parameters into bags that is a group > of > > > > fields for a particular dimension table. So if Geo is one of the > > > dimensions > > > > group all the geo related information from that URL as a Bag. > > > > In the end it would like like {{92122,CA},{Unix,FireFox}}. In this > > > example > > > > first bag is GEO dimension and the second is Browser dimension. > > > > 2) Load these into OLAP staging database > > > > 3) Populate star schema from staging tables > > > > > > > > I am sure other people might already be doing this so I thought I'll > > > check > > > > as to if this makes sense. > > > > > > > > >