On 7/14/11 3:03 PM, Dmitriy Ryaboy wrote:
In the dw world, using a single table and using null as an "all" marker is the 
standard thing to do

But I imagine that in the dw world, the cube results would get stored in such a way that you can efficiently retrieve results of specific group-bys (partitions?). That would be similar to storing results of different group-bys operations in different output files.

On the other hand, its possible that the results of most cube operations are probably small enough that you could do rest of the processing using a excel spreadsheet! (so partitioning does not matter)

. In my udf I actually allow an optional string to be passed to the constructor 
to denote all if null is a valid value... I'll post the udf shortly, it's a 
prerequisite to LOCube.
If results of all group-by's are are stored together, I think some such feature to indicate if its actually a null or a '*' ( the 'all' marker symbol used in Arnab's presentation) will be essential.


I suspect the case of splitting out the agg levels is actually more rare, and 
can easily be accomplished with a SPLIT operator.
The other nice Thing about the udf is how much code it saves, esp for larger 
numbers of dimensions.

The udf code saving is important if the script is being written manually. But if pig is doing automatic translation (and assuming multiple output files is what makes sense), translating into multiple group-by statements might be more efficient, as it can avoid the filtering that would need to be done for split. But I agree that implementing this feature using udf is going to be easier. Any changes to make it more efficient can be done later.


Perhaps my sample script generated 4 jobs because I had 3 dimensions?


I doubt if it is because of number of dimensions, I think there might have been something else in the query that prevented the group-by's from being combined together. Do you still have the original script ? Can you send the script (and maybe the explain output) ?

Thanks,
Thejas




On Jul 14, 2011, at 4:10 PM, Thejas Nair<the...@hortonworks.com>  wrote:

+1 to what Gianmarco said about the place to do it.  See sample_clause in 
LogicalPlanGenerator.g.

I tried the expanded query (2 dimensions) with 0.8, it results only in 2 MR 
jobs, the 1st MR job has all the computation being done in a single MR job. The 
2nd MR job just concats the outputs into one file. See- 
http://pastebin.com/aarBELC2. I got an exception in 0.9 for same query, I have 
created a jira (PIG-2164) to address that.

The CubeDimensions udf would be a nice way to get around a combiner issue, but 
the combiner issue (if any) should actually get fixed.

In the example, you are putting all records into same file. That would lead to 
a problem, because it will not be possible to distinguish between the records 
for (group by (a,b)) that have value of b as null and (group by (a,null)). If 
all inputs go into same file, it would need to have a marker column to indicate 
the input it belongs to.
I think, in most cases people would read the results of different group-by 
combinations separately, so it makes sense to have different output files. (eg, 
8 files if there are 3 dimensions). Ie, a split on the marker column might have 
to be introduced.



Thanks,
Thejas




On 7/13/11 6:05 PM, Dmitriy Ryaboy wrote:
Arnab has a really interesting presentation at the post-hadoop-summit
Pig meeting about how Cubing could work in Map-Reduce, and suggested a
straightforward path to integrating into Pig. Arnab, do you have the
presentation posted somewhere?

In any case, I started mucking around a little with this, trying to
hack in the naive solution.

So far, one interesting result, followed by a question:

I manually cubed by writing a bunch of group-bys, like so (using pig 8) :

ab = foreach (group rel by (a, b)) generate flatten(group) as (a, b),
COUNT_STAR(rel) as cnt;
a_only = foreach (group rel by (a, null)) generate flatten(group) as
(a, b), COUNT_STAR(rel) as cnt;
b_only = foreach (group rel by (null, b)) generate flatten(group) as
(a, b), COUNT_STAR(rel) as cnt;
ab = foreach (group rel by (null, null)) generate flatten(group) as
(a, b), COUNT_STAR(rel) as cnt;
cube = union ab, a_only, b_only, ab;
store cube ....

Except for extra fun, I did this with 3 dimensions and therefore 8
groupings. This generated 4 MR jobs, the first of which moved all the
data across the wire despite the fact that COUNT_STAR is algebraic. On
my test dataset, the work took 18 minutes.

I then wrote a UDF that given a tuple, created all the cube dimensions
of the tuple -- so CubeDimensions(a, b) returns { (a, b), (a, null),
(null, b), (null, null) }, and this works on any number of dimensions.
The naive cube then simply becomes this:

cubed = foreach rel generate flatten(CubeDimensions(a, b));
cube = foreach (group rel by $0) generate flatten(group) as (a, b),
COUNT_STAR(rel);

On the same dataset, this generated only 1 MR job, and ran in 3
minutes because we were able to take advantage of the combiners!

Assuming algebraic aggregations, this is actually pretty good given
how little work it involves.

I looked at adding a new operator that would be (for now) syntactic
sugar around this pattern -- basically, "CUBE rel by (a, b, c)" would
insert the operators equivalent to the code above.

I can muddle my way through the grammar. What's the appropriate place
to put the translation logic? Logical to physical compiler? Optimizer?
The LogicalPlanBuilder?

D


Reply via email to