Re: Issues with group as an alias

Chris Olston Mon, 23 Jun 2008 13:15:59 -0700

I vote for (2), because just about every new user gets tripped up bythe implicit column names. Take for example:


A = load ...;
B = load ...;
C = cogroup A by url, B by url;
D = foreach C {
             A = order A by $0;
             generate flatten(A), flatten(B.$2);
          };

Users think that the "order A by $0" is referring to the *outertable* called A, not the A that is a column of C. There are tworeasons for this confusion, I think:1. Rather than writing C.A (absolute path), the user must write A(relative path) -- the prefix of the path is implicit.2. The fact that C has a field called "A" is not evident frominspecting the script -- C's schema does not appear anywhere.


Forcing the user to write:

C = cogroup A by url, B by url as (url, A, B);

(or whatever names they want to give the fields of C) would fix #2.

(We may also want to think about #1, but that's a separate issue fromthe current line of discussion.)


-Chris


On Jun 17, 2008, at 7:30 AM, Benjamin Reed wrote:

I agree with Pi. +1 for (1).

ben

On Tuesday 17 June 2008 03:41:13 pi song wrote:

If it's confusing because our model is different, people just have to

learn. If it's confusing because it is misleading, it has to befixed.


As far as we can explain "why" logically, I think it should be ok.
I vote (1) for this.

On Tue, Jun 17, 2008 at 8:03 AM, Chris Olston <[EMAIL PROTECTED]inc.com> wrote:

Oh -- sorry I misunderstood.

That's a valid question and now is the right time to revisit it.Doesanybody see any natural naming convention *other than* namingthem afterthe input tables (pig's current practice)? If so, let's discuss.If not,

it seems the only two choices are: (1) leave it as-is, or (2) do not

assign any name, and force user to use "AS" (this is what Jaqldoes I

believe).

-Chris


On Jun 16, 2008, at 1:29 PM, Olga Natkovich wrote:

 Chris,

What I meant to ask was what do we do with the rest of thefields in the

group tuples. Currently, we name those fields with the names of the

correspondent tables. I was asking if we want to continue that.I know

that people find it confusing to see fields named after relations.

Olga

 -----Original Message-----

From: Chris Olston [mailto:[EMAIL PROTECTED]
Sent: Monday, June 16, 2008 12:54 PM
To: [email protected]
Subject: Re: Issues with group as an alias

Olga,

The idea is that when there is just one field with one name,
we use that name for the group key. In all other cases we do
*not* supply an automatic name (the user can assign their own
name using "as").

I believe this solution: (1) is very simple and unambiguous,
and (2) makes common cases very natural (e.g, BAR = group FOO
by URL; foreach BAR generate URL, ...).

-Chris

On Jun 16, 2008, at 12:48 PM, Olga Natkovich wrote:

 What about naming the rest of the fields in the group? Do

we want to

continue naming them with the names of the correspondingtables? I

think users find that confusing as well.

Olga

 -----Original Message-----

From: Alan Gates [mailto:[EMAIL PROTECTED]
Sent: Monday, June 16, 2008 11:32 AM
To: [email protected]
Subject: Re: Issues with group as an alias

I would like to propose a slight modification:

I think that we should continue to support 'group' as the


alias name

for some transition period (3 or maybe 6 months).

We can remove all references to group as an alias from the
documentation and print a warning when users use it. But Idon'tthink we should drop it immediately, as we'll break manyscripts.
Other than that I'm fine with the proposal.

Alan.

Chris Olston wrote:
No.

The standing proposal for Option III is:
1. If you are (CO)Grouping on a *single* field AND in thecase of
co-group all field names are the same (e.g., cogroup A by
url, B by


url), then give the group key that name (e.g., "url").

2. Else, do *not* automatically assign any name. The user


can refer to

it as $0 and/or use "AS" to give it a name manually.

(To be clear, even in case #1, the user has the option to


override the

automatically-assigned name using "AS" if s/he chooses.)

-Chris


On Jun 16, 2008, at 8:25 AM, Benjamin Reed wrote:

 I completely agree. It does start getting confusing.

Especially if we

try to deal with multi field keys.

A = load 'somefile1' USING PigStorage() AS (B, C, Z) B = load

'somefile2' USING PigStorage() AS (A, C, Y) C = load'somefile3'

USING PigStorage() AS (A, B)

G1 = COGROUP A by (B,C), B by (A, C);
G2 = COGROUP G1 by (B_C, A.Z), C by (A, B);

What is the schema for G2?

ben

On Saturday 14 June 2008 06:46:00 Mridul Muralidharan wrote:

So what is the conclusion here ?

group key alias == the first variables group by field ?


What happens in a case like this then :

--
A = load 'somefile1' USING PigStorage() AS (B, C) B = load
'somefile2' USING PigStorage() AS (A, C) C = load


'somefile3' USING


PigStorage() AS (A, B)

G1 = COGROUP A by B, B by A;
G2 = COGROUP A by C, C by A;
...
--

A slightly contrived example for sure, but imo grammer


has to be as


clearly specified as possible.

A reserved keyword as group alias implies we dont hit


this problem


(group or groupkey or grpkey)... and also the fact that we are

backwardly compatible.

[I never liked inferred schema prefix section in the


schemas doc


(which is applied selectively) - makes it extremely tough to

generate pig scripts]


Regards,
Mridul

Alan Gates wrote:

Currently in Pig Latin, anytime a (CO)GROUP statement is


used, the


field (or set of fields) that are grouped on are given

the alias


'group'.

This has a couple of issues:

1)  It's confusing.  'group' is now a keyword and an alias.
2)  We don't currently allow 'group' as an alias in an


AS.  It is


strange to have an alias that can only be assigned by

the language


and never by the user.

Possible solutions:

I) Status quo. We could fix it so that group is allowedto be

assigned as an alias in AS.

Pros:  Backward compatibility
Cons: a) will make the parser more complicated
    b) see 1) above.


II) Don't give an implicit alias to the group key(s).


If users


want an alias, they can assign it using AS.

Pros:  Simplicity
Cons:  We do assign aliases to grouped bags.  That is,


if we have C


= GROUP B by $0 the resulting schema of C is (group, B).

 So if we


don't assign an alias to the group key, we now have a

schema ($0,


B).  This seems strange.  And worse yet, if users want

to alias the


group key(s), they'll be forced to alias all the

grouped bags as


well.

III) Carry the alias (if any) that the field had before.


 So if we


had a script like:

A = load 'myfile' as (x, y, z);
B = group A by x;

The the schema of B would be (x, A).  This is quite


natural for


grouping of single columns.  But it turns nasty when you

group on


multiple columns.  Do we then append the names to

together?  So if


you have

B = group A by x, y;

is the resulting schema (x_y, A)?  Ugh.

In this case there is also the question of what to do in


the case


of cogroups, where the key may be named differently in

different


relations.

A = load 'myfile' as (x, y, z);
B = load 'myotherfile' as (t, u, v); C = cogroup A by


x, B by t;

Is the resulting schema (x, A, B) or (t, A, B) or are


both valid?


This

could be resolved by either saying first one always wins, or
allowing either.
Pros: Very natural for the users, their fields maintainnames
through the query.
Cons: Quickly gets burdensome in the case of multi-keygroups.
IV) Assign a non-keyword alias to the group key, like grp or
groupkey or grpkey (or some other suitable choice).
Pros: Least disruptive change. Users only have to gothrough
their scripts and find places where they use the group
alias and


change it to grp (or whatever).

Cons:  Still leaves us with a situation where we are


assigning a


name to a field arbtrarily, leaving users confused as to

how their


fields got named that.

V) Remove GROUP as a keyword.  It is just short for


COGROUP of one


relation anyway.

Pros:  Smaller syntax in a language is always good.
Cons:  Will break a lot of scripts, and confuse a lot of


users who


only think in terms of GROUP and JOIN and never use COGROUP

explicitly.

One could also conceive of combinations of these.  For


example, we


always assign a name like grpkey to the group key(s),

and in the


single key case we also carry forward the alias that the field

already had, if any.

Thoughts?  Other possibilities?

Alan.


--
Christopher Olston, Ph.D.
Sr. Research Scientist
Yahoo! Research


--
Christopher Olston, Ph.D.
Sr. Research Scientist
Yahoo! Research


--
Christopher Olston, Ph.D.
Sr. Research Scientist
Yahoo! Research


--
Christopher Olston, Ph.D.
Sr. Research Scientist
Yahoo! Research

Re: Issues with group as an alias

Reply via email to