III has open questions. Can we make a concrete proposal for III? thanx ben
On Friday 13 June 2008 22:24:01 Ted Dunning wrote: > I think that I am convinced III is best. > > On Fri, Jun 13, 2008 at 7:26 AM, Alan Gates <[EMAIL PROTECTED]> wrote: > > All, > > > > I too will vote for III, with the caveat that we don't give names to > > multi-field grouping keys. We need to make sure we support AS to allow > > the user to name their grouping keys if they want. > > > > So far, the vote totals are: > > I: 1 > > II: 0 > > III: 3 > > IV: 0 > > V: 0 > > > > I'd like to make a decision and move forward by mid next week. If you > > haven't voted and you'd like to, please do so now. If you feel > > passionately about one of the options that is loosing, please make your > > arguments now. > > > > Alan. > > > > Alan Gates wrote: > >> Currently in Pig Latin, anytime a (CO)GROUP statement is used, the field > >> (or set of fields) that are grouped on are given the alias 'group'. > >> This has a couple of issues: > >> > >> 1) It's confusing. 'group' is now a keyword and an alias. > >> 2) We don't currently allow 'group' as an alias in an AS. It is > >> strange to have an alias that can only be assigned by the language and > >> never by the user. > >> > >> Possible solutions: > >> > >> I) Status quo. We could fix it so that group is allowed to be assigned > >> as an alias in AS. > >> > >> Pros: Backward compatibility > >> Cons: a) will make the parser more complicated > >> b) see 1) above. > >> > >> > >> II) Don't give an implicit alias to the group key(s). If users want an > >> alias, they can assign it using AS. > >> > >> Pros: Simplicity > >> Cons: We do assign aliases to grouped bags. That is, if we have C = > >> GROUP B by $0 the resulting schema of C is (group, B). So if we don't > >> assign an alias to the group key, we now have a schema ($0, B). This > >> seems strange. And worse yet, if users want to alias the group key(s), > >> they'll be forced to alias all the grouped bags as well. > >> > >> III) Carry the alias (if any) that the field had before. So if we had a > >> script like: > >> > >> A = load 'myfile' as (x, y, z); > >> B = group A by x; > >> > >> The the schema of B would be (x, A). This is quite natural for grouping > >> of single columns. But it turns nasty when you group on multiple > >> columns. Do we then append the names to together? So if you have > >> > >> B = group A by x, y; > >> > >> is the resulting schema (x_y, A)? Ugh. > >> > >> In this case there is also the question of what to do in the case of > >> cogroups, where the key may be named differently in different relations. > >> > >> A = load 'myfile' as (x, y, z); > >> B = load 'myotherfile' as (t, u, v); > >> C = cogroup A by x, B by t; > >> > >> Is the resulting schema (x, A, B) or (t, A, B) or are both valid? This > >> could be resolved by either saying first one always wins, or allowing > >> either. > >> > >> Pros: Very natural for the users, their fields maintain names through > >> the query. > >> Cons: Quickly gets burdensome in the case of multi-key groups. > >> > >> IV) Assign a non-keyword alias to the group key, like grp or groupkey or > >> grpkey (or some other suitable choice). > >> Pros: Least disruptive change. Users only have to go through their > >> scripts and find places where they use the group alias and change it to > >> grp (or whatever). > >> Cons: Still leaves us with a situation where we are assigning a name to > >> a field arbtrarily, leaving users confused as to how their fields got > >> named that. > >> > >> V) Remove GROUP as a keyword. It is just short for COGROUP of one > >> relation anyway. > >> > >> Pros: Smaller syntax in a language is always good. > >> Cons: Will break a lot of scripts, and confuse a lot of users who only > >> think in terms of GROUP and JOIN and never use COGROUP explicitly. > >> > >> One could also conceive of combinations of these. For example, we > >> always assign a name like grpkey to the group key(s), and in the single > >> key case we also carry forward the alias that the field already had, if > >> any. > >> > >> Thoughts? Other possibilities? > >> > >> Alan.
