> On Sept. 29, 2014, 12:56 a.m., Matthew Hayes wrote: > > datafu-pig/src/main/java/datafu/pig/util/SelectFieldByName.java, line 49 > > <https://reviews.apache.org/r/25564/diff/2/?file=707974#file707974line49> > > > > Hmm, something just occurred to me. This does not currently provide > > the output schema. So this is one problem. But, how do we determine the > > output schema? If the output value is decided dynamically, then it can > > vary. One way to address this is to require that all the other values of > > the tuple are of the same type. Then you just take the schema form the > > first value. In your example they are all chararray. But this does limit > > the uses of this UDF. > > Russell Jurney wrote: > In practice, this is not an issue. The UDF is used this way, and you can > cast it to what you want. > > with_value_substitution = FOREACH with_group GENERATE > FLATTEN(ChooseFieldByValue(groupField, *)) AS groupValue:chararray, > *, > (int)$period AS periodSeconds:int; > > However, I don't see why I can't detect the schema of the field selected > and return that? > > Matthew Hayes wrote: > The schema can't be dynamic like that. I'll have to think about this > some more. I don't like that we have to cast it like this. One way we can > make this better is to have the UDF pick the schema that is best fit for the > types provided. For example, if all the fields are of the same type, like > chararray, then the resulting type is chararray. Otherwise make the type > bytearray and you can cast however you want. I'd like to hear what other > people think about this. How about emailing datafu dev? > > Russell Jurney wrote: > I will bring it up on the list, but I don't think returning a tuple is > weird at all. It is highly convenient, and 'just works.'
I'm not saying that returning a tuple is weird. What is weird to me is not defining the schema of the tuple being returned by the UDF. - Matthew ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25564/#review54788 ----------------------------------------------------------- On Oct. 2, 2014, 4:19 p.m., Russell Jurney wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/25564/ > ----------------------------------------------------------- > > (Updated Oct. 2, 2014, 4:19 p.m.) > > > Review request for DataFu, Jonathan Coveney, Jakob Homan, Matthew Hayes, and > Sam Shah. > > > Repository: datafu > > > Description > ------- > > Example use: > group_fields = LOAD '/e8/smalldata/group_fields.txt' AS > (groupField:chararray); > with_group = CROSS group_fields, hour_rounded; > with_group = FOREACH with_group GENERATE group_fields::groupField AS > groupField, > hour_rounded::sourceNameOrIp AS sourceNameOrIp, > hour_rounded::destinationNameOrIp AS destinationNameOrIp, > ...; > with_value_substitution = FOREACH with_group GENERATE > ChooseFieldByValue(groupField, *) AS groupValue:tuple(value:chararray), *; > with_value_substitution = FOREACH with_value_substitution GENERATE > FLATTEN(groupValue) AS groupValue:chararray, > groupField, > foo, > bar, > ...; > all_success = FOREACH (GROUP with_value_substitution BY (groupField, > groupValue, day)) GENERATE > FLATTEN(group) AS (seriesType, groupValue, day), > (int)COUNT_STAR(with_value_substitution) AS connections:int; > > > Diffs > ----- > > datafu-pig/src/main/java/datafu/pig/util/SelectFieldByName.java > PRE-CREATION > datafu-pig/src/test/java/datafu/test/pig/util/SelectFieldByNameTest.java > PRE-CREATION > > Diff: https://reviews.apache.org/r/25564/diff/ > > > Testing > ------- > > This UDF was used to replace a very inefficient pig script where macros that > did many individual GROUP BY's took many minutes to plan. > > Testing: unit tests and used on real data on a cluster. > > > Thanks, > > Russell Jurney > >