[
https://issues.apache.org/jira/browse/PIG-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gianmarco De Francisci Morales updated PIG-1387:
------------------------------------------------
Attachment: PIG-1387.1.patch
Very first approach to this feature in PIG-1387.1.patch.
For now, I am focusing on TOBAG, I assume other TO* functions would be similar.
We might have a problem for TOTUPLE because it starts with LEFT_PAREN that is
also used by other rules.
I am not sure this is the direction we want to go.
Basically what I am doing is to do the translation in the parser.
When I encounter a LEFT_CURLY in a GENERATE statement, I process it as if it
were a function call.
I generate a FUNC_EVAL virtual token and process the rest of the expression as
arguments to the function.
I see some issues with this approach, for example one needs to be careful to
change the grammar of the GENERATE statement if one ever changes the grammar of
the function call, because there is an implicit dependence between them.
Also, I feel it might be complicated to properly parse the arguments in some
extreme cases (but I have no example at hand).
The pro is that it is extremely easy to perform the change, and that it is
purely syntactical, which means that it reduces the chances of bugs.
I would like to have the opinion of the community before going on.
Some small examples:
{code}
grunt> cat a.txt
1 11
2 3
3 10
4 11
5 10
6 15
grunt> a = load 'a.txt' as (id,num); b = foreach a generate TOBAG($0);
grunt> dump b
({(1)})
({(2)})
({(3)})
({(4)})
({(5)})
({(6)})
grunt> a = load 'a.txt' as (id,num); b = foreach a generate {$0};
grunt> dump b
({(1)})
({(2)})
({(3)})
({(4)})
({(5)})
({(6)})
grunt> a = load 'a.txt' as (id,num); b = foreach a generate TOBAG($0,$1);
grunt> dump b
({(1),(11)})
({(2),(3)})
({(3),(10)})
({(4),(11)})
({(5),(10)})
({(6),(15)})
grunt> a = load 'a.txt' as (id,num); b = foreach a generate {$0,$1};
grunt> dump b
({(1),(11)})
({(2),(3)})
({(3),(10)})
({(4),(11)})
({(5),(10)})
({(6),(15)})
{code}
And this is the logical plan generated:
{code}
#-----------------------------------------------
# New Logical Plan:
#-----------------------------------------------
b: (Name: LOStore Schema: #191:bag{#192:tuple(#193:bytearray)})
|
|---b: (Name: LOForEach Schema: #191:bag{#192:tuple(#193:bytearray)})
| |
| (Name: LOGenerate[false] Schema: #191:bag{#192:tuple(#193:bytearray)})
| | |
| | (Name: UserFunc(org.apache.pig.builtin.TOBAG) Type: bag Uid: 191)
| | |
| | |---id:(Name: Project Type: bytearray Uid: 183 Input: 0 Column: (*))
| | |
| | |---num:(Name: Project Type: bytearray Uid: 184 Input: 1 Column:
(*))
| |
| |---(Name: LOInnerLoad[0] Schema: id#183:bytearray)
| |
| |---(Name: LOInnerLoad[1] Schema: num#184:bytearray)
|
|---a: (Name: LOLoad Schema:
id#183:bytearray,num#184:bytearray)RequiredFields:null
{code}
> Syntactical Sugar for PIG-1385
> ------------------------------
>
> Key: PIG-1387
> URL: https://issues.apache.org/jira/browse/PIG-1387
> Project: Pig
> Issue Type: Wish
> Components: grunt
> Affects Versions: 0.6.0
> Reporter: hc busy
> Labels: gsoc2011
> Fix For: 0.10
>
> Attachments: PIG-1387.1.patch
>
>
> From this conversation, extend PIG-1385 to instead of calling UDF use
> built-in behavior when the (),{},[] groupings are encountered.
> > > What about making them part of the language using symbols?
> > >
> > > instead of
> > >
> > > foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7;
> > >
> > > have language support
> > >
> > > foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7;
> > >
> > > or even:
> > >
> > > foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11;
> > >
> > >
> > > Is there reason not to do the second or third other than being more
> > > complicated?
> > >
> > > Certainly I'd volunteer to put the top implementation in to the util
> > > package and submit them for builtin's, but the latter syntactic candies
> > > seems more natural..
> > >
> > >
> > >
> > > On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates <[email protected]> wrote:
> > >
> > >> The grouping package in piggybank is left over from back when Pig
> > allowed
> > >> users to define grouping functions (0.1). Functions like these should
> > go in
> > >> evaluation.util.
> > >>
> > >> However, I'd consider putting these in builtin (in main Pig) instead.
> > >> These are things everyone asks for and they seem like a reasonable
> > addition
> > >> to the core engine. This will be more of a burden to write (as we'll
> > hold
> > >> them to a higher standard) but of more use to people as well.
> > >>
> > >> Alan.
> > >>
> > >>
> > >> On Apr 19, 2010, at 12:53 PM, hc busy wrote:
> > >>
> > >> Some times I wonder... I mean, somebody went to the trouble of making a
> > >>> path
> > >>> called
> > >>>
> > >>> org.apache.pig.piggybank.grouping
> > >>>
> > >>> (where it seems like this code belong), but didn't check in any java
> > code
> > >>> into that package.
> > >>>
> > >>>
> > >>> Any comment about where to put this kind of utility classes?
> > >>>
> > >>>
> > >>>
> > >>> On Mon, Apr 19, 2010 at 12:07 PM, Andrey S <[email protected]> wrote:
> > >>>
> > >>> 2010/4/19 hc busy <[email protected]>
> > >>>>
> > >>>> That's just the way it is right now, you can't make bags or tuples
> > >>>>> directly... Maybe we should have some UDF's in piggybank for these:
> > >>>>>
> > >>>>> toBag()
> > >>>>> toTuple(); --which is kinda like exec(Tuple in){return in;}
> > >>>>> TupleToBag(); --some times you need it this way for some reason.
> > >>>>>
> > >>>>>
> > >>>>> Ok. I place my current code here, may be later I make a patch (if
> > such
> > >>>> implementation is acceptable of course).
> > >>>>
> > >>>> import org.apache.pig.EvalFunc;
> > >>>> import org.apache.pig.data.BagFactory;
> > >>>> import org.apache.pig.data.DataBag;
> > >>>> import org.apache.pig.data.Tuple;
> > >>>> import org.apache.pig.data.TupleFactory;
> > >>>>
> > >>>> import java.io.IOException;
> > >>>>
> > >>>> /**
> > >>>> * Convert any sequence of fields to bag with specified count of
> > >>>> fields<br>
> > >>>> * Schema: count:int, fld1 [, fld2, fld3, fld4... ].
> > >>>> * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... }
> > >>>> *
> > >>>> * @author astepachev
> > >>>> */
> > >>>> public class ToBag extends EvalFunc<DataBag> {
> > >>>> public BagFactory bagFactory;
> > >>>> public TupleFactory tupleFactory;
> > >>>>
> > >>>> public ToBag() {
> > >>>> bagFactory = BagFactory.getInstance();
> > >>>> tupleFactory = TupleFactory.getInstance();
> > >>>> }
> > >>>>
> > >>>> @Override
> > >>>> public DataBag exec(Tuple input) throws IOException {
> > >>>> if (input.isNull())
> > >>>> return null;
> > >>>> final DataBag bag = bagFactory.newDefaultBag();
> > >>>> final Integer couter = (Integer) input.get(0);
> > >>>> if (couter == null)
> > >>>> return null;
> > >>>> Tuple tuple = tupleFactory.newTuple();
> > >>>> for (int i = 0; i < input.size() - 1; i++) {
> > >>>> if (i % couter == 0) {
> > >>>> tuple = tupleFactory.newTuple();
> > >>>> bag.add(tuple);
> > >>>> }
> > >>>> tuple.append(input.get(i + 1));
> > >>>> }
> > >>>> return bag;
> > >>>> }
> > >>>> }
> > >>>>
> > >>>> import org.apache.pig.ExecType;
> > >>>> import org.apache.pig.PigServer;
> > >>>> import org.junit.Before;
> > >>>> import org.junit.Test;
> > >>>>
> > >>>> import java.io.IOException;
> > >>>> import java.net.URISyntaxException;
> > >>>> import java.net.URL;
> > >>>>
> > >>>> import static org.junit.Assert.assertTrue;
> > >>>>
> > >>>> /**
> > >>>> * @author astepachev
> > >>>> */
> > >>>> public class ToBagTest {
> > >>>> PigServer pigServer;
> > >>>> URL inputTxt;
> > >>>>
> > >>>> @Before
> > >>>> public void init() throws IOException, URISyntaxException {
> > >>>> pigServer = new PigServer(ExecType.LOCAL);
> > >>>> inputTxt =
> > >>>> this.getClass().getResource("bagTest.txt").toURI().toURL();
> > >>>> }
> > >>>>
> > >>>> @Test
> > >>>> public void testSimple() throws IOException {
> > >>>> pigServer.registerQuery("a = load '" + inputTxt.toExternalForm()
> > +
> > >>>> "' using PigStorage(',') " +
> > >>>> "as (id:int, a:chararray, b:chararray, c:chararray,
> > >>>> d:chararray);");
> > >>>> pigServer.registerQuery("last = foreach a generate flatten(" +
> > >>>> ToBag.class.getName() + "(2, id, a, id, b, id, c));");
> > >>>>
> > >>>> pigServer.deleteFile("target/pigtest/func1.txt");
> > >>>> pigServer.store("last", "target/pigtest/func1.txt");
> > >>>> assertTrue(pigServer.fileSize("target/pigtest/func1.txt") > 0);
> > >>>> }
> > >>>> }
> > >>>>
> > >>>>
> > >>
> > >
> >
> This is a candidate project for Google summer of code 2011. More information
> about the program can be found at http://wiki.apache.org/pig/GSoc2011
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira