[jira] [Updated] (PIG-3527) Allow PigProcessor to handle multiple inputs
[ https://issues.apache.org/jira/browse/PIG-3527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Wagner updated PIG-3527: - Attachment: PIG-3527.1.patch Here's an initial patch.There are some things that I need to clean up, and I've made notes of these with TODOs I've posted a review at https://reviews.apache.org/r/15194/. One interesting thing to note is that after attaching inputs directly to the operator pipeline, I observed an ~%40 speedup. I believe this is because there aren't so many calls returning STATUS_EOP, but I haven't tested this. > Allow PigProcessor to handle multiple inputs > > > Key: PIG-3527 > URL: https://issues.apache.org/jira/browse/PIG-3527 > Project: Pig > Issue Type: Sub-task > Components: tez >Reporter: Mark Wagner >Assignee: Mark Wagner > Fix For: tez-branch > > Attachments: PIG-3527.1.patch > > > The PigProcessor needs to be able to handle multiple distinct inputs. These > can come in a variety of flavors including multiple "file" inputs (Merge > join), multiple shuffle inputs (Hash Join / Co-group), and a mix (Replicated > Join). -- This message was sent by Atlassian JIRA (v6.1#6144)
Re: Review Request 15194: Support multiple inputs for PigProcessor
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/15194/ --- (Updated Nov. 2, 2013, 1:17 a.m.) Review request for pig, Cheolsoo Park, Daniel Dai, and Rohini Palaniswamy. Bugs: PIG-3527 https://issues.apache.org/jira/browse/PIG-3527 Repository: pig-git Description --- Adds support for multiple LogicalInputs to the PigProcessor. This is done by adding a new TezLoad interface which PhysicalOperators may implement. On the backend, any operators implementing this interface will have the LogicalInput attached to them. 2 implementations are included: * POSimpleTezLoad which consumes a single MRInput * POShuffleTezLoad which consumes one or more ShuffledMergedInputs. The POShuffleTezLoad does a k-way merge of the shuffle inputs to package for the operator pipeline. This required a change to the comparators used so that the sort order remained consistent. There is also a fix to POForEach where it was using the incorrect status code for signaling (although it produced the same end result in the MR pipeline). Diffs - src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigBigDecimalRawComparator.java ddea99e src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigBigIntegerRawComparator.java 5ea3fc7 src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigBooleanRawComparator.java dfd4ebf src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigBytesRawComparator.java 09397e5 src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigDateTimeRawComparator.java a87161f src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigDoubleRawComparator.java cbf457f src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigFloatRawComparator.java 1d86e3f src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigIntRawComparator.java bb6c9df src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigLongRawComparator.java b3ded76 src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigSecondaryKeyComparator.java 5ad334b src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigTextRawComparator.java 022f37b src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigTupleDefaultRawComparator.java 866c39d src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigTupleSortComparator.java 9724b9f src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/POSimpleTezLoad.java PRE-CREATION src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/TezLoad.java PRE-CREATION src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POForEach.java eb9f62a src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPackage.java 86314d9 src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPackageLite.java c200715 src/org/apache/pig/backend/hadoop/executionengine/tez/FileInputHandler.java d29e330 src/org/apache/pig/backend/hadoop/executionengine/tez/InputHandler.java d2298ca src/org/apache/pig/backend/hadoop/executionengine/tez/POShuffleTezLoad.java PRE-CREATION src/org/apache/pig/backend/hadoop/executionengine/tez/PigProcessor.java ebb3145 src/org/apache/pig/backend/hadoop/executionengine/tez/ShuffledInputHandler.java d7b42b8 src/org/apache/pig/backend/hadoop/executionengine/tez/TezDagBuilder.java 45e47b0 src/org/apache/pig/data/BinInterSedes.java b3ec51e src/org/apache/pig/data/DefaultTuple.java 2e7ca5f test/e2e/pig/tests/tez.conf 24af8d3 Diff: https://reviews.apache.org/r/15194/diff/ Testing --- Manual testing and an e2e test has been added. Because of the comparator change, some of the tests fail because of bag ordering. Thanks, Mark Wagner
Review Request 15194: Support multiple inputs for PigProcessor
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/15194/ --- Review request for pig, Cheolsoo Park and Daniel Dai. Bugs: PIG-3527 https://issues.apache.org/jira/browse/PIG-3527 Repository: pig-git Description --- Adds support for multiple LogicalInputs to the PigProcessor. This is done by adding a new TezLoad interface which PhysicalOperators may implement. On the backend, any operators implementing this interface will have the LogicalInput attached to them. 2 implementations are included: * POSimpleTezLoad which consumes a single MRInput * POShuffleTezLoad which consumes one or more ShuffledMergedInputs. The POShuffleTezLoad does a k-way merge of the shuffle inputs to package for the operator pipeline. This required a change to the comparators used so that the sort order remained consistent. There is also a fix to POForEach where it was using the incorrect status code for signaling (although it produced the same end result in the MR pipeline). Diffs - src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigBigDecimalRawComparator.java ddea99e src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigBigIntegerRawComparator.java 5ea3fc7 src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigBooleanRawComparator.java dfd4ebf src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigBytesRawComparator.java 09397e5 src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigDateTimeRawComparator.java a87161f src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigDoubleRawComparator.java cbf457f src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigFloatRawComparator.java 1d86e3f src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigIntRawComparator.java bb6c9df src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigLongRawComparator.java b3ded76 src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigSecondaryKeyComparator.java 5ad334b src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigTextRawComparator.java 022f37b src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigTupleDefaultRawComparator.java 866c39d src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigTupleSortComparator.java 9724b9f src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/POSimpleTezLoad.java PRE-CREATION src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/TezLoad.java PRE-CREATION src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POForEach.java eb9f62a src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPackage.java 86314d9 src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPackageLite.java c200715 src/org/apache/pig/backend/hadoop/executionengine/tez/FileInputHandler.java d29e330 src/org/apache/pig/backend/hadoop/executionengine/tez/InputHandler.java d2298ca src/org/apache/pig/backend/hadoop/executionengine/tez/POShuffleTezLoad.java PRE-CREATION src/org/apache/pig/backend/hadoop/executionengine/tez/PigProcessor.java ebb3145 src/org/apache/pig/backend/hadoop/executionengine/tez/ShuffledInputHandler.java d7b42b8 src/org/apache/pig/backend/hadoop/executionengine/tez/TezDagBuilder.java 45e47b0 src/org/apache/pig/data/BinInterSedes.java b3ec51e src/org/apache/pig/data/DefaultTuple.java 2e7ca5f test/e2e/pig/tests/tez.conf 24af8d3 Diff: https://reviews.apache.org/r/15194/diff/ Testing --- Manual testing and an e2e test has been added. Because of the comparator change, some of the tests fail because of bag ordering. Thanks, Mark Wagner
[jira] Subscription: PIG patch available
Issue Subscription Filter: PIG patch available (11 issues) Subscriber: pigdaily Key Summary PIG-3556Fix tez branch compilation with Hadoop 1.0 https://issues.apache.org/jira/browse/PIG-3556 PIG-3553HadoopJobHistoryLoader fails to load job history on hadoop v 1.2 https://issues.apache.org/jira/browse/PIG-3553 PIG-3507It fails to run pig in local mode on a Kerberos enabled Hadoop cluster https://issues.apache.org/jira/browse/PIG-3507 PIG-3505Make AvroStorage sync interval take default from io.file.buffer.size https://issues.apache.org/jira/browse/PIG-3505 PIG-3478Make StreamingUDF work for Hadoop 2 https://issues.apache.org/jira/browse/PIG-3478 PIG-3453Implement a Storm backend to Pig https://issues.apache.org/jira/browse/PIG-3453 PIG-3441Allow Pig to use default resources from Configuration objects https://issues.apache.org/jira/browse/PIG-3441 PIG-3388No support for Regex for row filter in org.apache.pig.backend.hadoop.hbase.HBaseStorage https://issues.apache.org/jira/browse/PIG-3388 PIG-3347Store invocation in local mode brings side effect https://issues.apache.org/jira/browse/PIG-3347 PIG-3257Add unique identifier UDF https://issues.apache.org/jira/browse/PIG-3257 PIG-2629Wrong Usage of Scalar which is null causes high namenode operation https://issues.apache.org/jira/browse/PIG-2629 You may edit this subscription at: https://issues.apache.org/jira/secure/FilterSubscription!default.jspa?subId=13225&filterId=12322384
[jira] [Resolved] (PIG-3522) Remove shock from pig
[ https://issues.apache.org/jira/browse/PIG-3522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai resolved PIG-3522. - Resolution: Fixed Hadoop Flags: Reviewed Patch committed to trunk. > Remove shock from pig > - > > Key: PIG-3522 > URL: https://issues.apache.org/jira/browse/PIG-3522 > Project: Pig > Issue Type: Improvement > Components: impl >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: 0.13.0 > > Attachments: PIG-3522-1.patch > > > It is only used in very ancient Hadoop which uses HOD as resource manager. > Current Pig code does not use it. This include the entire lib-src/shock > directory and jsch.jar -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (PIG-3522) Remove shock from pig
[ https://issues.apache.org/jira/browse/PIG-3522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13811774#comment-13811774 ] Thejas M Nair commented on PIG-3522: +1 > Remove shock from pig > - > > Key: PIG-3522 > URL: https://issues.apache.org/jira/browse/PIG-3522 > Project: Pig > Issue Type: Improvement > Components: impl >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: 0.13.0 > > Attachments: PIG-3522-1.patch > > > It is only used in very ancient Hadoop which uses HOD as resource manager. > Current Pig code does not use it. This include the entire lib-src/shock > directory and jsch.jar -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (PIG-3558) ORC support for Pig
[ https://issues.apache.org/jira/browse/PIG-3558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13811752#comment-13811752 ] Daniel Dai commented on PIG-3558: - Also there is a binary file which cannot be put in patch. Copy http://svn.apache.org/viewvc/hive/trunk/ql/src/test/resources/orc-file-11-format.orc?revision=1519868&view=co to test/org/apache/pig/builtin/orc. > ORC support for Pig > --- > > Key: PIG-3558 > URL: https://issues.apache.org/jira/browse/PIG-3558 > Project: Pig > Issue Type: Improvement > Components: impl >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: 0.13.0 > > Attachments: PIG-3558-1.patch > > > Adding LoadFunc and StoreFunc for ORC. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (PIG-3453) Implement a Storm backend to Pig
[ https://issues.apache.org/jira/browse/PIG-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13811751#comment-13811751 ] Jacob Perkins commented on PIG-3453: [~dvryaboy] You're right, I'd honestly prefer to keep working on the git branch since I'm more comfortable/familiar with the workflow. I've been merging changes from apache trunk as I've been going already. It's no big deal to make patches. I went with Trident originally because it's a very simple abstraction that's fairly straightforward to map to pig constructs. I'm not opposed to going directly to storm if that makes sense from a performance perspective but I imagine it'd be a quite a bit more complicated and involve more code. Worth looking further into I suppose. And no, I have not looked at throughput numbers yet. Any suggestions for the best way to do that, eg. comparing a trident topology to a lean storm topology? > Implement a Storm backend to Pig > > > Key: PIG-3453 > URL: https://issues.apache.org/jira/browse/PIG-3453 > Project: Pig > Issue Type: New Feature >Affects Versions: 0.13.0 >Reporter: Pradeep Gollakota >Assignee: Jacob Perkins > Labels: storm > Fix For: 0.13.0 > > Attachments: storm-integration.patch > > > There is a lot of interest around implementing a Storm backend to Pig for > streaming processing. The proposal and initial discussions can be found at > https://cwiki.apache.org/confluence/display/PIG/Pig+on+Storm+Proposal -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (PIG-3453) Implement a Storm backend to Pig
[ https://issues.apache.org/jira/browse/PIG-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13811744#comment-13811744 ] Cheolsoo Park commented on PIG-3453: [~dvryaboy], I have no problem with that. Even we should consider migrating Pig to git. But if Jacob wants to merge it into trunk at some point, and more contributors want to collaborate, having an official branch in Apache is better than keeping it in his personal repo. Do you have any problem with creating a branch for Storm backend? > Implement a Storm backend to Pig > > > Key: PIG-3453 > URL: https://issues.apache.org/jira/browse/PIG-3453 > Project: Pig > Issue Type: New Feature >Affects Versions: 0.13.0 >Reporter: Pradeep Gollakota >Assignee: Jacob Perkins > Labels: storm > Fix For: 0.13.0 > > Attachments: storm-integration.patch > > > There is a lot of interest around implementing a Storm backend to Pig for > streaming processing. The proposal and initial discussions can be found at > https://cwiki.apache.org/confluence/display/PIG/Pig+on+Storm+Proposal -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (PIG-3558) ORC support for Pig
[ https://issues.apache.org/jira/browse/PIG-3558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13811736#comment-13811736 ] Daniel Dai commented on PIG-3558: - The patch depends on HIVE-5728, which provide the InputFormat/OutputFormat Pig needs. > ORC support for Pig > --- > > Key: PIG-3558 > URL: https://issues.apache.org/jira/browse/PIG-3558 > Project: Pig > Issue Type: Improvement > Components: impl >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: 0.13.0 > > Attachments: PIG-3558-1.patch > > > Adding LoadFunc and StoreFunc for ORC. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (PIG-3558) ORC support for Pig
[ https://issues.apache.org/jira/browse/PIG-3558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-3558: Attachment: PIG-3558-1.patch > ORC support for Pig > --- > > Key: PIG-3558 > URL: https://issues.apache.org/jira/browse/PIG-3558 > Project: Pig > Issue Type: Improvement > Components: impl >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: 0.13.0 > > Attachments: PIG-3558-1.patch > > > Adding LoadFunc and StoreFunc for ORC. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (PIG-3558) ORC support for Pig
Daniel Dai created PIG-3558: --- Summary: ORC support for Pig Key: PIG-3558 URL: https://issues.apache.org/jira/browse/PIG-3558 Project: Pig Issue Type: Improvement Components: impl Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.13.0 Adding LoadFunc and StoreFunc for ORC. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (PIG-3453) Implement a Storm backend to Pig
[ https://issues.apache.org/jira/browse/PIG-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13811726#comment-13811726 ] Dmitriy V. Ryaboy commented on PIG-3453: I don't see why Jacob can't keep working in a github branch... easier to look at what's changing, and he can keep merging the (read-only) git mirror from apache to keep up with changes. Jacob I see you are using Trident. Have you looked at your throughput numbers, vs going directly to storm? > Implement a Storm backend to Pig > > > Key: PIG-3453 > URL: https://issues.apache.org/jira/browse/PIG-3453 > Project: Pig > Issue Type: New Feature >Affects Versions: 0.13.0 >Reporter: Pradeep Gollakota >Assignee: Jacob Perkins > Labels: storm > Fix For: 0.13.0 > > Attachments: storm-integration.patch > > > There is a lot of interest around implementing a Storm backend to Pig for > streaming processing. The proposal and initial discussions can be found at > https://cwiki.apache.org/confluence/display/PIG/Pig+on+Storm+Proposal -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (PIG-3453) Implement a Storm backend to Pig
[ https://issues.apache.org/jira/browse/PIG-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13811622#comment-13811622 ] Cheolsoo Park commented on PIG-3453: Yes, that's correct. I can create a branch for you. Let me do it perhaps tomorrow. If anyone has objections, please chime in. > Implement a Storm backend to Pig > > > Key: PIG-3453 > URL: https://issues.apache.org/jira/browse/PIG-3453 > Project: Pig > Issue Type: New Feature >Affects Versions: 0.13.0 >Reporter: Pradeep Gollakota >Assignee: Jacob Perkins > Labels: storm > Fix For: 0.13.0 > > Attachments: storm-integration.patch > > > There is a lot of interest around implementing a Storm backend to Pig for > streaming processing. The proposal and initial discussions can be found at > https://cwiki.apache.org/confluence/display/PIG/Pig+on+Storm+Proposal -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (PIG-3557) Implement optimizations for LIMIT
[ https://issues.apache.org/jira/browse/PIG-3557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Bain updated PIG-3557: --- Issue Type: Sub-task (was: Bug) Parent: PIG-3446 > Implement optimizations for LIMIT > - > > Key: PIG-3557 > URL: https://issues.apache.org/jira/browse/PIG-3557 > Project: Pig > Issue Type: Sub-task > Components: tez >Affects Versions: tez-branch >Reporter: Alex Bain >Assignee: Alex Bain > > Implement optimizations for LIMIT when other parts of Pig-on-Tez are more > mature. Some of the optimizations mentioned by Daniel include: > 1. If the previous stage using 1 reduce, no need to add one more vertex > 2. If the limitplan is null (ie, not the "limited order by" case), we might > not need a shuffle edge, a pass through edge should be enough if possible > 3. Similar to PIG-1270, we can push limit to InputHandler > 4. We also need to think through the "limited order by" case once "order by" > is implemented -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (PIG-3557) Implement optimizations for LIMIT
Alex Bain created PIG-3557: -- Summary: Implement optimizations for LIMIT Key: PIG-3557 URL: https://issues.apache.org/jira/browse/PIG-3557 Project: Pig Issue Type: Bug Components: tez Affects Versions: tez-branch Reporter: Alex Bain Assignee: Alex Bain Implement optimizations for LIMIT when other parts of Pig-on-Tez are more mature. Some of the optimizations mentioned by Daniel include: 1. If the previous stage using 1 reduce, no need to add one more vertex 2. If the limitplan is null (ie, not the "limited order by" case), we might not need a shuffle edge, a pass through edge should be enough if possible 3. Similar to PIG-1270, we can push limit to InputHandler 4. We also need to think through the "limited order by" case once "order by" is implemented -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (PIG-3453) Implement a Storm backend to Pig
[ https://issues.apache.org/jira/browse/PIG-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13811548#comment-13811548 ] Jacob Perkins commented on PIG-3453: [~cheolsoo] Yes. That makes a lot of sense. So, if I understand correctly, you'll make a feature branch. Then I can just work off that feature branch. I'll create a sub task called something like 'word count' or proof-of-concept or some such, submit this first patch (against the feature branch, not trunk) for it, and we'll go from there? > Implement a Storm backend to Pig > > > Key: PIG-3453 > URL: https://issues.apache.org/jira/browse/PIG-3453 > Project: Pig > Issue Type: New Feature >Affects Versions: 0.13.0 >Reporter: Pradeep Gollakota >Assignee: Jacob Perkins > Labels: storm > Fix For: 0.13.0 > > Attachments: storm-integration.patch > > > There is a lot of interest around implementing a Storm backend to Pig for > streaming processing. The proposal and initial discussions can be found at > https://cwiki.apache.org/confluence/display/PIG/Pig+on+Storm+Proposal -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (PIG-3453) Implement a Storm backend to Pig
[ https://issues.apache.org/jira/browse/PIG-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13811537#comment-13811537 ] Cheolsoo Park commented on PIG-3453: Usually, we create a feature branch for a big feature and merge it to trunk after fully developed/tested. Although it's totally possible to develop it in your personal repo and post a giant patch at one shot, the bigger the patch is, the longer it takes to be reviewed. So I recommend to create subtasks and incrementally commit small patches. To do that, you will need a svn branch because you can't resolve jiras w/o committing patches. The Pig git repo is a read-only mirror of svn repo. So unfortunately, patches need to be posted in jiras to get committed. Since you don't have commit access to svn repo, it will be helpful to have at least one committer in the loop. Does this make sense? > Implement a Storm backend to Pig > > > Key: PIG-3453 > URL: https://issues.apache.org/jira/browse/PIG-3453 > Project: Pig > Issue Type: New Feature >Affects Versions: 0.13.0 >Reporter: Pradeep Gollakota >Assignee: Jacob Perkins > Labels: storm > Fix For: 0.13.0 > > Attachments: storm-integration.patch > > > There is a lot of interest around implementing a Storm backend to Pig for > streaming processing. The proposal and initial discussions can be found at > https://cwiki.apache.org/confluence/display/PIG/Pig+on+Storm+Proposal -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (PIG-3453) Implement a Storm backend to Pig
[ https://issues.apache.org/jira/browse/PIG-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13811497#comment-13811497 ] Jacob Perkins commented on PIG-3453: [~cheolsoo] I've got it on a separate branch in my github fork of apache pig (http://github.com/thedatachef/pig/tree/storm-integration) I just wasn't sure what the best way to say "hey, here's a storm execution engine" was other than a patch :) Can you direct me to the dev mailing list? Also, and maybe this is a question for the dev mailing list, but this is the first apache project I've contributed to. I'm not sure how closely it's integrated with git/github other than as a convenient mirror. If you create a branch called storm under apache/pig what's the best way for me to push changes to it? A pull request or is there another preferred method? > Implement a Storm backend to Pig > > > Key: PIG-3453 > URL: https://issues.apache.org/jira/browse/PIG-3453 > Project: Pig > Issue Type: New Feature >Affects Versions: 0.13.0 >Reporter: Pradeep Gollakota >Assignee: Jacob Perkins > Labels: storm > Fix For: 0.13.0 > > Attachments: storm-integration.patch > > > There is a lot of interest around implementing a Storm backend to Pig for > streaming processing. The proposal and initial discussions can be found at > https://cwiki.apache.org/confluence/display/PIG/Pig+on+Storm+Proposal -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (PIG-3453) Implement a Storm backend to Pig
[ https://issues.apache.org/jira/browse/PIG-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13811487#comment-13811487 ] Cheolsoo Park commented on PIG-3453: [~thedatachef], this is exciting! I am wondering whether we should create a branch for storm backend like we have the tez branch. Since the backend interfaces including ExecutionEngine, Launcher, and PigStats are evolving now, it will be probably easier for you to maintain your work in a branch. Feel free to send an email on the dev mailing list. I am happy to help you create a branch and commit your work. > Implement a Storm backend to Pig > > > Key: PIG-3453 > URL: https://issues.apache.org/jira/browse/PIG-3453 > Project: Pig > Issue Type: New Feature >Affects Versions: 0.13.0 >Reporter: Pradeep Gollakota >Assignee: Jacob Perkins > Labels: storm > Fix For: 0.13.0 > > Attachments: storm-integration.patch > > > There is a lot of interest around implementing a Storm backend to Pig for > streaming processing. The proposal and initial discussions can be found at > https://cwiki.apache.org/confluence/display/PIG/Pig+on+Storm+Proposal -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (PIG-3453) Implement a Storm backend to Pig
[ https://issues.apache.org/jira/browse/PIG-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacob Perkins updated PIG-3453: --- Attachment: storm-integration.patch > Implement a Storm backend to Pig > > > Key: PIG-3453 > URL: https://issues.apache.org/jira/browse/PIG-3453 > Project: Pig > Issue Type: New Feature >Affects Versions: 0.13.0 >Reporter: Pradeep Gollakota >Assignee: Jacob Perkins > Labels: storm > Fix For: 0.13.0 > > Attachments: storm-integration.patch > > > There is a lot of interest around implementing a Storm backend to Pig for > streaming processing. The proposal and initial discussions can be found at > https://cwiki.apache.org/confluence/display/PIG/Pig+on+Storm+Proposal -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (PIG-3453) Implement a Storm backend to Pig
[ https://issues.apache.org/jira/browse/PIG-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacob Perkins updated PIG-3453: --- Fix Version/s: 0.13.0 Assignee: Jacob Perkins Affects Version/s: 0.13.0 Status: Patch Available (was: Open) Whew. Here's a patch that demonstrates running, e2e, a word count. It's quite hefty so here's some high level points: * Implemented two new operators 'tap' and 'sink' with corresponding logical operators LOTap and LOSink and interfaces SinkFunc and TapFunc. I did the best I could to keep them general enough to work beyond the scope of simply storm. It may make sense to split just this part out into it's own jira&patch. * Implemented LocalFileTap and LocalFileSink (which really shouldn't be used for more than simple testing) to demonstrate the TapFunc and SinkFunc. * LogToTopologyTranslationVisitor - Much like LogToPhyTranslationVisitor for the physical plan, it walks the logical plan and creates a TridentTopology. * LOForEach - I more or less copied exactly what's being done in the LogToPhyTranslationVisitor. Since POForEach is serializable, rather than parsing the logical expression plans myself I simple create the POForEach and wrap it with a storm trident BaseFunction. It seemed a reasonably pragmatic approach for now. * LOCogroup - I took a similar approach to LoForEach except, since POPackage is tied so closely with Hadoop Writables I implemented something similar to what POPackage is doing with StreamPackageFunction * TridentExecutionEngine - This is probably the hackiest part. I'm not sure what the best way to create a stats object for this is. The topology runs continuously, it doesn't 'succeed'. I don't want to fake POStores. * Building and classpath. I did the best I could to not have a dependency nightmare scenario. After applying the patch to trunk it should build fine. To run you'll want zookeeper-3.3.3.jar (no other version works) and storm-core-0.9.0-rc2.jar in your class path. * test script: {code: title=wordcount.pig|borderStyle=solid} set storm.executionengine.stream.batch.size 1 data = tap '$sometext' using org.apache.pig.backend.storm.tap.LocalFileTap('line') as (line:chararray); tokens = foreach data generate flatten(TOKENIZE(line)) as (token:chararray); counts = foreach (group tokens by token) generate group as token, COUNT(tokens) as num; sink counts into '$output' using org.apache.pig.backend.storm.sink.LocalFileSink('token'); {code} I'm sure there's more details than this. Again it's a large patch and, rather than continuing to polish it, I think it's time for feedback. > Implement a Storm backend to Pig > > > Key: PIG-3453 > URL: https://issues.apache.org/jira/browse/PIG-3453 > Project: Pig > Issue Type: New Feature >Affects Versions: 0.13.0 >Reporter: Pradeep Gollakota >Assignee: Jacob Perkins > Labels: storm > Fix For: 0.13.0 > > Attachments: storm-integration.patch > > > There is a lot of interest around implementing a Storm backend to Pig for > streaming processing. The proposal and initial discussions can be found at > https://cwiki.apache.org/confluence/display/PIG/Pig+on+Storm+Proposal -- This message was sent by Atlassian JIRA (v6.1#6144)