[jira] [Updated] (PIG-3527) Allow PigProcessor to handle multiple inputs

2013-11-01 Thread Mark Wagner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Wagner updated PIG-3527:
-

Attachment: PIG-3527.1.patch

Here's an initial patch.There are some things that I need to clean up, and I've 
made notes of these with TODOs I've posted a review at 
https://reviews.apache.org/r/15194/. One interesting thing to note is that 
after attaching inputs directly to the operator pipeline, I observed an ~%40 
speedup. I believe this is because there aren't so many calls returning 
STATUS_EOP, but I haven't tested this.

> Allow PigProcessor to handle multiple inputs
> 
>
> Key: PIG-3527
> URL: https://issues.apache.org/jira/browse/PIG-3527
> Project: Pig
>  Issue Type: Sub-task
>  Components: tez
>Reporter: Mark Wagner
>Assignee: Mark Wagner
> Fix For: tez-branch
>
> Attachments: PIG-3527.1.patch
>
>
> The PigProcessor needs to be able to handle multiple distinct inputs. These 
> can come in a variety of flavors including multiple "file" inputs (Merge 
> join), multiple shuffle inputs (Hash Join / Co-group), and a mix (Replicated 
> Join).



--
This message was sent by Atlassian JIRA
(v6.1#6144)


Re: Review Request 15194: Support multiple inputs for PigProcessor

2013-11-01 Thread Mark Wagner

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15194/
---

(Updated Nov. 2, 2013, 1:17 a.m.)


Review request for pig, Cheolsoo Park, Daniel Dai, and Rohini Palaniswamy.


Bugs: PIG-3527
https://issues.apache.org/jira/browse/PIG-3527


Repository: pig-git


Description
---

Adds support for multiple LogicalInputs to the PigProcessor. This is done by 
adding a new TezLoad interface which PhysicalOperators may implement. On the 
backend, any operators implementing this interface will have the LogicalInput 
attached to them. 2 implementations are included:
* POSimpleTezLoad which consumes a single MRInput
* POShuffleTezLoad which consumes one or more ShuffledMergedInputs.
The POShuffleTezLoad does a k-way merge of the shuffle inputs to package for 
the operator pipeline. This required a change to the comparators used so that 
the sort order remained consistent. There is also a fix to POForEach where it 
was using the incorrect status code for signaling (although it produced the 
same end result in the MR pipeline).


Diffs
-

  
src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigBigDecimalRawComparator.java
 ddea99e 
  
src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigBigIntegerRawComparator.java
 5ea3fc7 
  
src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigBooleanRawComparator.java
 dfd4ebf 
  
src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigBytesRawComparator.java
 09397e5 
  
src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigDateTimeRawComparator.java
 a87161f 
  
src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigDoubleRawComparator.java
 cbf457f 
  
src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigFloatRawComparator.java
 1d86e3f 
  
src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigIntRawComparator.java
 bb6c9df 
  
src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigLongRawComparator.java
 b3ded76 
  
src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigSecondaryKeyComparator.java
 5ad334b 
  
src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigTextRawComparator.java
 022f37b 
  
src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigTupleDefaultRawComparator.java
 866c39d 
  
src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigTupleSortComparator.java
 9724b9f 
  
src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/POSimpleTezLoad.java
 PRE-CREATION 
  src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/TezLoad.java 
PRE-CREATION 
  
src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POForEach.java
 eb9f62a 
  
src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPackage.java
 86314d9 
  
src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPackageLite.java
 c200715 
  src/org/apache/pig/backend/hadoop/executionengine/tez/FileInputHandler.java 
d29e330 
  src/org/apache/pig/backend/hadoop/executionengine/tez/InputHandler.java 
d2298ca 
  src/org/apache/pig/backend/hadoop/executionengine/tez/POShuffleTezLoad.java 
PRE-CREATION 
  src/org/apache/pig/backend/hadoop/executionengine/tez/PigProcessor.java 
ebb3145 
  
src/org/apache/pig/backend/hadoop/executionengine/tez/ShuffledInputHandler.java 
d7b42b8 
  src/org/apache/pig/backend/hadoop/executionengine/tez/TezDagBuilder.java 
45e47b0 
  src/org/apache/pig/data/BinInterSedes.java b3ec51e 
  src/org/apache/pig/data/DefaultTuple.java 2e7ca5f 
  test/e2e/pig/tests/tez.conf 24af8d3 

Diff: https://reviews.apache.org/r/15194/diff/


Testing
---

Manual testing and an e2e test has been added. Because of the comparator 
change, some of the tests fail because of bag ordering.


Thanks,

Mark Wagner



Review Request 15194: Support multiple inputs for PigProcessor

2013-11-01 Thread Mark Wagner

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15194/
---

Review request for pig, Cheolsoo Park and Daniel Dai.


Bugs: PIG-3527
https://issues.apache.org/jira/browse/PIG-3527


Repository: pig-git


Description
---

Adds support for multiple LogicalInputs to the PigProcessor. This is done by 
adding a new TezLoad interface which PhysicalOperators may implement. On the 
backend, any operators implementing this interface will have the LogicalInput 
attached to them. 2 implementations are included:
* POSimpleTezLoad which consumes a single MRInput
* POShuffleTezLoad which consumes one or more ShuffledMergedInputs.
The POShuffleTezLoad does a k-way merge of the shuffle inputs to package for 
the operator pipeline. This required a change to the comparators used so that 
the sort order remained consistent. There is also a fix to POForEach where it 
was using the incorrect status code for signaling (although it produced the 
same end result in the MR pipeline).


Diffs
-

  
src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigBigDecimalRawComparator.java
 ddea99e 
  
src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigBigIntegerRawComparator.java
 5ea3fc7 
  
src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigBooleanRawComparator.java
 dfd4ebf 
  
src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigBytesRawComparator.java
 09397e5 
  
src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigDateTimeRawComparator.java
 a87161f 
  
src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigDoubleRawComparator.java
 cbf457f 
  
src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigFloatRawComparator.java
 1d86e3f 
  
src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigIntRawComparator.java
 bb6c9df 
  
src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigLongRawComparator.java
 b3ded76 
  
src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigSecondaryKeyComparator.java
 5ad334b 
  
src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigTextRawComparator.java
 022f37b 
  
src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigTupleDefaultRawComparator.java
 866c39d 
  
src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigTupleSortComparator.java
 9724b9f 
  
src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/POSimpleTezLoad.java
 PRE-CREATION 
  src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/TezLoad.java 
PRE-CREATION 
  
src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POForEach.java
 eb9f62a 
  
src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPackage.java
 86314d9 
  
src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPackageLite.java
 c200715 
  src/org/apache/pig/backend/hadoop/executionengine/tez/FileInputHandler.java 
d29e330 
  src/org/apache/pig/backend/hadoop/executionengine/tez/InputHandler.java 
d2298ca 
  src/org/apache/pig/backend/hadoop/executionengine/tez/POShuffleTezLoad.java 
PRE-CREATION 
  src/org/apache/pig/backend/hadoop/executionengine/tez/PigProcessor.java 
ebb3145 
  
src/org/apache/pig/backend/hadoop/executionengine/tez/ShuffledInputHandler.java 
d7b42b8 
  src/org/apache/pig/backend/hadoop/executionengine/tez/TezDagBuilder.java 
45e47b0 
  src/org/apache/pig/data/BinInterSedes.java b3ec51e 
  src/org/apache/pig/data/DefaultTuple.java 2e7ca5f 
  test/e2e/pig/tests/tez.conf 24af8d3 

Diff: https://reviews.apache.org/r/15194/diff/


Testing
---

Manual testing and an e2e test has been added. Because of the comparator 
change, some of the tests fail because of bag ordering.


Thanks,

Mark Wagner



[jira] Subscription: PIG patch available

2013-11-01 Thread jira
Issue Subscription
Filter: PIG patch available (11 issues)

Subscriber: pigdaily

Key Summary
PIG-3556Fix tez branch compilation with Hadoop 1.0
https://issues.apache.org/jira/browse/PIG-3556
PIG-3553HadoopJobHistoryLoader fails to load job history on hadoop v 1.2
https://issues.apache.org/jira/browse/PIG-3553
PIG-3507It fails to run pig in local mode on a Kerberos enabled Hadoop 
cluster
https://issues.apache.org/jira/browse/PIG-3507
PIG-3505Make AvroStorage sync interval take default from 
io.file.buffer.size 
https://issues.apache.org/jira/browse/PIG-3505
PIG-3478Make StreamingUDF work for Hadoop 2
https://issues.apache.org/jira/browse/PIG-3478
PIG-3453Implement a Storm backend to Pig
https://issues.apache.org/jira/browse/PIG-3453
PIG-3441Allow Pig to use default resources from Configuration objects
https://issues.apache.org/jira/browse/PIG-3441
PIG-3388No support for Regex for row filter in 
org.apache.pig.backend.hadoop.hbase.HBaseStorage
https://issues.apache.org/jira/browse/PIG-3388
PIG-3347Store invocation in local mode brings side effect
https://issues.apache.org/jira/browse/PIG-3347
PIG-3257Add unique identifier UDF
https://issues.apache.org/jira/browse/PIG-3257
PIG-2629Wrong Usage of Scalar which is null causes high namenode operation 
https://issues.apache.org/jira/browse/PIG-2629

You may edit this subscription at:
https://issues.apache.org/jira/secure/FilterSubscription!default.jspa?subId=13225&filterId=12322384


[jira] [Resolved] (PIG-3522) Remove shock from pig

2013-11-01 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai resolved PIG-3522.
-

  Resolution: Fixed
Hadoop Flags: Reviewed

Patch committed to trunk.

> Remove shock from pig
> -
>
> Key: PIG-3522
> URL: https://issues.apache.org/jira/browse/PIG-3522
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.13.0
>
> Attachments: PIG-3522-1.patch
>
>
> It is only used in very ancient Hadoop which uses HOD as resource manager. 
> Current Pig code does not use it. This include the entire lib-src/shock 
> directory and jsch.jar



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (PIG-3522) Remove shock from pig

2013-11-01 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13811774#comment-13811774
 ] 

Thejas M Nair commented on PIG-3522:


+1

> Remove shock from pig
> -
>
> Key: PIG-3522
> URL: https://issues.apache.org/jira/browse/PIG-3522
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.13.0
>
> Attachments: PIG-3522-1.patch
>
>
> It is only used in very ancient Hadoop which uses HOD as resource manager. 
> Current Pig code does not use it. This include the entire lib-src/shock 
> directory and jsch.jar



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (PIG-3558) ORC support for Pig

2013-11-01 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13811752#comment-13811752
 ] 

Daniel Dai commented on PIG-3558:
-

Also there is a binary file which cannot be put in patch. Copy 
http://svn.apache.org/viewvc/hive/trunk/ql/src/test/resources/orc-file-11-format.orc?revision=1519868&view=co
 to test/org/apache/pig/builtin/orc. 

> ORC support for Pig
> ---
>
> Key: PIG-3558
> URL: https://issues.apache.org/jira/browse/PIG-3558
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.13.0
>
> Attachments: PIG-3558-1.patch
>
>
> Adding LoadFunc and StoreFunc for ORC.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (PIG-3453) Implement a Storm backend to Pig

2013-11-01 Thread Jacob Perkins (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13811751#comment-13811751
 ] 

Jacob Perkins commented on PIG-3453:


[~dvryaboy] You're right, I'd honestly prefer to keep working on the git branch 
since I'm more comfortable/familiar with the workflow. I've been merging 
changes from apache trunk as I've been going already. It's no big deal to make 
patches.

I went with Trident originally because it's a very simple abstraction that's 
fairly straightforward to map to pig constructs. I'm not opposed to going 
directly to storm if that makes sense from a performance perspective but I 
imagine it'd be a quite a bit more complicated and involve more code. Worth 
looking further into I suppose. And no, I have not looked at throughput numbers 
yet. Any suggestions for the best way to do that, eg. comparing a trident 
topology to a lean storm topology?



> Implement a Storm backend to Pig
> 
>
> Key: PIG-3453
> URL: https://issues.apache.org/jira/browse/PIG-3453
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.13.0
>Reporter: Pradeep Gollakota
>Assignee: Jacob Perkins
>  Labels: storm
> Fix For: 0.13.0
>
> Attachments: storm-integration.patch
>
>
> There is a lot of interest around implementing a Storm backend to Pig for 
> streaming processing. The proposal and initial discussions can be found at 
> https://cwiki.apache.org/confluence/display/PIG/Pig+on+Storm+Proposal



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (PIG-3453) Implement a Storm backend to Pig

2013-11-01 Thread Cheolsoo Park (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13811744#comment-13811744
 ] 

Cheolsoo Park commented on PIG-3453:


[~dvryaboy], I have no problem with that. Even we should consider migrating Pig 
to git.

But if Jacob wants to merge it into trunk at some point, and more contributors 
want to collaborate, having an official branch in Apache is better than keeping 
it in his personal repo. Do you have any problem with creating a branch for 
Storm backend?

> Implement a Storm backend to Pig
> 
>
> Key: PIG-3453
> URL: https://issues.apache.org/jira/browse/PIG-3453
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.13.0
>Reporter: Pradeep Gollakota
>Assignee: Jacob Perkins
>  Labels: storm
> Fix For: 0.13.0
>
> Attachments: storm-integration.patch
>
>
> There is a lot of interest around implementing a Storm backend to Pig for 
> streaming processing. The proposal and initial discussions can be found at 
> https://cwiki.apache.org/confluence/display/PIG/Pig+on+Storm+Proposal



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (PIG-3558) ORC support for Pig

2013-11-01 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13811736#comment-13811736
 ] 

Daniel Dai commented on PIG-3558:
-

The patch depends on HIVE-5728, which provide the InputFormat/OutputFormat Pig 
needs.

> ORC support for Pig
> ---
>
> Key: PIG-3558
> URL: https://issues.apache.org/jira/browse/PIG-3558
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.13.0
>
> Attachments: PIG-3558-1.patch
>
>
> Adding LoadFunc and StoreFunc for ORC.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (PIG-3558) ORC support for Pig

2013-11-01 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-3558:


Attachment: PIG-3558-1.patch

> ORC support for Pig
> ---
>
> Key: PIG-3558
> URL: https://issues.apache.org/jira/browse/PIG-3558
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.13.0
>
> Attachments: PIG-3558-1.patch
>
>
> Adding LoadFunc and StoreFunc for ORC.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Created] (PIG-3558) ORC support for Pig

2013-11-01 Thread Daniel Dai (JIRA)
Daniel Dai created PIG-3558:
---

 Summary: ORC support for Pig
 Key: PIG-3558
 URL: https://issues.apache.org/jira/browse/PIG-3558
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.13.0


Adding LoadFunc and StoreFunc for ORC.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (PIG-3453) Implement a Storm backend to Pig

2013-11-01 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13811726#comment-13811726
 ] 

Dmitriy V. Ryaboy commented on PIG-3453:


I don't see why Jacob can't keep working in a github branch... easier to look 
at what's changing, and he can keep merging the (read-only) git mirror from 
apache to keep up with changes.

Jacob I see you are using Trident. Have you looked at your throughput numbers, 
vs going directly to storm?

> Implement a Storm backend to Pig
> 
>
> Key: PIG-3453
> URL: https://issues.apache.org/jira/browse/PIG-3453
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.13.0
>Reporter: Pradeep Gollakota
>Assignee: Jacob Perkins
>  Labels: storm
> Fix For: 0.13.0
>
> Attachments: storm-integration.patch
>
>
> There is a lot of interest around implementing a Storm backend to Pig for 
> streaming processing. The proposal and initial discussions can be found at 
> https://cwiki.apache.org/confluence/display/PIG/Pig+on+Storm+Proposal



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (PIG-3453) Implement a Storm backend to Pig

2013-11-01 Thread Cheolsoo Park (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13811622#comment-13811622
 ] 

Cheolsoo Park commented on PIG-3453:


Yes, that's correct. I can create a branch for you. Let me do it perhaps 
tomorrow.

If anyone has objections, please chime in.

> Implement a Storm backend to Pig
> 
>
> Key: PIG-3453
> URL: https://issues.apache.org/jira/browse/PIG-3453
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.13.0
>Reporter: Pradeep Gollakota
>Assignee: Jacob Perkins
>  Labels: storm
> Fix For: 0.13.0
>
> Attachments: storm-integration.patch
>
>
> There is a lot of interest around implementing a Storm backend to Pig for 
> streaming processing. The proposal and initial discussions can be found at 
> https://cwiki.apache.org/confluence/display/PIG/Pig+on+Storm+Proposal



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (PIG-3557) Implement optimizations for LIMIT

2013-11-01 Thread Alex Bain (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Bain updated PIG-3557:
---

Issue Type: Sub-task  (was: Bug)
Parent: PIG-3446

> Implement optimizations for LIMIT
> -
>
> Key: PIG-3557
> URL: https://issues.apache.org/jira/browse/PIG-3557
> Project: Pig
>  Issue Type: Sub-task
>  Components: tez
>Affects Versions: tez-branch
>Reporter: Alex Bain
>Assignee: Alex Bain
>
> Implement optimizations for LIMIT when other parts of Pig-on-Tez are more 
> mature. Some of the optimizations mentioned by Daniel include:
> 1. If the previous stage using 1 reduce, no need to add one more vertex
> 2. If the limitplan is null (ie, not the "limited order by" case), we might 
> not need a shuffle edge, a pass through edge should be enough if possible
> 3. Similar to PIG-1270, we can push limit to InputHandler
> 4. We also need to think through the "limited order by" case once "order by" 
> is implemented



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Created] (PIG-3557) Implement optimizations for LIMIT

2013-11-01 Thread Alex Bain (JIRA)
Alex Bain created PIG-3557:
--

 Summary: Implement optimizations for LIMIT
 Key: PIG-3557
 URL: https://issues.apache.org/jira/browse/PIG-3557
 Project: Pig
  Issue Type: Bug
  Components: tez
Affects Versions: tez-branch
Reporter: Alex Bain
Assignee: Alex Bain


Implement optimizations for LIMIT when other parts of Pig-on-Tez are more 
mature. Some of the optimizations mentioned by Daniel include:

1. If the previous stage using 1 reduce, no need to add one more vertex
2. If the limitplan is null (ie, not the "limited order by" case), we might not 
need a shuffle edge, a pass through edge should be enough if possible
3. Similar to PIG-1270, we can push limit to InputHandler
4. We also need to think through the "limited order by" case once "order by" is 
implemented



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (PIG-3453) Implement a Storm backend to Pig

2013-11-01 Thread Jacob Perkins (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13811548#comment-13811548
 ] 

Jacob Perkins commented on PIG-3453:


[~cheolsoo] Yes. That makes a lot of sense. So, if I understand correctly, 
you'll make a feature branch. Then I can just work off that feature branch. 
I'll create a sub task called something like 'word count' or proof-of-concept 
or some such, submit this first patch (against the feature branch, not trunk) 
for it, and we'll go from there?

> Implement a Storm backend to Pig
> 
>
> Key: PIG-3453
> URL: https://issues.apache.org/jira/browse/PIG-3453
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.13.0
>Reporter: Pradeep Gollakota
>Assignee: Jacob Perkins
>  Labels: storm
> Fix For: 0.13.0
>
> Attachments: storm-integration.patch
>
>
> There is a lot of interest around implementing a Storm backend to Pig for 
> streaming processing. The proposal and initial discussions can be found at 
> https://cwiki.apache.org/confluence/display/PIG/Pig+on+Storm+Proposal



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (PIG-3453) Implement a Storm backend to Pig

2013-11-01 Thread Cheolsoo Park (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13811537#comment-13811537
 ] 

Cheolsoo Park commented on PIG-3453:


Usually, we create a feature branch for a big feature and merge it to trunk 
after fully developed/tested. Although it's totally possible to develop it in 
your personal repo and post a giant patch at one shot, the bigger the patch is, 
the longer it takes to be reviewed. So I recommend to create subtasks and 
incrementally commit small patches. To do that, you will need a svn branch 
because you can't resolve jiras w/o committing patches. 

The Pig git repo is a read-only mirror of svn repo. So unfortunately, patches 
need to be posted in jiras to get committed. Since you don't have commit access 
to svn repo, it will be helpful to have at least one committer in the loop. 
Does this make sense?

> Implement a Storm backend to Pig
> 
>
> Key: PIG-3453
> URL: https://issues.apache.org/jira/browse/PIG-3453
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.13.0
>Reporter: Pradeep Gollakota
>Assignee: Jacob Perkins
>  Labels: storm
> Fix For: 0.13.0
>
> Attachments: storm-integration.patch
>
>
> There is a lot of interest around implementing a Storm backend to Pig for 
> streaming processing. The proposal and initial discussions can be found at 
> https://cwiki.apache.org/confluence/display/PIG/Pig+on+Storm+Proposal



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (PIG-3453) Implement a Storm backend to Pig

2013-11-01 Thread Jacob Perkins (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13811497#comment-13811497
 ] 

Jacob Perkins commented on PIG-3453:


[~cheolsoo] I've got it on a separate branch in my github fork of apache pig 
(http://github.com/thedatachef/pig/tree/storm-integration) I just wasn't sure 
what the best way to say "hey, here's a storm execution engine" was other than 
a patch :) Can you direct me to the dev mailing list? Also, and maybe this is a 
question for the dev mailing list, but this is the first apache project I've 
contributed to. I'm not sure how closely it's integrated with git/github other 
than as a convenient mirror. If you create a branch called storm under 
apache/pig what's the best way for me to push changes to it? A pull request or 
is there another preferred method?

> Implement a Storm backend to Pig
> 
>
> Key: PIG-3453
> URL: https://issues.apache.org/jira/browse/PIG-3453
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.13.0
>Reporter: Pradeep Gollakota
>Assignee: Jacob Perkins
>  Labels: storm
> Fix For: 0.13.0
>
> Attachments: storm-integration.patch
>
>
> There is a lot of interest around implementing a Storm backend to Pig for 
> streaming processing. The proposal and initial discussions can be found at 
> https://cwiki.apache.org/confluence/display/PIG/Pig+on+Storm+Proposal



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (PIG-3453) Implement a Storm backend to Pig

2013-11-01 Thread Cheolsoo Park (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13811487#comment-13811487
 ] 

Cheolsoo Park commented on PIG-3453:


[~thedatachef], this is exciting! I am wondering whether we should create a 
branch for storm backend like we have the tez branch. Since the backend 
interfaces including ExecutionEngine, Launcher, and PigStats are evolving now, 
it will be probably easier for you to maintain your work in a branch. Feel free 
to send an email on the dev mailing list. I am happy to help you create a 
branch and commit your work.

> Implement a Storm backend to Pig
> 
>
> Key: PIG-3453
> URL: https://issues.apache.org/jira/browse/PIG-3453
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.13.0
>Reporter: Pradeep Gollakota
>Assignee: Jacob Perkins
>  Labels: storm
> Fix For: 0.13.0
>
> Attachments: storm-integration.patch
>
>
> There is a lot of interest around implementing a Storm backend to Pig for 
> streaming processing. The proposal and initial discussions can be found at 
> https://cwiki.apache.org/confluence/display/PIG/Pig+on+Storm+Proposal



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (PIG-3453) Implement a Storm backend to Pig

2013-11-01 Thread Jacob Perkins (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacob Perkins updated PIG-3453:
---

Attachment: storm-integration.patch

> Implement a Storm backend to Pig
> 
>
> Key: PIG-3453
> URL: https://issues.apache.org/jira/browse/PIG-3453
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.13.0
>Reporter: Pradeep Gollakota
>Assignee: Jacob Perkins
>  Labels: storm
> Fix For: 0.13.0
>
> Attachments: storm-integration.patch
>
>
> There is a lot of interest around implementing a Storm backend to Pig for 
> streaming processing. The proposal and initial discussions can be found at 
> https://cwiki.apache.org/confluence/display/PIG/Pig+on+Storm+Proposal



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (PIG-3453) Implement a Storm backend to Pig

2013-11-01 Thread Jacob Perkins (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacob Perkins updated PIG-3453:
---

Fix Version/s: 0.13.0
 Assignee: Jacob Perkins
Affects Version/s: 0.13.0
   Status: Patch Available  (was: Open)

Whew. Here's a patch that demonstrates running, e2e, a word count. It's quite 
hefty so here's some high level points:

* Implemented two new operators 'tap' and 'sink' with corresponding logical 
operators LOTap and LOSink and interfaces SinkFunc and TapFunc. I did the best 
I could to keep them general enough to work beyond the scope of simply storm. 
It may make sense to split just this part out into it's own jira&patch.

* Implemented LocalFileTap and LocalFileSink (which really shouldn't be used 
for more than simple testing) to demonstrate the TapFunc and SinkFunc.

* LogToTopologyTranslationVisitor - Much like LogToPhyTranslationVisitor for 
the physical plan, it walks the logical plan and creates a TridentTopology.

* LOForEach - I more or less copied exactly what's being done in the 
LogToPhyTranslationVisitor. Since POForEach is serializable, rather than 
parsing the logical expression plans myself I simple create the POForEach and 
wrap it with a storm trident BaseFunction. It seemed a reasonably pragmatic 
approach for now.

* LOCogroup - I took a similar approach to LoForEach except, since POPackage is 
tied so closely with Hadoop Writables I implemented something similar to what 
POPackage is doing with StreamPackageFunction

* TridentExecutionEngine - This is probably the hackiest part. I'm not sure 
what the best way to create a stats object for this is. The topology runs 
continuously, it doesn't 'succeed'. I don't want to fake POStores.

* Building and classpath. I did the best I could to not have a dependency 
nightmare scenario. After applying the patch to trunk it should build fine. To 
run you'll want zookeeper-3.3.3.jar (no other version works) and 
storm-core-0.9.0-rc2.jar in your class path.

* test script:

{code: title=wordcount.pig|borderStyle=solid}
set storm.executionengine.stream.batch.size 1

data = tap '$sometext' using 
org.apache.pig.backend.storm.tap.LocalFileTap('line') as (line:chararray);

tokens = foreach data generate flatten(TOKENIZE(line)) as (token:chararray);

counts = foreach (group tokens by token) generate
 group as token,
 COUNT(tokens) as num;

sink counts into '$output' using 
org.apache.pig.backend.storm.sink.LocalFileSink('token');
{code}

I'm sure there's more details than this. Again it's a large patch and, rather 
than continuing to polish it, I think it's time for feedback.

> Implement a Storm backend to Pig
> 
>
> Key: PIG-3453
> URL: https://issues.apache.org/jira/browse/PIG-3453
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.13.0
>Reporter: Pradeep Gollakota
>Assignee: Jacob Perkins
>  Labels: storm
> Fix For: 0.13.0
>
> Attachments: storm-integration.patch
>
>
> There is a lot of interest around implementing a Storm backend to Pig for 
> streaming processing. The proposal and initial discussions can be found at 
> https://cwiki.apache.org/confluence/display/PIG/Pig+on+Storm+Proposal



--
This message was sent by Atlassian JIRA
(v6.1#6144)