[jira] [Updated] (PIG-3764) Compile physical operators to bytecode

2014-02-20 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-3764:


Description: 
I started a prototype here:
https://github.com/julienledem/pig/compare/trunk...compile_physical_plan

The current physical plan is relatively inefficient at evaluating expressions.
In the context of a better execution engine (Tez, Spark, ...), compiling 
expressions to bytecode would be a significant speedup.

This is a candidate project for Google summer of code 2014. More information 
about the program can be found at 
https://cwiki.apache.org/confluence/display/PIG/GSoc2014

  was:
I started a prototype here:
https://github.com/julienledem/pig/compare/trunk...compile_physical_plan

The current physical plan is relatively inefficient at evaluating expressions.
In the context of a better execution engine (Tez, Spark, ...), compiling 
expressions to bytecode would be a significant speedup.


> Compile physical operators to bytecode
> --
>
> Key: PIG-3764
> URL: https://issues.apache.org/jira/browse/PIG-3764
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Julien Le Dem
>  Labels: GSOC2014
>
> I started a prototype here:
> https://github.com/julienledem/pig/compare/trunk...compile_physical_plan
> The current physical plan is relatively inefficient at evaluating expressions.
> In the context of a better execution engine (Tez, Spark, ...), compiling 
> expressions to bytecode would be a significant speedup.
> This is a candidate project for Google summer of code 2014. More information 
> about the program can be found at 
> https://cwiki.apache.org/confluence/display/PIG/GSoc2014



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (PIG-2784) Framework for dynamic query optimization

2014-02-20 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-2784:


Description: 
We need a framework to implement dynamic query optimization, i.e. changing the 
query plan at runtime. Currently we support estimating the number of reducers 
dynamically, which works well as the first step but was not perfectly 
implemented. In near future, we'll support more dynamic optimization, like 
[removing sample job for 
order-by|https://issues.apache.org/jira/browse/PIG-483], [removing limit 
job|https://issues.apache.org/jira/browse/PIG-2675], dynamically detecting skew 
and using skew-join, etc.

Currently estimating #reducer is implemented in JobControlCompiler after 
MRCompiler compiles all the MapReduceOperators and generate the complete 
MRPlan. One place (discussed with Thejas) to implement the framework is at the 
MRCompiler, where the MRPlan'll be generated at batches and adjusted 
dynamically. 

Any comment?

This is a candidate project for Google summer of code 2014. More information 
about the program can be found at 
https://cwiki.apache.org/confluence/display/PIG/GSoc2014

  was:
We need a framework to implement dynamic query optimization, i.e. changing the 
query plan at runtime. Currently we support estimating the number of reducers 
dynamically, which works well as the first step but was not perfectly 
implemented. In near future, we'll support more dynamic optimization, like 
[removing sample job for 
order-by|https://issues.apache.org/jira/browse/PIG-483], [removing limit 
job|https://issues.apache.org/jira/browse/PIG-2675], dynamically detecting skew 
and using skew-join, etc.

Currently estimating #reducer is implemented in JobControlCompiler after 
MRCompiler compiles all the MapReduceOperators and generate the complete 
MRPlan. One place (discussed with Thejas) to implement the framework is at the 
MRCompiler, where the MRPlan'll be generated at batches and adjusted 
dynamically. 

Any comment?


> Framework for dynamic query optimization
> 
>
> Key: PIG-2784
> URL: https://issues.apache.org/jira/browse/PIG-2784
> Project: Pig
>  Issue Type: New Feature
>Reporter: Jie Li
>Assignee: Aniket Mokashi
>  Labels: GSOC2014
>
> We need a framework to implement dynamic query optimization, i.e. changing 
> the query plan at runtime. Currently we support estimating the number of 
> reducers dynamically, which works well as the first step but was not 
> perfectly implemented. In near future, we'll support more dynamic 
> optimization, like [removing sample job for 
> order-by|https://issues.apache.org/jira/browse/PIG-483], [removing limit 
> job|https://issues.apache.org/jira/browse/PIG-2675], dynamically detecting 
> skew and using skew-join, etc.
> Currently estimating #reducer is implemented in JobControlCompiler after 
> MRCompiler compiles all the MapReduceOperators and generate the complete 
> MRPlan. One place (discussed with Thejas) to implement the framework is at 
> the MRCompiler, where the MRPlan'll be generated at batches and adjusted 
> dynamically. 
> Any comment?
> This is a candidate project for Google summer of code 2014. More information 
> about the program can be found at 
> https://cwiki.apache.org/confluence/display/PIG/GSoc2014



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (PIG-2599) Mavenize Pig

2014-02-20 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-2599:


Description: 
Switch Pig build system from ant to maven.

This is a candidate project for Google summer of code 2014. More information 
about the program can be found at 
https://cwiki.apache.org/confluence/display/PIG/GSoc2014

  was:
Switch Pig build system from ant to maven.

This is a candidate project for Google summer of code 2013. More information 
about the program can be found at 
https://cwiki.apache.org/confluence/display/PIG/GSoc2013


> Mavenize Pig
> 
>
> Key: PIG-2599
> URL: https://issues.apache.org/jira/browse/PIG-2599
> Project: Pig
>  Issue Type: New Feature
>  Components: build
>Reporter: Daniel Dai
>  Labels: gsoc2014
> Fix For: 0.13.0
>
> Attachments: maven-pig.1.zip
>
>
> Switch Pig build system from ant to maven.
> This is a candidate project for Google summer of code 2014. More information 
> about the program can be found at 
> https://cwiki.apache.org/confluence/display/PIG/GSoc2014



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (PIG-2597) Move grunt from javacc to ANTRL

2014-02-20 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-2597:


Description: 
Currently, the parser for queries is in ANTLR, but Grunt is still javacc. The 
parser is very difficult to work with, and next to impossible to understand or 
modify. ANTLR provides a much cleaner, more standard way to generate 
parsers/lexers/ASTs/etc, and moving from javacc to Grunt would be huge as we 
continue to add features to Pig.

This is a candidate project for Google summer of code 2014. More information 
about the program can be found at 
https://cwiki.apache.org/confluence/display/PIG/GSoc2014

  was:
Currently, the parser for queries is in ANTLR, but Grunt is still javacc. The 
parser is very difficult to work with, and next to impossible to understand or 
modify. ANTLR provides a much cleaner, more standard way to generate 
parsers/lexers/ASTs/etc, and moving from javacc to Grunt would be huge as we 
continue to add features to Pig.

This is a candidate project for Google summer of code 2013. More information 
about the program can be found at 
https://cwiki.apache.org/confluence/display/PIG/GSoc2013


> Move grunt from javacc to ANTRL
> ---
>
> Key: PIG-2597
> URL: https://issues.apache.org/jira/browse/PIG-2597
> Project: Pig
>  Issue Type: Improvement
>Reporter: Jonathan Coveney
>  Labels: gsoc2014
> Attachments: pig02.diff
>
>
> Currently, the parser for queries is in ANTLR, but Grunt is still javacc. The 
> parser is very difficult to work with, and next to impossible to understand 
> or modify. ANTLR provides a much cleaner, more standard way to generate 
> parsers/lexers/ASTs/etc, and moving from javacc to Grunt would be huge as we 
> continue to add features to Pig.
> This is a candidate project for Google summer of code 2014. More information 
> about the program can be found at 
> https://cwiki.apache.org/confluence/display/PIG/GSoc2014



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (PIG-2597) Move grunt from javacc to ANTRL

2014-02-20 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-2597:


Labels: gsoc2014  (was: gsoc2013)

> Move grunt from javacc to ANTRL
> ---
>
> Key: PIG-2597
> URL: https://issues.apache.org/jira/browse/PIG-2597
> Project: Pig
>  Issue Type: Improvement
>Reporter: Jonathan Coveney
>  Labels: gsoc2014
> Attachments: pig02.diff
>
>
> Currently, the parser for queries is in ANTLR, but Grunt is still javacc. The 
> parser is very difficult to work with, and next to impossible to understand 
> or modify. ANTLR provides a much cleaner, more standard way to generate 
> parsers/lexers/ASTs/etc, and moving from javacc to Grunt would be huge as we 
> continue to add features to Pig.
> This is a candidate project for Google summer of code 2013. More information 
> about the program can be found at 
> https://cwiki.apache.org/confluence/display/PIG/GSoc2013



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] Subscription: PIG patch available

2014-02-20 Thread jira
Issue Subscription
Filter: PIG patch available (10 issues)

Subscriber: pigdaily

Key Summary
PIG-3757Make scalar work
https://issues.apache.org/jira/browse/PIG-3757
PIG-3737Bundle dependent jars in distribution in %PIG_HOME%/lib folder
https://issues.apache.org/jira/browse/PIG-3737
PIG-3735UDF to data cleanse the dirty data with expected pattern
https://issues.apache.org/jira/browse/PIG-3735
PIG-3668COR built-in function when atleast one of the coefficient values is 
NaN
https://issues.apache.org/jira/browse/PIG-3668
PIG-3635Fix e2e tests for Hadoop 2.X on Windows
https://issues.apache.org/jira/browse/PIG-3635
PIG-3613UDF for SimilarityMatching between strings with matching scores
https://issues.apache.org/jira/browse/PIG-3613
PIG-3587add functionality for rolling over dates
https://issues.apache.org/jira/browse/PIG-3587
PIG-3456Reduce threadlocal conf access in backend for each record
https://issues.apache.org/jira/browse/PIG-3456
PIG-3441Allow Pig to use default resources from Configuration objects
https://issues.apache.org/jira/browse/PIG-3441
PIG-3373XMLLoader returns non-matching nodes when a tag name spans through 
the block boundary
https://issues.apache.org/jira/browse/PIG-3373

You may edit this subscription at:
https://issues.apache.org/jira/secure/FilterSubscription!default.jspa?subId=13225&filterId=12322384


[jira] [Updated] (PIG-3774) Piggybank Over UDF get wrong result

2014-02-20 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-3774:


  Resolution: Fixed
Hadoop Flags: Reviewed
  Status: Resolved  (was: Patch Available)

Patch committed to trunk and 0.12 branch.

> Piggybank Over UDF get wrong result
> ---
>
> Key: PIG-3774
> URL: https://issues.apache.org/jira/browse/PIG-3774
> Project: Pig
>  Issue Type: Bug
>  Components: piggybank
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.12.1, 0.13.0
>
> Attachments: PIG-3774-1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (PIG-3774) Piggybank Over UDF get wrong result

2014-02-20 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13907765#comment-13907765
 ] 

Alan Gates commented on PIG-3774:
-

+1.  

> Piggybank Over UDF get wrong result
> ---
>
> Key: PIG-3774
> URL: https://issues.apache.org/jira/browse/PIG-3774
> Project: Pig
>  Issue Type: Bug
>  Components: piggybank
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.12.1, 0.13.0
>
> Attachments: PIG-3774-1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (PIG-3774) Piggybank Over UDF get wrong result

2014-02-20 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-3774:


Attachment: PIG-3774-1.patch

> Piggybank Over UDF get wrong result
> ---
>
> Key: PIG-3774
> URL: https://issues.apache.org/jira/browse/PIG-3774
> Project: Pig
>  Issue Type: Bug
>  Components: piggybank
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.12.1, 0.13.0
>
> Attachments: PIG-3774-1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (PIG-3774) Piggybank Over UDF get wrong result

2014-02-20 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-3774:


Status: Patch Available  (was: Open)

> Piggybank Over UDF get wrong result
> ---
>
> Key: PIG-3774
> URL: https://issues.apache.org/jira/browse/PIG-3774
> Project: Pig
>  Issue Type: Bug
>  Components: piggybank
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.12.1, 0.13.0
>
> Attachments: PIG-3774-1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (PIG-3774) Piggybank Over UDF get wrong result

2014-02-20 Thread Daniel Dai (JIRA)
Daniel Dai created PIG-3774:
---

 Summary: Piggybank Over UDF get wrong result
 Key: PIG-3774
 URL: https://issues.apache.org/jira/browse/PIG-3774
 Project: Pig
  Issue Type: Bug
  Components: piggybank
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.12.1, 0.13.0






--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (PIG-2672) Optimize the use of DistributedCache

2014-02-20 Thread Aniket Mokashi (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13907473#comment-13907473
 ] 

Aniket Mokashi commented on PIG-2672:
-

Thanks [~brocknoland]! Looks like it existed even before this 
@https://github.com/apache/pig/blob/branch-0.12/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java#L1524.
 Let me open another jira to fix it.

> Optimize the use of DistributedCache
> 
>
> Key: PIG-2672
> URL: https://issues.apache.org/jira/browse/PIG-2672
> Project: Pig
>  Issue Type: Improvement
>Reporter: Rohini Palaniswamy
>Assignee: Aniket Mokashi
> Fix For: 0.13.0
>
> Attachments: PIG-2672-10.patch, PIG-2672-5.patch, PIG-2672-7.patch, 
> PIG-2672.patch
>
>
> Pig currently copies jar files to a temporary location in hdfs and then adds 
> them to DistributedCache for each job launched. This is inefficient in terms 
> of 
>* Space - The jars are distributed to task trackers for every job taking 
> up lot of local temporary space in tasktrackers.
>* Performance - The jar distribution impacts the job launch time.  



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (PIG-2672) Optimize the use of DistributedCache

2014-02-20 Thread Brock Noland (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13907447#comment-13907447
 ] 

Brock Noland commented on PIG-2672:
---

FYI in in HIVE-860 a reviewer asked me if the following code (copied from this 
patch) closed the stream:

{noformat}
String checksum = DigestUtils.shaHex(url.openStream());
{noformat}

Doesn't look like it does according to the common-codec source. Therefore I 
think pig has a file descriptor leak.

> Optimize the use of DistributedCache
> 
>
> Key: PIG-2672
> URL: https://issues.apache.org/jira/browse/PIG-2672
> Project: Pig
>  Issue Type: Improvement
>Reporter: Rohini Palaniswamy
>Assignee: Aniket Mokashi
> Fix For: 0.13.0
>
> Attachments: PIG-2672-10.patch, PIG-2672-5.patch, PIG-2672-7.patch, 
> PIG-2672.patch
>
>
> Pig currently copies jar files to a temporary location in hdfs and then adds 
> them to DistributedCache for each job launched. This is inefficient in terms 
> of 
>* Space - The jars are distributed to task trackers for every job taking 
> up lot of local temporary space in tasktrackers.
>* Performance - The jar distribution impacts the job launch time.  



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (PIG-3675) Documentation for AccumuloStorage

2014-02-20 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-3675:


  Resolution: Fixed
Hadoop Flags: Reviewed
  Status: Resolved  (was: Patch Available)

Patch committed to trunk. Thanks Josh!

> Documentation for AccumuloStorage
> -
>
> Key: PIG-3675
> URL: https://issues.apache.org/jira/browse/PIG-3675
> Project: Pig
>  Issue Type: Bug
>  Components: documentation
>Reporter: Daniel Dai
>Assignee: Josh Elser
> Fix For: 0.13.0
>
> Attachments: 
> 0001-PIG-3675-Initial-documentation-for-AccumuloStorage.patch, 
> 0001-PIG-3675-Initial-documentation-for-AccumuloStorage.patch.2
>
>




--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (PIG-3773) Grunt ERROR 2017 with RANK and two output paths

2014-02-20 Thread Ville Weijo (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ville Weijo updated PIG-3773:
-

Attachment: pig_1392907156669.log

Stack trace

> Grunt ERROR 2017 with RANK and two output paths
> ---
>
> Key: PIG-3773
> URL: https://issues.apache.org/jira/browse/PIG-3773
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.12.0, 0.11.1
>Reporter: Ville Weijo
> Attachments: pig_1392907156669.log
>
>
> Execution of Pig script
> {code}
> A = LOAD 'input.txt';
> B = RANK A;
> STORE B INTO 'output1.txt';
> STORE A INTO 'output2.txt';
> {code}
> crashes with 
> {code}
> [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2017: Internal error 
> creating job configuration.
> {code}
> If "STORE A INTO 'output2.txt'" is removed, the script works fine. Content of 
> 'input.txt' does not seem to matter much, except it cannot be empty 
> (apparently triggers bug 
> [PIG-3726|https://issues.apache.org/jira/browse/PIG-3726]).



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (PIG-3773) Grunt ERROR 2017 with RANK and two output paths

2014-02-20 Thread Ville Weijo (JIRA)
Ville Weijo created PIG-3773:


 Summary: Grunt ERROR 2017 with RANK and two output paths
 Key: PIG-3773
 URL: https://issues.apache.org/jira/browse/PIG-3773
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.11.1, 0.12.0
Reporter: Ville Weijo


Execution of Pig script
{code}
A = LOAD 'input.txt';
B = RANK A;
STORE B INTO 'output1.txt';
STORE A INTO 'output2.txt';
{code}
crashes with 
{code}
[main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2017: Internal error 
creating job configuration.
{code}
If "STORE A INTO 'output2.txt'" is removed, the script works fine. Content of 
'input.txt' does not seem to matter much, except it cannot be empty (apparently 
triggers bug [PIG-3726|https://issues.apache.org/jira/browse/PIG-3726]).




--
This message was sent by Atlassian JIRA
(v6.1.5#6160)