from:"Cheolsoo Park \(JIRA\)"

[jira] [Commented] (PIG-3021) Split results missing records when there is null values in the column comparison

2017-04-18 Thread Cheolsoo Park (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15973449#comment-15973449
 ] 

Cheolsoo Park commented on PIG-3021:


[~daijy], no I don't remember. Sorry for the late reply.

> Split results missing records when there is null values in the column 
> comparison
> 
>
> Key: PIG-3021
> URL: https://issues.apache.org/jira/browse/PIG-3021
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.10.0
>Reporter: Chang Luo
>Assignee: Nian Ji
> Attachments: PIG-3021-2.patch, PIG-3021-3.patch, PIG-3021-4.patch, 
> PIG-3021.patch
>
>
> Suppose a(x, y)
> split a into b if x==y, c otherwise;
> One will expect the union of b and c will be a.  However, if x or y is null, 
> the record won't appear in either b or c.
> To workaround this, I have to change to the following:
> split a into b if x is not null and y is not null and x==y, c otherwise;



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (PIG-4940) Predicate push-down filtering unary expressions can be pushed.

2016-06-28 Thread Cheolsoo Park (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-4940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15353741#comment-15353741
 ] 

Cheolsoo Park commented on PIG-4940:


Sorry, but I'll let someone who is still active in the community review your 
patch...

> Predicate push-down filtering unary expressions can be pushed.
> --
>
> Key: PIG-4940
> URL: https://issues.apache.org/jira/browse/PIG-4940
> Project: Pig
>  Issue Type: Bug
>Reporter: Ryan Blue
>
> While testing predicate push-down, I ran into the following error:
> {code:title=Pig Exception}
> ERROR 0: Unsupported conversion of LogicalExpression to Expression: Map
> at 
> org.apache.pig.newplan.FilterExtractor.getExpression(FilterExtractor.java:389)
> at 
> org.apache.pig.newplan.FilterExtractor.getExpression(FilterExtractor.java:401)
> at 
> org.apache.pig.newplan.FilterExtractor.getExpression(FilterExtractor.java:378)
> at 
> org.apache.pig.newplan.FilterExtractor.getExpression(FilterExtractor.java:401)
> at 
> org.apache.pig.newplan.FilterExtractor.getExpression(FilterExtractor.java:380)
> at 
> org.apache.pig.newplan.FilterExtractor.visit(FilterExtractor.java:109)
> at 
> org.apache.pig.newplan.PredicatePushDownFilterExtractor.visit(PredicatePushDownFilterExtractor.java:70)
> at 
> org.apache.pig.newplan.logical.rules.PredicatePushdownOptimizer$PredicatePushDownTransformer.transform(PredicatePushdownOptimizer.java:146)
> at 
> org.apache.pig.newplan.optimizer.PlanOptimizer.optimize(PlanOptimizer.java:110)
> ... 19 more
> {code}
> The problem is that the code is trying to push a map access operation, that 
> isn't supported. The cause appears to be the logic in 
> {{checkPushDown(UnaryExpression)}} that separates expressions that can be 
> pushed from expressions that must be run by Pig. This function assumes that 
> any expression under {{IsNullExpression}} or {{NotExpression}} can be pushed 
> and adds the unary node's child expression to the pushdown expression without 
> calling {{checkPushDown}} on it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4850) Registered jars do not use submit replication

2016-03-25 Thread Cheolsoo Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-4850:
---
   Resolution: Fixed
Fix Version/s: 0.16.0
   Status: Resolved  (was: Patch Available)

Committed into trunk.

> Registered jars do not use submit replication
> -
>
> Key: PIG-4850
> URL: https://issues.apache.org/jira/browse/PIG-4850
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Ryan Blue
>Assignee: Ryan Blue
> Fix For: 0.16.0
>
> Attachments: PIG-4850.1.patch
>
>
> PIG-4074 added support for mapred.submit.replication, which sets the 
> replication factor for files added to the distributed cache. The purpose is 
> to avoid a huge number of task attempts downloading the same file in HDFS at 
> once during localization and slowing down because of contention over few 
> replicas. The replication factor for files was set correctly, but registered 
> jars are added to HDFS through a different code path and weren't using the 
> submit replication factor. This causes localization time for jobs to increase 
> by as much as 10 minutes (at which point the tasks are killed).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PIG-4850) Registered jars do not use submit replication

2016-03-25 Thread Cheolsoo Park (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-4850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212561#comment-15212561
 ] 

Cheolsoo Park commented on PIG-4850:


Oh so PIG-4074 missed a code path for registered jars. LGTM.

Thanks for fixing this! I'll commit it later today.

+1.

> Registered jars do not use submit replication
> -
>
> Key: PIG-4850
> URL: https://issues.apache.org/jira/browse/PIG-4850
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Ryan Blue
>Assignee: Ryan Blue
> Attachments: PIG-4850.1.patch
>
>
> PIG-4074 added support for mapred.submit.replication, which sets the 
> replication factor for files added to the distributed cache. The purpose is 
> to avoid a huge number of task attempts downloading the same file in HDFS at 
> once during localization and slowing down because of contention over few 
> replicas. The replication factor for files was set correctly, but registered 
> jars are added to HDFS through a different code path and weren't using the 
> submit replication factor. This causes localization time for jobs to increase 
> by as much as 10 minutes (at which point the tasks are killed).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4696) Empty map returned by a streaming_python udf wrongly contains a null key

2015-10-08 Thread Cheolsoo Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-4696:
---
   Resolution: Fixed
Fix Version/s: 0.16.0
   Status: Resolved  (was: Patch Available)

Thanks Rohini for the review. Committed to trunk.

> Empty map returned by a streaming_python udf wrongly contains a null key
> 
>
> Key: PIG-4696
> URL: https://issues.apache.org/jira/browse/PIG-4696
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.15.0
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
>Priority: Minor
> Fix For: 0.16.0
>
> Attachments: PIG-4696.1.patch
>
>
> To reproduce, please run the following query-
> {code}
> b = FOREACH a GENERATE (map[])udfs.empty_dict();
> DUMP b;
> {code}
> where empty_dict() is a Python udf-
> {code}
> @outputSchema("map_out: []")
> def empty_dict():
> return {}
> {code}
> This returns {{([])}} in jython while {{(\[null#\])}} in streaming_python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4696) Empty map returned by a streaming_python udf wrongly contains a null key

2015-10-07 Thread Cheolsoo Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-4696:
---
Attachment: PIG-4696.1.patch

> Empty map returned by a streaming_python udf wrongly contains a null key
> 
>
> Key: PIG-4696
> URL: https://issues.apache.org/jira/browse/PIG-4696
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.15.0
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
>Priority: Minor
> Attachments: PIG-4696.1.patch
>
>
> To reproduce, please run the following query-
> {code}
> b = FOREACH a GENERATE (map[])udfs.empty_dict();
> DUMP b;
> {code}
> where empty_dict() is a Python udf-
> {code}
> @outputSchema("map_out: []")
> def empty_dict():
> return {}
> {code}
> This returns {{([])}} in jython while {{(\[null#\])}} in streaming_python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4696) Empty map returned by a streaming_python udf wrongly contains a null key

2015-10-07 Thread Cheolsoo Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-4696:
---
Status: Patch Available  (was: Open)

> Empty map returned by a streaming_python udf wrongly contains a null key
> 
>
> Key: PIG-4696
> URL: https://issues.apache.org/jira/browse/PIG-4696
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.15.0
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
>Priority: Minor
> Attachments: PIG-4696.1.patch
>
>
> To reproduce, please run the following query-
> {code}
> b = FOREACH a GENERATE (map[])udfs.empty_dict();
> DUMP b;
> {code}
> where empty_dict() is a Python udf-
> {code}
> @outputSchema("map_out: []")
> def empty_dict():
> return {}
> {code}
> This returns {{([])}} in jython while {{(\[null#\])}} in streaming_python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (PIG-4696) Empty map returned by a streaming_python udf wrongly contains a null key

2015-10-07 Thread Cheolsoo Park (JIRA)

Cheolsoo Park created PIG-4696:
--

 Summary: Empty map returned by a streaming_python udf wrongly 
contains a null key
 Key: PIG-4696
 URL: https://issues.apache.org/jira/browse/PIG-4696
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.15.0
Reporter: Cheolsoo Park
Assignee: Cheolsoo Park
Priority: Minor


To reproduce, please run the following query-
{code}
b = FOREACH a GENERATE (map[])udfs.empty_dict();
DUMP b;
{code}
where empty_dict() is a Python udf-
{code}
@outputSchema("map_out: []")
def empty_dict():
return {}
{code}
This returns {{([])}} in jython while {{(\[null#\])}} in streaming_python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PIG-4066) An optimization for ROLLUP operation in Pig

2015-05-21 Thread Cheolsoo Park (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-4066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555615#comment-14555615
 ] 

Cheolsoo Park commented on PIG-4066:


[~daijy], sorry for the trouble and thanks for the clean-up.

> An optimization for ROLLUP operation in Pig
> ---
>
> Key: PIG-4066
> URL: https://issues.apache.org/jira/browse/PIG-4066
> Project: Pig
>  Issue Type: Improvement
>Reporter: Quang-Nhat HOANG-XUAN
>Assignee: Quang-Nhat HOANG-XUAN
>  Labels: hybrid-irg, optimization, rollup
> Fix For: 0.15.0
>
> Attachments: Current Rollup vs Our Rollup.jpg, PIG-4066-revert.patch, 
> PIG-4066.2.patch, PIG-4066.3.patch, PIG-4066.4.patch, PIG-4066.5.patch, 
> PIG-4066.patch, TechnicalNotes.2.pdf, TechnicalNotes.pdf, UserGuide.pdf
>
>
> This patch aims at addressing the current limitation of the ROLLUP operator 
> in PIG: most of the work is done in the Map phase of the underlying MapReduce 
> job to generate all possible intermediate keys that the reducer use to 
> aggregate and produce the ROLLUP output. Based on our previous work: 
> “Duy-Hung Phan, Matteo Dell’Amico, Pietro Michiardi: On the design space of 
> MapReduce ROLLUP aggregates” 
> (http://www.eurecom.fr/en/publication/4212/download/rs-publi-4212_2.pdf), we 
> show that the design space for a ROLLUP implementation allows for a different 
> approach (in-reducer grouping, IRG), in which less work is done in the Map 
> phase and the grouping is done in the Reduce phase. This patch presents the 
> most efficient implementation we designed (Hybrid IRG), which allows defining 
> a parameter to balance between parallelism (in the reducers) and 
> communication cost.
> This patch contains the following features:
> 1. The new ROLLUP approach: IRG, Hybrid IRG.
> 2. The PIVOT clause in CUBE operators.
> 3. Test cases.
> The new syntax to use our ROLLUP approach:
> alias = CUBE rel BY { CUBE col_ref | ROLLUP col_ref [PIVOT pivot_value]} [, { 
> CUBE col_ref | ROLLUP col_ref [PIVOT pivot_value]}...]
> In case there is multiple ROLLUP operator in one CUBE clause, the last ROLLUP 
> operator will be executed with our approach (IRG, Hybrid IRG) while the 
> remaining ROLLUP ahead will be executed with the default approach.
> We have already made some experiments for comparison between our ROLLUP 
> implementation and the current ROLLUP. More information can be found at here: 
> http://hxquangnhat.github.io/PIG-ROLLUP-H2IRG/
> Patch can be reviewed at here: https://reviews.apache.org/r/23804/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PIG-3294) Allow Pig use Hive UDFs

2015-05-18 Thread Cheolsoo Park (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14548184#comment-14548184
 ] 

Cheolsoo Park commented on PIG-3294:


[~daijy], thank you for the great work! I am interested in deploying this 
feature in production. What I don't fully understand is its dependency on Hive. 
So my questions are-
# What Hive jars do I need in classpath to use Hive UDFs in Pig (if there is 
any)?
# What does HIVE-9767 do? Do I need to backport it to my Hive release? (Looks 
like HIVE-9766 is included in the patch.)

> Allow Pig use Hive UDFs
> ---
>
> Key: PIG-3294
> URL: https://issues.apache.org/jira/browse/PIG-3294
> Project: Pig
>  Issue Type: New Feature
>Reporter: Daniel Dai
>Assignee: Daniel Dai
>  Labels: gsoc2013, java
> Fix For: 0.15.0
>
> Attachments: PIG-3294-1.patch, PIG-3294-2.patch, PIG-3294-3.patch, 
> PIG-3294-4.patch, PIG-3294-5.patch, PIG-3294-before-refactory.patch
>
>
> It would be nice if Pig provide some interoperability with Hive. We can wrap 
> Hive UDF in Pig so we can use Hive UDF in Pig.
> This is a candidate project for Google summer of code 2013. More information 
> about the program can be found at 
> https://cwiki.apache.org/confluence/display/PIG/GSoc2013



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (PIG-4511) Add columns to prune from PluckTuple

2015-04-27 Thread Cheolsoo Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park resolved PIG-4511.

Resolution: Fixed

Committed to branch-0.15 and trunk. Thank you [~jbabcock]!

> Add columns to prune from PluckTuple
> 
>
> Key: PIG-4511
> URL: https://issues.apache.org/jira/browse/PIG-4511
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.14.0
>Reporter: Joseph Babcock
>Assignee: Joseph Babcock
>Priority: Minor
> Fix For: 0.15.0
>
> Attachments: pluckTuple.patch
>
>
> Currently pluckTuple returns all columns in relation that match a prefix 
> predicate. This patch allows a boolean flag 'false' to specify that columns 
> NOT matching this prefix should be retained.
> Example:
> 
> a = load 'a' as (x:int,y:chararray,z:long)
> b = load 'b' as (x:int,y:chararray,z:long)
> c = join a by x, b by x;
> Define pluck PluckTuple('a::','false')
> -- returns b
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PIG-4511) Add columns to prune from PluckTuple

2015-04-27 Thread Cheolsoo Park (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-4511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14514310#comment-14514310
 ] 

Cheolsoo Park commented on PIG-4511:


+1. I'll commit the patch soon.

> Add columns to prune from PluckTuple
> 
>
> Key: PIG-4511
> URL: https://issues.apache.org/jira/browse/PIG-4511
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.14.0
>Reporter: Joseph Babcock
>Assignee: Joseph Babcock
>Priority: Minor
> Fix For: 0.15.0
>
> Attachments: pluckTuple.patch
>
>
> Currently pluckTuple returns all columns in relation that match a prefix 
> predicate. This patch allows a boolean flag 'false' to specify that columns 
> NOT matching this prefix should be retained.
> Example:
> 
> a = load 'a' as (x:int,y:chararray,z:long)
> b = load 'b' as (x:int,y:chararray,z:long)
> c = join a by x, b by x;
> Define pluck PluckTuple('a::','false')
> -- returns b
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PIG-4511) Add columns to prune from PluckTuple

2015-04-24 Thread Cheolsoo Park (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-4511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14511777#comment-14511777
 ] 

Cheolsoo Park commented on PIG-4511:


[~daijy], actually I think we shouldn't commit this patch as is.

Three problems:

1) It container unwanted changes such as-
{code}
+ * 
+ * Additional arguments to this udf are columns to exclude from the relation 
matching this prefix (assuming this column is the end of the alias: e.g., if 
choose to exclude column y then exclude a::b::y using PluckTuple('a::','y'))
  *
  * Example:
- *
- * 1) Prefix
  * a = load 'a' as (x, y);
  * b = load 'b' as (x, y);
  * c = join a by x, b by x;
@@ -47,29 +47,28 @@ import com.google.common.collect.Lists;
  * c: {a::x: bytearray,a::y: bytearray,b::x: bytearray,b::y: bytearray}
  * describe d;
  * d: {plucked::a::x: bytearray,plucked::a::y: bytearray}
- *
- * 2) Regex
- * a = load 'a' as (x, y);
- * b = load 'b' as (x, y);
- * c = join a by x, b by x;
- * DEFINE pluck PluckTuple('.*::y');
- * d = foreach c generate FLATTEN(pluck(*));
- * describe c;
- * c: {a::x: bytearray,a::y: bytearray,b::x: bytearray,b::y: bytearray}
- * describe d;
- * d: {plucked::a::y: bytearray,plucked::a::y: bytearray}
  */
 public class PluckTuple extends EvalFunc {
 private static final TupleFactory mTupleFactory = 
TupleFactory.getInstance();
-private static Pattern pattern;
+private static Pattern prefixPattern;
{code}
What happened is that he generated his patch based to my internal release 
branch where I committed {{PIG-4401-2.patch}} while {{PIG-4401-3.patch}} was 
committed to Apache trunk.

2) His patch is missing the update to docs.

3) Won't the following change have an impact to Tez local mode?
{code}
-pigServer = new PigServer(Util.getLocalTestMode());
+pigServer = new PigServer(ExecType.LOCAL);
{code}

Actually, I was communicating with Joseph (we work for the same employer) to 
update his patch. If [~jbabcock] is busy, I can put up a new patch that 
addresses these issues.

> Add columns to prune from PluckTuple
> 
>
> Key: PIG-4511
> URL: https://issues.apache.org/jira/browse/PIG-4511
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.14.0
>Reporter: Joseph Babcock
>Assignee: Joseph Babcock
>Priority: Minor
> Fix For: 0.15.0
>
> Attachments: pluckTuple.patch
>
>
> Currently pluckTuple returns all columns in relation that match a prefix 
> predicate. This patch allows a variable argument list of column names 
> following the predicate to remove from the alias. 
> Example:
> a = load 'a' as (x:int,y:chararray,z:long)
> b = load 'b' as (x:int,y:chararray,z:long)
> c = join a by x, b by x;
> Define pluck PluckTuple('a::','x',z')
> -- returns y from a only



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4511) Add columns to prune from PluckTuple

2015-04-17 Thread Cheolsoo Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-4511:
---
Affects Version/s: 0.14.0
Fix Version/s: (was: 0.14.0)
   0.15.0

> Add columns to prune from PluckTuple
> 
>
> Key: PIG-4511
> URL: https://issues.apache.org/jira/browse/PIG-4511
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.14.0
>Reporter: Joseph Babcock
>Assignee: Joseph Babcock
>Priority: Minor
> Fix For: 0.15.0
>
>
> Currently pluckTuple returns all columns in relation that match a prefix 
> predicate. This patch allows a variable argument list of column names 
> following the predicate to remove from the alias. 
> Example:
> a = load 'a' as (x:int,y:chararray,z:long)
> b = load 'b' as (x:int,y:chararray,z:long)
> c = join a by x, b by x;
> Define pluck PluckTuple('a::','x',z')
> -- returns y from a only



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (PIG-4511) Add columns to prune from PluckTuple

2015-04-17 Thread Cheolsoo Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park reassigned PIG-4511:
--

Assignee: Joseph Babcock

Assigning to Joseph.

> Add columns to prune from PluckTuple
> 
>
> Key: PIG-4511
> URL: https://issues.apache.org/jira/browse/PIG-4511
> Project: Pig
>  Issue Type: Improvement
>Reporter: Joseph Babcock
>Assignee: Joseph Babcock
>Priority: Minor
> Fix For: 0.14.0
>
>
> Currently pluckTuple returns all columns in relation that match a prefix 
> predicate. This patch allows a variable argument list of column names 
> following the predicate to remove from the alias. 
> Example:
> a = load 'a' as (x:int,y:chararray,z:long)
> b = load 'b' as (x:int,y:chararray,z:long)
> c = join a by x, b by x;
> Define pluck PluckTuple('a::','x',z')
> -- returns y from a only



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4409) fs.defaultFS is overwritten in JobConf by replicated join at runtime

2015-02-04 Thread Cheolsoo Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-4409:
---
   Resolution: Fixed
Fix Version/s: 0.14.1
   Status: Resolved  (was: Patch Available)

Thank you Daniel for the quick review. Committed to 0.14 and trunk.

> fs.defaultFS is overwritten in JobConf by replicated join at runtime
> 
>
> Key: PIG-4409
> URL: https://issues.apache.org/jira/browse/PIG-4409
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.14.0
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
>Priority: Critical
> Fix For: 0.14.1, 0.15.0
>
> Attachments: PIG-4409-1.patch
>
>
> This is a regression of PIG-4257.
> Pig accidentally overwrites {{fs.defaultFS}} in JobConf during the replicated 
> join at runtime. This can cause various side effects because udfs and 
> store/load funcs might depend on the value of {{fs.defaultFS}} at runtime.
> Here is an example. I have a store func that does 2-phase commit to S3. Each 
> reducer writes output to local disk first and copies them to the final 
> destination on S3 during the task commit phase. Once it's done copying, 
> reducer writes a commit log to a hdfs location. During the job commit phase, 
> AM reads all the commit logs and update Hive metastore accordingly.
> This store func stopped working in 0.14 when there is a replicate join in the 
> reduce phase. It is because {{fs.defaultFS}} is overwritten to local FS from 
> HDFS by replicated join at runtime.
> The root cause is that PIG-4257 changed 
> {{ConfigurationUtil.getLocalFSProperties()}} to return a reference to JobConf 
> instead of a copy object.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4409) fs.defaultFS is overwritten in JobConf by replicated join at runtime

2015-02-04 Thread Cheolsoo Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-4409:
---
Attachment: PIG-4409-1.patch

Uploading a patch that fixes the issue.

> fs.defaultFS is overwritten in JobConf by replicated join at runtime
> 
>
> Key: PIG-4409
> URL: https://issues.apache.org/jira/browse/PIG-4409
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.14.0
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
>Priority: Critical
> Fix For: 0.15.0
>
> Attachments: PIG-4409-1.patch
>
>
> This is a regression of PIG-4257.
> Pig accidentally overwrites {{fs.defaultFS}} in JobConf during the replicated 
> join at runtime. This can cause various side effects because udfs and 
> store/load funcs might depend on the value of {{fs.defaultFS}} at runtime.
> Here is an example. I have a store func that does 2-phase commit to S3. Each 
> reducer writes output to local disk first and copies them to the final 
> destination on S3 during the task commit phase. Once it's done copying, 
> reducer writes a commit log to a hdfs location. During the job commit phase, 
> AM reads all the commit logs and update Hive metastore accordingly.
> This store func stopped working in 0.14 when there is a replicate join in the 
> reduce phase. It is because {{fs.defaultFS}} is overwritten to local FS from 
> HDFS by replicated join at runtime.
> The root cause is that PIG-4257 changed 
> {{ConfigurationUtil.getLocalFSProperties()}} to return a reference to JobConf 
> instead of a copy object.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4409) fs.defaultFS is overwritten in JobConf by replicated join at runtime

2015-02-04 Thread Cheolsoo Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-4409:
---
Status: Patch Available  (was: Open)

> fs.defaultFS is overwritten in JobConf by replicated join at runtime
> 
>
> Key: PIG-4409
> URL: https://issues.apache.org/jira/browse/PIG-4409
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.14.0
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
>Priority: Critical
> Fix For: 0.15.0
>
> Attachments: PIG-4409-1.patch
>
>
> This is a regression of PIG-4257.
> Pig accidentally overwrites {{fs.defaultFS}} in JobConf during the replicated 
> join at runtime. This can cause various side effects because udfs and 
> store/load funcs might depend on the value of {{fs.defaultFS}} at runtime.
> Here is an example. I have a store func that does 2-phase commit to S3. Each 
> reducer writes output to local disk first and copies them to the final 
> destination on S3 during the task commit phase. Once it's done copying, 
> reducer writes a commit log to a hdfs location. During the job commit phase, 
> AM reads all the commit logs and update Hive metastore accordingly.
> This store func stopped working in 0.14 when there is a replicate join in the 
> reduce phase. It is because {{fs.defaultFS}} is overwritten to local FS from 
> HDFS by replicated join at runtime.
> The root cause is that PIG-4257 changed 
> {{ConfigurationUtil.getLocalFSProperties()}} to return a reference to JobConf 
> instead of a copy object.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (PIG-4409) fs.defaultFS is overwritten in JobConf by replicated join at runtime

2015-02-04 Thread Cheolsoo Park (JIRA)

Cheolsoo Park created PIG-4409:
--

 Summary: fs.defaultFS is overwritten in JobConf by replicated join 
at runtime
 Key: PIG-4409
 URL: https://issues.apache.org/jira/browse/PIG-4409
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.14.0
Reporter: Cheolsoo Park
Assignee: Cheolsoo Park
Priority: Critical
 Fix For: 0.15.0


This is a regression of PIG-4257.

Pig accidentally overwrites {{fs.defaultFS}} in JobConf during the replicated 
join at runtime. This can cause various side effects because udfs and 
store/load funcs might depend on the value of {{fs.defaultFS}} at runtime.

Here is an example. I have a store func that does 2-phase commit to S3. Each 
reducer writes output to local disk first and copies them to the final 
destination on S3 during the task commit phase. Once it's done copying, reducer 
writes a commit log to a hdfs location. During the job commit phase, AM reads 
all the commit logs and update Hive metastore accordingly.

This store func stopped working in 0.14 when there is a replicate join in the 
reduce phase. It is because {{fs.defaultFS}} is overwritten to local FS from 
HDFS by replicated join at runtime.

The root cause is that PIG-4257 changed 
{{ConfigurationUtil.getLocalFSProperties()}} to return a reference to JobConf 
instead of a copy object.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4401) Add pattern matching to PluckTuple

2015-01-29 Thread Cheolsoo Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-4401:
---
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed to trunk. Thanks Daniel for the review!

> Add pattern matching to PluckTuple
> --
>
> Key: PIG-4401
> URL: https://issues.apache.org/jira/browse/PIG-4401
> Project: Pig
>  Issue Type: Improvement
>  Components: internal-udfs
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
>Priority: Minor
> Fix For: 0.15.0
>
> Attachments: PIG-4401-1.patch, PIG-4401-2.patch, PIG-4401-3.patch
>
>
> PluckTuple is useful when cleaning up long prefixes in a lengthy Pig script. 
> Currently, the udf filters out fields only with exact match, but it would be 
> useful if it could filter based on regex/wildcard.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4402) JavaScript UDF example in the doc is broken

2015-01-29 Thread Cheolsoo Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-4402:
---
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed to trunk. Thank you Daniel for the review.

> JavaScript UDF example in the doc is broken
> ---
>
> Key: PIG-4402
> URL: https://issues.apache.org/jira/browse/PIG-4402
> Project: Pig
>  Issue Type: Bug
>  Components: documentation
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
>Priority: Minor
> Fix For: 0.15.0
>
> Attachments: PIG-4402-1.patch
>
>
> The following example in the [JS udf 
> doc|http://pig.apache.org/docs/r0.14.0/udf.html#js-udfs] throws an error, 
> which is embarrassing-
> {code}
> complex.outputSchema = "word:chararray,num:long";
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4401) Add pattern matching to PluckTuple

2015-01-29 Thread Cheolsoo Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-4401:
---
Affects Version/s: (was: 0.15.0)
Fix Version/s: 0.15.0

> Add pattern matching to PluckTuple
> --
>
> Key: PIG-4401
> URL: https://issues.apache.org/jira/browse/PIG-4401
> Project: Pig
>  Issue Type: Improvement
>  Components: internal-udfs
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
>Priority: Minor
> Fix For: 0.15.0
>
> Attachments: PIG-4401-1.patch, PIG-4401-2.patch, PIG-4401-3.patch
>
>
> PluckTuple is useful when cleaning up long prefixes in a lengthy Pig script. 
> Currently, the udf filters out fields only with exact match, but it would be 
> useful if it could filter based on regex/wildcard.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4401) Add pattern matching to PluckTuple

2015-01-29 Thread Cheolsoo Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-4401:
---
Attachment: PIG-4401-3.patch

> Add pattern matching to PluckTuple
> --
>
> Key: PIG-4401
> URL: https://issues.apache.org/jira/browse/PIG-4401
> Project: Pig
>  Issue Type: Improvement
>  Components: internal-udfs
>Affects Versions: 0.15.0
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
>Priority: Minor
> Attachments: PIG-4401-1.patch, PIG-4401-2.patch, PIG-4401-3.patch
>
>
> PluckTuple is useful when cleaning up long prefixes in a lengthy Pig script. 
> Currently, the udf filters out fields only with exact match, but it would be 
> useful if it could filter based on regex/wildcard.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4401) Add pattern matching to PluckTuple

2015-01-29 Thread Cheolsoo Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-4401:
---
Attachment: (was: PIG-4401-3.patch)

> Add pattern matching to PluckTuple
> --
>
> Key: PIG-4401
> URL: https://issues.apache.org/jira/browse/PIG-4401
> Project: Pig
>  Issue Type: Improvement
>  Components: internal-udfs
>Affects Versions: 0.15.0
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
>Priority: Minor
> Attachments: PIG-4401-1.patch, PIG-4401-2.patch, PIG-4401-3.patch
>
>
> PluckTuple is useful when cleaning up long prefixes in a lengthy Pig script. 
> Currently, the udf filters out fields only with exact match, but it would be 
> useful if it could filter based on regex/wildcard.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4401) Add pattern matching to PluckTuple

2015-01-29 Thread Cheolsoo Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-4401:
---
Attachment: PIG-4401-3.patch

Sure. I incorporated your comments in the new patch, Thanks!

> Add pattern matching to PluckTuple
> --
>
> Key: PIG-4401
> URL: https://issues.apache.org/jira/browse/PIG-4401
> Project: Pig
>  Issue Type: Improvement
>  Components: internal-udfs
>Affects Versions: 0.15.0
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
>Priority: Minor
> Attachments: PIG-4401-1.patch, PIG-4401-2.patch, PIG-4401-3.patch
>
>
> PluckTuple is useful when cleaning up long prefixes in a lengthy Pig script. 
> Currently, the udf filters out fields only with exact match, but it would be 
> useful if it could filter based on regex/wildcard.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4402) JavaScript UDF example in the doc is broken

2015-01-29 Thread Cheolsoo Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-4402:
---
Attachment: PIG-4402-1.patch

> JavaScript UDF example in the doc is broken
> ---
>
> Key: PIG-4402
> URL: https://issues.apache.org/jira/browse/PIG-4402
> Project: Pig
>  Issue Type: Bug
>  Components: documentation
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
>Priority: Minor
> Fix For: 0.15.0
>
> Attachments: PIG-4402-1.patch
>
>
> The following example in the [JS udf 
> doc|http://pig.apache.org/docs/r0.14.0/udf.html#js-udfs] throws an error, 
> which is embarrassing-
> {code}
> complex.outputSchema = "word:chararray,num:long";
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4402) JavaScript UDF example in the doc is broken

2015-01-29 Thread Cheolsoo Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-4402:
---
Status: Patch Available  (was: Open)

> JavaScript UDF example in the doc is broken
> ---
>
> Key: PIG-4402
> URL: https://issues.apache.org/jira/browse/PIG-4402
> Project: Pig
>  Issue Type: Bug
>  Components: documentation
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
>Priority: Minor
> Fix For: 0.15.0
>
> Attachments: PIG-4402-1.patch
>
>
> The following example in the [JS udf 
> doc|http://pig.apache.org/docs/r0.14.0/udf.html#js-udfs] throws an error, 
> which is embarrassing-
> {code}
> complex.outputSchema = "word:chararray,num:long";
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (PIG-4402) JavaScript UDF example in the doc is broken

2015-01-29 Thread Cheolsoo Park (JIRA)

Cheolsoo Park created PIG-4402:
--

 Summary: JavaScript UDF example in the doc is broken
 Key: PIG-4402
 URL: https://issues.apache.org/jira/browse/PIG-4402
 Project: Pig
  Issue Type: Bug
  Components: documentation
Reporter: Cheolsoo Park
Assignee: Cheolsoo Park
Priority: Minor
 Fix For: 0.15.0


The following example in the [JS udf 
doc|http://pig.apache.org/docs/r0.14.0/udf.html#js-udfs] throws an error, which 
is embarrassing-
{code}
complex.outputSchema = "word:chararray,num:long";
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4401) Add pattern matching to PluckTuple

2015-01-29 Thread Cheolsoo Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-4401:
---
Status: Patch Available  (was: Open)

> Add pattern matching to PluckTuple
> --
>
> Key: PIG-4401
> URL: https://issues.apache.org/jira/browse/PIG-4401
> Project: Pig
>  Issue Type: Improvement
>  Components: internal-udfs
>Affects Versions: 0.15.0
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
>Priority: Minor
> Attachments: PIG-4401-1.patch, PIG-4401-2.patch
>
>
> PluckTuple is useful when cleaning up long prefixes in a lengthy Pig script. 
> Currently, the udf filters out fields only with exact match, but it would be 
> useful if it could filter based on regex/wildcard.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4401) Add pattern matching to PluckTuple

2015-01-29 Thread Cheolsoo Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-4401:
---
Attachment: PIG-4401-2.patch

Updated the doc in a new patch.

> Add pattern matching to PluckTuple
> --
>
> Key: PIG-4401
> URL: https://issues.apache.org/jira/browse/PIG-4401
> Project: Pig
>  Issue Type: Improvement
>  Components: internal-udfs
>Affects Versions: 0.15.0
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
>Priority: Minor
> Attachments: PIG-4401-1.patch, PIG-4401-2.patch
>
>
> PluckTuple is useful when cleaning up long prefixes in a lengthy Pig script. 
> Currently, the udf filters out fields only with exact match, but it would be 
> useful if it could filter based on regex/wildcard.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4401) Add pattern matching to PluckTuple

2015-01-29 Thread Cheolsoo Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-4401:
---
Attachment: PIG-4401-1.patch

Attaching a patch. It also updates TestPluckTuple with a test case.

> Add pattern matching to PluckTuple
> --
>
> Key: PIG-4401
> URL: https://issues.apache.org/jira/browse/PIG-4401
> Project: Pig
>  Issue Type: Improvement
>  Components: internal-udfs
>Affects Versions: 0.15.0
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
>Priority: Minor
> Attachments: PIG-4401-1.patch
>
>
> PluckTuple is useful when cleaning up long prefixes in a lengthy Pig script. 
> Currently, the udf filters out fields only with exact match, but it would be 
> useful if it could filter based on regex/wildcard.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (PIG-4401) Add pattern matching to PluckTuple

2015-01-29 Thread Cheolsoo Park (JIRA)

Cheolsoo Park created PIG-4401:
--

 Summary: Add pattern matching to PluckTuple
 Key: PIG-4401
 URL: https://issues.apache.org/jira/browse/PIG-4401
 Project: Pig
  Issue Type: Improvement
  Components: internal-udfs
Affects Versions: 0.15.0
Reporter: Cheolsoo Park
Assignee: Cheolsoo Park
Priority: Minor


PluckTuple is useful when cleaning up long prefixes in a lengthy Pig script. 
Currently, the udf filters out fields only with exact match, but it would be 
useful if it could filter based on regex/wildcard.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PIG-4375) ObjectCache should use ProcessorContext.getObjectRegistry()

2015-01-22 Thread Cheolsoo Park (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-4375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14287943#comment-14287943
 ] 

Cheolsoo Park commented on PIG-4375:


[~rohini], I haven't tested join + group by. But I tested another job that was 
failing with OOM previously, and it passes now. So +1.

> ObjectCache should use ProcessorContext.getObjectRegistry()
> ---
>
> Key: PIG-4375
> URL: https://issues.apache.org/jira/browse/PIG-4375
> Project: Pig
>  Issue Type: Sub-task
>Affects Versions: 0.14.0
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
> Fix For: 0.14.1, 0.15.0
>
> Attachments: PIG-4375-1.patch
>
>
>   We have been instantiating our ObjectRegistryImpl which is wrong and will 
> leak memory. It was a copy from hive code in the beginning and I think lot 
> has changed in Tez after that and we have missed it. We also need to expose 
> the ObjectRegistry to users so that they can cache their own stuff for 
> IndexedLoadFunc implementations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4066) An optimization for ROLLUP operation in Pig

2014-12-12 Thread Cheolsoo Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-4066:
---
   Resolution: Fixed
Fix Version/s: 0.15.0
   Status: Resolved  (was: Patch Available)

Committed to trunk.

> An optimization for ROLLUP operation in Pig
> ---
>
> Key: PIG-4066
> URL: https://issues.apache.org/jira/browse/PIG-4066
> Project: Pig
>  Issue Type: Improvement
>Reporter: Quang-Nhat HOANG-XUAN
>Assignee: Quang-Nhat HOANG-XUAN
>  Labels: hybrid-irg, optimization, rollup
> Fix For: 0.15.0
>
> Attachments: Current Rollup vs Our Rollup.jpg, PIG-4066.2.patch, 
> PIG-4066.3.patch, PIG-4066.4.patch, PIG-4066.5.patch, PIG-4066.patch, 
> TechnicalNotes.2.pdf, TechnicalNotes.pdf, UserGuide.pdf
>
>
> This patch aims at addressing the current limitation of the ROLLUP operator 
> in PIG: most of the work is done in the Map phase of the underlying MapReduce 
> job to generate all possible intermediate keys that the reducer use to 
> aggregate and produce the ROLLUP output. Based on our previous work: 
> “Duy-Hung Phan, Matteo Dell’Amico, Pietro Michiardi: On the design space of 
> MapReduce ROLLUP aggregates” 
> (http://www.eurecom.fr/en/publication/4212/download/rs-publi-4212_2.pdf), we 
> show that the design space for a ROLLUP implementation allows for a different 
> approach (in-reducer grouping, IRG), in which less work is done in the Map 
> phase and the grouping is done in the Reduce phase. This patch presents the 
> most efficient implementation we designed (Hybrid IRG), which allows defining 
> a parameter to balance between parallelism (in the reducers) and 
> communication cost.
> This patch contains the following features:
> 1. The new ROLLUP approach: IRG, Hybrid IRG.
> 2. The PIVOT clause in CUBE operators.
> 3. Test cases.
> The new syntax to use our ROLLUP approach:
> alias = CUBE rel BY { CUBE col_ref | ROLLUP col_ref [PIVOT pivot_value]} [, { 
> CUBE col_ref | ROLLUP col_ref [PIVOT pivot_value]}...]
> In case there is multiple ROLLUP operator in one CUBE clause, the last ROLLUP 
> operator will be executed with our approach (IRG, Hybrid IRG) while the 
> remaining ROLLUP ahead will be executed with the default approach.
> We have already made some experiments for comparison between our ROLLUP 
> implementation and the current ROLLUP. More information can be found at here: 
> http://hxquangnhat.github.io/PIG-ROLLUP-H2IRG/
> Patch can be reviewed at here: https://reviews.apache.org/r/23804/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PIG-4066) An optimization for ROLLUP operation in Pig

2014-12-12 Thread Cheolsoo Park (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-4066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14244477#comment-14244477
 ] 

Cheolsoo Park commented on PIG-4066:


+1.

I will commit this patch today. This optimization is disabled by default and 
only applicable to MR, so it shouldn't break anything. Nevertheless, I ran full 
unit tests and e2e tests, and both were clean.

[~hxquangnhat], we should document this. Do you mind opening another jira to 
add document? I think 
[optimization-rules|http://pig.apache.org/docs/r0.13.0/perf.html#optimization-rules]
 is the best place to put it.

> An optimization for ROLLUP operation in Pig
> ---
>
> Key: PIG-4066
> URL: https://issues.apache.org/jira/browse/PIG-4066
> Project: Pig
>  Issue Type: Improvement
>Reporter: Quang-Nhat HOANG-XUAN
>Assignee: Quang-Nhat HOANG-XUAN
>  Labels: hybrid-irg, optimization, rollup
> Attachments: Current Rollup vs Our Rollup.jpg, PIG-4066.2.patch, 
> PIG-4066.3.patch, PIG-4066.4.patch, PIG-4066.5.patch, PIG-4066.patch, 
> TechnicalNotes.2.pdf, TechnicalNotes.pdf, UserGuide.pdf
>
>
> This patch aims at addressing the current limitation of the ROLLUP operator 
> in PIG: most of the work is done in the Map phase of the underlying MapReduce 
> job to generate all possible intermediate keys that the reducer use to 
> aggregate and produce the ROLLUP output. Based on our previous work: 
> “Duy-Hung Phan, Matteo Dell’Amico, Pietro Michiardi: On the design space of 
> MapReduce ROLLUP aggregates” 
> (http://www.eurecom.fr/en/publication/4212/download/rs-publi-4212_2.pdf), we 
> show that the design space for a ROLLUP implementation allows for a different 
> approach (in-reducer grouping, IRG), in which less work is done in the Map 
> phase and the grouping is done in the Reduce phase. This patch presents the 
> most efficient implementation we designed (Hybrid IRG), which allows defining 
> a parameter to balance between parallelism (in the reducers) and 
> communication cost.
> This patch contains the following features:
> 1. The new ROLLUP approach: IRG, Hybrid IRG.
> 2. The PIVOT clause in CUBE operators.
> 3. Test cases.
> The new syntax to use our ROLLUP approach:
> alias = CUBE rel BY { CUBE col_ref | ROLLUP col_ref [PIVOT pivot_value]} [, { 
> CUBE col_ref | ROLLUP col_ref [PIVOT pivot_value]}...]
> In case there is multiple ROLLUP operator in one CUBE clause, the last ROLLUP 
> operator will be executed with our approach (IRG, Hybrid IRG) while the 
> remaining ROLLUP ahead will be executed with the default approach.
> We have already made some experiments for comparison between our ROLLUP 
> implementation and the current ROLLUP. More information can be found at here: 
> http://hxquangnhat.github.io/PIG-ROLLUP-H2IRG/
> Patch can be reviewed at here: https://reviews.apache.org/r/23804/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PIG-4351) TestPigRunner.simpleTest2 fail on trunk

2014-12-09 Thread Cheolsoo Park (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-4351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14239788#comment-14239788
 ] 

Cheolsoo Park commented on PIG-4351:


+1

> TestPigRunner.simpleTest2 fail on trunk
> ---
>
> Key: PIG-4351
> URL: https://issues.apache.org/jira/browse/PIG-4351
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.15.0
>
> Attachments: PIG-4351-1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4343) Tez auto parallelism fails at query compile time

2014-11-25 Thread Cheolsoo Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-4343:
---
Issue Type: Sub-task  (was: Bug)
Parent: PIG-3446

> Tez auto parallelism fails at query compile time
> 
>
> Key: PIG-4343
> URL: https://issues.apache.org/jira/browse/PIG-4343
> Project: Pig
>  Issue Type: Sub-task
>Affects Versions: 0.14.0
>Reporter: Cheolsoo Park
>
> I was running some legacy MR jobs in Tez mode to do perf benchmarks. But when 
> {{pig.tez.auto.parallelism}} is enabled (by default), Pig fails with the 
> following error-
> {code}
> org.apache.pig.impl.plan.VisitorException: ERROR 0: java.io.IOException: 
> Cannot estimate parallelism for scope-892, effective parallelism for 
> predecessor scope-892 is -1
> at 
> org.apache.pig.backend.hadoop.executionengine.tez.plan.optimizer.ParallelismSetter.visitTezOp(ParallelismSetter.java:189)
> at 
> org.apache.pig.backend.hadoop.executionengine.tez.plan.TezOperator.visit(TezOperator.java:232)
> at 
> org.apache.pig.backend.hadoop.executionengine.tez.plan.TezOperator.visit(TezOperator.java:49)
> at 
> org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:70)
> at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:46)
> at 
> org.apache.pig.backend.hadoop.executionengine.tez.TezLauncher.processLoadAndParallelism(TezLauncher.java:429)
> at 
> org.apache.pig.backend.hadoop.executionengine.tez.TezLauncher.launchPig(TezLauncher.java:143)
> at 
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:301)
> at org.apache.pig.PigServer.launchPlan(PigServer.java:1390)
> at org.apache.pig.LipstickPigServer.launchPlan(LipstickPigServer.java:151)
> at 
> org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1375)
> at org.apache.pig.PigServer.execute(PigServer.java:1364)
> at org.apache.pig.PigServer.executeBatch(PigServer.java:415)
> at org.apache.pig.PigServer.executeBatch(PigServer.java:398)
> at 
> org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:171)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:234)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205)
> at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81)
> at com.netflix.lipstick.Main.run(Main.java:496)
> at com.netflix.lipstick.Main.main(Main.java:171)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
> Caused by: java.io.IOException: Cannot estimate parallelism for scope-892, 
> effective parallelism for predecessor scope-892 is -1
> at 
> org.apache.pig.backend.hadoop.executionengine.tez.plan.optimizer.TezOperDependencyParallelismEstimator.estimateParallelism(TezOperDependencyParallelismEstimator.java:116)
> at 
> org.apache.pig.backend.hadoop.executionengine.tez.plan.optimizer.ParallelismSetter.visitTezOp(ParallelismSetter.java:134)
> ... 24 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PIG-4343) Tez auto parallelism fails at query compile time

2014-11-24 Thread Cheolsoo Park (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-4343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223693#comment-14223693
 ] 

Cheolsoo Park commented on PIG-4343:


My workaround is to disable {{pig.tez.auto.parallelism}}, but I think it would 
be nicer if it is automatically disabled instead of failing when parallelism 
cannot be estimated.

> Tez auto parallelism fails at query compile time
> 
>
> Key: PIG-4343
> URL: https://issues.apache.org/jira/browse/PIG-4343
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.14.0
>Reporter: Cheolsoo Park
>
> I was running some legacy MR jobs in Tez mode to do perf benchmarks. But when 
> {{pig.tez.auto.parallelism}} is enabled (by default), Pig fails with the 
> following error-
> {code}
> org.apache.pig.impl.plan.VisitorException: ERROR 0: java.io.IOException: 
> Cannot estimate parallelism for scope-892, effective parallelism for 
> predecessor scope-892 is -1
> at 
> org.apache.pig.backend.hadoop.executionengine.tez.plan.optimizer.ParallelismSetter.visitTezOp(ParallelismSetter.java:189)
> at 
> org.apache.pig.backend.hadoop.executionengine.tez.plan.TezOperator.visit(TezOperator.java:232)
> at 
> org.apache.pig.backend.hadoop.executionengine.tez.plan.TezOperator.visit(TezOperator.java:49)
> at 
> org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:70)
> at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:46)
> at 
> org.apache.pig.backend.hadoop.executionengine.tez.TezLauncher.processLoadAndParallelism(TezLauncher.java:429)
> at 
> org.apache.pig.backend.hadoop.executionengine.tez.TezLauncher.launchPig(TezLauncher.java:143)
> at 
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:301)
> at org.apache.pig.PigServer.launchPlan(PigServer.java:1390)
> at org.apache.pig.LipstickPigServer.launchPlan(LipstickPigServer.java:151)
> at 
> org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1375)
> at org.apache.pig.PigServer.execute(PigServer.java:1364)
> at org.apache.pig.PigServer.executeBatch(PigServer.java:415)
> at org.apache.pig.PigServer.executeBatch(PigServer.java:398)
> at 
> org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:171)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:234)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205)
> at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81)
> at com.netflix.lipstick.Main.run(Main.java:496)
> at com.netflix.lipstick.Main.main(Main.java:171)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
> Caused by: java.io.IOException: Cannot estimate parallelism for scope-892, 
> effective parallelism for predecessor scope-892 is -1
> at 
> org.apache.pig.backend.hadoop.executionengine.tez.plan.optimizer.TezOperDependencyParallelismEstimator.estimateParallelism(TezOperDependencyParallelismEstimator.java:116)
> at 
> org.apache.pig.backend.hadoop.executionengine.tez.plan.optimizer.ParallelismSetter.visitTezOp(ParallelismSetter.java:134)
> ... 24 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (PIG-4343) Tez auto parallelism fails at query compile time

2014-11-24 Thread Cheolsoo Park (JIRA)

Cheolsoo Park created PIG-4343:
--

 Summary: Tez auto parallelism fails at query compile time
 Key: PIG-4343
 URL: https://issues.apache.org/jira/browse/PIG-4343
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.14.0
Reporter: Cheolsoo Park


I was running some legacy MR jobs in Tez mode to do perf benchmarks. But when 
{{pig.tez.auto.parallelism}} is enabled (by default), Pig fails with the 
following error-
{code}
org.apache.pig.impl.plan.VisitorException: ERROR 0: java.io.IOException: Cannot 
estimate parallelism for scope-892, effective parallelism for predecessor 
scope-892 is -1
at 
org.apache.pig.backend.hadoop.executionengine.tez.plan.optimizer.ParallelismSetter.visitTezOp(ParallelismSetter.java:189)
at 
org.apache.pig.backend.hadoop.executionengine.tez.plan.TezOperator.visit(TezOperator.java:232)
at 
org.apache.pig.backend.hadoop.executionengine.tez.plan.TezOperator.visit(TezOperator.java:49)
at 
org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:70)
at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:46)
at 
org.apache.pig.backend.hadoop.executionengine.tez.TezLauncher.processLoadAndParallelism(TezLauncher.java:429)
at 
org.apache.pig.backend.hadoop.executionengine.tez.TezLauncher.launchPig(TezLauncher.java:143)
at 
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:301)
at org.apache.pig.PigServer.launchPlan(PigServer.java:1390)
at org.apache.pig.LipstickPigServer.launchPlan(LipstickPigServer.java:151)
at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1375)
at org.apache.pig.PigServer.execute(PigServer.java:1364)
at org.apache.pig.PigServer.executeBatch(PigServer.java:415)
at org.apache.pig.PigServer.executeBatch(PigServer.java:398)
at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:171)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:234)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81)
at com.netflix.lipstick.Main.run(Main.java:496)
at com.netflix.lipstick.Main.main(Main.java:171)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Caused by: java.io.IOException: Cannot estimate parallelism for scope-892, 
effective parallelism for predecessor scope-892 is -1
at 
org.apache.pig.backend.hadoop.executionengine.tez.plan.optimizer.TezOperDependencyParallelismEstimator.estimateParallelism(TezOperDependencyParallelismEstimator.java:116)
at 
org.apache.pig.backend.hadoop.executionengine.tez.plan.optimizer.ParallelismSetter.visitTezOp(ParallelismSetter.java:134)
... 24 more
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4329) Fetch optimization should be disabled when limit is not pushed up

2014-11-13 Thread Cheolsoo Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-4329:
---
Resolution: Fixed
  Assignee: Lorand Bendig  (was: Cheolsoo Park)
Status: Resolved  (was: Patch Available)

Committed to trunk.

> Fetch optimization should be disabled when limit is not pushed up
> -
>
> Key: PIG-4329
> URL: https://issues.apache.org/jira/browse/PIG-4329
> Project: Pig
>  Issue Type: Bug
>Reporter: Cheolsoo Park
>Assignee: Lorand Bendig
> Fix For: 0.15.0
>
> Attachments: PIG-4329-1.patch, PIG-4329-2.patch
>
>
> Although PIG-4135 disable fetch optimization when there is no limit in the 
> plan, that doesn't solve the problem completely. In fact, fetch optimization 
> should be still disabled if limit is not pushed up. Consider the following 
> query-
> {code}
> random_lists = load 'prodhive.schakraborty.search_server_denorm_impressions' 
> using DseStorage();
> random_lists = filter random_lists by entity_section=='random';
> random_lists = limit random_lists 10;
> dump random_lists;
> {code}
> Because the {{filter by}} blocks limit from being pushed up, POLoad actually 
> scans the full table. In this case, fetch optimization makes the job 
> extremely slow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PIG-4329) Fetch optimization should be disabled when limit is not pushed up

2014-11-13 Thread Cheolsoo Park (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-4329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14209996#comment-14209996
 ] 

Cheolsoo Park commented on PIG-4329:


+1. I'll commit it shortly. Thank you Lorand!

> Fetch optimization should be disabled when limit is not pushed up
> -
>
> Key: PIG-4329
> URL: https://issues.apache.org/jira/browse/PIG-4329
> Project: Pig
>  Issue Type: Bug
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
> Fix For: 0.15.0
>
> Attachments: PIG-4329-1.patch, PIG-4329-2.patch
>
>
> Although PIG-4135 disable fetch optimization when there is no limit in the 
> plan, that doesn't solve the problem completely. In fact, fetch optimization 
> should be still disabled if limit is not pushed up. Consider the following 
> query-
> {code}
> random_lists = load 'prodhive.schakraborty.search_server_denorm_impressions' 
> using DseStorage();
> random_lists = filter random_lists by entity_section=='random';
> random_lists = limit random_lists 10;
> dump random_lists;
> {code}
> Because the {{filter by}} blocks limit from being pushed up, POLoad actually 
> scans the full table. In this case, fetch optimization makes the job 
> extremely slow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4329) Fetch optimization should be disabled when limit is not pushed up

2014-11-12 Thread Cheolsoo Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-4329:
---
Status: Patch Available  (was: Open)

> Fetch optimization should be disabled when limit is not pushed up
> -
>
> Key: PIG-4329
> URL: https://issues.apache.org/jira/browse/PIG-4329
> Project: Pig
>  Issue Type: Bug
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
> Fix For: 0.15.0
>
> Attachments: PIG-4329-1.patch
>
>
> Although PIG-4135 disable fetch optimization when there is no limit in the 
> plan, that doesn't solve the problem completely. In fact, fetch optimization 
> should be still disabled if limit is not pushed up. Consider the following 
> query-
> {code}
> random_lists = load 'prodhive.schakraborty.search_server_denorm_impressions' 
> using DseStorage();
> random_lists = filter random_lists by entity_section=='random';
> random_lists = limit random_lists 10;
> dump random_lists;
> {code}
> Because the {{filter by}} blocks limit from being pushed up, POLoad actually 
> scans the full table. In this case, fetch optimization makes the job 
> extremely slow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4329) Fetch optimization should be disabled when limit is not pushed up

2014-11-12 Thread Cheolsoo Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-4329:
---
Attachment: PIG-4329-1.patch

Uploading a patch that disables fetch optimization when limit is not pushed up.

> Fetch optimization should be disabled when limit is not pushed up
> -
>
> Key: PIG-4329
> URL: https://issues.apache.org/jira/browse/PIG-4329
> Project: Pig
>  Issue Type: Bug
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
> Fix For: 0.15.0
>
> Attachments: PIG-4329-1.patch
>
>
> Although PIG-4135 disable fetch optimization when there is no limit in the 
> plan, that doesn't solve the problem completely. In fact, fetch optimization 
> should be still disabled if limit is not pushed up. Consider the following 
> query-
> {code}
> random_lists = load 'prodhive.schakraborty.search_server_denorm_impressions' 
> using DseStorage();
> random_lists = filter random_lists by entity_section=='random';
> random_lists = limit random_lists 10;
> dump random_lists;
> {code}
> Because the {{filter by}} blocks limit from being pushed up, POLoad actually 
> scans the full table. In this case, fetch optimization makes the job 
> extremely slow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4329) Fetch optimization should be disabled when limit is not pushed up

2014-11-12 Thread Cheolsoo Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-4329:
---
Description: 
Although PIG-4135 disable fetch optimization when there is no limit in the 
plan, that doesn't solve the problem completely. In fact, fetch optimization 
should be still disabled if limit is not pushed up. Consider the following 
query-
{code}
random_lists = load 'prodhive.schakraborty.search_server_denorm_impressions' 
using DseStorage();
random_lists = filter random_lists by entity_section=='random';
random_lists = limit random_lists 10;
dump random_lists;
{code}
Because the {{filter by}} blocks limit from being pushed up, POLoad actually 
scans the full table. In this case, fetch optimization makes the job extremely 
slow.

  was:
Although PIG-4135 disable fetch optimization when there is no limit in the 
plan, that doesn't solve the problem completely. In fact, fetch optimization 
should be still disabled if limit is not pushed up. Consider the following 
query-
{code}
random_lists = load 'prodhive.schakraborty.search_server_denorm_impressions' 
using DseStorage();
random_lists = filter random_lists by entity_section=='random');
random_lists = limit random_lists 10;
dump random_lists;
{code}
Because the {{filter by}} blocks limit from being pushed up, POLoad actually 
scans the full table. In this case, fetch optimization makes the job extremely 
slow.


> Fetch optimization should be disabled when limit is not pushed up
> -
>
> Key: PIG-4329
> URL: https://issues.apache.org/jira/browse/PIG-4329
> Project: Pig
>  Issue Type: Bug
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
> Fix For: 0.15.0
>
>
> Although PIG-4135 disable fetch optimization when there is no limit in the 
> plan, that doesn't solve the problem completely. In fact, fetch optimization 
> should be still disabled if limit is not pushed up. Consider the following 
> query-
> {code}
> random_lists = load 'prodhive.schakraborty.search_server_denorm_impressions' 
> using DseStorage();
> random_lists = filter random_lists by entity_section=='random';
> random_lists = limit random_lists 10;
> dump random_lists;
> {code}
> Because the {{filter by}} blocks limit from being pushed up, POLoad actually 
> scans the full table. In this case, fetch optimization makes the job 
> extremely slow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (PIG-4329) Fetch optimization should be disabled when limit is not pushed up

2014-11-12 Thread Cheolsoo Park (JIRA)

Cheolsoo Park created PIG-4329:
--

 Summary: Fetch optimization should be disabled when limit is not 
pushed up
 Key: PIG-4329
 URL: https://issues.apache.org/jira/browse/PIG-4329
 Project: Pig
  Issue Type: Bug
Reporter: Cheolsoo Park
Assignee: Cheolsoo Park
 Fix For: 0.15.0


Although PIG-4135 disable fetch optimization when there is no limit in the 
plan, that doesn't solve the problem completely. In fact, fetch optimization 
should be still disabled if limit is not pushed up. Consider the following 
query-
{code}
random_lists = load 'prodhive.schakraborty.search_server_denorm_impressions' 
using DseStorage();
random_lists = filter random_lists by entity_section=='random');
random_lists = limit random_lists 10;
dump random_lists;
{code}
Because the {{filter by}} blocks limit from being pushed up, POLoad actually 
scans the full table. In this case, fetch optimization makes the job extremely 
slow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PIG-3346) New property that controls the number of combined splits

2014-11-12 Thread Cheolsoo Park (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14208363#comment-14208363
 ] 

Cheolsoo Park commented on PIG-3346:


[~rohini], thank you for your suggestion. I just tried to set 
{{mapreduce.input.fileinputformat.split.maxsize}}, but that didn't help with s3 
files. Few mapper tasks still load too many small files. My patch actually 
limits the # of combined splits and reports it as a counter. This is quite 
helpful to debug slow mappers for me.

> New property that controls the number of combined splits
> 
>
> Key: PIG-3346
> URL: https://issues.apache.org/jira/browse/PIG-3346
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
> Fix For: 0.15.0
>
> Attachments: PIG-3346-2.patch, PIG-3346-3.patch, PIG-3346.patch
>
>
> Currently, the size of combined splits can be configured by the 
> {{pig.maxCombinedSplitSize}} property.
> Although this works fine most of time, it can lead to a undesired situation 
> where a single mapper ends up loading a lot of combined splits. Particularly, 
> this is bad if Pig uploads them from S3.
> So it will be useful if the max number of combined splits can be configured 
> via a property something like {{pig.maxCombinedSplitNum}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-3346) New property that controls the number of combined splits

2014-11-11 Thread Cheolsoo Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-3346:
---
Fix Version/s: 0.15.0

> New property that controls the number of combined splits
> 
>
> Key: PIG-3346
> URL: https://issues.apache.org/jira/browse/PIG-3346
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
> Fix For: 0.15.0
>
> Attachments: PIG-3346-2.patch, PIG-3346-3.patch, PIG-3346.patch
>
>
> Currently, the size of combined splits can be configured by the 
> {{pig.maxCombinedSplitSize}} property.
> Although this works fine most of time, it can lead to a undesired situation 
> where a single mapper ends up loading a lot of combined splits. Particularly, 
> this is bad if Pig uploads them from S3.
> So it will be useful if the max number of combined splits can be configured 
> via a property something like {{pig.maxCombinedSplitNum}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Reopened] (PIG-3346) New property that controls the number of combined splits

2014-11-11 Thread Cheolsoo Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park reopened PIG-3346:


Reopening this issue as I ran into it again.

I rebased my patch to trunk and added a new counter called 
{{CombinedInputSplits}} that tracks the # of input splits per task. I also 
updated the RB-
https://reviews.apache.org/r/11718/

> New property that controls the number of combined splits
> 
>
> Key: PIG-3346
> URL: https://issues.apache.org/jira/browse/PIG-3346
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
> Attachments: PIG-3346-2.patch, PIG-3346-3.patch, PIG-3346.patch
>
>
> Currently, the size of combined splits can be configured by the 
> {{pig.maxCombinedSplitSize}} property.
> Although this works fine most of time, it can lead to a undesired situation 
> where a single mapper ends up loading a lot of combined splits. Particularly, 
> this is bad if Pig uploads them from S3.
> So it will be useful if the max number of combined splits can be configured 
> via a property something like {{pig.maxCombinedSplitNum}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-3346) New property that controls the number of combined splits

2014-11-11 Thread Cheolsoo Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-3346:
---
Attachment: PIG-3346-3.patch

> New property that controls the number of combined splits
> 
>
> Key: PIG-3346
> URL: https://issues.apache.org/jira/browse/PIG-3346
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
> Attachments: PIG-3346-2.patch, PIG-3346-3.patch, PIG-3346.patch
>
>
> Currently, the size of combined splits can be configured by the 
> {{pig.maxCombinedSplitSize}} property.
> Although this works fine most of time, it can lead to a undesired situation 
> where a single mapper ends up loading a lot of combined splits. Particularly, 
> this is bad if Pig uploads them from S3.
> So it will be useful if the max number of combined splits can be configured 
> via a property something like {{pig.maxCombinedSplitNum}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4298) Descending order-by is broken in some cases when key is bytearrays

2014-11-07 Thread Cheolsoo Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-4298:
---
   Resolution: Fixed
Fix Version/s: (was: 0.15.0)
   0.14.0
   Status: Resolved  (was: Patch Available)

Committed to 0.14 and trunk.

> Descending order-by is broken in some cases when key is bytearrays 
> ---
>
> Key: PIG-4298
> URL: https://issues.apache.org/jira/browse/PIG-4298
> Project: Pig
>  Issue Type: Bug
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
> Fix For: 0.14.0
>
> Attachments: PIG-4298-1.patch, PIG-4298-2.patch, repo.tar.gz
>
>
> Here is a repo script (using [PigPen|https://github.com/Netflix/PigPen] )-
> {code}
> REGISTER pigpen.jar;
> load4254 = LOAD 'input.clj'
> USING PigStorage('\n')
> AS (value:chararray);
> DEFINE udf4265 pigpen.PigPenFnDataBag('(clojure.core/require (quote 
> [pigpen.runtime]) (quote [clojure.edn]))','(pigpen.runtime/exec 
> [(pigpen.runtime/process->bind (pigpen.runtime/pre-process :pig :native)) 
> (pigpen.runtime/map->bind clojure.edn/read-string) 
> (pigpen.runtime/key-selector->bind clojure.core/identity) 
> (pigpen.runtime/process->bind (pigpen.runtime/post-process :pig 
> :native-key-frozen-val))])');
> generate4263 = FOREACH load4254 GENERATE
> FLATTEN(udf4265(value));
> generate4257 = FOREACH generate4263 GENERATE
> $0 AS key,
> $1 AS value;
> order4258 = ORDER generate4257 BY key DESC; <-- sort order isn't changed by 
> DESC
> dump order4258;
> {code}
> This script returns the same result for both ASC and DESC orders.
> The problem is as follows-
> # {{PigBytesRawComparator}} calls 
> {{BinInterSedesTupleRawComparator.compare()}}.
> # {{BinInterSedesTupleRawComparator}} applies descending order.
> # {{PigBytesRawComparator}} applies descending order again to what 
> {{BinInterSedesTupleRawComparator}} returns.
> Therefore, descending order is never applied.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PIG-4298) Descending order-by is broken in some cases when key is bytearrays

2014-11-06 Thread Cheolsoo Park (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14201262#comment-14201262
 ] 

Cheolsoo Park commented on PIG-4298:


Thanks Daniel! The new patch LGTM.

> Descending order-by is broken in some cases when key is bytearrays 
> ---
>
> Key: PIG-4298
> URL: https://issues.apache.org/jira/browse/PIG-4298
> Project: Pig
>  Issue Type: Bug
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
> Fix For: 0.15.0
>
> Attachments: PIG-4298-1.patch, PIG-4298-2.patch, repo.tar.gz
>
>
> Here is a repo script (using [PigPen|https://github.com/Netflix/PigPen] )-
> {code}
> REGISTER pigpen.jar;
> load4254 = LOAD 'input.clj'
> USING PigStorage('\n')
> AS (value:chararray);
> DEFINE udf4265 pigpen.PigPenFnDataBag('(clojure.core/require (quote 
> [pigpen.runtime]) (quote [clojure.edn]))','(pigpen.runtime/exec 
> [(pigpen.runtime/process->bind (pigpen.runtime/pre-process :pig :native)) 
> (pigpen.runtime/map->bind clojure.edn/read-string) 
> (pigpen.runtime/key-selector->bind clojure.core/identity) 
> (pigpen.runtime/process->bind (pigpen.runtime/post-process :pig 
> :native-key-frozen-val))])');
> generate4263 = FOREACH load4254 GENERATE
> FLATTEN(udf4265(value));
> generate4257 = FOREACH generate4263 GENERATE
> $0 AS key,
> $1 AS value;
> order4258 = ORDER generate4257 BY key DESC; <-- sort order isn't changed by 
> DESC
> dump order4258;
> {code}
> This script returns the same result for both ASC and DESC orders.
> The problem is as follows-
> # {{PigBytesRawComparator}} calls 
> {{BinInterSedesTupleRawComparator.compare()}}.
> # {{BinInterSedesTupleRawComparator}} applies descending order.
> # {{PigBytesRawComparator}} applies descending order again to what 
> {{BinInterSedesTupleRawComparator}} returns.
> Therefore, descending order is never applied.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PIG-4298) Descending order-by is broken in some cases when key is bytearrays

2014-11-06 Thread Cheolsoo Park (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14201163#comment-14201163
 ] 

Cheolsoo Park commented on PIG-4298:


[~daijy], I think it goes to a different code path if you make that change. So 
it doesn't hit the bug.

> Descending order-by is broken in some cases when key is bytearrays 
> ---
>
> Key: PIG-4298
> URL: https://issues.apache.org/jira/browse/PIG-4298
> Project: Pig
>  Issue Type: Bug
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
> Fix For: 0.15.0
>
> Attachments: PIG-4298-1.patch, repo.tar.gz
>
>
> Here is a repo script (using [PigPen|https://github.com/Netflix/PigPen] )-
> {code}
> REGISTER pigpen.jar;
> load4254 = LOAD 'input.clj'
> USING PigStorage('\n')
> AS (value:chararray);
> DEFINE udf4265 pigpen.PigPenFnDataBag('(clojure.core/require (quote 
> [pigpen.runtime]) (quote [clojure.edn]))','(pigpen.runtime/exec 
> [(pigpen.runtime/process->bind (pigpen.runtime/pre-process :pig :native)) 
> (pigpen.runtime/map->bind clojure.edn/read-string) 
> (pigpen.runtime/key-selector->bind clojure.core/identity) 
> (pigpen.runtime/process->bind (pigpen.runtime/post-process :pig 
> :native-key-frozen-val))])');
> generate4263 = FOREACH load4254 GENERATE
> FLATTEN(udf4265(value));
> generate4257 = FOREACH generate4263 GENERATE
> $0 AS key,
> $1 AS value;
> order4258 = ORDER generate4257 BY key DESC; <-- sort order isn't changed by 
> DESC
> dump order4258;
> {code}
> This script returns the same result for both ASC and DESC orders.
> The problem is as follows-
> # {{PigBytesRawComparator}} calls 
> {{BinInterSedesTupleRawComparator.compare()}}.
> # {{BinInterSedesTupleRawComparator}} applies descending order.
> # {{PigBytesRawComparator}} applies descending order again to what 
> {{BinInterSedesTupleRawComparator}} returns.
> Therefore, descending order is never applied.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4298) Descending order-by is broken in some cases when key is bytearrays

2014-11-05 Thread Cheolsoo Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-4298:
---
Attachment: repo.tar.gz

Uploading the repo script in case someone want to reproduce the error.

> Descending order-by is broken in some cases when key is bytearrays 
> ---
>
> Key: PIG-4298
> URL: https://issues.apache.org/jira/browse/PIG-4298
> Project: Pig
>  Issue Type: Bug
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
> Fix For: 0.15.0
>
> Attachments: PIG-4298-1.patch, repo.tar.gz
>
>
> Here is a repo script (using [PigPen|https://github.com/Netflix/PigPen] )-
> {code}
> REGISTER pigpen.jar;
> load4254 = LOAD 'input.clj'
> USING PigStorage('\n')
> AS (value:chararray);
> DEFINE udf4265 pigpen.PigPenFnDataBag('(clojure.core/require (quote 
> [pigpen.runtime]) (quote [clojure.edn]))','(pigpen.runtime/exec 
> [(pigpen.runtime/process->bind (pigpen.runtime/pre-process :pig :native)) 
> (pigpen.runtime/map->bind clojure.edn/read-string) 
> (pigpen.runtime/key-selector->bind clojure.core/identity) 
> (pigpen.runtime/process->bind (pigpen.runtime/post-process :pig 
> :native-key-frozen-val))])');
> generate4263 = FOREACH load4254 GENERATE
> FLATTEN(udf4265(value));
> generate4257 = FOREACH generate4263 GENERATE
> $0 AS key,
> $1 AS value;
> order4258 = ORDER generate4257 BY key DESC; <-- sort order isn't changed by 
> DESC
> dump order4258;
> {code}
> This script returns the same result for both ASC and DESC orders.
> The problem is as follows-
> # {{PigBytesRawComparator}} calls 
> {{BinInterSedesTupleRawComparator.compare()}}.
> # {{BinInterSedesTupleRawComparator}} applies descending order.
> # {{PigBytesRawComparator}} applies descending order again to what 
> {{BinInterSedesTupleRawComparator}} returns.
> Therefore, descending order is never applied.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4298) Descending order-by is broken in some cases when key is bytearrays

2014-11-05 Thread Cheolsoo Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-4298:
---
Description: 
Here is a repo script (using [PigPen|https://github.com/Netflix/PigPen] )-
{code}
REGISTER pigpen.jar;

load4254 = LOAD 'input.clj'
USING PigStorage('\n')
AS (value:chararray);

DEFINE udf4265 pigpen.PigPenFnDataBag('(clojure.core/require (quote 
[pigpen.runtime]) (quote [clojure.edn]))','(pigpen.runtime/exec 
[(pigpen.runtime/process->bind (pigpen.runtime/pre-process :pig :native)) 
(pigpen.runtime/map->bind clojure.edn/read-string) 
(pigpen.runtime/key-selector->bind clojure.core/identity) 
(pigpen.runtime/process->bind (pigpen.runtime/post-process :pig 
:native-key-frozen-val))])');

generate4263 = FOREACH load4254 GENERATE
FLATTEN(udf4265(value));
generate4257 = FOREACH generate4263 GENERATE
$0 AS key,
$1 AS value;

order4258 = ORDER generate4257 BY key DESC; <-- sort order isn't changed by DESC
dump order4258;
{code}
This script returns the same result for both ASC and DESC orders.

The problem is as follows-
# {{PigBytesRawComparator}} calls {{BinInterSedesTupleRawComparator.compare()}}.
# {{BinInterSedesTupleRawComparator}} applies descending order.
# {{PigBytesRawComparator}} applies descending order again to what 
{{BinInterSedesTupleRawComparator}} returns.

Therefore, descending order is never applied.

  was:
Here is a repo script (using [PigPen|https://github.com/Netflix/PigPen])-
{code}
REGISTER pigpen.jar;

load4254 = LOAD 'input.clj'
USING PigStorage('\n')
AS (value:chararray);

DEFINE udf4265 pigpen.PigPenFnDataBag('(clojure.core/require (quote 
[pigpen.runtime]) (quote [clojure.edn]))','(pigpen.runtime/exec 
[(pigpen.runtime/process->bind (pigpen.runtime/pre-process :pig :native)) 
(pigpen.runtime/map->bind clojure.edn/read-string) 
(pigpen.runtime/key-selector->bind clojure.core/identity) 
(pigpen.runtime/process->bind (pigpen.runtime/post-process :pig 
:native-key-frozen-val))])');

generate4263 = FOREACH load4254 GENERATE
FLATTEN(udf4265(value));
generate4257 = FOREACH generate4263 GENERATE
$0 AS key,
$1 AS value;

order4258 = ORDER generate4257 BY key DESC; <-- sort order isn't changed by DESC
dump order4258;
{code}
This script returns the same result for both ASC and DESC orders.

The problem is as follows-
# {{PigBytesRawComparator}} calls {{BinInterSedesTupleRawComparator.compare()}}.
# {{BinInterSedesTupleRawComparator}} applies descending order.
# {{PigBytesRawComparator}} applies descending order again to what 
{{BinInterSedesTupleRawComparator}} returns.

Therefore, descending order is never applied.


> Descending order-by is broken in some cases when key is bytearrays 
> ---
>
> Key: PIG-4298
> URL: https://issues.apache.org/jira/browse/PIG-4298
> Project: Pig
>  Issue Type: Bug
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
> Fix For: 0.15.0
>
> Attachments: PIG-4298-1.patch
>
>
> Here is a repo script (using [PigPen|https://github.com/Netflix/PigPen] )-
> {code}
> REGISTER pigpen.jar;
> load4254 = LOAD 'input.clj'
> USING PigStorage('\n')
> AS (value:chararray);
> DEFINE udf4265 pigpen.PigPenFnDataBag('(clojure.core/require (quote 
> [pigpen.runtime]) (quote [clojure.edn]))','(pigpen.runtime/exec 
> [(pigpen.runtime/process->bind (pigpen.runtime/pre-process :pig :native)) 
> (pigpen.runtime/map->bind clojure.edn/read-string) 
> (pigpen.runtime/key-selector->bind clojure.core/identity) 
> (pigpen.runtime/process->bind (pigpen.runtime/post-process :pig 
> :native-key-frozen-val))])');
> generate4263 = FOREACH load4254 GENERATE
> FLATTEN(udf4265(value));
> generate4257 = FOREACH generate4263 GENERATE
> $0 AS key,
> $1 AS value;
> order4258 = ORDER generate4257 BY key DESC; <-- sort order isn't changed by 
> DESC
> dump order4258;
> {code}
> This script returns the same result for both ASC and DESC orders.
> The problem is as follows-
> # {{PigBytesRawComparator}} calls 
> {{BinInterSedesTupleRawComparator.compare()}}.
> # {{BinInterSedesTupleRawComparator}} applies descending order.
> # {{PigBytesRawComparator}} applies descending order again to what 
> {{BinInterSedesTupleRawComparator}} returns.
> Therefore, descending order is never applied.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4298) Descending order-by is broken in some cases when key is bytearrays

2014-11-05 Thread Cheolsoo Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-4298:
---
Status: Patch Available  (was: Open)

> Descending order-by is broken in some cases when key is bytearrays 
> ---
>
> Key: PIG-4298
> URL: https://issues.apache.org/jira/browse/PIG-4298
> Project: Pig
>  Issue Type: Bug
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
> Fix For: 0.15.0
>
> Attachments: PIG-4298-1.patch
>
>
> Here is a repo script (using [PigPen|https://github.com/Netflix/PigPen])-
> {code}
> REGISTER pigpen.jar;
> load4254 = LOAD 'input.clj'
> USING PigStorage('\n')
> AS (value:chararray);
> DEFINE udf4265 pigpen.PigPenFnDataBag('(clojure.core/require (quote 
> [pigpen.runtime]) (quote [clojure.edn]))','(pigpen.runtime/exec 
> [(pigpen.runtime/process->bind (pigpen.runtime/pre-process :pig :native)) 
> (pigpen.runtime/map->bind clojure.edn/read-string) 
> (pigpen.runtime/key-selector->bind clojure.core/identity) 
> (pigpen.runtime/process->bind (pigpen.runtime/post-process :pig 
> :native-key-frozen-val))])');
> generate4263 = FOREACH load4254 GENERATE
> FLATTEN(udf4265(value));
> generate4257 = FOREACH generate4263 GENERATE
> $0 AS key,
> $1 AS value;
> order4258 = ORDER generate4257 BY key DESC; <-- sort order isn't changed by 
> DESC
> dump order4258;
> {code}
> This script returns the same result for both ASC and DESC orders.
> The problem is as follows-
> # {{PigBytesRawComparator}} calls 
> {{BinInterSedesTupleRawComparator.compare()}}.
> # {{BinInterSedesTupleRawComparator}} applies descending order.
> # {{PigBytesRawComparator}} applies descending order again to what 
> {{BinInterSedesTupleRawComparator}} returns.
> Therefore, descending order is never applied.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4298) Descending order-by is broken in some cases when key is bytearrays

2014-11-05 Thread Cheolsoo Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-4298:
---
Attachment: PIG-4298-1.patch

Uploading a patch.

> Descending order-by is broken in some cases when key is bytearrays 
> ---
>
> Key: PIG-4298
> URL: https://issues.apache.org/jira/browse/PIG-4298
> Project: Pig
>  Issue Type: Bug
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
> Fix For: 0.15.0
>
> Attachments: PIG-4298-1.patch
>
>
> Here is a repo script (using [PigPen|https://github.com/Netflix/PigPen])-
> {code}
> REGISTER pigpen.jar;
> load4254 = LOAD 'input.clj'
> USING PigStorage('\n')
> AS (value:chararray);
> DEFINE udf4265 pigpen.PigPenFnDataBag('(clojure.core/require (quote 
> [pigpen.runtime]) (quote [clojure.edn]))','(pigpen.runtime/exec 
> [(pigpen.runtime/process->bind (pigpen.runtime/pre-process :pig :native)) 
> (pigpen.runtime/map->bind clojure.edn/read-string) 
> (pigpen.runtime/key-selector->bind clojure.core/identity) 
> (pigpen.runtime/process->bind (pigpen.runtime/post-process :pig 
> :native-key-frozen-val))])');
> generate4263 = FOREACH load4254 GENERATE
> FLATTEN(udf4265(value));
> generate4257 = FOREACH generate4263 GENERATE
> $0 AS key,
> $1 AS value;
> order4258 = ORDER generate4257 BY key DESC; <-- sort order isn't changed by 
> DESC
> dump order4258;
> {code}
> This script returns the same result for both ASC and DESC orders.
> The problem is as follows-
> # {{PigBytesRawComparator}} calls 
> {{BinInterSedesTupleRawComparator.compare()}}.
> # {{BinInterSedesTupleRawComparator}} applies descending order.
> # {{PigBytesRawComparator}} applies descending order again to what 
> {{BinInterSedesTupleRawComparator}} returns.
> Therefore, descending order is never applied.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (PIG-4298) Descending order-by is broken in some cases when key is bytearrays

2014-11-05 Thread Cheolsoo Park (JIRA)

Cheolsoo Park created PIG-4298:
--

 Summary: Descending order-by is broken in some cases when key is 
bytearrays 
 Key: PIG-4298
 URL: https://issues.apache.org/jira/browse/PIG-4298
 Project: Pig
  Issue Type: Bug
Reporter: Cheolsoo Park
Assignee: Cheolsoo Park
 Fix For: 0.15.0


Here is a repo script (using [PigPen|https://github.com/Netflix/PigPen])-
{code}
REGISTER pigpen.jar;

load4254 = LOAD 'input.clj'
USING PigStorage('\n')
AS (value:chararray);

DEFINE udf4265 pigpen.PigPenFnDataBag('(clojure.core/require (quote 
[pigpen.runtime]) (quote [clojure.edn]))','(pigpen.runtime/exec 
[(pigpen.runtime/process->bind (pigpen.runtime/pre-process :pig :native)) 
(pigpen.runtime/map->bind clojure.edn/read-string) 
(pigpen.runtime/key-selector->bind clojure.core/identity) 
(pigpen.runtime/process->bind (pigpen.runtime/post-process :pig 
:native-key-frozen-val))])');

generate4263 = FOREACH load4254 GENERATE
FLATTEN(udf4265(value));
generate4257 = FOREACH generate4263 GENERATE
$0 AS key,
$1 AS value;

order4258 = ORDER generate4257 BY key DESC; <-- sort order isn't changed by DESC
dump order4258;
{code}
This script returns the same result for both ASC and DESC orders.

The problem is as follows-
# {{PigBytesRawComparator}} calls {{BinInterSedesTupleRawComparator.compare()}}.
# {{BinInterSedesTupleRawComparator}} applies descending order.
# {{PigBytesRawComparator}} applies descending order again to what 
{{BinInterSedesTupleRawComparator}} returns.

Therefore, descending order is never applied.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4241) Auto local mode mistakenly converts large jobs to local mode when using with Hive tables

2014-10-27 Thread Cheolsoo Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-4241:
---
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Thank you Daniel for reviewing. Committed to 0.14 and trunk.

> Auto local mode mistakenly converts large jobs to local mode when using with 
> Hive tables
> 
>
> Key: PIG-4241
> URL: https://issues.apache.org/jira/browse/PIG-4241
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
> Fix For: 0.14.0
>
> Attachments: PIG-4241-1.patch, PIG-4241-2.patch, PIG-4241-3.patch
>
>
> The current implementation of auto local mode has two severe problems-
> # It assumes file-based inputs, and it always converts jobs with 
> non-file-based inputs into local mode unless the 
> {{LoadMetadata.getStatistics().getSizeInBytes()}} returns >100M. This is 
> particularly problematic when using Pig with Hive tables with custom 
> LoadFuncs that did not implement LoadMetadata interface.
> # It lists all the files to compute the total size. The algorithm is like 
> this. First, compute the total size. Second, compare it against the 
> configured max bytes. This is very time-consuming when Pig job loads a large 
> number of files. It will list all the files only to compute the total size. 
> Instead, we should stop computing the sum of input sizes as soon as it 
> becomes the max bytes-
> {code:title=JobControlCompiler.java}
> long totalInputFileSize = 
> InputSizeReducerEstimator.getTotalInputFileSize(conf, lds, job); // THIS IS 
> BAD!
> long inputByteMax = 
> conf.getLong(PigConfiguration.PIG_AUTO_LOCAL_INPUT_MAXBYTES, 100*1000*1000l);
> log.info("Size of input: " + totalInputFileSize +" bytes. Small job 
> threshold: " + inputByteMax );
> if (totalInputFileSize < 0 || totalInputFileSize > inputByteMax) {
> return false;
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4247) S3 properties are not picked up from core-site.xml in local mode

2014-10-26 Thread Cheolsoo Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-4247:
---
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed to trunk. Thank you Daniel for reviewing!

> S3 properties are not picked up from core-site.xml in local mode
> 
>
> Key: PIG-4247
> URL: https://issues.apache.org/jira/browse/PIG-4247
> Project: Pig
>  Issue Type: Bug
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
> Fix For: 0.15.0
>
> Attachments: PIG-4247-1.patch
>
>
> Even in local mode, {{fs.s3}} properties need to be set if the job accesses 
> s3 files (for eg, registering jars on s3, loading input files on s3, etc). In 
> particular, {{fs.s3.awsSecretAccessKey}} and {{fs.s3.awsAccessKey}} are 
> usually set in {{core-site.xml}}, but since local mode doesn't load 
> {{core-site.xml}}, these properties have to be set again in pig.properties. 
> This adds operational overhead because whenever rotating aws keys, 
> {{pig.properties}} also needs to be updated. So it would be nice if {{fs.s3}} 
> properties can be set even in local mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4241) Auto local mode mistakenly converts large jobs to local mode when using with Hive tables

2014-10-26 Thread Cheolsoo Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-4241:
---
Attachment: PIG-4241-3.patch

{quote}
Does all non-file-based job convert to auto local mode?
{quote}
Not all but many. The issue is that not all LoadFuncs implement 
{{getSizeInBytes()}}, and even if they do, this method works on a best effort 
basis. When it is not implemented or returns null, {{getTotalInputFileSize()}} 
returns 0 for Hive tables because they're mis-interpreted as HDFS file paths. 
Then, auto local mode thinks the size of the input path is 0 bytes, and thus, 
it is runnable in local mode. As a result, big jobs run in local mode filling 
up local disk on gateways (Genie nodes).

I updated the patch adding comments about the {{max}} parameter.

> Auto local mode mistakenly converts large jobs to local mode when using with 
> Hive tables
> 
>
> Key: PIG-4241
> URL: https://issues.apache.org/jira/browse/PIG-4241
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
> Fix For: 0.15.0
>
> Attachments: PIG-4241-1.patch, PIG-4241-2.patch, PIG-4241-3.patch
>
>
> The current implementation of auto local mode has two severe problems-
> # It assumes file-based inputs, and it always converts jobs with 
> non-file-based inputs into local mode unless the 
> {{LoadMetadata.getStatistics().getSizeInBytes()}} returns >100M. This is 
> particularly problematic when using Pig with Hive tables with custom 
> LoadFuncs that did not implement LoadMetadata interface.
> # It lists all the files to compute the total size. The algorithm is like 
> this. First, compute the total size. Second, compare it against the 
> configured max bytes. This is very time-consuming when Pig job loads a large 
> number of files. It will list all the files only to compute the total size. 
> Instead, we should stop computing the sum of input sizes as soon as it 
> becomes the max bytes-
> {code:title=JobControlCompiler.java}
> long totalInputFileSize = 
> InputSizeReducerEstimator.getTotalInputFileSize(conf, lds, job); // THIS IS 
> BAD!
> long inputByteMax = 
> conf.getLong(PigConfiguration.PIG_AUTO_LOCAL_INPUT_MAXBYTES, 100*1000*1000l);
> log.info("Size of input: " + totalInputFileSize +" bytes. Small job 
> threshold: " + inputByteMax );
> if (totalInputFileSize < 0 || totalInputFileSize > inputByteMax) {
> return false;
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PIG-2836) Namespace in Pig macros collides with Pig scripts

2014-10-26 Thread Cheolsoo Park (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-2836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14184558#comment-14184558
 ] 

Cheolsoo Park commented on PIG-2836:


[~eyal], did you try 0.13? I believe it's fixed by PIG-3581.

> Namespace in Pig macros collides with Pig scripts
> -
>
> Key: PIG-2836
> URL: https://issues.apache.org/jira/browse/PIG-2836
> Project: Pig
>  Issue Type: Bug
>  Components: grunt, parser
>Affects Versions: 0.9.2, 0.10.0, 0.11, 0.10.1
>Reporter: Russell Jurney
>Assignee: Alan Gates
>Priority: Critical
>  Labels: bacon, confit, goto, hash, macros, pig, sad
>
> Relation names in macros collide with relation names in the calling pig 
> script. This is my most common source of errors and it makes writing macros 
> hard. Suggest that the macro processor create a unique namespace for all 
> relations in a macro other than $in and $out. Prepend something to each 
> relation name or somehow create a unique per-macro namespace.
> This may conflict with some uses of macros where relation names are passed 
> through passively, but this is always avoidable by supplying parameters and 
> feels GOTO f*cked.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4241) Auto local mode mistakenly converts large jobs to local mode when using with Hive tables

2014-10-23 Thread Cheolsoo Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-4241:
---
Attachment: PIG-4241-2.patch

Fixed {{TestInputSizeReducerEstimator}} in a new patch.

> Auto local mode mistakenly converts large jobs to local mode when using with 
> Hive tables
> 
>
> Key: PIG-4241
> URL: https://issues.apache.org/jira/browse/PIG-4241
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
> Fix For: 0.15.0
>
> Attachments: PIG-4241-1.patch, PIG-4241-2.patch
>
>
> The current implementation of auto local mode has two severe problems-
> # It assumes file-based inputs, and it always converts jobs with 
> non-file-based inputs into local mode unless the 
> {{LoadMetadata.getStatistics().getSizeInBytes()}} returns >100M. This is 
> particularly problematic when using Pig with Hive tables with custom 
> LoadFuncs that did not implement LoadMetadata interface.
> # It lists all the files to compute the total size. The algorithm is like 
> this. First, compute the total size. Second, compare it against the 
> configured max bytes. This is very time-consuming when Pig job loads a large 
> number of files. It will list all the files only to compute the total size. 
> Instead, we should stop computing the sum of input sizes as soon as it 
> becomes the max bytes-
> {code:title=JobControlCompiler.java}
> long totalInputFileSize = 
> InputSizeReducerEstimator.getTotalInputFileSize(conf, lds, job); // THIS IS 
> BAD!
> long inputByteMax = 
> conf.getLong(PigConfiguration.PIG_AUTO_LOCAL_INPUT_MAXBYTES, 100*1000*1000l);
> log.info("Size of input: " + totalInputFileSize +" bytes. Small job 
> threshold: " + inputByteMax );
> if (totalInputFileSize < 0 || totalInputFileSize > inputByteMax) {
> return false;
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4247) S3 properties are not picked up from core-site.xml in local mode

2014-10-23 Thread Cheolsoo Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-4247:
---
Status: Patch Available  (was: Open)

> S3 properties are not picked up from core-site.xml in local mode
> 
>
> Key: PIG-4247
> URL: https://issues.apache.org/jira/browse/PIG-4247
> Project: Pig
>  Issue Type: Bug
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
> Fix For: 0.15.0
>
> Attachments: PIG-4247-1.patch
>
>
> Even in local mode, {{fs.s3}} properties need to be set if the job accesses 
> s3 files (for eg, registering jars on s3, loading input files on s3, etc). In 
> particular, {{fs.s3.awsSecretAccessKey}} and {{fs.s3.awsAccessKey}} are 
> usually set in {{core-site.xml}}, but since local mode doesn't load 
> {{core-site.xml}}, these properties have to be set again in pig.properties. 
> This adds operational overhead because whenever rotating aws keys, 
> {{pig.properties}} also needs to be updated. So it would be nice if {{fs.s3}} 
> properties can be set even in local mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4247) S3 properties are not picked up from core-site.xml in local mode

2014-10-23 Thread Cheolsoo Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-4247:
---
Attachment: PIG-4247-1.patch

Uploading a patch that load s3-related properties from {{core-site.xml}} 
regardless whether it's local mode or not.

> S3 properties are not picked up from core-site.xml in local mode
> 
>
> Key: PIG-4247
> URL: https://issues.apache.org/jira/browse/PIG-4247
> Project: Pig
>  Issue Type: Bug
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
> Fix For: 0.15.0
>
> Attachments: PIG-4247-1.patch
>
>
> Even in local mode, {{fs.s3}} properties need to be set if the job accesses 
> s3 files (for eg, registering jars on s3, loading input files on s3, etc). In 
> particular, {{fs.s3.awsSecretAccessKey}} and {{fs.s3.awsAccessKey}} are 
> usually set in {{core-site.xml}}, but since local mode doesn't load 
> {{core-site.xml}}, these properties have to be set again in pig.properties. 
> This adds operational overhead because whenever rotating aws keys, 
> {{pig.properties}} also needs to be updated. So it would be nice if {{fs.s3}} 
> properties can be set even in local mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (PIG-4247) S3 properties are not picked up from core-site.xml in local mode

2014-10-23 Thread Cheolsoo Park (JIRA)

Cheolsoo Park created PIG-4247:
--

 Summary: S3 properties are not picked up from core-site.xml in 
local mode
 Key: PIG-4247
 URL: https://issues.apache.org/jira/browse/PIG-4247
 Project: Pig
  Issue Type: Bug
Reporter: Cheolsoo Park
Assignee: Cheolsoo Park
 Fix For: 0.15.0


Even in local mode, {{fs.s3}} properties need to be set if the job accesses s3 
files (for eg, registering jars on s3, loading input files on s3, etc). In 
particular, {{fs.s3.awsSecretAccessKey}} and {{fs.s3.awsAccessKey}} are usually 
set in {{core-site.xml}}, but since local mode doesn't load {{core-site.xml}}, 
these properties have to be set again in pig.properties. This adds operational 
overhead because whenever rotating aws keys, {{pig.properties}} also needs to 
be updated. So it would be nice if {{fs.s3}} properties can be set even in 
local mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4241) Auto local mode mistakenly converts large jobs to local mode when using with Hive tables

2014-10-17 Thread Cheolsoo Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-4241:
---
Attachment: PIG-4241-1.patch

Attached patch includes the following changes-
# When Hive table name is interpreted as a hdfs file, {{globStatus()}} returns 
null. In this case, {{InputSizeReducerEstimator.getTotalInputFileSize()}} 
returns -1 now. Therefore, big jobs with Hive table input do not get converted 
to local mode.
# Max parameter is add to 
{{InputSizeReducerEstimator.getTotalInputFileSize()}}. Now when it computes the 
total input size recursively, it exits as soon as it reaches the max. This 
helps avoid listing all the files to determine whether the job can be converted 
to local mode or not.

> Auto local mode mistakenly converts large jobs to local mode when using with 
> Hive tables
> 
>
> Key: PIG-4241
> URL: https://issues.apache.org/jira/browse/PIG-4241
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
> Fix For: 0.15.0
>
> Attachments: PIG-4241-1.patch
>
>
> The current implementation of auto local mode has two severe problems-
> # It assumes file-based inputs, and it always converts jobs with 
> non-file-based inputs into local mode unless the 
> {{LoadMetadata.getStatistics().getSizeInBytes()}} returns >100M. This is 
> particularly problematic when using Pig with Hive tables with custom 
> LoadFuncs that did not implement LoadMetadata interface.
> # It lists all the files to compute the total size. The algorithm is like 
> this. First, compute the total size. Second, compare it against the 
> configured max bytes. This is very time-consuming when Pig job loads a large 
> number of files. It will list all the files only to compute the total size. 
> Instead, we should stop computing the sum of input sizes as soon as it 
> becomes the max bytes-
> {code:title=JobControlCompiler.java}
> long totalInputFileSize = 
> InputSizeReducerEstimator.getTotalInputFileSize(conf, lds, job); // THIS IS 
> BAD!
> long inputByteMax = 
> conf.getLong(PigConfiguration.PIG_AUTO_LOCAL_INPUT_MAXBYTES, 100*1000*1000l);
> log.info("Size of input: " + totalInputFileSize +" bytes. Small job 
> threshold: " + inputByteMax );
> if (totalInputFileSize < 0 || totalInputFileSize > inputByteMax) {
> return false;
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4241) Auto local mode mistakenly converts large jobs to local mode when using with Hive tables

2014-10-17 Thread Cheolsoo Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-4241:
---
Status: Patch Available  (was: Open)

> Auto local mode mistakenly converts large jobs to local mode when using with 
> Hive tables
> 
>
> Key: PIG-4241
> URL: https://issues.apache.org/jira/browse/PIG-4241
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
> Fix For: 0.15.0
>
> Attachments: PIG-4241-1.patch
>
>
> The current implementation of auto local mode has two severe problems-
> # It assumes file-based inputs, and it always converts jobs with 
> non-file-based inputs into local mode unless the 
> {{LoadMetadata.getStatistics().getSizeInBytes()}} returns >100M. This is 
> particularly problematic when using Pig with Hive tables with custom 
> LoadFuncs that did not implement LoadMetadata interface.
> # It lists all the files to compute the total size. The algorithm is like 
> this. First, compute the total size. Second, compare it against the 
> configured max bytes. This is very time-consuming when Pig job loads a large 
> number of files. It will list all the files only to compute the total size. 
> Instead, we should stop computing the sum of input sizes as soon as it 
> becomes the max bytes-
> {code:title=JobControlCompiler.java}
> long totalInputFileSize = 
> InputSizeReducerEstimator.getTotalInputFileSize(conf, lds, job); // THIS IS 
> BAD!
> long inputByteMax = 
> conf.getLong(PigConfiguration.PIG_AUTO_LOCAL_INPUT_MAXBYTES, 100*1000*1000l);
> log.info("Size of input: " + totalInputFileSize +" bytes. Small job 
> threshold: " + inputByteMax );
> if (totalInputFileSize < 0 || totalInputFileSize > inputByteMax) {
> return false;
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (PIG-4241) Auto local mode mistakenly converts large jobs to local mode when using with Hive tables

2014-10-16 Thread Cheolsoo Park (JIRA)

Cheolsoo Park created PIG-4241:
--

 Summary: Auto local mode mistakenly converts large jobs to local 
mode when using with Hive tables
 Key: PIG-4241
 URL: https://issues.apache.org/jira/browse/PIG-4241
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Cheolsoo Park
Assignee: Cheolsoo Park
 Fix For: 0.15.0


The current implementation of auto local mode has two severe problems-
# It assumes file-based inputs, and it always converts jobs with non-file-based 
inputs into local mode unless the 
{{LoadMetadata.getStatistics().getSizeInBytes()}} returns >100M. This is 
particularly problematic when using Pig with Hive tables with custom LoadFuncs 
that did not implement LoadMetadata interface.
# It lists all the files to compute the total size. The algorithm is like this. 
First, compute the total size. Second, compare it against the configured max 
bytes. This is very time-consuming when Pig job loads a large number of files. 
It will list all the files only to compute the total size. Instead, we should 
stop computing the sum of input sizes as soon as it becomes the max bytes-
{code:title=JobControlCompiler.java}
long totalInputFileSize = InputSizeReducerEstimator.getTotalInputFileSize(conf, 
lds, job); // THIS IS BAD!
long inputByteMax = 
conf.getLong(PigConfiguration.PIG_AUTO_LOCAL_INPUT_MAXBYTES, 100*1000*1000l);
log.info("Size of input: " + totalInputFileSize +" bytes. Small job threshold: 
" + inputByteMax );
if (totalInputFileSize < 0 || totalInputFileSize > inputByteMax) {
return false;
}
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PIG-4227) Streaming Python UDF handles bag outputs incorrectly

2014-10-15 Thread Cheolsoo Park (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14173238#comment-14173238
 ] 

Cheolsoo Park commented on PIG-4227:


Yes, it works! I haven't verified the broken tests, but if they all pass, 
please go ahead commit it.

Thank you Daniel!

> Streaming Python UDF handles bag outputs incorrectly
> 
>
> Key: PIG-4227
> URL: https://issues.apache.org/jira/browse/PIG-4227
> Project: Pig
>  Issue Type: Bug
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
> Fix For: 0.14.0
>
> Attachments: PIG-4227-1.patch, PIG-4227-2.patch
>
>
> I have a udf that generates different outputs when running as jython and 
> streaming python.
> {code:title=jython}
> {([[BBC Worldwide]])}
> {code} 
> {code:title=streaming python}
> {(BC Worldwid)}
> {code}
> The problem is that streaming python encodes a bag output incorrectly. For 
> this particular example, it serializes the output string as follows-
> {code}
> |{_[[BBC Worldwide]]|}_
> {code}
> where '|' and '\_' wrap bag delimiters '\{' and '\}'. i.e. '\{' => '|\{\_' 
> and '\}' => '|\}\_'.
> But this is wrong because bag must contain tuples not chararrays. i.e. the 
> correct encoding is as follows-
> {code}
> |{_|(_[[BBC Worldwide]]|)_|}_
> {code}
> where '|' and '_' wrap tuple delimiters '(' and ')' as well as bag delimiters.
> This results in truncated outputs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PIG-4227) Streaming Python UDF handles bag outputs incorrectly

2014-10-15 Thread Cheolsoo Park (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14172704#comment-14172704
 ] 

Cheolsoo Park commented on PIG-4227:


Yes, you're right.

> Streaming Python UDF handles bag outputs incorrectly
> 
>
> Key: PIG-4227
> URL: https://issues.apache.org/jira/browse/PIG-4227
> Project: Pig
>  Issue Type: Bug
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
> Fix For: 0.14.0
>
> Attachments: PIG-4227-1.patch
>
>
> I have a udf that generates different outputs when running as jython and 
> streaming python.
> {code:title=jython}
> {([[BBC Worldwide]])}
> {code} 
> {code:title=streaming python}
> {(BC Worldwid)}
> {code}
> The problem is that streaming python encodes a bag output incorrectly. For 
> this particular example, it serializes the output string as follows-
> {code}
> |{_[[BBC Worldwide]]|}_
> {code}
> where '|' and '\_' wrap bag delimiters '\{' and '\}'. i.e. '\{' => '|\{\_' 
> and '\}' => '|\}\_'.
> But this is wrong because bag must contain tuples not chararrays. i.e. the 
> correct encoding is as follows-
> {code}
> |{_|(_[[BBC Worldwide]]|)_|}_
> {code}
> where '|' and '_' wrap tuple delimiters '(' and ')' as well as bag delimiters.
> This results in truncated outputs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PIG-4227) Streaming Python UDF handles bag outputs incorrectly

2014-10-15 Thread Cheolsoo Park (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14172658#comment-14172658
 ] 

Cheolsoo Park commented on PIG-4227:


{quote}
Otherwise we break python udf which do insert tuples.
{quote}
True, but I hardly see udfs that insert tuples because in Jython, you never had 
to do that. Since I deployed streaming udf in prod, few users have inserted 
tuples only because they had to. Now I deployed my patch to prod and haven't 
heard any complaints.

But I do agree that if a udf returns a list of tuples, there will be an extra 
layer of tuple. That's a valid corner case, indeed.

> Streaming Python UDF handles bag outputs incorrectly
> 
>
> Key: PIG-4227
> URL: https://issues.apache.org/jira/browse/PIG-4227
> Project: Pig
>  Issue Type: Bug
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
> Fix For: 0.14.0
>
> Attachments: PIG-4227-1.patch
>
>
> I have a udf that generates different outputs when running as jython and 
> streaming python.
> {code:title=jython}
> {([[BBC Worldwide]])}
> {code} 
> {code:title=streaming python}
> {(BC Worldwid)}
> {code}
> The problem is that streaming python encodes a bag output incorrectly. For 
> this particular example, it serializes the output string as follows-
> {code}
> |{_[[BBC Worldwide]]|}_
> {code}
> where '|' and '\_' wrap bag delimiters '\{' and '\}'. i.e. '\{' => '|\{\_' 
> and '\}' => '|\}\_'.
> But this is wrong because bag must contain tuples not chararrays. i.e. the 
> correct encoding is as follows-
> {code}
> |{_|(_[[BBC Worldwide]]|)_|}_
> {code}
> where '|' and '_' wrap tuple delimiters '(' and ')' as well as bag delimiters.
> This results in truncated outputs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PIG-4227) Streaming Python UDF handles bag outputs incorrectly

2014-10-15 Thread Cheolsoo Park (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14172620#comment-14172620
 ] 

Cheolsoo Park commented on PIG-4227:


[~daijy], sorry for breaking unit tests.
{quote}
I don't totally understand the issue in the description, is that because jython 
adds tuple inside a list automatically but python does not?
{quote}
You're right that Jython udf usually doesn't return a list of Python tuples but 
just returns a list of Python objects. In that case, Pig converts it to a bag 
of tuples automatically by wrapping objects with tuples. However, Python 
streaming udf serializes it as a bag of non-tuples, and they're never wrapped 
with tuples. The problem is that outputSchema is defined as something like 
{{bag:\{tuple\:( chararray )\}}}, and now deserialization code skips bytes to 
skip tuple delimiters that do not exist. That results in truncating 3 chars at 
the beginning and the end.

So the root cause is that Jython and Python streaming handles a Python list of 
non-tuples differently. This makes it not possible to run the same udf in the 
two modes. With my patch, I can run the same udf in the two modes and get the 
same result. For eg, here is the diff in one of udfs before and after my patch. 
This should clarify the difference-
{code}
34c34
< output.append(recos[r]['id'])
---
> output.append(tuple([recos[r]['id']]))
44c44
< output.append(recos[r]['id'])
---
> output.append(tuple([recos[r]['id']]))
49c49
< output.append(items[i]['id'])
---
> output.append(tuple([items[i]['id']]))
84c84
< output.append(recos[r]['id'])
---
> output.append(tuple([recos[r]['id']]))
96c96
< output.append(recos[r]['id'])
---
> output.append(tuple([recos[r]['id']]))
101c101
< output.append(items[i]['id'])
---
> output.append(tuple([items[i]['id']]))
105c105
< return [-1]
---
> return [tuple([-1])]
{code}

> Streaming Python UDF handles bag outputs incorrectly
> 
>
> Key: PIG-4227
> URL: https://issues.apache.org/jira/browse/PIG-4227
> Project: Pig
>  Issue Type: Bug
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
> Fix For: 0.14.0
>
> Attachments: PIG-4227-1.patch
>
>
> I have a udf that generates different outputs when running as jython and 
> streaming python.
> {code:title=jython}
> {([[BBC Worldwide]])}
> {code} 
> {code:title=streaming python}
> {(BC Worldwid)}
> {code}
> The problem is that streaming python encodes a bag output incorrectly. For 
> this particular example, it serializes the output string as follows-
> {code}
> |{_[[BBC Worldwide]]|}_
> {code}
> where '|' and '\_' wrap bag delimiters '\{' and '\}'. i.e. '\{' => '|\{\_' 
> and '\}' => '|\}\_'.
> But this is wrong because bag must contain tuples not chararrays. i.e. the 
> correct encoding is as follows-
> {code}
> |{_|(_[[BBC Worldwide]]|)_|}_
> {code}
> where '|' and '_' wrap tuple delimiters '(' and ')' as well as bag delimiters.
> This results in truncated outputs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (PIG-4231) Make rank work with Spark

2014-10-14 Thread Cheolsoo Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park reassigned PIG-4231:
--

Assignee: Carlos Balduz

> Make rank work with Spark
> -
>
> Key: PIG-4231
> URL: https://issues.apache.org/jira/browse/PIG-4231
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Carlos Balduz
>Assignee: Carlos Balduz
>  Labels: spork
>
> Rank does not work with Spark since PORank and POCounter have  not been 
> implemented yet.
> Pig Stack Trace
> ---
> ERROR 0: java.lang.IllegalArgumentException: Spork unsupported 
> PhysicalOperator: (Name: DATA: POCounter[tuple] - scope-146 Operator Key: 
> scope-146)
> org.apache.pig.backend.executionengine.ExecException: ERROR 0: 
> java.lang.IllegalArgumentException: Spork unsupported PhysicalOperator: 
> (Name: DATA: POCounter[tuple] - scope-146 Operator Key: scope-146)
>   at 
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:285)
>   at org.apache.pig.PigServer.launchPlan(PigServer.java:1378)
>   at 
> org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1363)
>   at org.apache.pig.PigServer.execute(PigServer.java:1352)
>   at org.apache.pig.PigServer.executeBatch(PigServer.java:403)
>   at org.apache.pig.PigServer.executeBatch(PigServer.java:386)
>   at 
> org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:170)
>   at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:233)
>   at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:204)
>   at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81)
>   at org.apache.pig.Main.run(Main.java:482)
>   at org.apache.pig.Main.main(Main.java:164)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4227) Streaming Python UDF handles bag outputs incorrectly

2014-10-11 Thread Cheolsoo Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-4227:
---
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed to trunk. Thank you Daniel for reviewing the patch!

> Streaming Python UDF handles bag outputs incorrectly
> 
>
> Key: PIG-4227
> URL: https://issues.apache.org/jira/browse/PIG-4227
> Project: Pig
>  Issue Type: Bug
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
> Fix For: 0.15.0
>
> Attachments: PIG-4227-1.patch
>
>
> I have a udf that generates different outputs when running as jython and 
> streaming python.
> {code:title=jython}
> {([[BBC Worldwide]])}
> {code} 
> {code:title=streaming python}
> {(BC Worldwid)}
> {code}
> The problem is that streaming python encodes a bag output incorrectly. For 
> this particular example, it serializes the output string as follows-
> {code}
> |{_[[BBC Worldwide]]|}_
> {code}
> where '|' and '\_' wrap bag delimiters '\{' and '\}'. i.e. '\{' => '|\{\_' 
> and '\}' => '|\}\_'.
> But this is wrong because bag must contain tuples not chararrays. i.e. the 
> correct encoding is as follows-
> {code}
> |{_|(_[[BBC Worldwide]]|)_|}_
> {code}
> where '|' and '_' wrap tuple delimiters '(' and ')' as well as bag delimiters.
> This results in truncated outputs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PIG-4184) UDF backward compatibility issue after POStatus.STATUS_NULL refactory

2014-10-10 Thread Cheolsoo Park (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14167506#comment-14167506
 ] 

Cheolsoo Park commented on PIG-4184:


+1.

I ran into this problem while using another UDF (Datafu 
[TransposeTupleToBag|http://datafu.incubator.apache.org/docs/datafu/1.1.0/datafu/pig/util/TransposeTupleToBag.html]).
 The output differs between Pig 0.12 and 0.13 when passing null tuples.

> UDF backward compatibility issue after POStatus.STATUS_NULL refactory
> -
>
> Key: PIG-4184
> URL: https://issues.apache.org/jira/browse/PIG-4184
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.14.0
>
> Attachments: PIG-4184-1.patch
>
>
> This is the same issue we discussed in PIG-3739 and PIG-3679. However, our 
> previous fix does not solve the issue, in fact, it make things worse and it 
> is totally my fault.
> Consider the following UDF and script:
> {code}
> public class IntToBool extends EvalFunc {
> @Override
> public Boolean exec(Tuple input) throws IOException {
> if (input == null || input.size() == 0)
> return null;
> Integer val = (Integer)input.get(0);
> return (val == null || val == 0) ? false : true;
> }
> }
> {code}
> {code}
> a = load '1.txt' as (i0:int, i1:int);
> b = foreach a generate IntToBool(i0);
> store b into 'output';
> {code}
> 1.txt
> {code}
> 1
> 2   3
> {code}
> With Pig 0.12, we get:
> {code}
> (false)
> (true)
> {code}
> With Pig 0.13/0.14, we get:
> {code}
> ()
> (true)
> {code}
> The reason is in 0.12, Pig pass first row as a tuple with a null item to 
> IntToBool, with 0.13/0.14, Pig swallow the first row, which is not right. And 
> this wrong behavior is brought by PIG-3739 and PIG-3679.
> Before that (but after POStatus.STATUS_NULL refactory PIG-3568), we do have a 
> behavior change which makes e2e test StreamingPythonUDFs_10 fail with NPE. 
> However, I think this is an inconsistent behavior of 0.12. Consider the 
> following scripts:
> {code}
> a = load '1.txt' as (name:chararray, age:int, gpa:double);
> b = foreach a generate ROUND((gpa>3.0?gpa+1:gpa));
> store b into 'output';
> {code}
> {code}
> a = load '1.txt' as (name:chararray, age:int, gpa:double);
> b = foreach a generate ROUND(gpa);
> store b into 'output';
> {code}
> If gpa field is null, script 1 skip the row and script 2 fail with NPE, which 
> does not make sense. So my thinking is:
> 1. Pig 0.12 is wrong and POStatus.STATUS_NULL refactory fix this behavior (we 
> don't need related fix in PIG-3739/PIG-3679)
> 2. ROUND (and some other UDF) is wrong anyway, we shall fix it



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (PIG-4225) Allow users to specify Python executable for Pig streaming

2014-10-09 Thread Cheolsoo Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park resolved PIG-4225.

Resolution: Duplicate

This is already fixed by PIG-4124.

> Allow users to specify Python executable for Pig streaming
> --
>
> Key: PIG-4225
> URL: https://issues.apache.org/jira/browse/PIG-4225
> Project: Pig
>  Issue Type: Improvement
>  Components: internal-udfs
>Affects Versions: 0.12.0, 0.12.1
>Reporter: Mike Sukmanowsky
>
> The [current 
> PythonScriptEngine|https://github.com/apache/pig/blob/release-0.12.0/src/org/apache/pig/scripting/streaming/python/PythonScriptEngine.java#L69]
>  uses whatever python is currently on the path in order to execute scripts.
> Python users are accustomed to creating virtual environments (virtualenvs) 
> where associated requirements are installed without needing to worry about 
> "global" installs via, for example, sudo pip install .
> Is it possible to have the Python executable specified either via the 
> {{DEFINE}} command syntax or, in a hadoop job configuration variable? Perhaps 
> {{pig.pythonstreaming.pythonpath}}?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4227) Streaming Python UDF handles bag outputs incorrectly

2014-10-09 Thread Cheolsoo Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-4227:
---
Attachment: PIG-4227-1.patch

Attaching a patch.

> Streaming Python UDF handles bag outputs incorrectly
> 
>
> Key: PIG-4227
> URL: https://issues.apache.org/jira/browse/PIG-4227
> Project: Pig
>  Issue Type: Bug
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
> Fix For: 0.15.0
>
> Attachments: PIG-4227-1.patch
>
>
> I have a udf that generates different outputs when running as jython and 
> streaming python.
> {code:title=jython}
> {([[BBC Worldwide]])}
> {code} 
> {code:title=streaming python}
> {(BC Worldwid)}
> {code}
> The problem is that streaming python encodes a bag output incorrectly. For 
> this particular example, it serializes the output string as follows-
> {code}
> |{_[[BBC Worldwide]]|}_
> {code}
> where '|' and '\_' wrap bag delimiters '\{' and '\}'. i.e. '\{' => '|\{\_' 
> and '\}' => '|\}\_'.
> But this is wrong because bag must contain tuples not chararrays. i.e. the 
> correct encoding is as follows-
> {code}
> |{_|(_[[BBC Worldwide]]|)_|}_
> {code}
> where '|' and '_' wrap tuple delimiters '(' and ')' as well as bag delimiters.
> This results in truncated outputs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4227) Streaming Python UDF handles bag outputs incorrectly

2014-10-09 Thread Cheolsoo Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-4227:
---
Status: Patch Available  (was: Open)

> Streaming Python UDF handles bag outputs incorrectly
> 
>
> Key: PIG-4227
> URL: https://issues.apache.org/jira/browse/PIG-4227
> Project: Pig
>  Issue Type: Bug
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
> Fix For: 0.15.0
>
> Attachments: PIG-4227-1.patch
>
>
> I have a udf that generates different outputs when running as jython and 
> streaming python.
> {code:title=jython}
> {([[BBC Worldwide]])}
> {code} 
> {code:title=streaming python}
> {(BC Worldwid)}
> {code}
> The problem is that streaming python encodes a bag output incorrectly. For 
> this particular example, it serializes the output string as follows-
> {code}
> |{_[[BBC Worldwide]]|}_
> {code}
> where '|' and '\_' wrap bag delimiters '\{' and '\}'. i.e. '\{' => '|\{\_' 
> and '\}' => '|\}\_'.
> But this is wrong because bag must contain tuples not chararrays. i.e. the 
> correct encoding is as follows-
> {code}
> |{_|(_[[BBC Worldwide]]|)_|}_
> {code}
> where '|' and '_' wrap tuple delimiters '(' and ')' as well as bag delimiters.
> This results in truncated outputs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (PIG-4227) Streaming Python UDF handles bag outputs incorrectly

2014-10-09 Thread Cheolsoo Park (JIRA)

Cheolsoo Park created PIG-4227:
--

 Summary: Streaming Python UDF handles bag outputs incorrectly
 Key: PIG-4227
 URL: https://issues.apache.org/jira/browse/PIG-4227
 Project: Pig
  Issue Type: Bug
Reporter: Cheolsoo Park
Assignee: Cheolsoo Park
 Fix For: 0.15.0


I have a udf that generates different outputs when running as jython and 
streaming python.
{code:title=jython}
{([[BBC Worldwide]])}
{code} 
{code:title=streaming python}
{(BC Worldwid)}
{code}
The problem is that streaming python encodes a bag output incorrectly. For this 
particular example, it serializes the output string as follows-
{code}
|{_[[BBC Worldwide]]|}_
{code}
where '|' and '\_' wrap bag delimiters '\{' and '\}'. i.e. '\{' => '|\{\_' and 
'\}' => '|\}\_'.

But this is wrong because bag must contain tuples not chararrays. i.e. the 
correct encoding is as follows-
{code}
|{_|(_[[BBC Worldwide]]|)_|}_
{code}
where '|' and '_' wrap tuple delimiters '(' and ')' as well as bag delimiters.

This results in truncated outputs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4183) Fix classpath error when using pig command with Spark

2014-09-17 Thread Cheolsoo Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-4183:
---
Issue Type: Sub-task  (was: Bug)
Parent: PIG-4059

> Fix classpath error when using pig command with Spark
> -
>
> Key: PIG-4183
> URL: https://issues.apache.org/jira/browse/PIG-4183
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: fix-pig-4183.patch
>
>
> You can reproduce the bug in following steps
> 1.Build spark -0.9 env and build hadoop1 env:
> 2.Compile code: ant jar
> 3.Export $PIG_CLASSPATH
> echo $PIG_CLASSPATH
> /home/zly/prj/oss/pig/build/ivy/lib/Pig/*:/home/zly/prj/oss/hadoop-1.2.1/conf
> 4.Run: cd $PIG_HOME/bin; ./pig –x spark id.spark.pig 
> 5.Error message found in pig log
>   549 ERROR 2998: Unhandled internal error. 
> org.slf4j.spi.LocationAwareLogger.log(Lorg/slf4j/Marker;Ljava/lang/String;ILjava/lang/String;[Ljava/lang/Object;Ljava/lang/Throwable;)V
> 550 
> 551 java.lang.NoSuchMethodError: 
> org.slf4j.spi.LocationAwareLogger.log(Lorg/slf4j/Marker;Ljava/lang/String;ILjava/lang/String;[Ljava/lang/Object;Ljava/lang/Throwable;)V
> 552 at 
> org.eclipse.jetty.util.log.JettyAwareLogger.log(JettyAwareLogger.java:607)
> 553 at 
> org.eclipse.jetty.util.log.JettyAwareLogger.warn(JettyAwareLogger.java:431)
> 554 at org.eclipse.jetty.util.log.Slf4jLog.warn(Slf4jLog.java:69)
> 555 at 
> org.eclipse.jetty.util.component.AbstractLifeCycle.setFailed(AbstractLifeCycle.java:204)
> 556 at 
> org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:74)
> 557 at org.apache.spark.HttpServer.start(HttpServer.scala:65)
> 558 at 
> org.apache.spark.broadcast.HttpBroadcast$.createServer(HttpBroadcast.scala:130)
> 559 at 
> org.apache.spark.broadcast.HttpBroadcast$.initialize(HttpBroadcast.scala:101)
> 560 at 
> org.apache.spark.broadcast.HttpBroadcastFactory.initialize(HttpBroadcast.scala:70)
>561 at 
> org.apache.spark.broadcast.BroadcastManager.initialize(Broadcast.scala:81)
> 562 at 
> org.apache.spark.broadcast.BroadcastManager.(Broadcast.scala:68)
> 563 at org.apache.spark.SparkEnv$.create(SparkEnv.scala:175)
> 564 at 
> org.apache.spark.SparkContext.(SparkContext.scala:139)
> 565 at 
> org.apache.spark.SparkContext.(SparkContext.scala:100)
> 566 at 
> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:81)
> 567 at 
> org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.startSparkIfNeeded(SparkLauncher.java:202)
> 568 at 
> org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.launchPig(SparkLauncher.java:114)
> 569 at 
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:279)
> 570 at org.apache.pig.PigServer.launchPlan(PigServer.java:1378)
> 571 at 
> org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1363)
> 572 at org.apache.pig.PigServer.execute(PigServer.java:1352)
> 573 at org.apache.pig.PigServer.executeBatch(PigServer.java:403)
> 574 at org.apache.pig.PigServer.executeBatch(PigServer.java:386)
> 575 at 
> org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:170)
> 576 at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:233)
> 577 at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:204)
> 578 at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81)
> 579 at org.apache.pig.Main.run(Main.java:611)
> 580 at org.apache.pig.Main.main(Main.java:164)
> 581 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 582 at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> 583 at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 584 at java.lang.reflect.Method.invoke(Method.java:606)
> 585 at org.apache.hadoop.util.RunJar.main(RunJar.java:160)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PIG-4128) New logical optimizer rule: ConstantCalculator

2014-09-16 Thread Cheolsoo Park (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-4128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14136539#comment-14136539
 ] 

Cheolsoo Park commented on PIG-4128:


[~daijy], I want to say thank you for this patch. One of common mistakes that I 
see is that partition column is long type, and user specifies a filter 
expression like {{ == }}. The partition used to 
be not pushed down in this case because Pig inserts a cast expression into the 
filer expression, and this was always confusing to users. But with your patch, 
this filter expression just works.

> New logical optimizer rule: ConstantCalculator
> --
>
> Key: PIG-4128
> URL: https://issues.apache.org/jira/browse/PIG-4128
> Project: Pig
>  Issue Type: New Feature
>  Components: impl
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.14.0
>
> Attachments: PIG-4128-1.patch, PIG-4128-2.patch, PIG-4128-3.patch
>
>
> Pig used to have a LogicExpressionSimplifier to simplify expression which 
> also calculates constant expression. The optimizer rule is buggy and we 
> disable it by default in PIG-2316.
> However, we do need this feature especially in partition/predicate push down, 
> since both does not deal with complex constant expression, we'd like to 
> replace the expression with constant before the actual push down. Yes, user 
> may manually do the calculation and rewrite the query, but even rewrite is 
> sometimes not possible. Consider the case user want to push a datetime 
> predicate, user have to write a ToDate udf since Pig does not have datetime 
> constant.
> In this Jira, I provide a new rule: ConstantCalculator, which is much simpler 
> and much less error prone, to replace LogicExpressionSimplifier.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4171) Streaming UDF fails when direct fetch optimization is enabled

2014-09-16 Thread Cheolsoo Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-4171:
---
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed to trunk. Thank you Daniel for reviewing!

> Streaming UDF fails when direct fetch optimization is enabled
> -
>
> Key: PIG-4171
> URL: https://issues.apache.org/jira/browse/PIG-4171
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.13.0
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
>Priority: Minor
> Fix For: 0.14.0
>
> Attachments: PIG-4171-1.patch
>
>
> To reproduce the error, register any udf as {{streaming_python}} and run it 
> in direct fetch mode.
> It fails with the following error in my environment-
> {code}
> sys.argv[5], sys.argv[6], sys.argv[7], sys.argv[8])
>   File "/mnt/pig_tmp/prodpig/controller4894777320356829424.py", line 77, in 
> main
> self.output_stream = open(output_stream_path, 'a')
> IOError: [Errno 13] Permission denied: 
> '/mnt/var/lib/hadoop/tmp/udfOutput/sanitize.out'
> {code}
> The problem is that Streaming UDF tries to write out a log, but the user 
> doesn't have write permission to the default location ({{hadoop.tmp.dir}}).
> In fact, Streaming UDF handles local mode properly by using 
> {{pig.udf.scripting.log.dir}} instead of {{hadoop.log.dir}} or 
> {{hadoop.tmp.dir}}. We should do the same for direct fetch mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (PIG-4131) Pig - kerberos error

2014-09-16 Thread Cheolsoo Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park reassigned PIG-4131:
--

Assignee: liyunzhang_intel

> Pig - kerberos error
> 
>
> Key: PIG-4131
> URL: https://issues.apache.org/jira/browse/PIG-4131
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.12.0
> Environment: centos 6.5, cdh5.1.1
>Reporter: narayana b
>Assignee: liyunzhang_intel
> Attachments: PIG_3507_1.patch
>
>
> Hi,
> I have running hadoop cluster
> The following steps are done and found an error.
> Is this core bug or something else?
> 1) I initialized my kerberos client( as Im using kerberos)
> kinit cloudera/cloudera-cdh05.narayana.local@NARAYANA.LOCAL
> 2) pig -x local
> 3) A = load '/etc/passwd' using PigStorage(':');
> 4) B = foreach A generate $0 as id;
> 5) dump B;
>  
> Error
> --
> 2014-08-18 00:04:29,074 [main] INFO  
> org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: 
> UNKNOWN
> 2014-08-18 00:04:29,149 [main] INFO  
> org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - 
> {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, DuplicateForEachColumnRewrite, 
> GroupByConstParallelSetter, ImplicitSplitInserter, LimitOptimizer, 
> LoadTypeCastInserter, MergeFilter, MergeForEach, NewPartitionFilterOptimizer, 
> PartitionFilterOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, 
> StreamTypeCastInserter], RULES_DISABLED=[FilterLogicExpressionSimplifier]}
> 2014-08-18 00:04:29,201 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
> 6000: Output Location Validation Failed for: 
> 'file:/tmp/temp805769141/tmp-904818707 More info to follow:
> Can't get Master Kerberos principal for use as renewer
> Details at logfile: /usr/bin/pig_1408300456319.log
> grunt> quit
> [root@cloudera-cdh05 bin]# tail -100f /usr/bin/pig_1408300456319.log
>  
> Pig Stack Trace
> 
> ERROR 6000: Output Location Validation Failed for: 
> 'file:/tmp/temp805769141/tmp-
> 904818707 More info to follow:
> Can't get Master Kerberos principal for use as renewer
> org.apache.pig.impl.logicalLayer.FrontendException
> : ERROR 1066: Unable to open iterator for alias B
> at org.apache.pig.PigServer.openIterator(PigServer.java:880)
> at 
> org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:774)
> at 
> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:372)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:198)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:173)
> at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
> at org.apache.pig.Main.run(Main.java:541)
> at org.apache.pig.Main.main(Main.java:156)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
> Caused by: org.apache.pig.PigException: ERROR 1002: Unable to store alias 
> B
> at org.apache.pig.PigServer.storeEx(PigServer.java:982)
> at org.apache.pig.PigServer.store(PigServer.java:942)
> at org.apache.pig.PigServer.openIterator(PigServer.java:855)
> ... 12 more
> Caused by: org.apache.pig.impl.plan.VisitorException: ERROR 6000: Output 
> Location Validation Failed for: 'file:/tmp/temp805769141/tmp-904818707 More 
> info to follow:
> Can't get Master Kerberos principal for use as renewer
> at 
> org.apache.pig.newplan.logical.rules.InputOutputFileValidator$InputOutputFileVisitor.visit(InputOutputFileValidator.java:95)
> at 
> org.apache.pig.newplan.logical.relational.LOStore.accept(LOStore.java:66)
> at 
> org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:64)
> at 
> org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
> at 
> org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
> at 
> org.apache.pig.newplan.DepthFirstWalker.walk(DepthFirstWalker.java:53)
> at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
> at 
> org.apache.pig.newplan.logical.rules.InputOutputFileValidator.validate(InputOutputFileValidator.java:45)
> at 
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:303)
> at org.apache.pig.PigServer.compilePp(PigServer.java:1380)
> at 
> org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1305

[jira] [Assigned] (PIG-4168) Initial implementation of unit tests for Pig on Spark

2014-09-16 Thread Cheolsoo Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park reassigned PIG-4168:
--

Assignee: liyunzhang_intel

Assigning to Liyun.

> Initial implementation of unit tests for Pig on Spark
> -
>
> Key: PIG-4168
> URL: https://issues.apache.org/jira/browse/PIG-4168
> Project: Pig
>  Issue Type: Sub-task
>Reporter: Praveen Rachabattuni
>Assignee: liyunzhang_intel
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PIG-4170) Multiquery with different type of key gives wrong result

2014-09-15 Thread Cheolsoo Park (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-4170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14134972#comment-14134972
 ] 

Cheolsoo Park commented on PIG-4170:


+1

> Multiquery with different type of key gives wrong result
> 
>
> Key: PIG-4170
> URL: https://issues.apache.org/jira/browse/PIG-4170
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.13.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.14.0
>
> Attachments: PIG-4170-0.patch, PIG-4170-1.patch
>
>
> The following script produce wrong result:
> {code}
> A = load '1.txt' as (i:int, s:chararray);
> B = group A by i;
> C = group A by s;
> store B into 'ooo1';
> store C into 'ooo2';
> {code}
> 1.txt:
> {code}
> 1   h
> 1   a
> {code}
> Expected: 
> {code}
> ooo1:
> 1   {(1,a),(1,h)}
> ooo2:
> a   {(1,a)}
> h   {(1,h)}
> {code}
> Actual:
> {code}
> ooo1:
> 1   {((1),a),((1),h)}
> ooo2:
> a   {(1,(a))}
> h   {(1,(h))}
> {code}
> This happens after PIG-3591.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4171) Streaming UDF fails when direct fetch optimization is enabled

2014-09-15 Thread Cheolsoo Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-4171:
---
Status: Patch Available  (was: Open)

> Streaming UDF fails when direct fetch optimization is enabled
> -
>
> Key: PIG-4171
> URL: https://issues.apache.org/jira/browse/PIG-4171
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.13.0
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
>Priority: Minor
> Fix For: 0.14.0
>
> Attachments: PIG-4171-1.patch
>
>
> To reproduce the error, register any udf as {{streaming_python}} and run it 
> in direct fetch mode.
> It fails with the following error in my environment-
> {code}
> sys.argv[5], sys.argv[6], sys.argv[7], sys.argv[8])
>   File "/mnt/pig_tmp/prodpig/controller4894777320356829424.py", line 77, in 
> main
> self.output_stream = open(output_stream_path, 'a')
> IOError: [Errno 13] Permission denied: 
> '/mnt/var/lib/hadoop/tmp/udfOutput/sanitize.out'
> {code}
> The problem is that Streaming UDF tries to write out a log, but the user 
> doesn't have write permission to the default location ({{hadoop.tmp.dir}}).
> In fact, Streaming UDF handles local mode properly by using 
> {{pig.udf.scripting.log.dir}} instead of {{hadoop.log.dir}} or 
> {{hadoop.tmp.dir}}. We should do the same for direct fetch mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4171) Streaming UDF fails when direct fetch optimization is enabled

2014-09-15 Thread Cheolsoo Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-4171:
---
Attachment: PIG-4171-1.patch

Uploading a patch.

> Streaming UDF fails when direct fetch optimization is enabled
> -
>
> Key: PIG-4171
> URL: https://issues.apache.org/jira/browse/PIG-4171
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.13.0
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
>Priority: Minor
> Fix For: 0.14.0
>
> Attachments: PIG-4171-1.patch
>
>
> To reproduce the error, register any udf as {{streaming_python}} and run it 
> in direct fetch mode.
> It fails with the following error in my environment-
> {code}
> sys.argv[5], sys.argv[6], sys.argv[7], sys.argv[8])
>   File "/mnt/pig_tmp/prodpig/controller4894777320356829424.py", line 77, in 
> main
> self.output_stream = open(output_stream_path, 'a')
> IOError: [Errno 13] Permission denied: 
> '/mnt/var/lib/hadoop/tmp/udfOutput/sanitize.out'
> {code}
> The problem is that Streaming UDF tries to write out a log, but the user 
> doesn't have write permission to the default location ({{hadoop.tmp.dir}}).
> In fact, Streaming UDF handles local mode properly by using 
> {{pig.udf.scripting.log.dir}} instead of {{hadoop.log.dir}} or 
> {{hadoop.tmp.dir}}. We should do the same for direct fetch mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (PIG-4171) Streaming UDF fails when direct fetch optimization is enabled

2014-09-15 Thread Cheolsoo Park (JIRA)

Cheolsoo Park created PIG-4171:
--

 Summary: Streaming UDF fails when direct fetch optimization is 
enabled
 Key: PIG-4171
 URL: https://issues.apache.org/jira/browse/PIG-4171
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.13.0
Reporter: Cheolsoo Park
Assignee: Cheolsoo Park
Priority: Minor
 Fix For: 0.14.0


To reproduce the error, register any udf as {{streaming_python}} and run it in 
direct fetch mode.

It fails with the following error in my environment-
{code}
sys.argv[5], sys.argv[6], sys.argv[7], sys.argv[8])
  File "/mnt/pig_tmp/prodpig/controller4894777320356829424.py", line 77, in main
self.output_stream = open(output_stream_path, 'a')
IOError: [Errno 13] Permission denied: 
'/mnt/var/lib/hadoop/tmp/udfOutput/sanitize.out'
{code}
The problem is that Streaming UDF tries to write out a log, but the user 
doesn't have write permission to the default location ({{hadoop.tmp.dir}}).

In fact, Streaming UDF handles local mode properly by using 
{{pig.udf.scripting.log.dir}} instead of {{hadoop.log.dir}} or 
{{hadoop.tmp.dir}}. We should do the same for direct fetch mode.








--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PIG-4169) NPE in ConstantCalculator

2014-09-15 Thread Cheolsoo Park (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-4169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14134315#comment-14134315
 ] 

Cheolsoo Park commented on PIG-4169:


[~daijy], thank you for catching it! Let's revert {{PIG-4169-1.patch}} and 
commit your patch. +1

> NPE in ConstantCalculator
> -
>
> Key: PIG-4169
> URL: https://issues.apache.org/jira/browse/PIG-4169
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
> Fix For: 0.14.0
>
> Attachments: PIG-4169-1.patch, PIG-4169-2.patch
>
>
> To reproduce the issue, run the following query-
> {code}
> a = LOAD 'foo' AS (x:int);
> b = FOREACH a GENERATE TOTUPLE((chararray)null);
> DUMP b;
> {code}
> As can be seen, it is calling TOTUPLE with null. This causes a front-end 
> exception with the following stack trace-
> {code}
> Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: 
> org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception 
> while executing [POUserFunc (Name: 
> POUserFunc(org.apache.pig.builtin.TOTUPLE)[tuple] - scope-13 Operator Key: 
> scope-13) children: null at []]: java.lang.NullPointerException
>   at 
> org.apache.pig.newplan.logical.rules.ConstantCalculator$ConstantCalculatorTransformer$ConstantCalculatorExpressionVisitor.execute(ConstantCalculator.java:154)
>   at 
> org.apache.pig.newplan.logical.expression.AllSameExpressionVisitor.visit(AllSameExpressionVisitor.java:143)
>   at 
> org.apache.pig.newplan.logical.expression.UserFuncExpression.accept(UserFuncExpression.java:112)
>   at 
> org.apache.pig.newplan.ReverseDependencyOrderWalkerWOSeenChk.walk(ReverseDependencyOrderWalkerWOSeenChk.java:69)
>   at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
>   at 
> org.apache.pig.newplan.logical.optimizer.AllExpressionVisitor.visitAll(AllExpressionVisitor.java:72)
>   at 
> org.apache.pig.newplan.logical.optimizer.AllExpressionVisitor.visit(AllExpressionVisitor.java:131)
>   at 
> org.apache.pig.newplan.logical.relational.LOGenerate.accept(LOGenerate.java:245)
>   at 
> org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
>   at 
> org.apache.pig.newplan.logical.optimizer.AllExpressionVisitor.visit(AllExpressionVisitor.java:124)
>   at 
> org.apache.pig.newplan.logical.relational.LOForEach.accept(LOForEach.java:87)
>   at 
> org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
>   at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
>   at 
> org.apache.pig.newplan.logical.rules.ConstantCalculator$ConstantCalculatorTransformer.transform(ConstantCalculator.java:181)
>   at 
> org.apache.pig.newplan.optimizer.PlanOptimizer.optimize(PlanOptimizer.java:110)
>   ... 16 more
> Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 0: 
> Exception while executing [POUserFunc (Name: 
> POUserFunc(org.apache.pig.builtin.TOTUPLE)[tuple] - scope-13 Operator Key: 
> scope-13) children: null at []]: java.lang.NullPointerException
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:360)
>   at 
> org.apache.pig.newplan.logical.rules.ConstantCalculator$ConstantCalculatorTransformer$ConstantCalculatorExpressionVisitor.execute(ConstantCalculator.java:151)
>   ... 30 more
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:284)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNextTuple(POUserFunc.java:383)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:355)
>   ... 31 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PIG-4104) Accumulator UDF throws OOM in Tez

2014-09-15 Thread Cheolsoo Park (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-4104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14134035#comment-14134035
 ] 

Cheolsoo Park commented on PIG-4104:


+1. My test passes. Thank you Rohini!

> Accumulator UDF throws OOM in Tez
> -
>
> Key: PIG-4104
> URL: https://issues.apache.org/jira/browse/PIG-4104
> Project: Pig
>  Issue Type: Sub-task
>  Components: tez
>Reporter: Cheolsoo Park
>Assignee: Rohini Palaniswamy
> Fix For: 0.14.0
>
> Attachments: PIG-4104-2.patch
>
>
> This is somewhat expected since we copy lots of object in POShuffleLoadTez 
> for accumulator UDF. With large data, it consistently fails with OOM. We need 
> to re-implement it.
> Here is an example stack trace-
> {code}
> 2014-08-02 02:59:15,801 ERROR [TezChild] 
> org.apache.tez.runtime.task.TezTaskRunner: Exception of type Error. Exiting 
> now
> java.lang.OutOfMemoryError: GC overhead limit exceeded
> at java.lang.StringCoding$StringDecoder.decode(StringCoding.java:149)
> at java.lang.StringCoding.decode(StringCoding.java:193)
> at java.lang.String.(String.java:416)
> at 
> org.apache.pig.data.BinInterSedes$BinInterSedesTupleRawComparator.compareBinInterSedesDatum(BinInterSedes.java:964)
> at 
> org.apache.pig.data.BinInterSedes$BinInterSedesTupleRawComparator.compareBinSedesTuple(BinInterSedes.java:770)
> at 
> org.apache.pig.data.BinInterSedes$BinInterSedesTupleRawComparator.compare(BinInterSedes.java:728)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTupleSortComparator.compare(PigTupleSortComparator.java:100)
> at 
> org.apache.tez.runtime.library.common.sort.impl.TezMerger$MergeQueue.lessThan(TezMerger.java:539)
> at org.apache.hadoop.util.PriorityQueue.downHeap(PriorityQueue.java:144)
> at org.apache.hadoop.util.PriorityQueue.adjustTop(PriorityQueue.java:108)
> at 
> org.apache.tez.runtime.library.common.sort.impl.TezMerger$MergeQueue.adjustPriorityQueue(TezMerger.java:486)
> at 
> org.apache.tez.runtime.library.common.sort.impl.TezMerger$MergeQueue.next(TezMerger.java:503)
> at 
> org.apache.tez.runtime.library.common.ValuesIterator.readNextKey(ValuesIterator.java:179)
> at 
> org.apache.tez.runtime.library.common.ValuesIterator.access$300(ValuesIterator.java:45)
> at 
> org.apache.tez.runtime.library.common.ValuesIterator$1$1.next(ValuesIterator.java:138)
> at 
> org.apache.pig.backend.hadoop.executionengine.tez.POShuffleTezLoad.getNextTuple(POShuffleTezLoad.java:176)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:301)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNextTuple(POForEach.java:242)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:301)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNextTuple(POForEach.java:242)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:301)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNextTuple(POForEach.java:242)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:301)
> at 
> org.apache.pig.backend.hadoop.executionengine.tez.POStoreTez.getNextTuple(POStoreTez.java:113)
> at 
> org.apache.pig.backend.hadoop.executionengine.tez.PigProcessor.runPipeline(PigProcessor.java:313)
> at 
> org.apache.pig.backend.hadoop.executionengine.tez.PigProcessor.run(PigProcessor.java:196)
> at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:324)
> at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:180)
> at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172)
> at java.security.AccessController.doPrivileged(Native Method) 
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PIG-4104) Accumulator UDF throws OOM in Tez

2014-09-14 Thread Cheolsoo Park (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-4104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14133450#comment-14133450
 ] 

Cheolsoo Park commented on PIG-4104:


[~rohini], yes sure. I am running my job now.

Your patch is smart. Moving the iteration from POShuffleLoadTez to 
AccumulativeTupleBuffer should work. I don't know why I couldn't think of this!

> Accumulator UDF throws OOM in Tez
> -
>
> Key: PIG-4104
> URL: https://issues.apache.org/jira/browse/PIG-4104
> Project: Pig
>  Issue Type: Sub-task
>  Components: tez
>Reporter: Cheolsoo Park
>Assignee: Rohini Palaniswamy
> Fix For: 0.14.0
>
> Attachments: PIG-4104-2.patch
>
>
> This is somewhat expected since we copy lots of object in POShuffleLoadTez 
> for accumulator UDF. With large data, it consistently fails with OOM. We need 
> to re-implement it.
> Here is an example stack trace-
> {code}
> 2014-08-02 02:59:15,801 ERROR [TezChild] 
> org.apache.tez.runtime.task.TezTaskRunner: Exception of type Error. Exiting 
> now
> java.lang.OutOfMemoryError: GC overhead limit exceeded
> at java.lang.StringCoding$StringDecoder.decode(StringCoding.java:149)
> at java.lang.StringCoding.decode(StringCoding.java:193)
> at java.lang.String.(String.java:416)
> at 
> org.apache.pig.data.BinInterSedes$BinInterSedesTupleRawComparator.compareBinInterSedesDatum(BinInterSedes.java:964)
> at 
> org.apache.pig.data.BinInterSedes$BinInterSedesTupleRawComparator.compareBinSedesTuple(BinInterSedes.java:770)
> at 
> org.apache.pig.data.BinInterSedes$BinInterSedesTupleRawComparator.compare(BinInterSedes.java:728)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTupleSortComparator.compare(PigTupleSortComparator.java:100)
> at 
> org.apache.tez.runtime.library.common.sort.impl.TezMerger$MergeQueue.lessThan(TezMerger.java:539)
> at org.apache.hadoop.util.PriorityQueue.downHeap(PriorityQueue.java:144)
> at org.apache.hadoop.util.PriorityQueue.adjustTop(PriorityQueue.java:108)
> at 
> org.apache.tez.runtime.library.common.sort.impl.TezMerger$MergeQueue.adjustPriorityQueue(TezMerger.java:486)
> at 
> org.apache.tez.runtime.library.common.sort.impl.TezMerger$MergeQueue.next(TezMerger.java:503)
> at 
> org.apache.tez.runtime.library.common.ValuesIterator.readNextKey(ValuesIterator.java:179)
> at 
> org.apache.tez.runtime.library.common.ValuesIterator.access$300(ValuesIterator.java:45)
> at 
> org.apache.tez.runtime.library.common.ValuesIterator$1$1.next(ValuesIterator.java:138)
> at 
> org.apache.pig.backend.hadoop.executionengine.tez.POShuffleTezLoad.getNextTuple(POShuffleTezLoad.java:176)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:301)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNextTuple(POForEach.java:242)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:301)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNextTuple(POForEach.java:242)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:301)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNextTuple(POForEach.java:242)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:301)
> at 
> org.apache.pig.backend.hadoop.executionengine.tez.POStoreTez.getNextTuple(POStoreTez.java:113)
> at 
> org.apache.pig.backend.hadoop.executionengine.tez.PigProcessor.runPipeline(PigProcessor.java:313)
> at 
> org.apache.pig.backend.hadoop.executionengine.tez.PigProcessor.run(PigProcessor.java:196)
> at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:324)
> at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:180)
> at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172)
> at java.security.AccessController.doPrivileged(Native Method) 
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4169) NPE in ConstantCalculator

2014-09-12 Thread Cheolsoo Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-4169:
---
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed to trunk. Thanks Daniel for reviewing!

> NPE in ConstantCalculator
> -
>
> Key: PIG-4169
> URL: https://issues.apache.org/jira/browse/PIG-4169
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
> Fix For: 0.14.0
>
> Attachments: PIG-4169-1.patch
>
>
> To reproduce the issue, run the following query-
> {code}
> a = LOAD 'foo' AS (x:int);
> b = FOREACH a GENERATE TOTUPLE((chararray)null);
> DUMP b;
> {code}
> As can be seen, it is calling TOTUPLE with null. This causes a front-end 
> exception with the following stack trace-
> {code}
> Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: 
> org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception 
> while executing [POUserFunc (Name: 
> POUserFunc(org.apache.pig.builtin.TOTUPLE)[tuple] - scope-13 Operator Key: 
> scope-13) children: null at []]: java.lang.NullPointerException
>   at 
> org.apache.pig.newplan.logical.rules.ConstantCalculator$ConstantCalculatorTransformer$ConstantCalculatorExpressionVisitor.execute(ConstantCalculator.java:154)
>   at 
> org.apache.pig.newplan.logical.expression.AllSameExpressionVisitor.visit(AllSameExpressionVisitor.java:143)
>   at 
> org.apache.pig.newplan.logical.expression.UserFuncExpression.accept(UserFuncExpression.java:112)
>   at 
> org.apache.pig.newplan.ReverseDependencyOrderWalkerWOSeenChk.walk(ReverseDependencyOrderWalkerWOSeenChk.java:69)
>   at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
>   at 
> org.apache.pig.newplan.logical.optimizer.AllExpressionVisitor.visitAll(AllExpressionVisitor.java:72)
>   at 
> org.apache.pig.newplan.logical.optimizer.AllExpressionVisitor.visit(AllExpressionVisitor.java:131)
>   at 
> org.apache.pig.newplan.logical.relational.LOGenerate.accept(LOGenerate.java:245)
>   at 
> org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
>   at 
> org.apache.pig.newplan.logical.optimizer.AllExpressionVisitor.visit(AllExpressionVisitor.java:124)
>   at 
> org.apache.pig.newplan.logical.relational.LOForEach.accept(LOForEach.java:87)
>   at 
> org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
>   at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
>   at 
> org.apache.pig.newplan.logical.rules.ConstantCalculator$ConstantCalculatorTransformer.transform(ConstantCalculator.java:181)
>   at 
> org.apache.pig.newplan.optimizer.PlanOptimizer.optimize(PlanOptimizer.java:110)
>   ... 16 more
> Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 0: 
> Exception while executing [POUserFunc (Name: 
> POUserFunc(org.apache.pig.builtin.TOTUPLE)[tuple] - scope-13 Operator Key: 
> scope-13) children: null at []]: java.lang.NullPointerException
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:360)
>   at 
> org.apache.pig.newplan.logical.rules.ConstantCalculator$ConstantCalculatorTransformer$ConstantCalculatorExpressionVisitor.execute(ConstantCalculator.java:151)
>   ... 30 more
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:284)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNextTuple(POUserFunc.java:383)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:355)
>   ... 31 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4169) NPE in ConstantCalculator

2014-09-12 Thread Cheolsoo Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-4169:
---
Status: Patch Available  (was: Open)

> NPE in ConstantCalculator
> -
>
> Key: PIG-4169
> URL: https://issues.apache.org/jira/browse/PIG-4169
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
> Fix For: 0.14.0
>
> Attachments: PIG-4169-1.patch
>
>
> To reproduce the issue, run the following query-
> {code}
> a = LOAD 'foo' AS (x:int);
> b = FOREACH a GENERATE TOTUPLE((chararray)null);
> DUMP b;
> {code}
> As can be seen, it is calling TOTUPLE with null. This causes a front-end 
> exception with the following stack trace-
> {code}
> Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: 
> org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception 
> while executing [POUserFunc (Name: 
> POUserFunc(org.apache.pig.builtin.TOTUPLE)[tuple] - scope-13 Operator Key: 
> scope-13) children: null at []]: java.lang.NullPointerException
>   at 
> org.apache.pig.newplan.logical.rules.ConstantCalculator$ConstantCalculatorTransformer$ConstantCalculatorExpressionVisitor.execute(ConstantCalculator.java:154)
>   at 
> org.apache.pig.newplan.logical.expression.AllSameExpressionVisitor.visit(AllSameExpressionVisitor.java:143)
>   at 
> org.apache.pig.newplan.logical.expression.UserFuncExpression.accept(UserFuncExpression.java:112)
>   at 
> org.apache.pig.newplan.ReverseDependencyOrderWalkerWOSeenChk.walk(ReverseDependencyOrderWalkerWOSeenChk.java:69)
>   at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
>   at 
> org.apache.pig.newplan.logical.optimizer.AllExpressionVisitor.visitAll(AllExpressionVisitor.java:72)
>   at 
> org.apache.pig.newplan.logical.optimizer.AllExpressionVisitor.visit(AllExpressionVisitor.java:131)
>   at 
> org.apache.pig.newplan.logical.relational.LOGenerate.accept(LOGenerate.java:245)
>   at 
> org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
>   at 
> org.apache.pig.newplan.logical.optimizer.AllExpressionVisitor.visit(AllExpressionVisitor.java:124)
>   at 
> org.apache.pig.newplan.logical.relational.LOForEach.accept(LOForEach.java:87)
>   at 
> org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
>   at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
>   at 
> org.apache.pig.newplan.logical.rules.ConstantCalculator$ConstantCalculatorTransformer.transform(ConstantCalculator.java:181)
>   at 
> org.apache.pig.newplan.optimizer.PlanOptimizer.optimize(PlanOptimizer.java:110)
>   ... 16 more
> Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 0: 
> Exception while executing [POUserFunc (Name: 
> POUserFunc(org.apache.pig.builtin.TOTUPLE)[tuple] - scope-13 Operator Key: 
> scope-13) children: null at []]: java.lang.NullPointerException
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:360)
>   at 
> org.apache.pig.newplan.logical.rules.ConstantCalculator$ConstantCalculatorTransformer$ConstantCalculatorExpressionVisitor.execute(ConstantCalculator.java:151)
>   ... 30 more
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:284)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNextTuple(POUserFunc.java:383)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:355)
>   ... 31 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4169) NPE in ConstantCalculator

2014-09-12 Thread Cheolsoo Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-4169:
---
Attachment: PIG-4169-1.patch

Uploading a patch that fixes the NPE.

> NPE in ConstantCalculator
> -
>
> Key: PIG-4169
> URL: https://issues.apache.org/jira/browse/PIG-4169
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
> Fix For: 0.14.0
>
> Attachments: PIG-4169-1.patch
>
>
> To reproduce the issue, run the following query-
> {code}
> a = LOAD 'foo' AS (x:int);
> b = FOREACH a GENERATE TOTUPLE((chararray)null);
> DUMP b;
> {code}
> As can be seen, it is calling TOTUPLE with null. This causes a front-end 
> exception with the following stack trace-
> {code}
> Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: 
> org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception 
> while executing [POUserFunc (Name: 
> POUserFunc(org.apache.pig.builtin.TOTUPLE)[tuple] - scope-13 Operator Key: 
> scope-13) children: null at []]: java.lang.NullPointerException
>   at 
> org.apache.pig.newplan.logical.rules.ConstantCalculator$ConstantCalculatorTransformer$ConstantCalculatorExpressionVisitor.execute(ConstantCalculator.java:154)
>   at 
> org.apache.pig.newplan.logical.expression.AllSameExpressionVisitor.visit(AllSameExpressionVisitor.java:143)
>   at 
> org.apache.pig.newplan.logical.expression.UserFuncExpression.accept(UserFuncExpression.java:112)
>   at 
> org.apache.pig.newplan.ReverseDependencyOrderWalkerWOSeenChk.walk(ReverseDependencyOrderWalkerWOSeenChk.java:69)
>   at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
>   at 
> org.apache.pig.newplan.logical.optimizer.AllExpressionVisitor.visitAll(AllExpressionVisitor.java:72)
>   at 
> org.apache.pig.newplan.logical.optimizer.AllExpressionVisitor.visit(AllExpressionVisitor.java:131)
>   at 
> org.apache.pig.newplan.logical.relational.LOGenerate.accept(LOGenerate.java:245)
>   at 
> org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
>   at 
> org.apache.pig.newplan.logical.optimizer.AllExpressionVisitor.visit(AllExpressionVisitor.java:124)
>   at 
> org.apache.pig.newplan.logical.relational.LOForEach.accept(LOForEach.java:87)
>   at 
> org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
>   at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
>   at 
> org.apache.pig.newplan.logical.rules.ConstantCalculator$ConstantCalculatorTransformer.transform(ConstantCalculator.java:181)
>   at 
> org.apache.pig.newplan.optimizer.PlanOptimizer.optimize(PlanOptimizer.java:110)
>   ... 16 more
> Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 0: 
> Exception while executing [POUserFunc (Name: 
> POUserFunc(org.apache.pig.builtin.TOTUPLE)[tuple] - scope-13 Operator Key: 
> scope-13) children: null at []]: java.lang.NullPointerException
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:360)
>   at 
> org.apache.pig.newplan.logical.rules.ConstantCalculator$ConstantCalculatorTransformer$ConstantCalculatorExpressionVisitor.execute(ConstantCalculator.java:151)
>   ... 30 more
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:284)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNextTuple(POUserFunc.java:383)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:355)
>   ... 31 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (PIG-4169) NPE in ConstantCalculator

2014-09-12 Thread Cheolsoo Park (JIRA)

Cheolsoo Park created PIG-4169:
--

 Summary: NPE in ConstantCalculator
 Key: PIG-4169
 URL: https://issues.apache.org/jira/browse/PIG-4169
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Cheolsoo Park
Assignee: Cheolsoo Park
 Fix For: 0.14.0


To reproduce the issue, run the following query-
{code}
a = LOAD 'foo' AS (x:int);
b = FOREACH a GENERATE TOTUPLE((chararray)null);
DUMP b;
{code}
As can be seen, it is calling TOTUPLE with null. This causes a front-end 
exception with the following stack trace-
{code}
Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: 
org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while 
executing [POUserFunc (Name: POUserFunc(org.apache.pig.builtin.TOTUPLE)[tuple] 
- scope-13 Operator Key: scope-13) children: null at []]: 
java.lang.NullPointerException
at 
org.apache.pig.newplan.logical.rules.ConstantCalculator$ConstantCalculatorTransformer$ConstantCalculatorExpressionVisitor.execute(ConstantCalculator.java:154)
at 
org.apache.pig.newplan.logical.expression.AllSameExpressionVisitor.visit(AllSameExpressionVisitor.java:143)
at 
org.apache.pig.newplan.logical.expression.UserFuncExpression.accept(UserFuncExpression.java:112)
at 
org.apache.pig.newplan.ReverseDependencyOrderWalkerWOSeenChk.walk(ReverseDependencyOrderWalkerWOSeenChk.java:69)
at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
at 
org.apache.pig.newplan.logical.optimizer.AllExpressionVisitor.visitAll(AllExpressionVisitor.java:72)
at 
org.apache.pig.newplan.logical.optimizer.AllExpressionVisitor.visit(AllExpressionVisitor.java:131)
at 
org.apache.pig.newplan.logical.relational.LOGenerate.accept(LOGenerate.java:245)
at 
org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
at 
org.apache.pig.newplan.logical.optimizer.AllExpressionVisitor.visit(AllExpressionVisitor.java:124)
at 
org.apache.pig.newplan.logical.relational.LOForEach.accept(LOForEach.java:87)
at 
org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
at 
org.apache.pig.newplan.logical.rules.ConstantCalculator$ConstantCalculatorTransformer.transform(ConstantCalculator.java:181)
at 
org.apache.pig.newplan.optimizer.PlanOptimizer.optimize(PlanOptimizer.java:110)
... 16 more
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 0: 
Exception while executing [POUserFunc (Name: 
POUserFunc(org.apache.pig.builtin.TOTUPLE)[tuple] - scope-13 Operator Key: 
scope-13) children: null at []]: java.lang.NullPointerException
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:360)
at 
org.apache.pig.newplan.logical.rules.ConstantCalculator$ConstantCalculatorTransformer$ConstantCalculatorExpressionVisitor.execute(ConstantCalculator.java:151)
... 30 more
Caused by: java.lang.NullPointerException
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:284)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNextTuple(POUserFunc.java:383)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:355)
... 31 more
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4159) TestGroupConstParallelTez and TestJobSubmissionTez should be excluded in Hadoop 20 unit tests

2014-09-08 Thread Cheolsoo Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-4159:
---
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed to trunk. Thank you Daniel for reviewing.

> TestGroupConstParallelTez and TestJobSubmissionTez should be excluded in 
> Hadoop 20 unit tests
> -
>
> Key: PIG-4159
> URL: https://issues.apache.org/jira/browse/PIG-4159
> Project: Pig
>  Issue Type: Bug
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
> Fix For: 0.14.0
>
> Attachments: PIG-4159-1.patch
>
>
> These tests are Tez-specific, so should not run in Hadoop 20.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PIG-4148) Tez order-by is often skewed because FindQuantiles UDF is called with small number

2014-09-08 Thread Cheolsoo Park (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-4148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14125768#comment-14125768
 ] 

Cheolsoo Park commented on PIG-4148:


Another interesting observation is that all the samples that are received by 
the sample aggregate vertex have row number {{376}}. This is funny because row 
number in POReservoirSample varies from 1 to 52M.

> Tez order-by is often skewed because FindQuantiles UDF is called with small 
> number
> --
>
> Key: PIG-4148
> URL: https://issues.apache.org/jira/browse/PIG-4148
> Project: Pig
>  Issue Type: Sub-task
>  Components: tez
>Reporter: Cheolsoo Park
> Fix For: 0.14.0
>
> Attachments: generate_sample.py, metric_retention.explain, 
> popackage.log, samples_logs.tar.gz
>
>
> In Tez, FindQuantiles UDF is called with a smaller number of samples than MR 
> resulting in skew in range partitions.
> For example, I have a job that runs sampling with a parallelism of 300. Since 
> each task samples 100 records, the total sample should be 30K. But 
> FindQuantiles UDF is called with only 300 samples.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4148) Tez order-by is often skewed because FindQuantiles UDF is called with small number

2014-09-08 Thread Cheolsoo Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-4148:
---
Attachment: metric_retention.explain

I am also attaching the explain output of my job. To summarize, it has 
{{group-by => order-by => store}}, and the {{order-by}} is consistently skewed. 
For the above reason, FindQunatiles UDF builds the quantiles list that is 
biased, and two tasks are given 70% of total records-
{code}
Total rows: 52548775
Partitions: 0~299
Partition 292: 11552505 (22%)
Partition 299: 25000602 (47%)
{code}

> Tez order-by is often skewed because FindQuantiles UDF is called with small 
> number
> --
>
> Key: PIG-4148
> URL: https://issues.apache.org/jira/browse/PIG-4148
> Project: Pig
>  Issue Type: Sub-task
>  Components: tez
>Reporter: Cheolsoo Park
> Fix For: 0.14.0
>
> Attachments: generate_sample.py, metric_retention.explain, 
> popackage.log, samples_logs.tar.gz
>
>
> In Tez, FindQuantiles UDF is called with a smaller number of samples than MR 
> resulting in skew in range partitions.
> For example, I have a job that runs sampling with a parallelism of 300. Since 
> each task samples 100 records, the total sample should be 30K. But 
> FindQuantiles UDF is called with only 300 samples.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4148) Tez order-by is often skewed because FindQuantiles UDF is called with small number

2014-09-08 Thread Cheolsoo Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-4148:
---
Attachment: popackage.log
generate_sample.py
samples_logs.tar.gz

[~daijy], [~rohini], I am attaching logs from my job.
# samples_logs.tar.gz contains the following logs from 300 sampling tasks-
{code}
if (rand < numSamples) {
log.info("XXX: rand: " + rand + " rowProcessed: " + rowProcessed + " 
sample: " + res.result);
samples[rand] = res;
}
{code}
# generate_sample.py is a Python script that mimics POReservoirSample and 
builds the final samples bag. You can run it with the following command after 
untar samples_logs.tar.gz-
{code}
python ./generate_sample.py
{code}
This will show the total size of samples bag is 30,000.
# Finally, popackage.log contains the following from the POPackage 
(POShuffleTezLoad) of sampling aggregate vertex.
{code}
for (Object val : vals) {
 NullableTuple nTup = (NullableTuple) val;
 int index = nTup.getIndex();
 Tuple tup = pkgr.getValueTuple(keyWritable, nTup, index);
 if (pkgr.getKeyType() == DataType.BYTEARRAY) {
 LOG.info("XXX samples in POPackage: "+tup);
 }
 bag.add(tup);
}
{code}
It shows there are only 300 samples records.
{code}
$ wc -l popackage.log
300 popackage.log
{code}

Based on this observation, I think samples are not sent by the sampler vertex.

> Tez order-by is often skewed because FindQuantiles UDF is called with small 
> number
> --
>
> Key: PIG-4148
> URL: https://issues.apache.org/jira/browse/PIG-4148
> Project: Pig
>  Issue Type: Sub-task
>  Components: tez
>Reporter: Cheolsoo Park
> Fix For: 0.14.0
>
> Attachments: generate_sample.py, popackage.log, samples_logs.tar.gz
>
>
> In Tez, FindQuantiles UDF is called with a smaller number of samples than MR 
> resulting in skew in range partitions.
> For example, I have a job that runs sampling with a parallelism of 300. Since 
> each task samples 100 records, the total sample should be 30K. But 
> FindQuantiles UDF is called with only 300 samples.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (PIG-2904) Scripting UDFs should allow DEFINE statements to pass parameters to the UDF's constructor

2014-09-06 Thread Cheolsoo Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-2904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park reassigned PIG-2904:
--

Assignee: (was: Cheolsoo Park)

> Scripting UDFs should allow DEFINE statements to pass parameters to the UDF's 
> constructor
> -
>
> Key: PIG-2904
> URL: https://issues.apache.org/jira/browse/PIG-2904
> Project: Pig
>  Issue Type: New Feature
>Reporter: Julien Le Dem
> Attachments: PIG-2904.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 2336 matches

Mail list logo