[jira] Updated: (PIG-1587) Cloning utility functions for new logical plan

2010-09-01 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1587:


Description: 
We sometimes need to copy a logical operator/plan when writing an optimization 
rule. Currently copy an operator/plan is awkward. We need to write some 
utilities to facilitate this process. Swati contribute PIG-1510 but we feel it 
still cannot address most use cases. I propose to add some more utilities into 
new logical plan:

all LogicalExpressions:
{code}
copy(LogicalExpressionPlan newPlan, boolean keepUid);
{code}
* Do a shallow copy of the logical expression operator (except for fieldSchema, 
uidOnlySchema, ProjectExpression.attachedRelationalOp)
* Set the plan to newPlan
* If keepUid is true, further copy uidOnlyFieldSchema

all LogicalRelationalOperators:
{code}
copy(LogicalPlan newPlan, boolean keepUid);
{code}
* Do a shallow copy of the logical relational operator (except for schema, uid 
related fields)
* Set the plan to newPlan;
* If the operator have inner plan/expression plan, copy the whole inner plan 
with the same keepUid flag (Especially, LOInnerLoad will copy its inner 
project, with the same keepUid flag)
* If keepUid is true, further copy uid related fields (LOUnion.uidMapping, 
LOCogroup.groupKeyUidOnlySchema, LOCogroup.generatedInputUids)

LogicalExpressionPlan.java
{code}
LogicalExpressionPlan copy(LogicalRelationalOperator attachedRelationalOp, 
boolean keepUid);
LogicalExpressionPlan copyAbove(LogicalExpression leave, 
LogicalRelationalOperator attachedRelationalOp, boolean keepUid);
LogicalExpressionPlan copyBelow(LogicalExpression root, 
LogicalRelationalOperator attachedRelationalOp, boolean keepUid);
{code}
* Create a new logical expression plan and copy expression operator along with 
connection with the same keepUid flag
* Set all ProjectExpression.attachedRelationalOp to attachedRelationalOp 
parameter

{code}
PairListOperator, ListOperator merge(LogicalExpressionPlan plan, 
LogicalRelationalOperator attachedRelationalOp);
{code}
* Merge plan into the current logical expression plan as an independent tree
* attachedRelationalOp is the destination operator new logical expression plan 
attached to
* return the sources/sinks of this independent tree


LogicalPlan.java
{code}
LogicalPlan copy(LOForEach foreach, boolean keepUid);
LogicalPlan copyAbove(LogicalRelationalOperator leave, LOForEach foreach, 
boolean keepUid);
LogicalPlan copyBelow(LogicalRelationalOperator root, LOForEach foreach, 
boolean keepUid);
{code}
* Main use case to copy inner plan of ForEach
* Create a new logical plan and copy relational operator along with connection
* Copy all expression plans inside relational operator, set plan and 
attachedRelationalOp properly
* If the plan is ForEach inner plan, param foreach is the destination ForEach 
operator; otherwise, pass null

{code}
PairListOperator, ListOperator merge(LogicalPlan plan, LOForEach foreach);
{code}
* Merge plan into the current logical plan as an independent tree
* foreach is the destination LOForEach is the destination plan is a ForEach 
inner plan; otherwise, pass null
* return the sources/sinks of this independent tree


  was:
We sometimes need to copy a logical operator/plan when writing an optimization 
rule. Currently copy an operator/plan is awkward. We need to write some 
utilities to facilitate this process. Swati contribute PIG-1510 but we feel it 
still cannot address most use cases. I propose to add some more utilities into 
new logical plan:

all LogicalExpressions:
{code}
copy(LogicalExpressionPlan newPlan, boolean keepUid);
{code}
* Do a shallow copy of the logical expression operator (except for fieldSchema, 
uidOnlySchema, ProjectExpression.attachedRelationalOp)
* Set the plan to newPlan
* If keepUid is true, further copy uidOnlyFieldSchema

all LogicalRelationalOperators:
{code}
copy(LogicalPlan newPlan, boolean keepUid);
{code}
* Do a shallow copy of the logical relational operator (except for schema, uid 
related fields)
* Set the plan to newPlan;
* If the operator have inner plan/expression plan, copy the whole inner plan 
with the same keepUid flag (Especially, LOInnerLoad will copy its inner 
project, with the same keepUid flag)
* If keepUid is true, further copy uid related fields (LOUnion.uidMapping, 
LOCogroup.groupKeyUidOnlySchema, LOCogroup.generatedInputUids)

LogicalExpressionPlan.java
{code}
LogicalExpressionPlan copy(LogicalRelationalOperator attachedRelationalOp, 
boolean keepUid);
{code}
* Copy expression operator along with connection with the same keepUid flag
* Set all ProjectExpression.attachedRelationalOp to attachedRelationalOp 
parameter

{code}
ListOperator merge(LogicalExpressionPlan plan);
{code}
* Merge plan into the current logical expression plan as an independent tree
* return the sources of this independent tree


LogicalPlan.java
{code}
LogicalPlan copy(boolean keepUid);
{code}
* Main use 

[jira] Updated: (PIG-1508) Make 'docs' target (forrest) work with Java 1.6

2010-09-01 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-1508:


Fix Version/s: 0.8.0

 Make 'docs' target (forrest) work with Java 1.6
 ---

 Key: PIG-1508
 URL: https://issues.apache.org/jira/browse/PIG-1508
 Project: Pig
  Issue Type: Bug
  Components: documentation
Affects Versions: 0.7.0
Reporter: Carl Steinbach
Assignee: Carl Steinbach
 Fix For: 0.8.0

 Attachments: PIG-1508.patch.txt


 FOR-984 covers the very inconvenient fact that Forrest 0.8 does not work with 
 Java 1.6
 The same ticket also suggests a workaround: disabling sitemap and stylesheet 
 validation
 by setting the forrest.validate.sitemap and forrest.validate.stylesheets 
 properties to false.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-1580) new syntax for native mapreduce operator

2010-09-01 Thread Thejas M Nair (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair resolved PIG-1580.


Resolution: Won't Fix

In case of 'hadoop jar' command, the files to ship to distributed cache are 
specified using -files command line option.  Since typical users would be 
moving an existing map-reduce job that they were running using 'hadoop jar', it 
is easier for them to copy the existing command line options rather than the 
SHIP/CACHE clause in the proposed syntax.

If we don't have the SHIP/CACHE clauses in mapreduce operator, there is very 
little similarity between streaming and mapreduce operator. It will be better 
to use LOAD/STORE instead of INPUT/OUTPUT in the syntax of mapreduce, as they 
specify the load/store functions and not the streaming deserializer/serializer.

So I think it is better to go back to the old syntax. Resolving jira as 
won't-fix.


 new syntax for native mapreduce operator
 

 Key: PIG-1580
 URL: https://issues.apache.org/jira/browse/PIG-1580
 Project: Pig
  Issue Type: Task
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0


 mapreduce operator (PIG-506) and stream operator have some similarities. It 
 makes sense to use a similar syntax for both.
 Alan has proposed the following syntax for mapreduce operator, and that we 
 move stream operator also to similar a syntax in a future release.
 MAPREDUCE id jar
  INPUT  'path' USING LoadFunc  
 OUTPUT  'path' USING StoreFunc
 [SHIP 'path' [, 'path' ...]]
 [CACHE 'dfs_path#dfs_file' [, 'dfs_path#dfs_file' ...]]

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1589) add test cases for mapreduce operator which use distributed cache

2010-09-01 Thread Thejas M Nair (JIRA)
add test cases for mapreduce operator which use distributed cache
-

 Key: PIG-1589
 URL: https://issues.apache.org/jira/browse/PIG-1589
 Project: Pig
  Issue Type: Task
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0


'-files filename' can be specified in the parameters for mapreduce operator to 
send files to distributed cache. Need to add test cases for that.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1543) IsEmpty returns the wrong value after using LIMIT

2010-09-01 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1543:


Status: Patch Available  (was: Open)

 IsEmpty returns the wrong value after using LIMIT
 -

 Key: PIG-1543
 URL: https://issues.apache.org/jira/browse/PIG-1543
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Justin Hu
Assignee: Daniel Dai
 Fix For: 0.8.0

 Attachments: PIG-1543-1.patch


 1. Two input files:
 1a: limit_empty.input_a
 1
 1
 1
 1b: limit_empty.input_b
 2
 2
 2.
 The pig script: limit_empty.pig
 -- A contains only 1's  B contains only 2's
 A = load 'limit_empty.input_a' as (a1:int);
 B = load 'limit_empty.input_a' as (b1:int);
 C =COGROUP A by a1, B by b1;
 D = FOREACH C generate A, B, (IsEmpty(A)? 0:1), (IsEmpty(B)? 0:1), COUNT(A), 
 COUNT(B);
 store D into 'limit_empty.output/d';
 -- After the script done, we see the right results:
 -- {(1),(1),(1)}   {}  1   0   3   0
 -- {} {(2),(2)}  0   1   0   2
 C1 = foreach C { Alim = limit A 1; Blim = limit B 1; generate Alim, Blim; }
 D1 = FOREACH C1 generate Alim,Blim, (IsEmpty(Alim)? 0:1), (IsEmpty(Blim)? 
 0:1), COUNT(Alim), COUNT(Blim);
 store D1 into 'limit_empty.output/d1';
 -- After the script done, we see the unexpected results:
 -- {(1)}   {}1   1   1   0
 -- {}  {(2)} 1   1   0   1
 dump D;
 dump D1;
 3. Run the scrip and redirect the stdout (2 dumps) file. There are two issues:
 The major one:
 IsEmpty() returns FALSE for empty bag in limit_empty.output/d1/*, while 
 IsEmpty() returns correctly in limit_empty.output/d/*.
 The difference is that one has been applied with LIMIT before using 
 IsEmpty().
 The minor one:
 The redirected output only contains the first dump:
 ({(1),(1),(1)},{},1,0,3L,0L)
 ({},{(2),(2)},0,1,0L,2L)
 We expect two more lines like:
 ({(1)},{},1,1,1L,0L)
 ({},{(2)},1,1,0L,1L)
 Besides, there is error says:
 [main] ERROR org.apache.pig.backend.hadoop.executionengine.HJob - 
 java.lang.ClassCastException: java.lang.Integer cannot be cast to 
 org.apache.pig.data.Tuple

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1590) Use POMergeJoin for Left Outer Join when join using 'merge'

2010-09-01 Thread Ashutosh Chauhan (JIRA)
Use POMergeJoin for Left Outer Join when join using 'merge'
---

 Key: PIG-1590
 URL: https://issues.apache.org/jira/browse/PIG-1590
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.8.0
Reporter: Ashutosh Chauhan
Priority: Minor


C = join A by $0 left, B by $0 using 'merge';

will result in map-side sort merge join. Internally, it will translate to use 
POMergeCogroup + ForEachFlatten. POMergeCogroup places quite a few restrictions 
on its loaders (A and B in this case) which is cumbersome. Currently, only 
Zebra is known to satisfy all those requirements. It will be better to use 
POMergeJoin in this case, since it has far fewer requirements on its loader. 
Importantly, it works with PigStorage.  Plus, POMergeJoin will be faster then 
POMergeCogroup + FE-Flatten.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1590) Use POMergeJoin for Left Outer Join when join using 'merge'

2010-09-01 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12905207#action_12905207
 ] 

Ashutosh Chauhan commented on PIG-1590:
---

It will entail changes in POMergeJoin and LogToPhyTranslationVisitor.

 Use POMergeJoin for Left Outer Join when join using 'merge'
 ---

 Key: PIG-1590
 URL: https://issues.apache.org/jira/browse/PIG-1590
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.8.0
Reporter: Ashutosh Chauhan
Priority: Minor

 C = join A by $0 left, B by $0 using 'merge';
 will result in map-side sort merge join. Internally, it will translate to use 
 POMergeCogroup + ForEachFlatten. POMergeCogroup places quite a few 
 restrictions on its loaders (A and B in this case) which is cumbersome. 
 Currently, only Zebra is known to satisfy all those requirements. It will be 
 better to use POMergeJoin in this case, since it has far fewer requirements 
 on its loader. Importantly, it works with PigStorage.  Plus, POMergeJoin will 
 be faster then POMergeCogroup + FE-Flatten.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1592) ORDER BY distribution is uneven when record size is correlated with order key

2010-09-01 Thread Dmitriy V. Ryaboy (JIRA)
ORDER BY distribution is uneven when record size is correlated with order key
-

 Key: PIG-1592
 URL: https://issues.apache.org/jira/browse/PIG-1592
 Project: Pig
  Issue Type: Improvement
Reporter: Dmitriy V. Ryaboy
 Fix For: 0.9.0


The partitioner contributed in PIG-545 distributes the order key space between 
partitions so that each partition gets approximately the same number of keys, 
even when the keys have a non-uniform distribution over the key space.

Unfortunately this still allows for severe partition imbalance when record size 
is correlated with the order key. By way of motivating example, consider this 
script which attempts to produce a list of genuses based on how many species 
each genus contains:

{code}
set default_parallel 60;
critters = load 'biodata'' as (genus, species);
genus_counts = foreach (group critters by genus) generate group as genus, 
COUNT(critters) as num_species, critters;
ordered_genuses = order genus_counts by num_species desc;
store ordered_genuses
{code}

The higher the value of genus_counts, the more species tuples will be contained 
in the critters bag, the wider the row. This can cause a severe processing 
imbalance, as the partitioner processing the records with the highest values of 
genus_counts will have the same number of *records* as the partitioner 
processing the lowest number, but it will have far more actual *bytes* to work 
on.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1592) ORDER BY distribution is uneven when record size is correlated with order key

2010-09-01 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12905220#action_12905220
 ] 

Dmitriy V. Ryaboy commented on PIG-1592:


One proposal is to simply change the default weighted range partitioner to take 
into account the record size. If record size is uniform, or uniformly 
distributed, or non-uniformly distributed but independent of the order key, 
this change shouldn't materially affect the distributions created for data sets 
not covered by this issue.

 ORDER BY distribution is uneven when record size is correlated with order key
 -

 Key: PIG-1592
 URL: https://issues.apache.org/jira/browse/PIG-1592
 Project: Pig
  Issue Type: Improvement
Reporter: Dmitriy V. Ryaboy
 Fix For: 0.9.0


 The partitioner contributed in PIG-545 distributes the order key space 
 between partitions so that each partition gets approximately the same number 
 of keys, even when the keys have a non-uniform distribution over the key 
 space.
 Unfortunately this still allows for severe partition imbalance when record 
 size is correlated with the order key. By way of motivating example, consider 
 this script which attempts to produce a list of genuses based on how many 
 species each genus contains:
 {code}
 set default_parallel 60;
 critters = load 'biodata'' as (genus, species);
 genus_counts = foreach (group critters by genus) generate group as genus, 
 COUNT(critters) as num_species, critters;
 ordered_genuses = order genus_counts by num_species desc;
 store ordered_genuses
 {code}
 The higher the value of genus_counts, the more species tuples will be 
 contained in the critters bag, the wider the row. This can cause a severe 
 processing imbalance, as the partitioner processing the records with the 
 highest values of genus_counts will have the same number of *records* as the 
 partitioner processing the lowest number, but it will have far more actual 
 *bytes* to work on.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1588) Parameter pre-processing of values containing pig positional variables ($0, $1 etc)

2010-09-01 Thread Laukik Chitnis (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12905224#action_12905224
 ] 

Laukik Chitnis commented on PIG-1588:
-

PIG-1586 is about investigating if the shell script is messing the parameter 
values. This jira is about the format of the parameter value itself. Even when 
we pass the parameter value through a pig param file, we need to escape the $0, 
$1 etc with \\$0, \\$1 etc, which was not the case in earlier versions of Pig.

 Parameter pre-processing of values containing pig positional variables ($0, 
 $1 etc)
 ---

 Key: PIG-1588
 URL: https://issues.apache.org/jira/browse/PIG-1588
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Laukik Chitnis
 Fix For: 0.7.0


 Pig 0.7 requires the positional variables to be escaped by a \\ when passed 
 as part of a parameter value (either through cmd line param or through 
 param_file), which was not the case in Pig 0.6 Assuming that this was not an 
 intended breakage of backward compatibility (could not find it in release 
 notes), this would be a bug.
 For example, We need to pass
 INPUT=CountWords(\\$0,\\$1,\\$2)
 instead of simply
 INPUT=CountWords($0,$1,$2)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1594) NullPointerException in new logical planner

2010-09-01 Thread Andrew Hitchcock (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Hitchcock updated PIG-1594:
--

Description: 
I've been testing the trunk version of Pig on Elastic MapReduce against our log 
processing sample application(1). When I try to run the query it throws a 
NullPointerException and suggests I disable the new logical plan. Disabling it 
works and the script succeeds. Here is the query I'm trying to run:

{{register file:/home/hadoop/lib/pig/piggybank.jar
  DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.EXTRACT();
  RAW_LOGS = LOAD '$INPUT' USING TextLoader as (line:chararray);
  LOGS_BASE= foreach RAW_LOGS generate FLATTEN(EXTRACT(line, '^(\\S+) (\\S+) 
(\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] (.+?) (\\S+) (\\S+) ([^]*) 
([^]*)')) as (remoteAddr:chararray, remoteLogname:chararray, user:chararray, 
time:chararray, request:chararray, status:int, bytes_string:chararray, 
referrer:chararray, browser:chararray);
  REFERRER_ONLY = FOREACH LOGS_BASE GENERATE referrer;
  FILTERED = FILTER REFERRER_ONLY BY referrer matches '.*bing.*' OR referrer 
matches '.*google.*';
  SEARCH_TERMS = FOREACH FILTERED GENERATE FLATTEN(EXTRACT(referrer, 
'.*[\\?]q=([^]+).*')) as terms:chararray;
  SEARCH_TERMS_FILTERED = FILTER SEARCH_TERMS BY NOT $0 IS NULL;
  SEARCH_TERMS_COUNT = FOREACH (GROUP SEARCH_TERMS_FILTERED BY $0) GENERATE $0, 
COUNT($1) as num;
  SEARCH_TERMS_COUNT_SORTED = LIMIT(ORDER SEARCH_TERMS_COUNT BY num DESC) 50;
  STORE SEARCH_TERMS_COUNT_SORTED into '$OUTPUT';}}

And here is the stack trace that results:

{{ERROR 2042: Error in new logical plan. Try -Dpig.usenewlogicalplan=false.

org.apache.pig.backend.executionengine.ExecException: ERROR 2042: Error in new 
logical plan. Try -Dpig.usenewlogicalplan=false.
at 
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:285)
at org.apache.pig.PigServer.compilePp(PigServer.java:1301)
at 
org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1154)
at org.apache.pig.PigServer.execute(PigServer.java:1148)
at org.apache.pig.PigServer.access$100(PigServer.java:123)
at org.apache.pig.PigServer$Graph.execute(PigServer.java:1464)
at org.apache.pig.PigServer.executeBatchEx(PigServer.java:350)
at org.apache.pig.PigServer.executeBatch(PigServer.java:324)
at 
org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:111)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:140)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:90)
at org.apache.pig.Main.run(Main.java:491)
at org.apache.pig.Main.main(Main.java:107)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Caused by: java.lang.NullPointerException
at org.apache.pig.EvalFunc.getSchemaName(EvalFunc.java:76)
at 
org.apache.pig.piggybank.impl.ErrorCatchingBase.outputSchema(ErrorCatchingBase.java:76)
at 
org.apache.pig.newplan.logical.expression.UserFuncExpression.getFieldSchema(UserFuncExpression.java:111)
at 
org.apache.pig.newplan.logical.optimizer.FieldSchemaResetter.execute(SchemaResetter.java:175)
at 
org.apache.pig.newplan.logical.expression.AllSameExpressionVisitor.visit(AllSameExpressionVisitor.java:143)
at 
org.apache.pig.newplan.logical.expression.UserFuncExpression.accept(UserFuncExpression.java:55)
at 
org.apache.pig.newplan.ReverseDependencyOrderWalker.walk(ReverseDependencyOrderWalker.java:69)
at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:50)
at 
org.apache.pig.newplan.logical.optimizer.SchemaResetter.visit(SchemaResetter.java:87)
at 
org.apache.pig.newplan.logical.relational.LOGenerate.accept(LOGenerate.java:149)
at 
org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:74)
at 
org.apache.pig.newplan.logical.optimizer.SchemaResetter.visit(SchemaResetter.java:76)
at 
org.apache.pig.newplan.logical.relational.LOForEach.accept(LOForEach.java:71)
at 
org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:74)
at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:50)
at 
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:247)
... 18 more
}}




1. 

[jira] Updated: (PIG-1594) NullPointerException in new logical planner

2010-09-01 Thread Andrew Hitchcock (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Hitchcock updated PIG-1594:
--

Description: 
I've been testing the trunk version of Pig on Elastic MapReduce against our log 
processing sample application(1). When I try to run the query it throws a 
NullPointerException and suggests I disable the new logical plan. Disabling it 
works and the script succeeds. Here is the query I'm trying to run:

{code}
register file:/home/hadoop/lib/pig/piggybank.jar
  DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.EXTRACT();
  RAW_LOGS = LOAD '$INPUT' USING TextLoader as (line:chararray);
  LOGS_BASE= foreach RAW_LOGS generate FLATTEN(EXTRACT(line, '^(\\S+) (\\S+) 
(\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] (.+?) (\\S+) (\\S+) ([^]*) 
([^]*)')) as (remoteAddr:chararray, remoteLogname:chararray, user:chararray, 
time:chararray, request:chararray, status:int, bytes_string:chararray, 
referrer:chararray, browser:chararray);
  REFERRER_ONLY = FOREACH LOGS_BASE GENERATE referrer;
  FILTERED = FILTER REFERRER_ONLY BY referrer matches '.*bing.*' OR referrer 
matches '.*google.*';
  SEARCH_TERMS = FOREACH FILTERED GENERATE FLATTEN(EXTRACT(referrer, 
'.*[\\?]q=([^]+).*')) as terms:chararray;
  SEARCH_TERMS_FILTERED = FILTER SEARCH_TERMS BY NOT $0 IS NULL;
  SEARCH_TERMS_COUNT = FOREACH (GROUP SEARCH_TERMS_FILTERED BY $0) GENERATE $0, 
COUNT($1) as num;
  SEARCH_TERMS_COUNT_SORTED = LIMIT(ORDER SEARCH_TERMS_COUNT BY num DESC) 50;
  STORE SEARCH_TERMS_COUNT_SORTED into '$OUTPUT';
{code}

And here is the stack trace that results:

{code}
ERROR 2042: Error in new logical plan. Try -Dpig.usenewlogicalplan=false.

org.apache.pig.backend.executionengine.ExecException: ERROR 2042: Error in new 
logical plan. Try -Dpig.usenewlogicalplan=false.
at 
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:285)
at org.apache.pig.PigServer.compilePp(PigServer.java:1301)
at 
org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1154)
at org.apache.pig.PigServer.execute(PigServer.java:1148)
at org.apache.pig.PigServer.access$100(PigServer.java:123)
at org.apache.pig.PigServer$Graph.execute(PigServer.java:1464)
at org.apache.pig.PigServer.executeBatchEx(PigServer.java:350)
at org.apache.pig.PigServer.executeBatch(PigServer.java:324)
at 
org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:111)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:140)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:90)
at org.apache.pig.Main.run(Main.java:491)
at org.apache.pig.Main.main(Main.java:107)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Caused by: java.lang.NullPointerException
at org.apache.pig.EvalFunc.getSchemaName(EvalFunc.java:76)
at 
org.apache.pig.piggybank.impl.ErrorCatchingBase.outputSchema(ErrorCatchingBase.java:76)
at 
org.apache.pig.newplan.logical.expression.UserFuncExpression.getFieldSchema(UserFuncExpression.java:111)
at 
org.apache.pig.newplan.logical.optimizer.FieldSchemaResetter.execute(SchemaResetter.java:175)
at 
org.apache.pig.newplan.logical.expression.AllSameExpressionVisitor.visit(AllSameExpressionVisitor.java:143)
at 
org.apache.pig.newplan.logical.expression.UserFuncExpression.accept(UserFuncExpression.java:55)
at 
org.apache.pig.newplan.ReverseDependencyOrderWalker.walk(ReverseDependencyOrderWalker.java:69)
at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:50)
at 
org.apache.pig.newplan.logical.optimizer.SchemaResetter.visit(SchemaResetter.java:87)
at 
org.apache.pig.newplan.logical.relational.LOGenerate.accept(LOGenerate.java:149)
at 
org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:74)
at 
org.apache.pig.newplan.logical.optimizer.SchemaResetter.visit(SchemaResetter.java:76)
at 
org.apache.pig.newplan.logical.relational.LOForEach.accept(LOForEach.java:71)
at 
org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:74)
at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:50)
at 
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:247)
... 18 more

{code}




1. 

[jira] Updated: (PIG-1594) NullPointerException in new logical planner

2010-09-01 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1594:


 Assignee: Daniel Dai
Fix Version/s: 0.8.0

 NullPointerException in new logical planner
 ---

 Key: PIG-1594
 URL: https://issues.apache.org/jira/browse/PIG-1594
 Project: Pig
  Issue Type: Bug
Reporter: Andrew Hitchcock
Assignee: Daniel Dai
 Fix For: 0.8.0


 I've been testing the trunk version of Pig on Elastic MapReduce against our 
 log processing sample application(1). When I try to run the query it throws a 
 NullPointerException and suggests I disable the new logical plan. Disabling 
 it works and the script succeeds. Here is the query I'm trying to run:
 {code}
 register file:/home/hadoop/lib/pig/piggybank.jar
   DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.EXTRACT();
   RAW_LOGS = LOAD '$INPUT' USING TextLoader as (line:chararray);
   LOGS_BASE= foreach RAW_LOGS generate FLATTEN(EXTRACT(line, '^(\\S+) (\\S+) 
 (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] (.+?) (\\S+) (\\S+) ([^]*) 
 ([^]*)')) as (remoteAddr:chararray, remoteLogname:chararray, 
 user:chararray, time:chararray, request:chararray, status:int, 
 bytes_string:chararray, referrer:chararray, browser:chararray);
   REFERRER_ONLY = FOREACH LOGS_BASE GENERATE referrer;
   FILTERED = FILTER REFERRER_ONLY BY referrer matches '.*bing.*' OR referrer 
 matches '.*google.*';
   SEARCH_TERMS = FOREACH FILTERED GENERATE FLATTEN(EXTRACT(referrer, 
 '.*[\\?]q=([^]+).*')) as terms:chararray;
   SEARCH_TERMS_FILTERED = FILTER SEARCH_TERMS BY NOT $0 IS NULL;
   SEARCH_TERMS_COUNT = FOREACH (GROUP SEARCH_TERMS_FILTERED BY $0) GENERATE 
 $0, COUNT($1) as num;
   SEARCH_TERMS_COUNT_SORTED = LIMIT(ORDER SEARCH_TERMS_COUNT BY num DESC) 50;
   STORE SEARCH_TERMS_COUNT_SORTED into '$OUTPUT';
 {code}
 And here is the stack trace that results:
 {code}
 ERROR 2042: Error in new logical plan. Try -Dpig.usenewlogicalplan=false.
 org.apache.pig.backend.executionengine.ExecException: ERROR 2042: Error in 
 new logical plan. Try -Dpig.usenewlogicalplan=false.
 at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:285)
 at org.apache.pig.PigServer.compilePp(PigServer.java:1301)
 at 
 org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1154)
 at org.apache.pig.PigServer.execute(PigServer.java:1148)
 at org.apache.pig.PigServer.access$100(PigServer.java:123)
 at org.apache.pig.PigServer$Graph.execute(PigServer.java:1464)
 at org.apache.pig.PigServer.executeBatchEx(PigServer.java:350)
 at org.apache.pig.PigServer.executeBatch(PigServer.java:324)
 at 
 org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:111)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:140)
 at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:90)
 at org.apache.pig.Main.run(Main.java:491)
 at org.apache.pig.Main.main(Main.java:107)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
 Caused by: java.lang.NullPointerException
 at org.apache.pig.EvalFunc.getSchemaName(EvalFunc.java:76)
 at 
 org.apache.pig.piggybank.impl.ErrorCatchingBase.outputSchema(ErrorCatchingBase.java:76)
 at 
 org.apache.pig.newplan.logical.expression.UserFuncExpression.getFieldSchema(UserFuncExpression.java:111)
 at 
 org.apache.pig.newplan.logical.optimizer.FieldSchemaResetter.execute(SchemaResetter.java:175)
 at 
 org.apache.pig.newplan.logical.expression.AllSameExpressionVisitor.visit(AllSameExpressionVisitor.java:143)
 at 
 org.apache.pig.newplan.logical.expression.UserFuncExpression.accept(UserFuncExpression.java:55)
 at 
 org.apache.pig.newplan.ReverseDependencyOrderWalker.walk(ReverseDependencyOrderWalker.java:69)
 at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:50)
 at 
 org.apache.pig.newplan.logical.optimizer.SchemaResetter.visit(SchemaResetter.java:87)
 at 
 org.apache.pig.newplan.logical.relational.LOGenerate.accept(LOGenerate.java:149)
 at 
 org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:74)
 at 
 org.apache.pig.newplan.logical.optimizer.SchemaResetter.visit(SchemaResetter.java:76)
 at 
 

[jira] Commented: (PIG-1572) change default datatype when relations are used as scalar to bytearray

2010-09-01 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12905293#action_12905293
 ] 

Daniel Dai commented on PIG-1572:
-

Patch looks good. One minor doubt is when we migrate to new logical plan, 
UserFuncExpression already have necessary cast inserted, seems we do not need 
to change new logical plan's UserFuncExpression.getFieldSchema(), am I right?

 change default datatype when relations are used as scalar to bytearray
 --

 Key: PIG-1572
 URL: https://issues.apache.org/jira/browse/PIG-1572
 Project: Pig
  Issue Type: Bug
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: PIG-1572.1.patch, PIG-1572.2.patch


 When relations are cast to scalar, the current default type is chararray. 
 This is inconsistent with the behavior in rest of pig-latin.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1583) piggybank unit test TestLookupInFiles is broken

2010-09-01 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1583:


  Status: Resolved  (was: Patch Available)
Hadoop Flags: [Reviewed]
  Resolution: Fixed

Patch committed to both trunk and 0.8 branch.

 piggybank unit test TestLookupInFiles is broken
 ---

 Key: PIG-1583
 URL: https://issues.apache.org/jira/browse/PIG-1583
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.8.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.8.0

 Attachments: PIG-1583-1.patch


 Error message:
 10/08/31 09:32:12 INFO mapred.TaskInProgress: Error from 
 attempt_20100831093139211_0001_m_00_3: 
 org.apache.pig.backend.executionengine.ExecException: ERROR 2078: Caught 
 error from UDF: org.apache.pig.piggybank.evaluation.string.LookupInFiles 
 [LookupInFiles : Cannot open file one]
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:262)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:283)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:355)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53)
 at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
 at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
 at org.apache.hadoop.mapred.Child.main(Child.java:170)
 Caused by: java.io.IOException: LookupInFiles : Cannot open file one
 at 
 org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:92)
 at 
 org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:115)
 at 
 org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:49)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229)
 ... 10 more
 Caused by: java.io.IOException: hdfs://localhost:47453/user/hadoopqa/one 
 does not exist
 at 
 org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:224)
 at 
 org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:172)
 at 
 org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:89)
 ... 13 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1199) help includes obsolete options

2010-09-01 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1199:


Release Note: 
Help now takes properties keyword to show all java properties supported by Pig:

The following properties are supported:
Logging:
verbose=true|false; default is false. This property is the same as -v 
switch
brief=true|false; default is false. This property is the same as -b 
switch
debug=OFF|ERROR|WARN|INFO|DEBUG; default is INFO. This property is the 
same as -d switch
...

 help includes obsolete options
 --

 Key: PIG-1199
 URL: https://issues.apache.org/jira/browse/PIG-1199
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Olga Natkovich
Assignee: Olga Natkovich
 Fix For: 0.8.0

 Attachments: PIG-1199.patch, PIG-1199_2.patch


 This is confusing to users

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1585) Add new properties to help and documentation

2010-09-01 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12905323#action_12905323
 ] 

Olga Natkovich commented on PIG-1585:
-

Since this is just a minor cosmetic patch, I am just planning to commit the 
changes to both the branch and the trunk without tests and review.

 Add new properties to help and documentation
 

 Key: PIG-1585
 URL: https://issues.apache.org/jira/browse/PIG-1585
 Project: Pig
  Issue Type: Bug
Reporter: Olga Natkovich
Assignee: Olga Natkovich
 Fix For: 0.8.0

 Attachments: PIG-1585.patch


 New properties:
 Compression:
 pig.tmpfilecompression, default to false, tells if the temporary files should 
 be compressed or not. If true, then 
 pig.tmpfilecompression.codec specifies which compression codec to use. 
 Currently, PIG only accepts gz and lzo as possible values. Since LZO is 
 under GPL license, Hadoop may need to be configured to use LZO codec. Please 
 refer to http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ for 
 details. 
 Combining small files:
 pig.noSplitCombination - disables combining multiple small files to the block 
 size

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1585) Add new properties to help and documentation

2010-09-01 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1585:


Attachment: PIG-1585.patch

 Add new properties to help and documentation
 

 Key: PIG-1585
 URL: https://issues.apache.org/jira/browse/PIG-1585
 Project: Pig
  Issue Type: Bug
Reporter: Olga Natkovich
Assignee: Olga Natkovich
 Fix For: 0.8.0

 Attachments: PIG-1585.patch


 New properties:
 Compression:
 pig.tmpfilecompression, default to false, tells if the temporary files should 
 be compressed or not. If true, then 
 pig.tmpfilecompression.codec specifies which compression codec to use. 
 Currently, PIG only accepts gz and lzo as possible values. Since LZO is 
 under GPL license, Hadoop may need to be configured to use LZO codec. Please 
 refer to http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ for 
 details. 
 Combining small files:
 pig.noSplitCombination - disables combining multiple small files to the block 
 size

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1572) change default datatype when relations are used as scalar to bytearray

2010-09-01 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12905324#action_12905324
 ] 

Thejas M Nair commented on PIG-1572:


Yes, the changes to UserFuncExpression.getFieldSchema() are no longer required 
because the cast inserted to appropriate type. But while thinking about that I 
believe I have found an issue with the handling of non PigStorage load 
functions.
Since this patch address a bunch of issues I will commit it and create a new 
jira to address that, and also look at the utility of this change to 
UserFuncExpression.getFieldSchema().



 change default datatype when relations are used as scalar to bytearray
 --

 Key: PIG-1572
 URL: https://issues.apache.org/jira/browse/PIG-1572
 Project: Pig
  Issue Type: Bug
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: PIG-1572.1.patch, PIG-1572.2.patch


 When relations are cast to scalar, the current default type is chararray. 
 This is inconsistent with the behavior in rest of pig-latin.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-1585) Add new properties to help and documentation

2010-09-01 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich resolved PIG-1585.
-

Resolution: Fixed

patch committed to both trunk and 0.8 branch. I also added 
LogicalExpressionSimplifier to the help

 Add new properties to help and documentation
 

 Key: PIG-1585
 URL: https://issues.apache.org/jira/browse/PIG-1585
 Project: Pig
  Issue Type: Bug
Reporter: Olga Natkovich
Assignee: Olga Natkovich
 Fix For: 0.8.0

 Attachments: PIG-1585.patch


 New properties:
 Compression:
 pig.tmpfilecompression, default to false, tells if the temporary files should 
 be compressed or not. If true, then 
 pig.tmpfilecompression.codec specifies which compression codec to use. 
 Currently, PIG only accepts gz and lzo as possible values. Since LZO is 
 under GPL license, Hadoop may need to be configured to use LZO codec. Please 
 refer to http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ for 
 details. 
 Combining small files:
 pig.noSplitCombination - disables combining multiple small files to the block 
 size

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1572) change default datatype when relations are used as scalar to bytearray

2010-09-01 Thread Thejas M Nair (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair updated PIG-1572:
---

  Status: Resolved  (was: Patch Available)
Hadoop Flags: [Reviewed]
  Resolution: Fixed

Patch committed to trunk.


 change default datatype when relations are used as scalar to bytearray
 --

 Key: PIG-1572
 URL: https://issues.apache.org/jira/browse/PIG-1572
 Project: Pig
  Issue Type: Bug
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: PIG-1572.1.patch, PIG-1572.2.patch


 When relations are cast to scalar, the current default type is chararray. 
 This is inconsistent with the behavior in rest of pig-latin.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1572) change default datatype when relations are used as scalar to bytearray

2010-09-01 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12905332#action_12905332
 ] 

Thejas M Nair commented on PIG-1572:


Patch committed to 0.8 branch as well .

 change default datatype when relations are used as scalar to bytearray
 --

 Key: PIG-1572
 URL: https://issues.apache.org/jira/browse/PIG-1572
 Project: Pig
  Issue Type: Bug
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: PIG-1572.1.patch, PIG-1572.2.patch


 When relations are cast to scalar, the current default type is chararray. 
 This is inconsistent with the behavior in rest of pig-latin.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1572) change default datatype when relations are used as scalar to bytearray

2010-09-01 Thread Thejas M Nair (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair updated PIG-1572:
---

Release Note: 
This changes the release note in PIG-1434, the part  Also, please, note that 
when the schema can't be inferred chararray rather than bytearray is used.

The datatype of byetarray is used when schema can't be inferred.



 change default datatype when relations are used as scalar to bytearray
 --

 Key: PIG-1572
 URL: https://issues.apache.org/jira/browse/PIG-1572
 Project: Pig
  Issue Type: Bug
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: PIG-1572.1.patch, PIG-1572.2.patch


 When relations are cast to scalar, the current default type is chararray. 
 This is inconsistent with the behavior in rest of pig-latin.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1434) Allow casting relations to scalars

2010-09-01 Thread Thejas M Nair (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair updated PIG-1434:
---

Release Note: 
PIG-1434 adds functionality that allows to cast elements of a single-tuple 
relation into a scalar value. The primary use case for this is using values of 
global aggregates in the follow up computations. For instance,

A = load 'mydata' as (userid, clicks);

B = group A all;

C = foreach B genertate SUM(A.clicks) as total;

D = foreach A generate userid,  clicks/(double)C.total;

dump D;

 

This example allows computing the % of the clicks belonging to a particular 
user. Note that if the SUM as not given a name, a position can be used as well 
(userid,  clicks/(double)C.$0); Also, note that if explicit cast is not used an 
implict cast would be inserted according to regular Pig rules. Also, please, 
note that when the schema can't be inferred bytearray is used.

 

The relation can be used in any place where an expression of the type would 
make sense. This includes FOREACH, FILTER, and SPLIT.

 

A multi field tuple can also be used:

 

A = load 'mydata' as (userid, clicks);

B = group A all;

C = foreach B genertate SUM(A.clicks) as total, COUNT(A) as cnt;

D = FILTER A by clicks  C.total/3

E = foreach D generate userid,  clicks/(double)C.total, cnt;

Dump E;

 

If a relation contains more than single tuple, a runtime error is generated: 
Scalar has more than one row in the output



  was:
PIG-1434 adds functionality that allows to cast elements of a single-tuple 
relation into a scalar value. The primary use case for this is using values of 
global aggregates in the follow up computations. For instance,

A = load 'mydata' as (userid, clicks);

B = group A all;

C = foreach B genertate SUM(A.clicks) as total;

D = foreach A generate userid,  clicks/(double)C.total;

dump D;

 

This example allows computing the % of the clicks belonging to a particular 
user. Note that if the SUM as not given a name, a position can be used as well 
(userid,  clicks/(double)C.$0); Also, note that if explicit cast is not used an 
implict cast would be inserted according to regular Pig rules. Also, please, 
note that when the schema can't be inferred chararray rather than bytearray is 
used.

 

The relation can be used in any place where an expression of the type would 
make sense. This includes FOREACH, FILTER, and SPLIT.

 

A multi field tuple can also be used:

 

A = load 'mydata' as (userid, clicks);

B = group A all;

C = foreach B genertate SUM(A.clicks) as total, COUNT(A) as cnt;

D = FILTER A by clicks  C.total/3

E = foreach D generate userid,  clicks/(double)C.total, cnt;

Dump E;

 

If a relation contains more than single tuple, a runtime error is generated: 
Scalar has more than one row in the output




Changed the release note to incorporate the change of default datatype to 
bytearray in PIG-1572

 Allow casting relations to scalars
 --

 Key: PIG-1434
 URL: https://issues.apache.org/jira/browse/PIG-1434
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Aniket Mokashi
 Fix For: 0.8.0

 Attachments: scalarImpl.patch, ScalarImpl1.patch, ScalarImpl5.patch, 
 ScalarImplFinale.patch, ScalarImplFinale1.patch, ScalarImplFinaleRebase.patch


 This jira is to implement a simplified version of the functionality described 
 in https://issues.apache.org/jira/browse/PIG-801.
 The proposal is to allow casting relations to scalar types in foreach.
 Example:
 A = load 'data' as (x, y, z);
 B = group A all;
 C = foreach B generate COUNT(A);
 .
 X = 
 Y = foreach X generate $1/(long) C;
 Couple of additional comments:
 (1) You can only cast relations including a single value or an error will be 
 reported
 (2) Name resolution is needed since relation X might have field named C in 
 which case that field takes precedence.
 (3) Y will look for C closest to it.
 Implementation thoughts:
 The idea is to store C into a file and then convert it into scalar via a UDF. 
 I believe we already have a UDF that Ben Reed contributed for this purpose. 
 Most of the work would be to update the logical plan to
 (1) Store C
 (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1572) change default datatype when relations are used as scalar to bytearray

2010-09-01 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12905346#action_12905346
 ] 

Thejas M Nair commented on PIG-1572:


bq. Yes, the changes to UserFuncExpression.getFieldSchema() are no longer 
required because the cast inserted to appropriate type. But while thinking 
about that I believe I have found an issue with the handling of non PigStorage 
load functions.
Since this patch address a bunch of issues I will commit it and create a new 
jira to address that, and also look at the utility of this change to 
UserFuncExpression.getFieldSchema().

Created  PIG-1595 to address the issue.

 change default datatype when relations are used as scalar to bytearray
 --

 Key: PIG-1572
 URL: https://issues.apache.org/jira/browse/PIG-1572
 Project: Pig
  Issue Type: Bug
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: PIG-1572.1.patch, PIG-1572.2.patch


 When relations are cast to scalar, the current default type is chararray. 
 This is inconsistent with the behavior in rest of pig-latin.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1596) NPE's thrown when attempting to load hbase columns containing null values

2010-09-01 Thread George P. Stathis (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George P. Stathis updated PIG-1596:
---

Description: 
I'm not a committer, but I'd like to suggest the attached patch to handle 
loading hbase rows containing null cell values (since hbase is all about 
sparsly populated data rows). As it stands, a DataByteArray can be created with 
a null mData if a cell has no value, which causes NPEs by simply attempting to 
load a row containing the null cell in question.

PS: the attached patch also contains a slight change to the bin/pig executable 
to point to the build/pig\-\*\-SNAPSHOT.jar and not the build/pig\-\*\-dev.jar 
(the latter no longer seems to exist). If you prefer a separate patch for this, 
I'll be happy to submit it.

  was:
I'm not a committer, but I'd like to suggest the attached patch to handle 
loading hbase rows containing null cell values (since hbase is all about 
sparsly populated data rows). As it stands, a DataByteArray can be created with 
a null mData if a cell has no value, which causes NPEs by simply attempting to 
load a row containing the null cell in question.

PS: the attached patch also contains a slight change to the bin/pig executable 
to point to the build/pig-*-SNAPSHOT.jar and not the build/pig-*-dev.jar (the 
latter no longer seems to exist). If you prefer a separate patch for this, I'll 
be happy to submit it.


 NPE's thrown when attempting to load hbase columns containing null values
 -

 Key: PIG-1596
 URL: https://issues.apache.org/jira/browse/PIG-1596
 Project: Pig
  Issue Type: Bug
  Components: data
Affects Versions: 0.7.0
Reporter: George P. Stathis
 Fix For: 0.8.0, 0.9.0

 Attachments: null_hbase_records.patch


 I'm not a committer, but I'd like to suggest the attached patch to handle 
 loading hbase rows containing null cell values (since hbase is all about 
 sparsly populated data rows). As it stands, a DataByteArray can be created 
 with a null mData if a cell has no value, which causes NPEs by simply 
 attempting to load a row containing the null cell in question.
 PS: the attached patch also contains a slight change to the bin/pig 
 executable to point to the build/pig\-\*\-SNAPSHOT.jar and not the 
 build/pig\-\*\-dev.jar (the latter no longer seems to exist). If you prefer a 
 separate patch for this, I'll be happy to submit it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1596) NPE's thrown when attempting to load hbase columns containing null values

2010-09-01 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12905388#action_12905388
 ] 

Jeff Zhang commented on PIG-1596:
-

George, thanks for your suggestion. And I believe you are using latest 
HBaseStorage in trunk. What you pointed at is really a problem, and I have 
another solution for this. If the cell is null,  we put an empty byte array in 
DataByteArray, I think it should been the LoadFunc's reponponslibity to handle 
null cell.




 NPE's thrown when attempting to load hbase columns containing null values
 -

 Key: PIG-1596
 URL: https://issues.apache.org/jira/browse/PIG-1596
 Project: Pig
  Issue Type: Bug
  Components: data
Affects Versions: 0.7.0
Reporter: George P. Stathis
 Fix For: 0.8.0, 0.9.0

 Attachments: null_hbase_records.patch


 I'm not a committer, but I'd like to suggest the attached patch to handle 
 loading hbase rows containing null cell values (since hbase is all about 
 sparsly populated data rows). As it stands, a DataByteArray can be created 
 with a null mData if a cell has no value, which causes NPEs by simply 
 attempting to load a row containing the null cell in question.
 PS: the attached patch also contains a slight change to the bin/pig 
 executable to point to the build/pig\-\*\-SNAPSHOT.jar and not the 
 build/pig\-\*\-dev.jar (the latter no longer seems to exist). If you prefer a 
 separate patch for this, I'll be happy to submit it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1596) NPE's thrown when attempting to load hbase columns containing null values

2010-09-01 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated PIG-1596:


Attachment: PIG_1596.patch

Attach patch (modify HBaseStorage and add new TestCase)

 NPE's thrown when attempting to load hbase columns containing null values
 -

 Key: PIG-1596
 URL: https://issues.apache.org/jira/browse/PIG-1596
 Project: Pig
  Issue Type: Bug
  Components: data
Affects Versions: 0.7.0
Reporter: George P. Stathis
 Fix For: 0.8.0, 0.9.0

 Attachments: null_hbase_records.patch, PIG_1596.patch


 I'm not a committer, but I'd like to suggest the attached patch to handle 
 loading hbase rows containing null cell values (since hbase is all about 
 sparsly populated data rows). As it stands, a DataByteArray can be created 
 with a null mData if a cell has no value, which causes NPEs by simply 
 attempting to load a row containing the null cell in question.
 PS: the attached patch also contains a slight change to the bin/pig 
 executable to point to the build/pig\-\*\-SNAPSHOT.jar and not the 
 build/pig\-\*\-dev.jar (the latter no longer seems to exist). If you prefer a 
 separate patch for this, I'll be happy to submit it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1596) NPE's thrown when attempting to load hbase columns containing null values

2010-09-01 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12905404#action_12905404
 ] 

Dmitriy V. Ryaboy commented on PIG-1596:


Jeff,
I think it's clearer if you insert null into the tuple, not an empty 
DataByteArray (and assertNull in the test)

George, the SNAPSHOT thing is a real bug, thanks for catching that, this 
happened when pig was made available through maven in PIG-1334.

I'll create a separate ticket for that.

 NPE's thrown when attempting to load hbase columns containing null values
 -

 Key: PIG-1596
 URL: https://issues.apache.org/jira/browse/PIG-1596
 Project: Pig
  Issue Type: Bug
  Components: data
Affects Versions: 0.7.0
Reporter: George P. Stathis
 Fix For: 0.8.0, 0.9.0

 Attachments: null_hbase_records.patch, PIG_1596.patch


 I'm not a committer, but I'd like to suggest the attached patch to handle 
 loading hbase rows containing null cell values (since hbase is all about 
 sparsly populated data rows). As it stands, a DataByteArray can be created 
 with a null mData if a cell has no value, which causes NPEs by simply 
 attempting to load a row containing the null cell in question.
 PS: the attached patch also contains a slight change to the bin/pig 
 executable to point to the build/pig\-\*\-SNAPSHOT.jar and not the 
 build/pig\-\*\-dev.jar (the latter no longer seems to exist). If you prefer a 
 separate patch for this, I'll be happy to submit it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1597) Development snapshot jar no longer picked up by bin/pig

2010-09-01 Thread Dmitriy V. Ryaboy (JIRA)
Development snapshot jar no longer picked up by bin/pig
---

 Key: PIG-1597
 URL: https://issues.apache.org/jira/browse/PIG-1597
 Project: Pig
  Issue Type: Bug
  Components: grunt
Affects Versions: 0.8.0
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0


As George Stathis poined out in PIG-1596, bin/pig no longer picks up 
development pig jars. This appears to have been introduced in PIG-1334, as the 
jar was renamed from -dev- to -SNAPSHOT-

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1597) Development snapshot jar no longer picked up by bin/pig

2010-09-01 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-1597:
---

Status: Patch Available  (was: Open)

 Development snapshot jar no longer picked up by bin/pig
 ---

 Key: PIG-1597
 URL: https://issues.apache.org/jira/browse/PIG-1597
 Project: Pig
  Issue Type: Bug
  Components: grunt
Affects Versions: 0.8.0
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: PIG_1597.patch


 As George Stathis poined out in PIG-1596, bin/pig no longer picks up 
 development pig jars. This appears to have been introduced in PIG-1334, as 
 the jar was renamed from -dev- to -SNAPSHOT-

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1596) NPE's thrown when attempting to load hbase columns containing null values

2010-09-01 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated PIG-1596:


Attachment: PIG_1596_2.patch

Dmitriy, you are right. I updated the patch according your suggestion.



 NPE's thrown when attempting to load hbase columns containing null values
 -

 Key: PIG-1596
 URL: https://issues.apache.org/jira/browse/PIG-1596
 Project: Pig
  Issue Type: Bug
  Components: data
Affects Versions: 0.7.0
Reporter: George P. Stathis
 Fix For: 0.8.0, 0.9.0

 Attachments: null_hbase_records.patch, PIG_1596.patch, 
 PIG_1596_2.patch


 I'm not a committer, but I'd like to suggest the attached patch to handle 
 loading hbase rows containing null cell values (since hbase is all about 
 sparsly populated data rows). As it stands, a DataByteArray can be created 
 with a null mData if a cell has no value, which causes NPEs by simply 
 attempting to load a row containing the null cell in question.
 PS: the attached patch also contains a slight change to the bin/pig 
 executable to point to the build/pig\-\*\-SNAPSHOT.jar and not the 
 build/pig\-\*\-dev.jar (the latter no longer seems to exist). If you prefer a 
 separate patch for this, I'll be happy to submit it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.