[jira] Commented: (PIG-1038) Optimize nested distinct/sort to use secondary key

2009-11-09 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774926#action_12774926
 ] 

Hadoop QA commented on PIG-1038:


-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12424332/PIG-1038-2.patch
  against trunk revision 833549.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 6 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

-1 javac.  The applied patch generated 205 javac compiler warnings (more 
than the trunk's current 199 warnings).

-1 findbugs.  The patch appears to introduce 1 new Findbugs warnings.

-1 release audit.  The applied patch generated 319 release audit warnings 
(more than the trunk's current 317 warnings).

-1 core tests.  The patch failed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/145/testReport/
Release audit warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/145/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/145/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/145/console

This message is automatically generated.

> Optimize nested distinct/sort to use secondary key
> --
>
> Key: PIG-1038
> URL: https://issues.apache.org/jira/browse/PIG-1038
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.4.0
>Reporter: Olga Natkovich
>Assignee: Daniel Dai
> Fix For: 0.6.0
>
> Attachments: PIG-1038-1.patch, PIG-1038-2.patch
>
>
> If nested foreach plan contains sort/distinct, it is possible to use hadoop 
> secondary sort instead of SortedDataBag and DistinctDataBag to optimize the 
> query. 
> Eg1:
> A = load 'mydata';
> B = group A by $0;
> C = foreach B {
> D = order A by $1;
> generate group, D;
> }
> store C into 'myresult';
> We can specify a secondary sort on A.$1, and drop "order A by $1".
> Eg2:
> A = load 'mydata';
> B = group A by $0;
> C = foreach B {
> D = A.$1;
> E = distinct D;
> generate group, E;
> }
> store C into 'myresult';
> We can specify a secondary sort key on A.$1, and simplify "D=A.$1; E=distinct 
> D" to a special version of distinct, which does not do the sorting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-979) Acummulator Interface for UDFs

2009-11-09 Thread Ying He (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775054#action_12775054
 ] 

Ying He commented on PIG-979:
-

Without patch from PIG-1038, this patch won't compile. So all tests  would fail.

> Acummulator Interface for UDFs
> --
>
> Key: PIG-979
> URL: https://issues.apache.org/jira/browse/PIG-979
> Project: Pig
>  Issue Type: New Feature
>Reporter: Alan Gates
>Assignee: Ying He
> Attachments: PIG-979.patch
>
>
> Add an accumulator interface for UDFs that would allow them to take a set 
> number of records at a time instead of the entire bag.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-979) Acummulator Interface for UDFs

2009-11-09 Thread Ying He (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775057#action_12775057
 ] 

Ying He commented on PIG-979:
-

Without patch from PIG-1038, this patch won't compile. So all tests  would fail.

> Acummulator Interface for UDFs
> --
>
> Key: PIG-979
> URL: https://issues.apache.org/jira/browse/PIG-979
> Project: Pig
>  Issue Type: New Feature
>Reporter: Alan Gates
>Assignee: Ying He
> Attachments: PIG-979.patch
>
>
> Add an accumulator interface for UDFs that would allow them to take a set 
> number of records at a time instead of the entire bag.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-979) Acummulator Interface for UDFs

2009-11-09 Thread Ying He (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775058#action_12775058
 ] 

Ying He commented on PIG-979:
-

Without patch from PIG-1038, this patch won't compile. So all tests  would fail.

> Acummulator Interface for UDFs
> --
>
> Key: PIG-979
> URL: https://issues.apache.org/jira/browse/PIG-979
> Project: Pig
>  Issue Type: New Feature
>Reporter: Alan Gates
>Assignee: Ying He
> Attachments: PIG-979.patch
>
>
> Add an accumulator interface for UDFs that would allow them to take a set 
> number of records at a time instead of the entire bag.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-979) Acummulator Interface for UDFs

2009-11-09 Thread Ying He (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775059#action_12775059
 ] 

Ying He commented on PIG-979:
-

Without patch from PIG-1038, this patch won't compile. So all tests  would fail.

> Acummulator Interface for UDFs
> --
>
> Key: PIG-979
> URL: https://issues.apache.org/jira/browse/PIG-979
> Project: Pig
>  Issue Type: New Feature
>Reporter: Alan Gates
>Assignee: Ying He
> Attachments: PIG-979.patch
>
>
> Add an accumulator interface for UDFs that would allow them to take a set 
> number of records at a time instead of the entire bag.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1077) [Zebra] to support record(row)-based file split in Zebra' TableInputFormat

2009-11-09 Thread Chao Wang (JIRA)
[Zebra] to support record(row)-based file split in Zebra' TableInputFormat
--

 Key: PIG-1077
 URL: https://issues.apache.org/jira/browse/PIG-1077
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.4.0
Reporter: Chao Wang
Assignee: Chao Wang
 Fix For: 0.6.0


TFile currently supports split by record sequence number (see Jira 
HADOOP-6218). We want to utilize this to provide record(row)-based input split 
support in Zebra.
One prominent benefit is that: in cases where we have very large data files, we 
can create much more fine-grained input splits than before where we can only 
create one big split for one big file.

In more detail, the new row-based getSplits() works by default (user does not 
specify no. of splits to be generated) as follows: 
1) Select the biggest column group in terms of data size, split all of its 
TFiles according to hdfs block size (64 MB or 128 MB) and get a list of 
physical byte offsets as the output per TFile. For example, let us assume for 
the 1st TFile we get offset1, offset2, ..., offset10; 
2) Invoke TFile.getRecordNumNear(long offset) to get the RecordNum of a 
key-value pair near a byte offset. For the example above, say we get 
recordNum1, recordNum2, ..., recordNum10; 
3) Stitch [0, recordNum1], [recordNum1+1, recordNum2], ..., [recordNum9+1, 
recordNum10], [recordNum10+1, lastRecordNum] splits of all column groups, 
respectively to form 11 record-based input splits for the 1st TFile. 
4) For each input split, we need to create a TFile scanner through: 
TFile.createScannerByRecordNum(long beginRecNum, long endRecNum). 

Note: conversion from byte offset to record number will be done by each mapper, 
rather than being done at the job initialization phase. This is due to 
performance concern since the conversion incurs some TFile reading overhead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1073) LogicalPlanCloner can't clone plan containing LOJoin

2009-11-09 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775109#action_12775109
 ] 

Ashutosh Chauhan commented on PIG-1073:
---

Current patch only partially fixes the problem. It seems we have bigger 
problems in a way visiting is done on query plans currently. I am working on 
fixing those.

> LogicalPlanCloner can't clone plan containing LOJoin
> 
>
> Key: PIG-1073
> URL: https://issues.apache.org/jira/browse/PIG-1073
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Attachments: pig-1073.patch
>
>
> Add following testcase in LogicalPlanBuilder.java
> public void testLogicalPlanCloner() throws CloneNotSupportedException{
> LogicalPlan lp = buildPlan("C = join ( load 'A') by $0, (load 'B') by 
> $0;");
> LogicalPlanCloner cloner = new LogicalPlanCloner(lp);
> cloner.getClonedPlan();
> }
> and this fails with the following stacktrace:
> java.lang.NullPointerException
> at 
> org.apache.pig.impl.logicalLayer.LOVisitor.visit(LOVisitor.java:171)
> at 
> org.apache.pig.impl.logicalLayer.PlanSetter.visit(PlanSetter.java:63)
> at org.apache.pig.impl.logicalLayer.LOJoin.visit(LOJoin.java:213)
> at org.apache.pig.impl.logicalLayer.LOJoin.visit(LOJoin.java:45)
> at 
> org.apache.pig.impl.plan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:67)
> at 
> org.apache.pig.impl.plan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:69)
> at 
> org.apache.pig.impl.plan.DepthFirstWalker.walk(DepthFirstWalker.java:50)
> at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
> at 
> org.apache.pig.impl.logicalLayer.LogicalPlanCloneHelper.getClonedPlan(LogicalPlanCloneHelper.java:73)
> at 
> org.apache.pig.impl.logicalLayer.LogicalPlanCloner.getClonedPlan(LogicalPlanCloner.java:46)
> at 
> org.apache.pig.test.TestLogicalPlanBuilder.testLogicalPlanCloneHelper(TestLogicalPlanBuilder.java:2110)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[VOTE] Branch for Pig 0.6.0 release

2009-11-09 Thread Olga Natkovich
Hi,

 

I would like to propose to branch for Pig 0.6.0 release with the intent
to have a release before the end of the year. We have done a lot of work
since branching for Pig 0.5.0 that we would like to share with users.
This includes changing how bags are spilled onto disk (PIG-975,
PIG-1037), skewed and fragment-replicated outer join plus many other
performance improvements and bug fixes.

 

Please vote by Thursday.

 

Thanks,

 

Olga

 

 



Re: [VOTE] Branch for Pig 0.6.0 release

2009-11-09 Thread Alan Gates
+1.  In addition to the new features we've added, our change to use  
Hadoop's LineRecordReader brought Pig to parity with Hadoop in the  
PigMix tests, about a 30% average performance improvement.  This  
should be huge for our users.


Alan.

On Nov 9, 2009, at 12:26 PM, Olga Natkovich wrote:


Hi,



I would like to propose to branch for Pig 0.6.0 release with the  
intent
to have a release before the end of the year. We have done a lot of  
work

since branching for Pig 0.5.0 that we would like to share with users.
This includes changing how bags are spilled onto disk (PIG-975,
PIG-1037), skewed and fragment-replicated outer join plus many other
performance improvements and bug fixes.



Please vote by Thursday.



Thanks,



Olga









[jira] Commented: (PIG-979) Acummulator Interface for UDFs

2009-11-09 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775158#action_12775158
 ] 

Alan Gates commented on PIG-979:


A test should be added that checks that when accumulator UDFs are mixed with 
non-accumulator UDFs it works properly.

Why is the optimization not applied in the case that inner is set on POPackage? 
 It seems the accumulator interface should still work in this case.

Some comments on what AccumulatorOptimizer.check() is and what it allows would 
be helpful.

The code contains tabs in some spots instead of 4 spaces.

The cases in which the accumulator interface can be used has been greatly 
extended by adding the support for unary and binary operators.  But this comes 
at a cost.  Every binary and unary comparison now has to make the accumChild 
call.  99% of the time this will be false.  It's not clear to me how often 
users will do things like:

{code}
foreach C generate accumfunc1(A) + accumfunc2(A) OR
foreach C generate (accumfunc1(A) > 100 ? 0 : 1)
{code}

which is the only time I can see where this additional functionality is useful, 
since we don't currently allow these functions in filters.  It's possible that 
JIT along with branch prediction will remove this extra cost, since the branch 
will always be one way or another for a given query.  But I'd like to see this 
tested.  It would be interesting to compare a query with heavy use of binary 
operators (but no accumulator UDFs) with and without this change.

I don't understand why you need the new interface AccumulativeTupleBuffer and 
class AccumulativeBag.  Why can't the block of tuples read off of the iterator 
just be put in a regular bag and then passed to the UDFs?

In all the sum implementations of accumulate you calculate the sum of the block 
of tuples twice.  It should be done once and cached.

In COUNT.accumulate rather than making intermediateCount a Long and then 
forcing the creation of a new Long each time you add one you should instead 
keep it as a long and depend on boxing to convert it to Long when you return it 
in getValue.  Same in COUNT_STAR.accumulate





> Acummulator Interface for UDFs
> --
>
> Key: PIG-979
> URL: https://issues.apache.org/jira/browse/PIG-979
> Project: Pig
>  Issue Type: New Feature
>Reporter: Alan Gates
>Assignee: Ying He
> Attachments: PIG-979.patch
>
>
> Add an accumulator interface for UDFs that would allow them to take a set 
> number of records at a time instead of the entire bag.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1077) [Zebra] to support record(row)-based file split in Zebra's TableInputFormat

2009-11-09 Thread Chao Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Wang updated PIG-1077:
---

Summary: [Zebra] to support record(row)-based file split in Zebra's 
TableInputFormat  (was: [Zebra] to support record(row)-based file split in 
Zebra' TableInputFormat)

> [Zebra] to support record(row)-based file split in Zebra's TableInputFormat
> ---
>
> Key: PIG-1077
> URL: https://issues.apache.org/jira/browse/PIG-1077
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.4.0
>Reporter: Chao Wang
>Assignee: Chao Wang
> Fix For: 0.6.0
>
>
> TFile currently supports split by record sequence number (see Jira 
> HADOOP-6218). We want to utilize this to provide record(row)-based input 
> split support in Zebra.
> One prominent benefit is that: in cases where we have very large data files, 
> we can create much more fine-grained input splits than before where we can 
> only create one big split for one big file.
> In more detail, the new row-based getSplits() works by default (user does not 
> specify no. of splits to be generated) as follows: 
> 1) Select the biggest column group in terms of data size, split all of its 
> TFiles according to hdfs block size (64 MB or 128 MB) and get a list of 
> physical byte offsets as the output per TFile. For example, let us assume for 
> the 1st TFile we get offset1, offset2, ..., offset10; 
> 2) Invoke TFile.getRecordNumNear(long offset) to get the RecordNum of a 
> key-value pair near a byte offset. For the example above, say we get 
> recordNum1, recordNum2, ..., recordNum10; 
> 3) Stitch [0, recordNum1], [recordNum1+1, recordNum2], ..., [recordNum9+1, 
> recordNum10], [recordNum10+1, lastRecordNum] splits of all column groups, 
> respectively to form 11 record-based input splits for the 1st TFile. 
> 4) For each input split, we need to create a TFile scanner through: 
> TFile.createScannerByRecordNum(long beginRecNum, long endRecNum). 
> Note: conversion from byte offset to record number will be done by each 
> mapper, rather than being done at the job initialization phase. This is due 
> to performance concern since the conversion incurs some TFile reading 
> overhead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-979) Acummulator Interface for UDFs

2009-11-09 Thread Ying He (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775184#action_12775184
 ] 

Ying He commented on PIG-979:
-

Alan, thanks for the feedback.

1. A test case is already created to test mix of accumulator UDF with regular 
UDF, it is in testAccumBasic().

2. The optimizer can't be applied when inner is set to POPackage, because if an 
inner is set, POPackage checks the bag for that input is NULL, if it is, 
POPackage returns NULL. This can only be done when all the tuples are retrieved 
and put into a bag.

3 & 4, will fix that

5. needs performance testing.

6. The reducer get results from POPackage and pass it to root, which is 
POForEach, to process. From POForEach perspective, it gets a tuple with bags in 
it from POPackage. Then POForEach retrieves tuples off iterator and pass to 
UDFs in multiple cycles. Because only POPackage knows how to read tuples out of 
iterator and put in proper bags, AccumulativeTupleBuffer and AccumulativeBag 
are created to communicate between POPackage and POForEach. Every time 
POForEach calls getNextBatch() on AccumulativeTupleBuffer, it in effects calls 
inner class of POPackage to retrieve tuples out of iterator.

POPackage can not be the one to block the reading of tuples, because it is only 
called once from reducer. I also thought of changing reducer to call POPackage 
multiple times to process each batch of data, then it becomes tricky to 
maintain correct states of operators, and all operators in reducer plan would 
have to support partial data, which is not necessary. 

> Acummulator Interface for UDFs
> --
>
> Key: PIG-979
> URL: https://issues.apache.org/jira/browse/PIG-979
> Project: Pig
>  Issue Type: New Feature
>Reporter: Alan Gates
>Assignee: Ying He
> Attachments: PIG-979.patch
>
>
> Add an accumulator interface for UDFs that would allow them to take a set 
> number of records at a time instead of the entire bag.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1069) [zebra] Order Preserving Sorted Table Union

2009-11-09 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-1069:


   Resolution: Fixed
Fix Version/s: 0.6.0
   Status: Resolved  (was: Patch Available)

Patch checked in.

> [zebra] Order Preserving Sorted Table Union
> ---
>
> Key: PIG-1069
> URL: https://issues.apache.org/jira/browse/PIG-1069
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.6.0
>Reporter: Yan Zhou
>Assignee: Yan Zhou
> Fix For: 0.6.0
>
> Attachments: OrderPreservingSortedTableUnion_svn.patch
>
>
> The output schema will adopt a "schema union" semantics, namely, if an output 
> column only appears in one component table, the result rows will have the 
> values of the column if the rows are from that component table and null 
> otherwise; on the other hand, if an output column appears in multiple 
> component tables, the types of the column in all the component tables must be 
> identical. Otherwise, an exception will be thrown. The result rows will have 
> the values of the column if the rows are from the component tables that have 
> the column themselves, or null if otherwise. 
> The order preserving sort-unioned results could be further indexed by the 
> component tables if the projection contains column(s) named "source_table". 
> If so specified, the component table index will be output at the position(s) 
> as specified in the projection list. If the underlying table is not a union 
> of sorted tables, use of the special column name in projection will cause an 
> exception thrown. 
> If an attempt is made to create a table of a column named "source_table", an 
> excpetion will be thrown as the name is reserved by zebra for the virtual 
> name. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1056) table can not be loaded after store

2009-11-09 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775208#action_12775208
 ] 

Richard Ding commented on PIG-1056:
---

The current Zebra loader doesn't work well in Pig batch mode:   at compile time 
in batch mode, the file to be loaded may not exist (such as intermediate file). 
  

> table can not be loaded after store
> ---
>
> Key: PIG-1056
> URL: https://issues.apache.org/jira/browse/PIG-1056
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.6.0
>Reporter: Jing Huang
>
> Pig Stack Trace
> ---
> ERROR 1018: Problem determining schema during load
> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during 
> parsing. Problem determining schema during load
> at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1023)
> at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:967)
> at org.apache.pig.PigServer.registerQuery(PigServer.java:383)
> at 
> org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:716)
> at 
> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
> at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
> at org.apache.pig.Main.main(Main.java:397)
> Caused by: org.apache.pig.impl.logicalLayer.parser.ParseException: Problem 
> determining schema during load
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:734)
> at 
> org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63)
> at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1017)
> ... 8 more
> Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1018: 
> Problem determining schema during load
> at org.apache.pig.impl.logicalLayer.LOLoad.getSchema(LOLoad.java:155)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:732)
> ... 10 more
> Caused by: java.io.IOException: No table specified for input
> at 
> org.apache.hadoop.zebra.pig.TableLoader.checkConf(TableLoader.java:238)
> at 
> org.apache.hadoop.zebra.pig.TableLoader.determineSchema(TableLoader.java:258)
> at org.apache.pig.impl.logicalLayer.LOLoad.getSchema(LOLoad.java:148)
> ... 11 more
> 
> ~ 
> 
> script:
> register /grid/0/dev/hadoopqa/hadoop/lib/zebra.jar;
> A = load 'filter.txt' as (name:chararray, age:int);
> B = filter A by age < 20;
> --dump B;
> store B into 'filter1' using 
> org.apache.hadoop.zebra.pig.TableStorer('[name];[age]');
> rec1 = load 'B' using org.apache.hadoop.zebra.pig.TableLoader();
> dump rec1;

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1078) [zebra] merge join with empty table failed

2009-11-09 Thread Jing Huang (JIRA)
[zebra] merge join with empty table failed
--

 Key: PIG-1078
 URL: https://issues.apache.org/jira/browse/PIG-1078
 Project: Pig
  Issue Type: Bug
Reporter: Jing Huang


Got indexOutOfBound exception. 

Here is the pig script:
register /grid/0/dev/hadoopqa/jars/zebra.jar;
--a1 = load '1.txt' as (a:int, 
b:float,c:long,d:double,e:chararray,f:bytearray,r1(f1:chararray,f2:chararray),m1:map[]);

--a2 = load 'empty.txt' as (a:int, 
b:float,c:long,d:double,e:chararray,f:bytearray,r1(f1:chararray,f2:chararray),m1:map[]);
--dump a1;

--a1order = order a1 by a;
--a2order = order a2 by a;


--store a1order into 'a1' using 
org.apache.hadoop.zebra.pig.TableStorer('[a,b,c];[d,e,f,r1,m1]');
--store a2order into 'empty' using 
org.apache.hadoop.zebra.pig.TableStorer('[a,b,c];[d,e,f,r1,m1]');

rec1 = load 'a1' using org.apache.hadoop.zebra.pig.TableLoader();
rec2 = load 'empty' using org.apache.hadoop.zebra.pig.TableLoader();
joina = join rec1 by a, rec2 by a using "merge" ;
dump joina;

==
please note that table "a1" and "empty" are created correctly. 

Here is the stack trace:
Backend error message
-
java.lang.ArrayIndexOutOfBoundsException: 0
at 
org.apache.hadoop.zebra.mapred.TableInputFormat.getTableRecordReader(TableInputFormat.java:478)
at org.apache.hadoop.zebra.pig.TableLoader.bindTo(TableLoader.java:166)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.seekInRightStream(POMergeJoin.java:400)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:181)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:247)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:238)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map.map(PigMapOnly.java:65)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:159)

Pig Stack Trace
---
ERROR 6015: During execution, encountered a Hadoop error.

org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open 
iterator for alias joina
at org.apache.pig.PigServer.openIterator(PigServer.java:481)
at 
org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:539)
at 
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:241)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
at org.apache.pig.Main.main(Main.java:386)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 6015: 
During execution, encountered a Hadoop error.
at 
.apache.hadoop.zebra.mapred.TableInputFormat.getTableRecordReader(TableInputFormat.java:478)
at .apache.hadoop.zebra.pig.TableLoader.bindTo(TableLoader.java:166)
at 
.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.seekInRightStream(POMergeJoin.java:400)
at 
.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:181)
at 
.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:247)
at 
.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:238)
at 
.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map.map(PigMapOnly.java:65)
at .apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at .apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at .apache.hadoop.mapred.MapTask.run(MapTask.java:307)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
... 10 more

 



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1038) Optimize nested distinct/sort to use secondary key

2009-11-09 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775214#action_12775214
 ] 

Pradeep Kamath commented on PIG-1038:
-

Review comments:
In JobControlCompiler:
==
The OutputValueGroupingComparator is a RawComparator - 
how are we ensuring that compare(WritableComparable a, WritableComparable b) is 
called?

{code}
622 if ((wa.getIndex() & PigNullableWritable.mqFlag) != 0) { // 
this is a multi-query index  
{code}
Why do we only compare on index if this is true? The if-else in this block does 
not consider the
case where both indices are same - is that by design?

{code}
653 } else if (wa.isNull() && wb.isNull()) {  
{code}
In this block the case where both indices are same is not considered - is that 
by design?

The change in src/org/apache/pig/impl/io/PigNullableWritable.java seems 
unrelated to the patch

In SecondaryKeyOptimizer.java:
==
{code}
154 else if (mapLeaf instanceof POUnion || mapLeaf instanceof 
POSplit) {
155 List preds = 
mr.mapPlan.getPredecessors(mapLeaf);
156 for (PhysicalOperator pred:preds)
157 {
158 if (pred instanceof POLocalRearrange)
159 {
160 SortKeyInfo sortKeyInfo = 
getSortKeyInfo((POLocalRearrange)pred);
161 sortKeyInfos.add(sortKeyInfo);
162 }
163 }
{code}

If mapLeaf is a POSplit, the POSplit may have POLocalRearrange as the leaf (in 
multi query optimized queries) - should we be handling those?
Also, getSortKeyInfo() can return a null - so all places where getSortKeyInfo() 
is called, return value should be checked for null
{code}
98 List columns = new ArrayList();
 99 
columns.add(rearrange.getIndex()&PigNullableWritable.idxSpace);
100 columnChainInfo.insert(false, columns, DataType.TUPLE)
{code}
Why does the column chain start with the index of LocalRearrange and why is the 
type tuple? 

{code}
102 PhysicalOperator node = plan.getRoots().get(0);
103 while (node!=null)
104 {
105 if (node instanceof POProject) {
106 POProject project = (POProject)node;
107 
108 columnChainInfo.insert(project.isStar(), 
project.getColumns(), project.getResultType());
109 
110 if (plan.getPredecessors(node)==null)
111 node = null;
{code}
If node is initially the root, wouldn't plan.getPredecessors(node) always == 
null?

{code}
175 List reduceRoots = mr.reducePlan.getRoots();
176 if (reduceRoots.size() != 1) {
177 log.debug("Expected reduce to have single leaf");
178 return;
179 }
{code}
Did you mean to say "Expected reduce to have single root" ?

{code}
209 // Removed POSort, if the predecessor require a databag, we 
need to add a PORelationToExprProject
{code}
Should the above comment read "if the successor requires ..." 

{code}
247 throw new VisitorException("Sort on columns from 
different inputs.");
{code}
Should this exception follow Error Handling guidelines to include errorcode, 
and error source?

{code}
253 } else if (mapLeaf instanceof POUnion || mapLeaf instanceof 
POSplit){
254 List preds = 
mr.mapPlan.getPredecessors(mapLeaf);
255 for (PhysicalOperator pred:preds) {
256 POLocalRearrange rearrange = (POLocalRearrange)pred;
257 rearrange.setUseSecondaryKey(true);
258 if (rearrange.getIndex()==indexOfRearrangeToChange)
259 setSecondaryPlan(mr.mapPlan, rearrange, 
secondarySortKeyInfo);
260 }
261 }
{code}
If mapLeaf is a POSplit, the POSplit may have POLocalRearrange as the leaf (in 
multi query optimized queries) - should we be handling those?
Also in the if statement on line 258, what if the condition evaluates to false 
- shouwl we throw an Exception like earlier in the same
method?

{code}
274 for (int i=1;i 
sortKeyInfos, SortKeyInfo secondarySortKeyInfo) {
It seems like the secondarySortKeyInfo passed in the constructor call is always 
null - is that argument needed in the constructor?

431 throw new VisitorException("POForEach has more than 1 input 
plans");
Should this exception follow Error Handling guidelines to include errorcode, 
and error source?

A test case should be added for the case of a non Project group by key like 
group by $0 + $1 - I did not follow the code path for this case - we should 
ensure this works with
a nested sort in the foreach.

In ColumnChainInfo.java:
=

A comme

1st Hadoop India User Group meet

2009-11-09 Thread Sanjay Sharma
We are planning to hold first Hadoop India user group meet up on 28th November 
2009 in Noida.

We would be talking about our experiences with Apache 
Hadoop/Hbase/Hive/PIG/Nutch/etc.

The agenda would be:
- Introductions
- Sharing experiences on Hadoop and related technologies
- Establishing agenda for the next few meetings
- Information exchange: tips, tricks, problems and open discussion
- Possible speaker TBD (invitations open!!)  {we do have something to share on 
"Hadoop for newbie" & "Hadoop Advanced Tuning"}

My company (Impetus) would be providing the meeting room and we should be able 
to accommodate around 40-60 friendly people. Coffee, Tea, and some snacks will 
be provided.

Please join the linked-in Hadoop India User Group 
(http://www.linkedin.com/groups?home=&gid=2258445&trk=anet_ug_hm) OR Yahoo 
group (http://tech.groups.yahoo.com/group/hadoopind/) and confirm your 
attendance.

Regards,
Sanjay Sharma

Follow our updates on www.twitter.com/impetuscalling.

* Impetus Technologies is exhibiting it capabilities in Mobile and Wireless in 
the GSMA Mobile Asia Congress, Hong Kong from November 16-18, 2009. Visit 
http://www.impetus.com/mlabs/GSMA_events.html for details.

NOTE: This message may contain information that is confidential, proprietary, 
privileged or otherwise protected by law. The message is intended solely for 
the named addressee. If received in error, please destroy and notify the 
sender. Any use of this email is prohibited when received in error. Impetus 
does not represent, warrant and/or guarantee, that the integrity of this 
communication has been maintained nor that the communication is free of errors, 
virus, interception or interference.


[jira] Updated: (PIG-1073) LogicalPlanCloner can't clone plan containing LOJoin

2009-11-09 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1073:
--

Attachment: (was: pig-1073.patch)

> LogicalPlanCloner can't clone plan containing LOJoin
> 
>
> Key: PIG-1073
> URL: https://issues.apache.org/jira/browse/PIG-1073
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
>
> Add following testcase in LogicalPlanBuilder.java
> public void testLogicalPlanCloner() throws CloneNotSupportedException{
> LogicalPlan lp = buildPlan("C = join ( load 'A') by $0, (load 'B') by 
> $0;");
> LogicalPlanCloner cloner = new LogicalPlanCloner(lp);
> cloner.getClonedPlan();
> }
> and this fails with the following stacktrace:
> java.lang.NullPointerException
> at 
> org.apache.pig.impl.logicalLayer.LOVisitor.visit(LOVisitor.java:171)
> at 
> org.apache.pig.impl.logicalLayer.PlanSetter.visit(PlanSetter.java:63)
> at org.apache.pig.impl.logicalLayer.LOJoin.visit(LOJoin.java:213)
> at org.apache.pig.impl.logicalLayer.LOJoin.visit(LOJoin.java:45)
> at 
> org.apache.pig.impl.plan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:67)
> at 
> org.apache.pig.impl.plan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:69)
> at 
> org.apache.pig.impl.plan.DepthFirstWalker.walk(DepthFirstWalker.java:50)
> at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
> at 
> org.apache.pig.impl.logicalLayer.LogicalPlanCloneHelper.getClonedPlan(LogicalPlanCloneHelper.java:73)
> at 
> org.apache.pig.impl.logicalLayer.LogicalPlanCloner.getClonedPlan(LogicalPlanCloner.java:46)
> at 
> org.apache.pig.test.TestLogicalPlanBuilder.testLogicalPlanCloneHelper(TestLogicalPlanBuilder.java:2110)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1073) LogicalPlanCloner can't clone plan containing LOJoin

2009-11-09 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1073:
--

Attachment: pig-1073-1.patch

updated patch

> LogicalPlanCloner can't clone plan containing LOJoin
> 
>
> Key: PIG-1073
> URL: https://issues.apache.org/jira/browse/PIG-1073
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Attachments: pig-1073-1.patch
>
>
> Add following testcase in LogicalPlanBuilder.java
> public void testLogicalPlanCloner() throws CloneNotSupportedException{
> LogicalPlan lp = buildPlan("C = join ( load 'A') by $0, (load 'B') by 
> $0;");
> LogicalPlanCloner cloner = new LogicalPlanCloner(lp);
> cloner.getClonedPlan();
> }
> and this fails with the following stacktrace:
> java.lang.NullPointerException
> at 
> org.apache.pig.impl.logicalLayer.LOVisitor.visit(LOVisitor.java:171)
> at 
> org.apache.pig.impl.logicalLayer.PlanSetter.visit(PlanSetter.java:63)
> at org.apache.pig.impl.logicalLayer.LOJoin.visit(LOJoin.java:213)
> at org.apache.pig.impl.logicalLayer.LOJoin.visit(LOJoin.java:45)
> at 
> org.apache.pig.impl.plan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:67)
> at 
> org.apache.pig.impl.plan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:69)
> at 
> org.apache.pig.impl.plan.DepthFirstWalker.walk(DepthFirstWalker.java:50)
> at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
> at 
> org.apache.pig.impl.logicalLayer.LogicalPlanCloneHelper.getClonedPlan(LogicalPlanCloneHelper.java:73)
> at 
> org.apache.pig.impl.logicalLayer.LogicalPlanCloner.getClonedPlan(LogicalPlanCloner.java:46)
> at 
> org.apache.pig.test.TestLogicalPlanBuilder.testLogicalPlanCloneHelper(TestLogicalPlanBuilder.java:2110)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1078) [zebra] merge join with empty table failed

2009-11-09 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775342#action_12775342
 ] 

Ashutosh Chauhan commented on PIG-1078:
---

This seems to be related to Zebra. Jing, do you think it has to do with merge 
join implementation of Pig ?

> [zebra] merge join with empty table failed
> --
>
> Key: PIG-1078
> URL: https://issues.apache.org/jira/browse/PIG-1078
> Project: Pig
>  Issue Type: Bug
>Reporter: Jing Huang
>
> Got indexOutOfBound exception. 
> Here is the pig script:
> register /grid/0/dev/hadoopqa/jars/zebra.jar;
> --a1 = load '1.txt' as (a:int, 
> b:float,c:long,d:double,e:chararray,f:bytearray,r1(f1:chararray,f2:chararray),m1:map[]);
> --a2 = load 'empty.txt' as (a:int, 
> b:float,c:long,d:double,e:chararray,f:bytearray,r1(f1:chararray,f2:chararray),m1:map[]);
> --dump a1;
> --a1order = order a1 by a;
> --a2order = order a2 by a;
> --store a1order into 'a1' using 
> org.apache.hadoop.zebra.pig.TableStorer('[a,b,c];[d,e,f,r1,m1]');
> --store a2order into 'empty' using 
> org.apache.hadoop.zebra.pig.TableStorer('[a,b,c];[d,e,f,r1,m1]');
> rec1 = load 'a1' using org.apache.hadoop.zebra.pig.TableLoader();
> rec2 = load 'empty' using org.apache.hadoop.zebra.pig.TableLoader();
> joina = join rec1 by a, rec2 by a using "merge" ;
> dump joina;
> ==
> please note that table "a1" and "empty" are created correctly. 
> Here is the stack trace:
> Backend error message
> -
> java.lang.ArrayIndexOutOfBoundsException: 0
> at 
> org.apache.hadoop.zebra.mapred.TableInputFormat.getTableRecordReader(TableInputFormat.java:478)
> at 
> org.apache.hadoop.zebra.pig.TableLoader.bindTo(TableLoader.java:166)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.seekInRightStream(POMergeJoin.java:400)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:181)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:247)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:238)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map.map(PigMapOnly.java:65)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> at org.apache.hadoop.mapred.Child.main(Child.java:159)
> Pig Stack Trace
> ---
> ERROR 6015: During execution, encountered a Hadoop error.
> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to 
> open iterator for alias joina
> at org.apache.pig.PigServer.openIterator(PigServer.java:481)
> at 
> org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:539)
> at 
> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:241)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
> at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
> at org.apache.pig.Main.main(Main.java:386)
> Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 6015: 
> During execution, encountered a Hadoop error.
> at 
> .apache.hadoop.zebra.mapred.TableInputFormat.getTableRecordReader(TableInputFormat.java:478)
> at .apache.hadoop.zebra.pig.TableLoader.bindTo(TableLoader.java:166)
> at 
> .apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.seekInRightStream(POMergeJoin.java:400)
> at 
> .apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:181)
> at 
> .apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:247)
> at 
> .apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:238)
> at 
> .apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map.map(PigMapOnly.java:65)
> at .apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> at .apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
> at .apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
> ... 10 more
> 
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to thi