[jira] Commented: (PIG-1389) Implement Pig counter to track number of rows for each input files

2010-06-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883424#action_12883424
 ] 

Hadoop QA commented on PIG-1389:


-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12448259/PIG-1389_1.patch
  against trunk revision 958666.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 6 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

-1 contrib tests.  The patch failed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/335/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/335/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/335/console

This message is automatically generated.

> Implement Pig counter to track number of rows for each input files 
> ---
>
> Key: PIG-1389
> URL: https://issues.apache.org/jira/browse/PIG-1389
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.7.0
>Reporter: Richard Ding
>Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1389.patch, PIG-1389.patch, PIG-1389_1.patch
>
>
> A MR job generated by Pig not only can have multiple outputs (in the case of 
> multiquery) but also can have multiple inputs (in the case of join or 
> cogroup). In both cases, the existing Hadoop counters (e.g. 
> MAP_INPUT_RECORDS, REDUCE_OUTPUT_RECORDS) can not be used to count the number 
> of records in the given input or output.  PIG-1299 addressed the case of 
> multiple outputs.  We need to add new counters for jobs with multiple inputs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1473) Avoid serialization/deserialization costs for PigStorage data - Use custom Map and Bag implementation

2010-06-28 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883382#action_12883382
 ] 

Jeff Zhang commented on PIG-1473:
-

This sounds like the lazy deserialization in Hive, Great !

> Avoid serialization/deserialization costs for PigStorage data - Use custom 
> Map and Bag implementation
> -
>
> Key: PIG-1473
> URL: https://issues.apache.org/jira/browse/PIG-1473
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.8.0
>Reporter: Thejas M Nair
>Assignee: Thejas M Nair
> Fix For: 0.8.0
>
>
> Cost of serialization/deserialization (sedes) can be very high and avoiding 
> it will improve performance.
> Avoid sedes when possible by implementing approach #3 proposed in 
> http://wiki.apache.org/pig/AvoidingSedes .
> The load function uses subclass of Map and DataBag which holds the serialized 
> copy.  LoadFunction delays deserialization of map and bag types until a 
> member function of java.util.Map or DataBag is called. 
> Example of query where this will help -
> {CODE}
> l = LOAD 'file1' AS (a : int, b : map [ ]);
> f = FOREACH l GENERATE udf1(a), b;  
> fil = FILTER f BY $0 > 5;
> dump fil; -- Serialization of column b can be delayed until here using this 
> approach .
> {CODE}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1337) Need a way to pass distributed cache configuration information to hadoop backend in Pig's LoadFunc

2010-06-28 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1337:


Fix Version/s: (was: 0.8.0)

> Need a way to pass distributed cache configuration information to hadoop 
> backend in Pig's LoadFunc
> --
>
> Key: PIG-1337
> URL: https://issues.apache.org/jira/browse/PIG-1337
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.6.0
>Reporter: Chao Wang
>
> The Zebra storage layer needs to use distributed cache to reduce name node 
> load during job runs.
> To to this, Zebra needs to set up distributed cache related configuration 
> information in TableLoader (which extends Pig's LoadFunc) .
> It is doing this within getSchema(conf). The problem is that the conf object 
> here is not the one that is being serialized to map/reduce backend. As such, 
> the distributed cache is not set up properly.
> To work over this problem, we need Pig in its LoadFunc to ensure a way that 
> we can use to set up distributed cache information in a conf object, and this 
> conf object is the one used by map/reduce backend.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1411) [Zebra] Can Zebra use HAR to reduce file/block count for namenode

2010-06-28 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1411:


Fix Version/s: (was: 0.8.0)
  Description: 
Due to column group structure,  Zebra can create extra files for namenode to 
remember. That means namenode taking more memory for Zebra related files.

The goal is to reduce the no of files/blocks

The idea among various options is to use HAR ( Hadoop Archive ). Hadoop Archive 
reduces the block  and file count by copying data from small files ( 1M, 2M 
...) into a hdfs-block of larger size. Thus, reducing the total no. of blocks 
and files.


 

  was:

Due to column group structure,  Zebra can create extra files for namenode to 
remember. That means namenode taking more memory for Zebra related files.

The goal is to reduce the no of files/blocks

The idea among various options is to use HAR ( Hadoop Archive ). Hadoop Archive 
reduces the block  and file count by copying data from small files ( 1M, 2M 
...) into a hdfs-block of larger size. Thus, reducing the total no. of blocks 
and files.


 


> [Zebra] Can Zebra use HAR to reduce file/block count for namenode
> -
>
> Key: PIG-1411
> URL: https://issues.apache.org/jira/browse/PIG-1411
> Project: Pig
>  Issue Type: New Feature
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Gaurav Jain
>Assignee: Gaurav Jain
>Priority: Minor
>
> Due to column group structure,  Zebra can create extra files for namenode to 
> remember. That means namenode taking more memory for Zebra related files.
> The goal is to reduce the no of files/blocks
> The idea among various options is to use HAR ( Hadoop Archive ). Hadoop 
> Archive reduces the block  and file count by copying data from small files ( 
> 1M, 2M ...) into a hdfs-block of larger size. Thus, reducing the total no. of 
> blocks and files.
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1355) [Zebra] Zebra Multiple Outputs should enable application to skip records

2010-06-28 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1355:


Fix Version/s: (was: 0.8.0)
  Description: 
Applications may not always want to write a record to a table. Zebra should 
allow application to do the same.

Zebra Mutipile Outputs interface allow users to stream data to different tables 
by inspecting the data Tuple. 

https://issues.apache.org/jira/browse/PIG-

So,

If ZebraOutputPartition returns -1, Zebra Multiple Outputs will skip that 
record and thus will not write to any table

However, Zebra BasicTableOutputFormat ( different from Zebra Multiple Outputs ) 
will write every record to a table

  was:

Applications may not always want to write a record to a table. Zebra should 
allow application to do the same.

Zebra Mutipile Outputs interface allow users to stream data to different tables 
by inspecting the data Tuple. 

https://issues.apache.org/jira/browse/PIG-

So,

If ZebraOutputPartition returns -1, Zebra Multiple Outputs will skip that 
record and thus will not write to any table

However, Zebra BasicTableOutputFormat ( different from Zebra Multiple Outputs ) 
will write every record to a table


> [Zebra]  Zebra Multiple Outputs should enable application to skip records
> -
>
> Key: PIG-1355
> URL: https://issues.apache.org/jira/browse/PIG-1355
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Gaurav Jain
>Assignee: Gaurav Jain
>Priority: Minor
>
> Applications may not always want to write a record to a table. Zebra should 
> allow application to do the same.
> Zebra Mutipile Outputs interface allow users to stream data to different 
> tables by inspecting the data Tuple. 
> https://issues.apache.org/jira/browse/PIG-
> So,
> If ZebraOutputPartition returns -1, Zebra Multiple Outputs will skip that 
> record and thus will not write to any table
> However, Zebra BasicTableOutputFormat ( different from Zebra Multiple Outputs 
> ) will write every record to a table

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1426) Change the size of Tuple from Int to VInt when Serialize Tuple

2010-06-28 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1426:


Fix Version/s: (was: 0.8.0)

> Change the size of Tuple from Int to VInt when Serialize Tuple
> --
>
> Key: PIG-1426
> URL: https://issues.apache.org/jira/browse/PIG-1426
> Project: Pig
>  Issue Type: Improvement
>  Components: data
>Affects Versions: 0.8.0
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
> Attachments: PIG_1426.patch
>
>
> Most of  time,  the size of tuple is not very large, one byte is enough for 
> store the size of tuple. So I suggest to use VInt instead of Int for the size 
> of tuple when doing Serialization. Because the key type of map output is 
> Tuple, so this can reduce the amount of data transferred from mapper to 
> reducer. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1137) [zebra] get* methods of Zebra Map/Reduce APIs need improvements

2010-06-28 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1137:


Fix Version/s: (was: 0.8.0)

> [zebra] get* methods of Zebra Map/Reduce APIs need improvements
> ---
>
> Key: PIG-1137
> URL: https://issues.apache.org/jira/browse/PIG-1137
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.6.0
>Reporter: Yan Zhou
>Assignee: Yan Zhou
>
> Currently the set* methods takes external Zebra objects, namely objects of  
> ZebraStorageHint, ZebraSchema, ZebraSortInfo or ZebraProjection. 
> Correspondingly, the get* methods should return such objects instead of 
> String or Zebra internal objects like Schema.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1120) [zebra] should support using org.apache.hadoop.zebra.pig.TableStorer() if user does not want to specify storage hint

2010-06-28 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1120:


Fix Version/s: (was: 0.8.0)

> [zebra] should support  using org.apache.hadoop.zebra.pig.TableStorer() if 
> user does not want to specify storage hint
> -
>
> Key: PIG-1120
> URL: https://issues.apache.org/jira/browse/PIG-1120
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.6.0
>Reporter: Jing Huang
>
> If user doesn't want to specify storage hint, current zebra implementation 
> only support  using org.apache.hadoop.zebra.pig.TableStorer('')   string in TableStorer(' ').
> We should support the format of  using 
> org.apache.hadoop.zebra.pig.TableStorer() as we do on  using 
> org.apache.hadoop.zebra.pig.TableLoader()
> sample pig script:
> register /grid/0/dev/hadoopqa/jars/zebra.jar;
> a = load '1.txt' as (a:int, 
> b:float,c:long,d:double,e:chararray,f:bytearray,r1(f1:chararray,f2:chararray),m1:map[]);
> b = load '2.txt' as (a:int, 
> b:float,c:long,d:double,e:chararray,f:bytearray,r1(f1:chararray,f2:chararray),m1:map[]);
> c = join a by a, b by a;
> d = foreach c generate a::a, a::b, b::c;
> describe d;
> dump d;
> store d into 'join3' using org.apache.hadoop.zebra.pig.TableStorer('');
> --this will fail
> --store d into 'join3' using org.apache.hadoop.zebra.pig.TableStorer( );

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1350) [Zebra] Zebra column names cannot have leading "_"

2010-06-28 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1350:


Fix Version/s: (was: 0.8.0)

> [Zebra] Zebra column names cannot have leading "_"
> --
>
> Key: PIG-1350
> URL: https://issues.apache.org/jira/browse/PIG-1350
> Project: Pig
>  Issue Type: Improvement
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Attachments: pig-1350.patch, pig-1350.patch
>
>
> Disallowing '_' as leading character in column names in Zebra schema is too 
> restrictive, which should be lifted.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1139) [zebra] Encapsulation of check of ZebraSortInfo by a Zebra reader; the check by a writer could be better encapsulated

2010-06-28 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1139:


Fix Version/s: (was: 0.8.0)

> [zebra] Encapsulation of check of ZebraSortInfo by a Zebra reader; the check 
> by a writer could be better encapsulated
> -
>
> Key: PIG-1139
> URL: https://issues.apache.org/jira/browse/PIG-1139
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.6.0
>Reporter: Yan Zhou
>Priority: Minor
>
> Currently the user's ZebraSortInfo by Map/Reduce's writer, namely, the 
> BasicTableOutputFormat.setStorageInfo, is sanity checked by the 
> SortInfo.parse(), although the sanity check could be all performed in that 
> method taking a ZebraSortInfo object.
> But the sanity check at the reader side is totally by the caller of 
> TableInputFormat.requireSortedTable method, which should be better 
> encapsulated into a new SortInfo's method.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Avoiding serialization/de-serialization in pig

2010-06-28 Thread Russell Jurney
I don't fully understand the repercussions of this, but I like it.  We're
moving from our VoldemortStorage stuff to Avro and it would be great to pipe
Avro all the way through.

Russ

On Mon, Jun 28, 2010 at 5:51 PM, Dmitriy Ryaboy  wrote:

> For what it's worth, I saw very significant speed improvements (order of
> magnitude for wide tables with few projected columns) when I implemented
> (2)
> for our protocol buffer - based loaders.
>
> I have a feeling that propagating schemas when known, and using them to for
> (de)serialization instead of reflecting every field, would also be a big
> win.
>
> Thoughts on just using Avro for the internal PigStorage?
>
> -D
>
> On Mon, Jun 28, 2010 at 5:08 PM, Thejas Nair  wrote:
>
> > I have created a wiki which puts together some ideas that can help in
> > improving performance by avoiding/delaying serialization/de-serialization
> .
> >
> > http://wiki.apache.org/pig/AvoidingSedes
> >
> > These are ideas that don't involve changes to optimizer. Most of them
> > involve changes in the load/store functions.
> >
> > Your feedback is welcome.
> >
> > Thanks,
> > Thejas
> >
> >
>


Re: Avoiding serialization/de-serialization in pig

2010-06-28 Thread Dmitriy Ryaboy
For what it's worth, I saw very significant speed improvements (order of
magnitude for wide tables with few projected columns) when I implemented (2)
for our protocol buffer - based loaders.

I have a feeling that propagating schemas when known, and using them to for
(de)serialization instead of reflecting every field, would also be a big
win.

Thoughts on just using Avro for the internal PigStorage?

-D

On Mon, Jun 28, 2010 at 5:08 PM, Thejas Nair  wrote:

> I have created a wiki which puts together some ideas that can help in
> improving performance by avoiding/delaying serialization/de-serialization .
>
> http://wiki.apache.org/pig/AvoidingSedes
>
> These are ideas that don't involve changes to optimizer. Most of them
> involve changes in the load/store functions.
>
> Your feedback is welcome.
>
> Thanks,
> Thejas
>
>


[jira] Updated: (PIG-1389) Implement Pig counter to track number of rows for each input files

2010-06-28 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1389:
--

Status: Patch Available  (was: Open)

> Implement Pig counter to track number of rows for each input files 
> ---
>
> Key: PIG-1389
> URL: https://issues.apache.org/jira/browse/PIG-1389
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.7.0
>Reporter: Richard Ding
>Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1389.patch, PIG-1389.patch, PIG-1389_1.patch
>
>
> A MR job generated by Pig not only can have multiple outputs (in the case of 
> multiquery) but also can have multiple inputs (in the case of join or 
> cogroup). In both cases, the existing Hadoop counters (e.g. 
> MAP_INPUT_RECORDS, REDUCE_OUTPUT_RECORDS) can not be used to count the number 
> of records in the given input or output.  PIG-1299 addressed the case of 
> multiple outputs.  We need to add new counters for jobs with multiple inputs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1389) Implement Pig counter to track number of rows for each input files

2010-06-28 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1389:
--

Status: Open  (was: Patch Available)

> Implement Pig counter to track number of rows for each input files 
> ---
>
> Key: PIG-1389
> URL: https://issues.apache.org/jira/browse/PIG-1389
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.7.0
>Reporter: Richard Ding
>Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1389.patch, PIG-1389.patch
>
>
> A MR job generated by Pig not only can have multiple outputs (in the case of 
> multiquery) but also can have multiple inputs (in the case of join or 
> cogroup). In both cases, the existing Hadoop counters (e.g. 
> MAP_INPUT_RECORDS, REDUCE_OUTPUT_RECORDS) can not be used to count the number 
> of records in the given input or output.  PIG-1299 addressed the case of 
> multiple outputs.  We need to add new counters for jobs with multiple inputs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1295) Binary comparator for secondary sort

2010-06-28 Thread Gianmarco De Francisci Morales (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883367#action_12883367
 ] 

Gianmarco De Francisci Morales commented on PIG-1295:
-

I think it is

> Binary comparator for secondary sort
> 
>
> Key: PIG-1295
> URL: https://issues.apache.org/jira/browse/PIG-1295
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Daniel Dai
>Assignee: Gianmarco De Francisci Morales
> Fix For: 0.8.0
>
> Attachments: PIG-1295_0.1.patch, PIG-1295_0.2.patch, 
> PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch, PIG-1295_0.6.patch
>
>
> When hadoop framework doing the sorting, it will try to use binary version of 
> comparator if available. The benefit of binary comparator is we do not need 
> to instantiate the object before we compare. We see a ~30% speedup after we 
> switch to binary comparator. Currently, Pig use binary comparator in 
> following case:
> 1. When semantics of order doesn't matter. For example, in distinct, we need 
> to do a sort in order to filter out duplicate values; however, we do not care 
> how comparator sort keys. Groupby also share this character. In this case, we 
> rely on hadoop's default binary comparator
> 2. Semantics of order matter, but the key is of simple type. In this case, we 
> have implementation for simple types, such as integer, long, float, 
> chararray, databytearray, string
> However, if the key is a tuple and the sort semantics matters, we do not have 
> a binary comparator implementation. This especially matters when we switch to 
> use secondary sort. In secondary sort, we convert the inner sort of nested 
> foreach into the secondary key and rely on hadoop to sorting on both main key 
> and secondary key. The sorting key will become a two items tuple. Since the 
> secondary key the sorting key of the nested foreach, so the sorting semantics 
> matters. It turns out we do not have binary comparator once we use secondary 
> sort, and we see a significant slow down.
> Binary comparator for tuple should be doable once we understand the binary 
> structure of the serialized tuple. We can focus on most common use cases 
> first, which is "group by" followed by a nested sort. In this case, we will 
> use secondary sort. Semantics of the first key does not matter but semantics 
> of secondary key matters. We need to identify the boundary of main key and 
> secondary key in the binary tuple buffer without instantiate tuple itself. 
> Then if the first key equals, we use a binary comparator to compare secondary 
> key. Secondary key can also be a complex data type, but for the first step, 
> we focus on simple secondary key, which is the most common use case.
> We mark this issue to be a candidate project for "Google summer of code 2010" 
> program. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1389) Implement Pig counter to track number of rows for each input files

2010-06-28 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1389:
--

Attachment: PIG-1389_1.patch

> Implement Pig counter to track number of rows for each input files 
> ---
>
> Key: PIG-1389
> URL: https://issues.apache.org/jira/browse/PIG-1389
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.7.0
>Reporter: Richard Ding
>Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1389.patch, PIG-1389.patch, PIG-1389_1.patch
>
>
> A MR job generated by Pig not only can have multiple outputs (in the case of 
> multiquery) but also can have multiple inputs (in the case of join or 
> cogroup). In both cases, the existing Hadoop counters (e.g. 
> MAP_INPUT_RECORDS, REDUCE_OUTPUT_RECORDS) can not be used to count the number 
> of records in the given input or output.  PIG-1299 addressed the case of 
> multiple outputs.  We need to add new counters for jobs with multiple inputs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1295) Binary comparator for secondary sort

2010-06-28 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1295:


   Status: Patch Available  (was: Open)
Fix Version/s: 0.8.0

> Binary comparator for secondary sort
> 
>
> Key: PIG-1295
> URL: https://issues.apache.org/jira/browse/PIG-1295
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Daniel Dai
>Assignee: Gianmarco De Francisci Morales
> Fix For: 0.8.0
>
> Attachments: PIG-1295_0.1.patch, PIG-1295_0.2.patch, 
> PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch, PIG-1295_0.6.patch
>
>
> When hadoop framework doing the sorting, it will try to use binary version of 
> comparator if available. The benefit of binary comparator is we do not need 
> to instantiate the object before we compare. We see a ~30% speedup after we 
> switch to binary comparator. Currently, Pig use binary comparator in 
> following case:
> 1. When semantics of order doesn't matter. For example, in distinct, we need 
> to do a sort in order to filter out duplicate values; however, we do not care 
> how comparator sort keys. Groupby also share this character. In this case, we 
> rely on hadoop's default binary comparator
> 2. Semantics of order matter, but the key is of simple type. In this case, we 
> have implementation for simple types, such as integer, long, float, 
> chararray, databytearray, string
> However, if the key is a tuple and the sort semantics matters, we do not have 
> a binary comparator implementation. This especially matters when we switch to 
> use secondary sort. In secondary sort, we convert the inner sort of nested 
> foreach into the secondary key and rely on hadoop to sorting on both main key 
> and secondary key. The sorting key will become a two items tuple. Since the 
> secondary key the sorting key of the nested foreach, so the sorting semantics 
> matters. It turns out we do not have binary comparator once we use secondary 
> sort, and we see a significant slow down.
> Binary comparator for tuple should be doable once we understand the binary 
> structure of the serialized tuple. We can focus on most common use cases 
> first, which is "group by" followed by a nested sort. In this case, we will 
> use secondary sort. Semantics of the first key does not matter but semantics 
> of secondary key matters. We need to identify the boundary of main key and 
> secondary key in the binary tuple buffer without instantiate tuple itself. 
> Then if the first key equals, we use a binary comparator to compare secondary 
> key. Secondary key can also be a complex data type, but for the first step, 
> we focus on simple secondary key, which is the most common use case.
> We mark this issue to be a candidate project for "Google summer of code 2010" 
> program. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1474) Avoid serialization/deserialization costs for PigStorage data - Use custom Tuple

2010-06-28 Thread Thejas M Nair (JIRA)
Avoid serialization/deserialization costs for PigStorage data - Use custom Tuple


 Key: PIG-1474
 URL: https://issues.apache.org/jira/browse/PIG-1474
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0


Avoid sedes when possible for data loaded using PigStorage by implementing 
approach #4 proposed in http://wiki.apache.org/pig/AvoidingSedes .

The write() and readFields() functions of tuple returned by TupleFactory  is 
used to serialize data between Map and Reduce. By using a tuple that knows the 
serialization format of the loader, we avoid sedes at Map Recue boundary and 
use the load functions serialized format between Map and Reduce . 
To use a new custom tuple for this purpose, a custom TupleFactory that returns 
tuples of this type has to be specified using the property 
"pig.data.tuple.factory.name" .
This approach will work only for a set of load functions in the query that 
share same serialization format for map and bags. If this approach proves to be 
very useful, it will build a case for more extensible approach.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Avoiding serialization/de-serialization in pig

2010-06-28 Thread Thejas Nair
I have created a wiki which puts together some ideas that can help in
improving performance by avoiding/delaying serialization/de-serialization .

http://wiki.apache.org/pig/AvoidingSedes

These are ideas that don't involve changes to optimizer. Most of them
involve changes in the load/store functions.

Your feedback is welcome.

Thanks,
Thejas



[jira] Commented: (PIG-1295) Binary comparator for secondary sort

2010-06-28 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883361#action_12883361
 ] 

Daniel Dai commented on PIG-1295:
-

Thanks, is the patch ready for review?

> Binary comparator for secondary sort
> 
>
> Key: PIG-1295
> URL: https://issues.apache.org/jira/browse/PIG-1295
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Daniel Dai
>Assignee: Gianmarco De Francisci Morales
> Fix For: 0.8.0
>
> Attachments: PIG-1295_0.1.patch, PIG-1295_0.2.patch, 
> PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch, PIG-1295_0.6.patch
>
>
> When hadoop framework doing the sorting, it will try to use binary version of 
> comparator if available. The benefit of binary comparator is we do not need 
> to instantiate the object before we compare. We see a ~30% speedup after we 
> switch to binary comparator. Currently, Pig use binary comparator in 
> following case:
> 1. When semantics of order doesn't matter. For example, in distinct, we need 
> to do a sort in order to filter out duplicate values; however, we do not care 
> how comparator sort keys. Groupby also share this character. In this case, we 
> rely on hadoop's default binary comparator
> 2. Semantics of order matter, but the key is of simple type. In this case, we 
> have implementation for simple types, such as integer, long, float, 
> chararray, databytearray, string
> However, if the key is a tuple and the sort semantics matters, we do not have 
> a binary comparator implementation. This especially matters when we switch to 
> use secondary sort. In secondary sort, we convert the inner sort of nested 
> foreach into the secondary key and rely on hadoop to sorting on both main key 
> and secondary key. The sorting key will become a two items tuple. Since the 
> secondary key the sorting key of the nested foreach, so the sorting semantics 
> matters. It turns out we do not have binary comparator once we use secondary 
> sort, and we see a significant slow down.
> Binary comparator for tuple should be doable once we understand the binary 
> structure of the serialized tuple. We can focus on most common use cases 
> first, which is "group by" followed by a nested sort. In this case, we will 
> use secondary sort. Semantics of the first key does not matter but semantics 
> of secondary key matters. We need to identify the boundary of main key and 
> secondary key in the binary tuple buffer without instantiate tuple itself. 
> Then if the first key equals, we use a binary comparator to compare secondary 
> key. Secondary key can also be a complex data type, but for the first step, 
> we focus on simple secondary key, which is the most common use case.
> We mark this issue to be a candidate project for "Google summer of code 2010" 
> program. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1473) Avoid serialization/deserialization costs for PigStorage data - Use custom Map and Bag implementation

2010-06-28 Thread Thejas M Nair (JIRA)
Avoid serialization/deserialization costs for PigStorage data - Use custom Map 
and Bag implementation
-

 Key: PIG-1473
 URL: https://issues.apache.org/jira/browse/PIG-1473
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Thejas M Nair
 Fix For: 0.8.0


Cost of serialization/deserialization (sedes) can be very high and avoiding it 
will improve performance.

Avoid sedes when possible by implementing approach #3 proposed in 
http://wiki.apache.org/pig/AvoidingSedes .

The load function uses subclass of Map and DataBag which holds the serialized 
copy.  LoadFunction delays deserialization of map and bag types until a member 
function of java.util.Map or DataBag is called. 

Example of query where this will help -
{CODE}
l = LOAD 'file1' AS (a : int, b : map [ ]);
f = FOREACH l GENERATE udf1(a), b;  
fil = FILTER f BY $0 > 5;
dump fil; -- Serialization of column b can be delayed until here using this 
approach .

{CODE}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1473) Avoid serialization/deserialization costs for PigStorage data - Use custom Map and Bag implementation

2010-06-28 Thread Thejas M Nair (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair reassigned PIG-1473:
--

Assignee: Thejas M Nair

> Avoid serialization/deserialization costs for PigStorage data - Use custom 
> Map and Bag implementation
> -
>
> Key: PIG-1473
> URL: https://issues.apache.org/jira/browse/PIG-1473
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.8.0
>Reporter: Thejas M Nair
>Assignee: Thejas M Nair
> Fix For: 0.8.0
>
>
> Cost of serialization/deserialization (sedes) can be very high and avoiding 
> it will improve performance.
> Avoid sedes when possible by implementing approach #3 proposed in 
> http://wiki.apache.org/pig/AvoidingSedes .
> The load function uses subclass of Map and DataBag which holds the serialized 
> copy.  LoadFunction delays deserialization of map and bag types until a 
> member function of java.util.Map or DataBag is called. 
> Example of query where this will help -
> {CODE}
> l = LOAD 'file1' AS (a : int, b : map [ ]);
> f = FOREACH l GENERATE udf1(a), b;  
> fil = FILTER f BY $0 > 5;
> dump fil; -- Serialization of column b can be delayed until here using this 
> approach .
> {CODE}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1389) Implement Pig counter to track number of rows for each input files

2010-06-28 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883354#action_12883354
 ] 

Richard Ding commented on PIG-1389:
---

It seems there is no good solution for Merge Join and Merge Cogroup in this 
case. So I'm going to treat them the same way as Replicated Join and not add 
counters for all side files.

> Implement Pig counter to track number of rows for each input files 
> ---
>
> Key: PIG-1389
> URL: https://issues.apache.org/jira/browse/PIG-1389
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.7.0
>Reporter: Richard Ding
>Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1389.patch, PIG-1389.patch
>
>
> A MR job generated by Pig not only can have multiple outputs (in the case of 
> multiquery) but also can have multiple inputs (in the case of join or 
> cogroup). In both cases, the existing Hadoop counters (e.g. 
> MAP_INPUT_RECORDS, REDUCE_OUTPUT_RECORDS) can not be used to count the number 
> of records in the given input or output.  PIG-1299 addressed the case of 
> multiple outputs.  We need to add new counters for jobs with multiple inputs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1399) Logical Optimizer: Expression optimizor rule

2010-06-28 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883348#action_12883348
 ] 

Yan Zhou commented on PIG-1399:
---

Other expression optimizations include:

3.  Optimization of erasure of logical implicated expression in AND
Example:
B = filter A by (a0 > 5 and a0 > 7);
=> B = filter A by a0 > 7;

4. Optimization of erasure of logical implicated expression in OR
Example:
B = filter A by ((a0 > 5) or (a0 > 6 and a1 > 15);
=> B = filter C by a0 > 5;

A comprehensive example of 2, 3 and 4 optimizations is:
B = filter A by NOT((a0 > 1 and a0 > 0) or (a1 < 3 and a0 >5))";
=> B = filter A by a0 <= 1;

> Logical Optimizer: Expression optimizor rule
> 
>
> Key: PIG-1399
> URL: https://issues.apache.org/jira/browse/PIG-1399
> Project: Pig
>  Issue Type: Sub-task
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Daniel Dai
>Assignee: Yan Zhou
>
> We can optimize expression in several ways:
> 1. Constant pre-calculation
> Example:
> B = filter A by a0 > 5+7;
> => B = filter A by a0 > 12;
> 2. Boolean expression optimization
> Example:
> B = filter A by not (not(a0>5) or a>10);
> => B = filter A by a0>5 and a<=10;

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1295) Binary comparator for secondary sort

2010-06-28 Thread Gianmarco De Francisci Morales (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gianmarco De Francisci Morales updated PIG-1295:


Attachment: PIG-1295_0.6.patch

Ok, if the user does not use DefaultTuple we fall back to the default 
deserialization case.

I added handling of nested tuples via recursion and appropriate unit tests.

> Binary comparator for secondary sort
> 
>
> Key: PIG-1295
> URL: https://issues.apache.org/jira/browse/PIG-1295
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Daniel Dai
>Assignee: Gianmarco De Francisci Morales
> Attachments: PIG-1295_0.1.patch, PIG-1295_0.2.patch, 
> PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch, PIG-1295_0.6.patch
>
>
> When hadoop framework doing the sorting, it will try to use binary version of 
> comparator if available. The benefit of binary comparator is we do not need 
> to instantiate the object before we compare. We see a ~30% speedup after we 
> switch to binary comparator. Currently, Pig use binary comparator in 
> following case:
> 1. When semantics of order doesn't matter. For example, in distinct, we need 
> to do a sort in order to filter out duplicate values; however, we do not care 
> how comparator sort keys. Groupby also share this character. In this case, we 
> rely on hadoop's default binary comparator
> 2. Semantics of order matter, but the key is of simple type. In this case, we 
> have implementation for simple types, such as integer, long, float, 
> chararray, databytearray, string
> However, if the key is a tuple and the sort semantics matters, we do not have 
> a binary comparator implementation. This especially matters when we switch to 
> use secondary sort. In secondary sort, we convert the inner sort of nested 
> foreach into the secondary key and rely on hadoop to sorting on both main key 
> and secondary key. The sorting key will become a two items tuple. Since the 
> secondary key the sorting key of the nested foreach, so the sorting semantics 
> matters. It turns out we do not have binary comparator once we use secondary 
> sort, and we see a significant slow down.
> Binary comparator for tuple should be doable once we understand the binary 
> structure of the serialized tuple. We can focus on most common use cases 
> first, which is "group by" followed by a nested sort. In this case, we will 
> use secondary sort. Semantics of the first key does not matter but semantics 
> of secondary key matters. We need to identify the boundary of main key and 
> secondary key in the binary tuple buffer without instantiate tuple itself. 
> Then if the first key equals, we use a binary comparator to compare secondary 
> key. Secondary key can also be a complex data type, but for the first step, 
> we focus on simple secondary key, which is the most common use case.
> We mark this issue to be a candidate project for "Google summer of code 2010" 
> program. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1472) Optimize serialization/deserialization between Map and Reduce and between MR jobs

2010-06-28 Thread Thejas M Nair (JIRA)
Optimize serialization/deserialization between Map and Reduce and between MR 
jobs
-

 Key: PIG-1472
 URL: https://issues.apache.org/jira/browse/PIG-1472
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0


In certain types of pig queries most of the execution time is spent in 
serializing/deserializing (sedes) records between Map and Reduce and between MR 
jobs. 
For example, if PigMix queries are modified to specify types for all the fields 
in the load statement schema, some of the queries (L2,L3,L9, L10 in pigmix v1) 
that have records with bags and maps being transmitted across map or reduce 
boundaries run a lot longer (runtime increase of few times has been seen.

There are a few optimizations that have shown to improve the performance of 
sedes in my tests -
1. Use smaller number of bytes to store length of the column . For example if a 
bytearray is smaller than 255 bytes , a byte can be used to store the length 
instead of the integer that is currently used.
2. Instead of custom code to do sedes on Strings, use DataOutput.writeUTF and 
DataInput.readUTF.  This reduces the cost of serialization by more than 1/2. 

Zebra and BinStorage are known to use DefaultTuple sedes functionality. The 
serialization format that these loaders use cannot change, so after the 
optimization their format is going to be different from the format used between 
M/R boundaries.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1471) inline UDFs in scripting languages

2010-06-28 Thread Aniket Mokashi (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883327#action_12883327
 ] 

Aniket Mokashi commented on PIG-1471:
-

The proposed syntax is
{code}
define hellopig using org.apache.pig.scripting.jython.JythonScriptEngine as 
'@outputSchema("x:{t:(word:chararray)}")\ndef helloworld():\n\treturn ('Hello, 
World')';
{code}

> inline UDFs in scripting languages
> --
>
> Key: PIG-1471
> URL: https://issues.apache.org/jira/browse/PIG-1471
> Project: Pig
>  Issue Type: New Feature
>Reporter: Aniket Mokashi
>Assignee: Aniket Mokashi
> Fix For: 0.8.0
>
>
> It should be possible to write UDFs in scripting languages such as python, 
> ruby, etc. This frees users from needing to compile Java, generate a jar, 
> etc. It also opens Pig to programmers who prefer scripting languages over 
> Java. It should be possible to write these scripts inline as part of pig 
> scripts. This feature is an extension of 
> https://issues.apache.org/jira/browse/PIG-928

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1471) inline UDFs in scripting languages

2010-06-28 Thread Aniket Mokashi (JIRA)
inline UDFs in scripting languages
--

 Key: PIG-1471
 URL: https://issues.apache.org/jira/browse/PIG-1471
 Project: Pig
  Issue Type: New Feature
Reporter: Aniket Mokashi
Assignee: Aniket Mokashi
 Fix For: 0.8.0


It should be possible to write UDFs in scripting languages such as python, 
ruby, etc. This frees users from needing to compile Java, generate a jar, etc. 
It also opens Pig to programmers who prefer scripting languages over Java. It 
should be possible to write these scripts inline as part of pig scripts. This 
feature is an extension of https://issues.apache.org/jira/browse/PIG-928


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1470) map/red jobs fail using G1 GC (Couldn't find heap)

2010-06-28 Thread Randy Prager (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883302#action_12883302
 ] 

Randy Prager commented on PIG-1470:
---

thanks.  we started testing w/ G1 GC on our hadoop cluster to avoid (which it 
seems to do) the exceptions

{noformat}
java.io.IOException: Task process exit with nonzero status of 134.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)
{noformat}

which occur randomly on 6u18,6u20 and the default GC.  We are going to try some 
other Java version + GC combinations ... do you have any insight into a stable 
mix of Java versions and GC?

> map/red jobs fail using G1 GC (Couldn't find heap)
> --
>
> Key: PIG-1470
> URL: https://issues.apache.org/jira/browse/PIG-1470
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.6.0
> Environment: OS: 2.6.27.19-5-default #1 SMP 2009-02-28 04:40:21 +0100 
> x86_64 x86_64 x86_64 GNU/Linux
> Java: Java(TM) SE Runtime Environment (build 1.6.0_18-b07)
> Hadoop: 0.20.1
>Reporter: Randy Prager
>
> Here is the hadoop map/red configuration (conf/mapred-site.xml) that fails
> {noformat}
>  
> mapred.child.java.opts
> -Xmx300m -XX:+DoEscapeAnalysis -XX:+UseCompressedOops 
> -XX:+UnlockExperimentalVMOptions -XX:+UseG1GC
> 
> {noformat}
> Here is the hadoop map/red configuration that succeeds
> {noformat}
>  
> mapred.child.java.opts
> -Xmx300m -XX:+DoEscapeAnalysis 
> -XX:+UseCompressedOops
> 
> {noformat}
> Here is the exception from the pig script.
> {noformat}
> Backend error message
> -
> org.apache.pig.backend.executionengine.ExecException: ERROR 2081: Unable to 
> set up the load function.
> at 
> org.apache.pig.backend.executionengine.PigSlice.init(PigSlice.java:89)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.SliceWrapper.makeReader(SliceWrapper.java:144)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getRecordReader(PigInputFormat.java:282)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:338)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> at org.apache.hadoop.mapred.Child.main(Child.java:170)
> Caused by: java.lang.RuntimeException: could not instantiate 'PigStorage' 
> with arguments '[,]'
> at 
> org.apache.pig.impl.PigContext.instantiateFuncFromSpec(PigContext.java:519)
> at 
> org.apache.pig.backend.executionengine.PigSlice.init(PigSlice.java:85)
> ... 5 more
> Caused by: java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> at 
> org.apache.pig.impl.PigContext.instantiateFuncFromSpec(PigContext.java:487)
> ... 6 more
> Caused by: java.lang.RuntimeException: Couldn't find heap
> at 
> org.apache.pig.impl.util.SpillableMemoryManager.(SpillableMemoryManager.java:95)
> at org.apache.pig.data.BagFactory.(BagFactory.java:106)
> at 
> org.apache.pig.data.DefaultBagFactory.(DefaultBagFactory.java:71)
> at org.apache.pig.data.BagFactory.getInstance(BagFactory.java:76)
> at 
> org.apache.pig.builtin.Utf8StorageConverter.(Utf8StorageConverter.java:49)
> at org.apache.pig.builtin.PigStorage.(PigStorage.java:69)
> at org.apache.pig.builtin.PigStorage.(PigStorage.java:79)
> ... 11 more
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1389) Implement Pig counter to track number of rows for each input files

2010-06-28 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883247#action_12883247
 ] 

Ashutosh Chauhan commented on PIG-1389:
---

In cases of Merge Join and Merge Cogroup there is a possibility of 
double-counting and under-counting the records from the side loaders inherently 
due to design. So, in those cases reported numbers may confuse users.

> Implement Pig counter to track number of rows for each input files 
> ---
>
> Key: PIG-1389
> URL: https://issues.apache.org/jira/browse/PIG-1389
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.7.0
>Reporter: Richard Ding
>Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1389.patch, PIG-1389.patch
>
>
> A MR job generated by Pig not only can have multiple outputs (in the case of 
> multiquery) but also can have multiple inputs (in the case of join or 
> cogroup). In both cases, the existing Hadoop counters (e.g. 
> MAP_INPUT_RECORDS, REDUCE_OUTPUT_RECORDS) can not be used to count the number 
> of records in the given input or output.  PIG-1299 addressed the case of 
> multiple outputs.  We need to add new counters for jobs with multiple inputs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1467) order by fail when set "fs.file.impl.disable.cache" to true

2010-06-28 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1467:


  Status: Resolved  (was: Patch Available)
Hadoop Flags: [Reviewed]
  Resolution: Fixed

Patch committed to both trunk and 0.7 branch.

> order by fail when set "fs.file.impl.disable.cache" to true
> ---
>
> Key: PIG-1467
> URL: https://issues.apache.org/jira/browse/PIG-1467
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.8.0, 0.7.0
>
> Attachments: PIG-1467-1.patch, PIG-1467-2.patch
>
>
> Order by fail with the message:
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.setConf(WeightedRangePartitioner.java:135)
> at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62)
> at 
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
> at 
> org.apache.hadoop.mapred.MapTask$NewOutputCollector.(MapTask.java:551)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:630)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:314)
> at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062)
> at org.apache.hadoop.mapred.Child.main(Child.java:211)
> This happens with the following hadoop settings:
> fs.file.impl.disable.cache=true
> fs.hdfs.impl.disable.cache=true

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1467) order by fail when set "fs.file.impl.disable.cache" to true

2010-06-28 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883230#action_12883230
 ] 

Richard Ding commented on PIG-1467:
---

+1

> order by fail when set "fs.file.impl.disable.cache" to true
> ---
>
> Key: PIG-1467
> URL: https://issues.apache.org/jira/browse/PIG-1467
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.7.0, 0.8.0
>
> Attachments: PIG-1467-1.patch, PIG-1467-2.patch
>
>
> Order by fail with the message:
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.setConf(WeightedRangePartitioner.java:135)
> at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62)
> at 
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
> at 
> org.apache.hadoop.mapred.MapTask$NewOutputCollector.(MapTask.java:551)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:630)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:314)
> at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062)
> at org.apache.hadoop.mapred.Child.main(Child.java:211)
> This happens with the following hadoop settings:
> fs.file.impl.disable.cache=true
> fs.hdfs.impl.disable.cache=true

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1389) Implement Pig counter to track number of rows for each input files

2010-06-28 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1389:
--

Attachment: PIG-1389.patch

sync with the latest trunk.

> Implement Pig counter to track number of rows for each input files 
> ---
>
> Key: PIG-1389
> URL: https://issues.apache.org/jira/browse/PIG-1389
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.7.0
>Reporter: Richard Ding
>Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1389.patch, PIG-1389.patch
>
>
> A MR job generated by Pig not only can have multiple outputs (in the case of 
> multiquery) but also can have multiple inputs (in the case of join or 
> cogroup). In both cases, the existing Hadoop counters (e.g. 
> MAP_INPUT_RECORDS, REDUCE_OUTPUT_RECORDS) can not be used to count the number 
> of records in the given input or output.  PIG-1299 addressed the case of 
> multiple outputs.  We need to add new counters for jobs with multiple inputs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Bug in new logical optimizer framework?

2010-06-28 Thread Alan Gates


On Jun 28, 2010, at 12:36 AM, Swati Jain wrote:


Thanks for the prompt reply. As you mentioned optimization is in its
developing stage, does it mean optimization framework is not  
complete or
only rules are in developing stage? In addition to that, I would  
really
appreciate if you could give a rough idea when the patch will be  
available

and what functionality will it contain?

At this point we believe the framework is complete and rules are being  
developed.  But the framework has never been used in user testing  
situations (alpha or beta testing) so there will be a whole round of  
bugs to fix once that testing is done.


The current plan is to switch to this code as the actual optimizer  
with 0.8, which we hope to release late this year (no promises).


Alan.


[jira] Commented: (PIG-1470) map/red jobs fail using G1 GC (Couldn't find heap)

2010-06-28 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883205#action_12883205
 ] 

Ashutosh Chauhan commented on PIG-1470:
---

This is actually a bug in G1. 
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6815790 Towards the bottom 
of page there is a comment: 
{code}
Evaluation  The monitoring and management support for G1 is yet to be 
implemented
{code}

I think until it gets fixed in G1, we should recommend users not to use G1.

> map/red jobs fail using G1 GC (Couldn't find heap)
> --
>
> Key: PIG-1470
> URL: https://issues.apache.org/jira/browse/PIG-1470
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.6.0
> Environment: OS: 2.6.27.19-5-default #1 SMP 2009-02-28 04:40:21 +0100 
> x86_64 x86_64 x86_64 GNU/Linux
> Java: Java(TM) SE Runtime Environment (build 1.6.0_18-b07)
> Hadoop: 0.20.1
>Reporter: Randy Prager
>
> Here is the hadoop map/red configuration (conf/mapred-site.xml) that fails
> {noformat}
>  
> mapred.child.java.opts
> -Xmx300m -XX:+DoEscapeAnalysis -XX:+UseCompressedOops 
> -XX:+UnlockExperimentalVMOptions -XX:+UseG1GC
> 
> {noformat}
> Here is the hadoop map/red configuration that succeeds
> {noformat}
>  
> mapred.child.java.opts
> -Xmx300m -XX:+DoEscapeAnalysis 
> -XX:+UseCompressedOops
> 
> {noformat}
> Here is the exception from the pig script.
> {noformat}
> Backend error message
> -
> org.apache.pig.backend.executionengine.ExecException: ERROR 2081: Unable to 
> set up the load function.
> at 
> org.apache.pig.backend.executionengine.PigSlice.init(PigSlice.java:89)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.SliceWrapper.makeReader(SliceWrapper.java:144)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getRecordReader(PigInputFormat.java:282)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:338)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> at org.apache.hadoop.mapred.Child.main(Child.java:170)
> Caused by: java.lang.RuntimeException: could not instantiate 'PigStorage' 
> with arguments '[,]'
> at 
> org.apache.pig.impl.PigContext.instantiateFuncFromSpec(PigContext.java:519)
> at 
> org.apache.pig.backend.executionengine.PigSlice.init(PigSlice.java:85)
> ... 5 more
> Caused by: java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> at 
> org.apache.pig.impl.PigContext.instantiateFuncFromSpec(PigContext.java:487)
> ... 6 more
> Caused by: java.lang.RuntimeException: Couldn't find heap
> at 
> org.apache.pig.impl.util.SpillableMemoryManager.(SpillableMemoryManager.java:95)
> at org.apache.pig.data.BagFactory.(BagFactory.java:106)
> at 
> org.apache.pig.data.DefaultBagFactory.(DefaultBagFactory.java:71)
> at org.apache.pig.data.BagFactory.getInstance(BagFactory.java:76)
> at 
> org.apache.pig.builtin.Utf8StorageConverter.(Utf8StorageConverter.java:49)
> at org.apache.pig.builtin.PigStorage.(PigStorage.java:69)
> at org.apache.pig.builtin.PigStorage.(PigStorage.java:79)
> ... 11 more
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1470) map/red jobs fail using G1 GC (Couldn't find heap)

2010-06-28 Thread Randy Prager (JIRA)
map/red jobs fail using G1 GC (Couldn't find heap)
--

 Key: PIG-1470
 URL: https://issues.apache.org/jira/browse/PIG-1470
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
 Environment: OS: 2.6.27.19-5-default #1 SMP 2009-02-28 04:40:21 +0100 
x86_64 x86_64 x86_64 GNU/Linux
Java: Java(TM) SE Runtime Environment (build 1.6.0_18-b07)
Hadoop: 0.20.1

Reporter: Randy Prager


Here is the hadoop map/red configuration (conf/mapred-site.xml) that fails

{noformat}
 
mapred.child.java.opts
-Xmx300m -XX:+DoEscapeAnalysis -XX:+UseCompressedOops 
-XX:+UnlockExperimentalVMOptions -XX:+UseG1GC

{noformat}

Here is the hadoop map/red configuration that succeeds

{noformat}
 
mapred.child.java.opts
-Xmx300m -XX:+DoEscapeAnalysis 
-XX:+UseCompressedOops

{noformat}

Here is the exception from the pig script.

{noformat}
Backend error message
-
org.apache.pig.backend.executionengine.ExecException: ERROR 2081: Unable to set 
up the load function.
at 
org.apache.pig.backend.executionengine.PigSlice.init(PigSlice.java:89)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.SliceWrapper.makeReader(SliceWrapper.java:144)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getRecordReader(PigInputFormat.java:282)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:338)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
Caused by: java.lang.RuntimeException: could not instantiate 'PigStorage' with 
arguments '[,]'
at 
org.apache.pig.impl.PigContext.instantiateFuncFromSpec(PigContext.java:519)
at 
org.apache.pig.backend.executionengine.PigSlice.init(PigSlice.java:85)
... 5 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at 
org.apache.pig.impl.PigContext.instantiateFuncFromSpec(PigContext.java:487)
... 6 more
Caused by: java.lang.RuntimeException: Couldn't find heap
at 
org.apache.pig.impl.util.SpillableMemoryManager.(SpillableMemoryManager.java:95)
at org.apache.pig.data.BagFactory.(BagFactory.java:106)
at 
org.apache.pig.data.DefaultBagFactory.(DefaultBagFactory.java:71)
at org.apache.pig.data.BagFactory.getInstance(BagFactory.java:76)
at 
org.apache.pig.builtin.Utf8StorageConverter.(Utf8StorageConverter.java:49)
at org.apache.pig.builtin.PigStorage.(PigStorage.java:69)
at org.apache.pig.builtin.PigStorage.(PigStorage.java:79)
... 11 more
{noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: load files

2010-06-28 Thread Jeff Zhang
part-x for is old hadoop mapred api, and part-m-x and
part-r-x is for new hadoop mapred api
You can use hadoop's globstatus("part-*") to handle both of these cases.



2010/6/28 Gang Luo :
> Thanks, Jeff.
> In pig, the file name look like this: part-m-x(for map result) or 
> part-r-x(for reduce result), which are different from the hadoop style 
> (part-x). So, can we control the name of each generated file? How?
>
> Thanks,
> -Gang
>
>
>
> - 原始邮件 
> 发件人: Jeff Zhang 
> 收件人: pig-dev@hadoop.apache.org
> 发送日期: 2010/6/27 (周日) 9:22:30 下午
> 主   题: Re: load files
>
> Hi Gang,
>
> The path specified in load can be both file or directory, besides you
> can also leverage hadoop's globstatus.  The path specified in store is
> a directory.
>
>
>
> On Mon, Jun 28, 2010 at 4:44 AM, Gang Luo  wrote:
>> Hi all,
>> when we specify the path of input to a load operator, is it a file or a 
>> directory? Similarly, when we use store-load to connect two MR operators, is 
>> the path specified in the store and load a directory?
>>
>> Thanks,
>> -Gang
>>
>>
>>
>>
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>
>
>
>
>



-- 
Best Regards

Jeff Zhang


Re: load files

2010-06-28 Thread Gang Luo
Thanks, Jeff.
In pig, the file name look like this: part-m-x(for map result) or 
part-r-x(for reduce result), which are different from the hadoop style 
(part-x). So, can we control the name of each generated file? How?

Thanks,
-Gang



- 原始邮件 
发件人: Jeff Zhang 
收件人: pig-dev@hadoop.apache.org
发送日期: 2010/6/27 (周日) 9:22:30 下午
主   题: Re: load files

Hi Gang,

The path specified in load can be both file or directory, besides you
can also leverage hadoop's globstatus.  The path specified in store is
a directory.



On Mon, Jun 28, 2010 at 4:44 AM, Gang Luo  wrote:
> Hi all,
> when we specify the path of input to a load operator, is it a file or a 
> directory? Similarly, when we use store-load to connect two MR operators, is 
> the path specified in the store and load a directory?
>
> Thanks,
> -Gang
>
>
>
>
>



-- 
Best Regards

Jeff Zhang






Re: Bug in new logical optimizer framework?

2010-06-28 Thread Swati Jain
Thanks for the prompt reply. As you mentioned optimization is in its
developing stage, does it mean optimization framework is not complete or
only rules are in developing stage? In addition to that, I would really
appreciate if you could give a rough idea when the patch will be available
and what functionality will it contain?

Actually, I had attached seven files in my previous mail to reproduce the
bug including the error log but as you couldn't find them I am inlining all
the attachments :
*
My patch:* (To enable the optimization)

Index: src/org/apache/pig/PigServer.java
===
--- src/org/apache/pig/PigServer.java(revision 951297)
+++ src/org/apache/pig/PigServer.java(working copy)
@@ -179,6 +179,11 @@

 aggregateWarning =
"true".equalsIgnoreCase(pigContext.getProperties().getProperty("aggregate.warning"));
 isMultiQuery =
"true".equalsIgnoreCase(pigContext.getProperties().getProperty("opt.multiquery","true"));
+
getPigContext().getProperties().setProperty("pig.usenewlogicalplan",
"true");
+log.info(
+"-> pig.usenewlogicalplan set to " +
+
getPigContext().getProperties().getProperty("pig.usenewlogicalplan",
"false") +
+" in PigServer" );

 if (connect) {
 pigContext.connect();

*Script 1: *
A = load '/home/pig/exfile1' USING PigStorage(' ') as (x:int,y:int);
B = Group A by x;
dump B;

* Script 2:*
A = load '/home/pig/exfile1' USING PigStorage(',') as (a1:int,a2:int);
B = load '/home/pig/exfile1' USING PigStorage(',') as (b1:int,b2:int);
C = JOIN A by a1, B by b1;
dump C;

*exfile1:*
1,5

Please let me know if you have any further questions.

Thanks,
Swati


On Sun, Jun 27, 2010 at 9:32 PM, Daniel Dai  wrote:

> Swati,
> New logical plan is half way done so it is not surprising to see exceptions
> at current stage. We are actively developing it and will deliver patch
> shortly. Meanwhile, please attach the problematic scripts (I didn't see it
> in your mail) so we can make sure those exceptions are addressed.
>
> Thanks,
> Daniel
>
>
> From: Swati Jain
> Sent: Sunday, June 27, 2010 7:07 PM
> To: pig-dev@hadoop.apache.org
> Subject: Bug in new logical optimizer framework?
>
>
> Folks,
>
> Posting on the dev since this is regarding the new logical plan
> optimization framework which is not enabled yet. I was interested in playing
> around with the new optimization framework and try adding some simple rules
> to it.
>
> I have attached two simple programs which do not work when the new logical
> optimization framework is enabled (they work when it is disabled). My
> changes to enable the new optimizer are pretty straightforward and the diff
> on branch-0.7 are attached (I just set the appropriate property to true). I
> have attached two very simple scripts both of which raise an exception (in
> local mode of execution) "java.io.IOException: Type mismatch in key from
> map: expected org.apache.pig.impl.io.NullableIntWritable, recieved
> org.apache.pig.impl.io.NullableBytesWritable" if there is atleast 1 row to
> be output. The error goes away if I replace "DUMP" with "EXPLAIN"
> (presumably because the bug manifests during plan execution). It would be
> great if someone could throw some light on this issue or give pointers on
> workarounds or ways to fix this. I have not filed a JIRA for the above,
> please let me know if I should.
>
> Also, it would be great to get some guidance on the state of the new
> optimizer wrt testing (I do understand it is not GA ready since it is
> disabled by default) and whether it is too early to start playing around
> with adding new rules.
>
> Thanks
> Swati
>