[jira] Commented: (PIG-1389) Implement Pig counter to track number of rows for each input files
[ https://issues.apache.org/jira/browse/PIG-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883424#action_12883424 ] Hadoop QA commented on PIG-1389: -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12448259/PIG-1389_1.patch against trunk revision 958666. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/335/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/335/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/335/console This message is automatically generated. > Implement Pig counter to track number of rows for each input files > --- > > Key: PIG-1389 > URL: https://issues.apache.org/jira/browse/PIG-1389 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.7.0 >Reporter: Richard Ding >Assignee: Richard Ding > Fix For: 0.8.0 > > Attachments: PIG-1389.patch, PIG-1389.patch, PIG-1389_1.patch > > > A MR job generated by Pig not only can have multiple outputs (in the case of > multiquery) but also can have multiple inputs (in the case of join or > cogroup). In both cases, the existing Hadoop counters (e.g. > MAP_INPUT_RECORDS, REDUCE_OUTPUT_RECORDS) can not be used to count the number > of records in the given input or output. PIG-1299 addressed the case of > multiple outputs. We need to add new counters for jobs with multiple inputs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1473) Avoid serialization/deserialization costs for PigStorage data - Use custom Map and Bag implementation
[ https://issues.apache.org/jira/browse/PIG-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883382#action_12883382 ] Jeff Zhang commented on PIG-1473: - This sounds like the lazy deserialization in Hive, Great ! > Avoid serialization/deserialization costs for PigStorage data - Use custom > Map and Bag implementation > - > > Key: PIG-1473 > URL: https://issues.apache.org/jira/browse/PIG-1473 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.8.0 >Reporter: Thejas M Nair >Assignee: Thejas M Nair > Fix For: 0.8.0 > > > Cost of serialization/deserialization (sedes) can be very high and avoiding > it will improve performance. > Avoid sedes when possible by implementing approach #3 proposed in > http://wiki.apache.org/pig/AvoidingSedes . > The load function uses subclass of Map and DataBag which holds the serialized > copy. LoadFunction delays deserialization of map and bag types until a > member function of java.util.Map or DataBag is called. > Example of query where this will help - > {CODE} > l = LOAD 'file1' AS (a : int, b : map [ ]); > f = FOREACH l GENERATE udf1(a), b; > fil = FILTER f BY $0 > 5; > dump fil; -- Serialization of column b can be delayed until here using this > approach . > {CODE} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1337) Need a way to pass distributed cache configuration information to hadoop backend in Pig's LoadFunc
[ https://issues.apache.org/jira/browse/PIG-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1337: Fix Version/s: (was: 0.8.0) > Need a way to pass distributed cache configuration information to hadoop > backend in Pig's LoadFunc > -- > > Key: PIG-1337 > URL: https://issues.apache.org/jira/browse/PIG-1337 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.6.0 >Reporter: Chao Wang > > The Zebra storage layer needs to use distributed cache to reduce name node > load during job runs. > To to this, Zebra needs to set up distributed cache related configuration > information in TableLoader (which extends Pig's LoadFunc) . > It is doing this within getSchema(conf). The problem is that the conf object > here is not the one that is being serialized to map/reduce backend. As such, > the distributed cache is not set up properly. > To work over this problem, we need Pig in its LoadFunc to ensure a way that > we can use to set up distributed cache information in a conf object, and this > conf object is the one used by map/reduce backend. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1411) [Zebra] Can Zebra use HAR to reduce file/block count for namenode
[ https://issues.apache.org/jira/browse/PIG-1411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1411: Fix Version/s: (was: 0.8.0) Description: Due to column group structure, Zebra can create extra files for namenode to remember. That means namenode taking more memory for Zebra related files. The goal is to reduce the no of files/blocks The idea among various options is to use HAR ( Hadoop Archive ). Hadoop Archive reduces the block and file count by copying data from small files ( 1M, 2M ...) into a hdfs-block of larger size. Thus, reducing the total no. of blocks and files. was: Due to column group structure, Zebra can create extra files for namenode to remember. That means namenode taking more memory for Zebra related files. The goal is to reduce the no of files/blocks The idea among various options is to use HAR ( Hadoop Archive ). Hadoop Archive reduces the block and file count by copying data from small files ( 1M, 2M ...) into a hdfs-block of larger size. Thus, reducing the total no. of blocks and files. > [Zebra] Can Zebra use HAR to reduce file/block count for namenode > - > > Key: PIG-1411 > URL: https://issues.apache.org/jira/browse/PIG-1411 > Project: Pig > Issue Type: New Feature > Components: impl >Affects Versions: 0.8.0 >Reporter: Gaurav Jain >Assignee: Gaurav Jain >Priority: Minor > > Due to column group structure, Zebra can create extra files for namenode to > remember. That means namenode taking more memory for Zebra related files. > The goal is to reduce the no of files/blocks > The idea among various options is to use HAR ( Hadoop Archive ). Hadoop > Archive reduces the block and file count by copying data from small files ( > 1M, 2M ...) into a hdfs-block of larger size. Thus, reducing the total no. of > blocks and files. > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1355) [Zebra] Zebra Multiple Outputs should enable application to skip records
[ https://issues.apache.org/jira/browse/PIG-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1355: Fix Version/s: (was: 0.8.0) Description: Applications may not always want to write a record to a table. Zebra should allow application to do the same. Zebra Mutipile Outputs interface allow users to stream data to different tables by inspecting the data Tuple. https://issues.apache.org/jira/browse/PIG- So, If ZebraOutputPartition returns -1, Zebra Multiple Outputs will skip that record and thus will not write to any table However, Zebra BasicTableOutputFormat ( different from Zebra Multiple Outputs ) will write every record to a table was: Applications may not always want to write a record to a table. Zebra should allow application to do the same. Zebra Mutipile Outputs interface allow users to stream data to different tables by inspecting the data Tuple. https://issues.apache.org/jira/browse/PIG- So, If ZebraOutputPartition returns -1, Zebra Multiple Outputs will skip that record and thus will not write to any table However, Zebra BasicTableOutputFormat ( different from Zebra Multiple Outputs ) will write every record to a table > [Zebra] Zebra Multiple Outputs should enable application to skip records > - > > Key: PIG-1355 > URL: https://issues.apache.org/jira/browse/PIG-1355 > Project: Pig > Issue Type: Improvement > Components: impl >Affects Versions: 0.8.0 >Reporter: Gaurav Jain >Assignee: Gaurav Jain >Priority: Minor > > Applications may not always want to write a record to a table. Zebra should > allow application to do the same. > Zebra Mutipile Outputs interface allow users to stream data to different > tables by inspecting the data Tuple. > https://issues.apache.org/jira/browse/PIG- > So, > If ZebraOutputPartition returns -1, Zebra Multiple Outputs will skip that > record and thus will not write to any table > However, Zebra BasicTableOutputFormat ( different from Zebra Multiple Outputs > ) will write every record to a table -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1426) Change the size of Tuple from Int to VInt when Serialize Tuple
[ https://issues.apache.org/jira/browse/PIG-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1426: Fix Version/s: (was: 0.8.0) > Change the size of Tuple from Int to VInt when Serialize Tuple > -- > > Key: PIG-1426 > URL: https://issues.apache.org/jira/browse/PIG-1426 > Project: Pig > Issue Type: Improvement > Components: data >Affects Versions: 0.8.0 >Reporter: Jeff Zhang >Assignee: Jeff Zhang > Attachments: PIG_1426.patch > > > Most of time, the size of tuple is not very large, one byte is enough for > store the size of tuple. So I suggest to use VInt instead of Int for the size > of tuple when doing Serialization. Because the key type of map output is > Tuple, so this can reduce the amount of data transferred from mapper to > reducer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1137) [zebra] get* methods of Zebra Map/Reduce APIs need improvements
[ https://issues.apache.org/jira/browse/PIG-1137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1137: Fix Version/s: (was: 0.8.0) > [zebra] get* methods of Zebra Map/Reduce APIs need improvements > --- > > Key: PIG-1137 > URL: https://issues.apache.org/jira/browse/PIG-1137 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.6.0 >Reporter: Yan Zhou >Assignee: Yan Zhou > > Currently the set* methods takes external Zebra objects, namely objects of > ZebraStorageHint, ZebraSchema, ZebraSortInfo or ZebraProjection. > Correspondingly, the get* methods should return such objects instead of > String or Zebra internal objects like Schema. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1120) [zebra] should support using org.apache.hadoop.zebra.pig.TableStorer() if user does not want to specify storage hint
[ https://issues.apache.org/jira/browse/PIG-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1120: Fix Version/s: (was: 0.8.0) > [zebra] should support using org.apache.hadoop.zebra.pig.TableStorer() if > user does not want to specify storage hint > - > > Key: PIG-1120 > URL: https://issues.apache.org/jira/browse/PIG-1120 > Project: Pig > Issue Type: Bug >Affects Versions: 0.6.0 >Reporter: Jing Huang > > If user doesn't want to specify storage hint, current zebra implementation > only support using org.apache.hadoop.zebra.pig.TableStorer('') string in TableStorer(' '). > We should support the format of using > org.apache.hadoop.zebra.pig.TableStorer() as we do on using > org.apache.hadoop.zebra.pig.TableLoader() > sample pig script: > register /grid/0/dev/hadoopqa/jars/zebra.jar; > a = load '1.txt' as (a:int, > b:float,c:long,d:double,e:chararray,f:bytearray,r1(f1:chararray,f2:chararray),m1:map[]); > b = load '2.txt' as (a:int, > b:float,c:long,d:double,e:chararray,f:bytearray,r1(f1:chararray,f2:chararray),m1:map[]); > c = join a by a, b by a; > d = foreach c generate a::a, a::b, b::c; > describe d; > dump d; > store d into 'join3' using org.apache.hadoop.zebra.pig.TableStorer(''); > --this will fail > --store d into 'join3' using org.apache.hadoop.zebra.pig.TableStorer( ); -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1350) [Zebra] Zebra column names cannot have leading "_"
[ https://issues.apache.org/jira/browse/PIG-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1350: Fix Version/s: (was: 0.8.0) > [Zebra] Zebra column names cannot have leading "_" > -- > > Key: PIG-1350 > URL: https://issues.apache.org/jira/browse/PIG-1350 > Project: Pig > Issue Type: Improvement >Reporter: Xuefu Zhang >Assignee: Xuefu Zhang > Attachments: pig-1350.patch, pig-1350.patch > > > Disallowing '_' as leading character in column names in Zebra schema is too > restrictive, which should be lifted. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1139) [zebra] Encapsulation of check of ZebraSortInfo by a Zebra reader; the check by a writer could be better encapsulated
[ https://issues.apache.org/jira/browse/PIG-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1139: Fix Version/s: (was: 0.8.0) > [zebra] Encapsulation of check of ZebraSortInfo by a Zebra reader; the check > by a writer could be better encapsulated > - > > Key: PIG-1139 > URL: https://issues.apache.org/jira/browse/PIG-1139 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.6.0 >Reporter: Yan Zhou >Priority: Minor > > Currently the user's ZebraSortInfo by Map/Reduce's writer, namely, the > BasicTableOutputFormat.setStorageInfo, is sanity checked by the > SortInfo.parse(), although the sanity check could be all performed in that > method taking a ZebraSortInfo object. > But the sanity check at the reader side is totally by the caller of > TableInputFormat.requireSortedTable method, which should be better > encapsulated into a new SortInfo's method. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Avoiding serialization/de-serialization in pig
I don't fully understand the repercussions of this, but I like it. We're moving from our VoldemortStorage stuff to Avro and it would be great to pipe Avro all the way through. Russ On Mon, Jun 28, 2010 at 5:51 PM, Dmitriy Ryaboy wrote: > For what it's worth, I saw very significant speed improvements (order of > magnitude for wide tables with few projected columns) when I implemented > (2) > for our protocol buffer - based loaders. > > I have a feeling that propagating schemas when known, and using them to for > (de)serialization instead of reflecting every field, would also be a big > win. > > Thoughts on just using Avro for the internal PigStorage? > > -D > > On Mon, Jun 28, 2010 at 5:08 PM, Thejas Nair wrote: > > > I have created a wiki which puts together some ideas that can help in > > improving performance by avoiding/delaying serialization/de-serialization > . > > > > http://wiki.apache.org/pig/AvoidingSedes > > > > These are ideas that don't involve changes to optimizer. Most of them > > involve changes in the load/store functions. > > > > Your feedback is welcome. > > > > Thanks, > > Thejas > > > > >
Re: Avoiding serialization/de-serialization in pig
For what it's worth, I saw very significant speed improvements (order of magnitude for wide tables with few projected columns) when I implemented (2) for our protocol buffer - based loaders. I have a feeling that propagating schemas when known, and using them to for (de)serialization instead of reflecting every field, would also be a big win. Thoughts on just using Avro for the internal PigStorage? -D On Mon, Jun 28, 2010 at 5:08 PM, Thejas Nair wrote: > I have created a wiki which puts together some ideas that can help in > improving performance by avoiding/delaying serialization/de-serialization . > > http://wiki.apache.org/pig/AvoidingSedes > > These are ideas that don't involve changes to optimizer. Most of them > involve changes in the load/store functions. > > Your feedback is welcome. > > Thanks, > Thejas > >
[jira] Updated: (PIG-1389) Implement Pig counter to track number of rows for each input files
[ https://issues.apache.org/jira/browse/PIG-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1389: -- Status: Patch Available (was: Open) > Implement Pig counter to track number of rows for each input files > --- > > Key: PIG-1389 > URL: https://issues.apache.org/jira/browse/PIG-1389 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.7.0 >Reporter: Richard Ding >Assignee: Richard Ding > Fix For: 0.8.0 > > Attachments: PIG-1389.patch, PIG-1389.patch, PIG-1389_1.patch > > > A MR job generated by Pig not only can have multiple outputs (in the case of > multiquery) but also can have multiple inputs (in the case of join or > cogroup). In both cases, the existing Hadoop counters (e.g. > MAP_INPUT_RECORDS, REDUCE_OUTPUT_RECORDS) can not be used to count the number > of records in the given input or output. PIG-1299 addressed the case of > multiple outputs. We need to add new counters for jobs with multiple inputs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1389) Implement Pig counter to track number of rows for each input files
[ https://issues.apache.org/jira/browse/PIG-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1389: -- Status: Open (was: Patch Available) > Implement Pig counter to track number of rows for each input files > --- > > Key: PIG-1389 > URL: https://issues.apache.org/jira/browse/PIG-1389 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.7.0 >Reporter: Richard Ding >Assignee: Richard Ding > Fix For: 0.8.0 > > Attachments: PIG-1389.patch, PIG-1389.patch > > > A MR job generated by Pig not only can have multiple outputs (in the case of > multiquery) but also can have multiple inputs (in the case of join or > cogroup). In both cases, the existing Hadoop counters (e.g. > MAP_INPUT_RECORDS, REDUCE_OUTPUT_RECORDS) can not be used to count the number > of records in the given input or output. PIG-1299 addressed the case of > multiple outputs. We need to add new counters for jobs with multiple inputs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1295) Binary comparator for secondary sort
[ https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883367#action_12883367 ] Gianmarco De Francisci Morales commented on PIG-1295: - I think it is > Binary comparator for secondary sort > > > Key: PIG-1295 > URL: https://issues.apache.org/jira/browse/PIG-1295 > Project: Pig > Issue Type: Improvement > Components: impl >Affects Versions: 0.7.0 >Reporter: Daniel Dai >Assignee: Gianmarco De Francisci Morales > Fix For: 0.8.0 > > Attachments: PIG-1295_0.1.patch, PIG-1295_0.2.patch, > PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch, PIG-1295_0.6.patch > > > When hadoop framework doing the sorting, it will try to use binary version of > comparator if available. The benefit of binary comparator is we do not need > to instantiate the object before we compare. We see a ~30% speedup after we > switch to binary comparator. Currently, Pig use binary comparator in > following case: > 1. When semantics of order doesn't matter. For example, in distinct, we need > to do a sort in order to filter out duplicate values; however, we do not care > how comparator sort keys. Groupby also share this character. In this case, we > rely on hadoop's default binary comparator > 2. Semantics of order matter, but the key is of simple type. In this case, we > have implementation for simple types, such as integer, long, float, > chararray, databytearray, string > However, if the key is a tuple and the sort semantics matters, we do not have > a binary comparator implementation. This especially matters when we switch to > use secondary sort. In secondary sort, we convert the inner sort of nested > foreach into the secondary key and rely on hadoop to sorting on both main key > and secondary key. The sorting key will become a two items tuple. Since the > secondary key the sorting key of the nested foreach, so the sorting semantics > matters. It turns out we do not have binary comparator once we use secondary > sort, and we see a significant slow down. > Binary comparator for tuple should be doable once we understand the binary > structure of the serialized tuple. We can focus on most common use cases > first, which is "group by" followed by a nested sort. In this case, we will > use secondary sort. Semantics of the first key does not matter but semantics > of secondary key matters. We need to identify the boundary of main key and > secondary key in the binary tuple buffer without instantiate tuple itself. > Then if the first key equals, we use a binary comparator to compare secondary > key. Secondary key can also be a complex data type, but for the first step, > we focus on simple secondary key, which is the most common use case. > We mark this issue to be a candidate project for "Google summer of code 2010" > program. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1389) Implement Pig counter to track number of rows for each input files
[ https://issues.apache.org/jira/browse/PIG-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1389: -- Attachment: PIG-1389_1.patch > Implement Pig counter to track number of rows for each input files > --- > > Key: PIG-1389 > URL: https://issues.apache.org/jira/browse/PIG-1389 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.7.0 >Reporter: Richard Ding >Assignee: Richard Ding > Fix For: 0.8.0 > > Attachments: PIG-1389.patch, PIG-1389.patch, PIG-1389_1.patch > > > A MR job generated by Pig not only can have multiple outputs (in the case of > multiquery) but also can have multiple inputs (in the case of join or > cogroup). In both cases, the existing Hadoop counters (e.g. > MAP_INPUT_RECORDS, REDUCE_OUTPUT_RECORDS) can not be used to count the number > of records in the given input or output. PIG-1299 addressed the case of > multiple outputs. We need to add new counters for jobs with multiple inputs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1295) Binary comparator for secondary sort
[ https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1295: Status: Patch Available (was: Open) Fix Version/s: 0.8.0 > Binary comparator for secondary sort > > > Key: PIG-1295 > URL: https://issues.apache.org/jira/browse/PIG-1295 > Project: Pig > Issue Type: Improvement > Components: impl >Affects Versions: 0.7.0 >Reporter: Daniel Dai >Assignee: Gianmarco De Francisci Morales > Fix For: 0.8.0 > > Attachments: PIG-1295_0.1.patch, PIG-1295_0.2.patch, > PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch, PIG-1295_0.6.patch > > > When hadoop framework doing the sorting, it will try to use binary version of > comparator if available. The benefit of binary comparator is we do not need > to instantiate the object before we compare. We see a ~30% speedup after we > switch to binary comparator. Currently, Pig use binary comparator in > following case: > 1. When semantics of order doesn't matter. For example, in distinct, we need > to do a sort in order to filter out duplicate values; however, we do not care > how comparator sort keys. Groupby also share this character. In this case, we > rely on hadoop's default binary comparator > 2. Semantics of order matter, but the key is of simple type. In this case, we > have implementation for simple types, such as integer, long, float, > chararray, databytearray, string > However, if the key is a tuple and the sort semantics matters, we do not have > a binary comparator implementation. This especially matters when we switch to > use secondary sort. In secondary sort, we convert the inner sort of nested > foreach into the secondary key and rely on hadoop to sorting on both main key > and secondary key. The sorting key will become a two items tuple. Since the > secondary key the sorting key of the nested foreach, so the sorting semantics > matters. It turns out we do not have binary comparator once we use secondary > sort, and we see a significant slow down. > Binary comparator for tuple should be doable once we understand the binary > structure of the serialized tuple. We can focus on most common use cases > first, which is "group by" followed by a nested sort. In this case, we will > use secondary sort. Semantics of the first key does not matter but semantics > of secondary key matters. We need to identify the boundary of main key and > secondary key in the binary tuple buffer without instantiate tuple itself. > Then if the first key equals, we use a binary comparator to compare secondary > key. Secondary key can also be a complex data type, but for the first step, > we focus on simple secondary key, which is the most common use case. > We mark this issue to be a candidate project for "Google summer of code 2010" > program. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1474) Avoid serialization/deserialization costs for PigStorage data - Use custom Tuple
Avoid serialization/deserialization costs for PigStorage data - Use custom Tuple Key: PIG-1474 URL: https://issues.apache.org/jira/browse/PIG-1474 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0 Reporter: Thejas M Nair Assignee: Thejas M Nair Fix For: 0.8.0 Avoid sedes when possible for data loaded using PigStorage by implementing approach #4 proposed in http://wiki.apache.org/pig/AvoidingSedes . The write() and readFields() functions of tuple returned by TupleFactory is used to serialize data between Map and Reduce. By using a tuple that knows the serialization format of the loader, we avoid sedes at Map Recue boundary and use the load functions serialized format between Map and Reduce . To use a new custom tuple for this purpose, a custom TupleFactory that returns tuples of this type has to be specified using the property "pig.data.tuple.factory.name" . This approach will work only for a set of load functions in the query that share same serialization format for map and bags. If this approach proves to be very useful, it will build a case for more extensible approach. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Avoiding serialization/de-serialization in pig
I have created a wiki which puts together some ideas that can help in improving performance by avoiding/delaying serialization/de-serialization . http://wiki.apache.org/pig/AvoidingSedes These are ideas that don't involve changes to optimizer. Most of them involve changes in the load/store functions. Your feedback is welcome. Thanks, Thejas
[jira] Commented: (PIG-1295) Binary comparator for secondary sort
[ https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883361#action_12883361 ] Daniel Dai commented on PIG-1295: - Thanks, is the patch ready for review? > Binary comparator for secondary sort > > > Key: PIG-1295 > URL: https://issues.apache.org/jira/browse/PIG-1295 > Project: Pig > Issue Type: Improvement > Components: impl >Affects Versions: 0.7.0 >Reporter: Daniel Dai >Assignee: Gianmarco De Francisci Morales > Fix For: 0.8.0 > > Attachments: PIG-1295_0.1.patch, PIG-1295_0.2.patch, > PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch, PIG-1295_0.6.patch > > > When hadoop framework doing the sorting, it will try to use binary version of > comparator if available. The benefit of binary comparator is we do not need > to instantiate the object before we compare. We see a ~30% speedup after we > switch to binary comparator. Currently, Pig use binary comparator in > following case: > 1. When semantics of order doesn't matter. For example, in distinct, we need > to do a sort in order to filter out duplicate values; however, we do not care > how comparator sort keys. Groupby also share this character. In this case, we > rely on hadoop's default binary comparator > 2. Semantics of order matter, but the key is of simple type. In this case, we > have implementation for simple types, such as integer, long, float, > chararray, databytearray, string > However, if the key is a tuple and the sort semantics matters, we do not have > a binary comparator implementation. This especially matters when we switch to > use secondary sort. In secondary sort, we convert the inner sort of nested > foreach into the secondary key and rely on hadoop to sorting on both main key > and secondary key. The sorting key will become a two items tuple. Since the > secondary key the sorting key of the nested foreach, so the sorting semantics > matters. It turns out we do not have binary comparator once we use secondary > sort, and we see a significant slow down. > Binary comparator for tuple should be doable once we understand the binary > structure of the serialized tuple. We can focus on most common use cases > first, which is "group by" followed by a nested sort. In this case, we will > use secondary sort. Semantics of the first key does not matter but semantics > of secondary key matters. We need to identify the boundary of main key and > secondary key in the binary tuple buffer without instantiate tuple itself. > Then if the first key equals, we use a binary comparator to compare secondary > key. Secondary key can also be a complex data type, but for the first step, > we focus on simple secondary key, which is the most common use case. > We mark this issue to be a candidate project for "Google summer of code 2010" > program. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1473) Avoid serialization/deserialization costs for PigStorage data - Use custom Map and Bag implementation
Avoid serialization/deserialization costs for PigStorage data - Use custom Map and Bag implementation - Key: PIG-1473 URL: https://issues.apache.org/jira/browse/PIG-1473 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0 Reporter: Thejas M Nair Fix For: 0.8.0 Cost of serialization/deserialization (sedes) can be very high and avoiding it will improve performance. Avoid sedes when possible by implementing approach #3 proposed in http://wiki.apache.org/pig/AvoidingSedes . The load function uses subclass of Map and DataBag which holds the serialized copy. LoadFunction delays deserialization of map and bag types until a member function of java.util.Map or DataBag is called. Example of query where this will help - {CODE} l = LOAD 'file1' AS (a : int, b : map [ ]); f = FOREACH l GENERATE udf1(a), b; fil = FILTER f BY $0 > 5; dump fil; -- Serialization of column b can be delayed until here using this approach . {CODE} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1473) Avoid serialization/deserialization costs for PigStorage data - Use custom Map and Bag implementation
[ https://issues.apache.org/jira/browse/PIG-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thejas M Nair reassigned PIG-1473: -- Assignee: Thejas M Nair > Avoid serialization/deserialization costs for PigStorage data - Use custom > Map and Bag implementation > - > > Key: PIG-1473 > URL: https://issues.apache.org/jira/browse/PIG-1473 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.8.0 >Reporter: Thejas M Nair >Assignee: Thejas M Nair > Fix For: 0.8.0 > > > Cost of serialization/deserialization (sedes) can be very high and avoiding > it will improve performance. > Avoid sedes when possible by implementing approach #3 proposed in > http://wiki.apache.org/pig/AvoidingSedes . > The load function uses subclass of Map and DataBag which holds the serialized > copy. LoadFunction delays deserialization of map and bag types until a > member function of java.util.Map or DataBag is called. > Example of query where this will help - > {CODE} > l = LOAD 'file1' AS (a : int, b : map [ ]); > f = FOREACH l GENERATE udf1(a), b; > fil = FILTER f BY $0 > 5; > dump fil; -- Serialization of column b can be delayed until here using this > approach . > {CODE} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1389) Implement Pig counter to track number of rows for each input files
[ https://issues.apache.org/jira/browse/PIG-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883354#action_12883354 ] Richard Ding commented on PIG-1389: --- It seems there is no good solution for Merge Join and Merge Cogroup in this case. So I'm going to treat them the same way as Replicated Join and not add counters for all side files. > Implement Pig counter to track number of rows for each input files > --- > > Key: PIG-1389 > URL: https://issues.apache.org/jira/browse/PIG-1389 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.7.0 >Reporter: Richard Ding >Assignee: Richard Ding > Fix For: 0.8.0 > > Attachments: PIG-1389.patch, PIG-1389.patch > > > A MR job generated by Pig not only can have multiple outputs (in the case of > multiquery) but also can have multiple inputs (in the case of join or > cogroup). In both cases, the existing Hadoop counters (e.g. > MAP_INPUT_RECORDS, REDUCE_OUTPUT_RECORDS) can not be used to count the number > of records in the given input or output. PIG-1299 addressed the case of > multiple outputs. We need to add new counters for jobs with multiple inputs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1399) Logical Optimizer: Expression optimizor rule
[ https://issues.apache.org/jira/browse/PIG-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883348#action_12883348 ] Yan Zhou commented on PIG-1399: --- Other expression optimizations include: 3. Optimization of erasure of logical implicated expression in AND Example: B = filter A by (a0 > 5 and a0 > 7); => B = filter A by a0 > 7; 4. Optimization of erasure of logical implicated expression in OR Example: B = filter A by ((a0 > 5) or (a0 > 6 and a1 > 15); => B = filter C by a0 > 5; A comprehensive example of 2, 3 and 4 optimizations is: B = filter A by NOT((a0 > 1 and a0 > 0) or (a1 < 3 and a0 >5))"; => B = filter A by a0 <= 1; > Logical Optimizer: Expression optimizor rule > > > Key: PIG-1399 > URL: https://issues.apache.org/jira/browse/PIG-1399 > Project: Pig > Issue Type: Sub-task > Components: impl >Affects Versions: 0.7.0 >Reporter: Daniel Dai >Assignee: Yan Zhou > > We can optimize expression in several ways: > 1. Constant pre-calculation > Example: > B = filter A by a0 > 5+7; > => B = filter A by a0 > 12; > 2. Boolean expression optimization > Example: > B = filter A by not (not(a0>5) or a>10); > => B = filter A by a0>5 and a<=10; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1295) Binary comparator for secondary sort
[ https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gianmarco De Francisci Morales updated PIG-1295: Attachment: PIG-1295_0.6.patch Ok, if the user does not use DefaultTuple we fall back to the default deserialization case. I added handling of nested tuples via recursion and appropriate unit tests. > Binary comparator for secondary sort > > > Key: PIG-1295 > URL: https://issues.apache.org/jira/browse/PIG-1295 > Project: Pig > Issue Type: Improvement > Components: impl >Affects Versions: 0.7.0 >Reporter: Daniel Dai >Assignee: Gianmarco De Francisci Morales > Attachments: PIG-1295_0.1.patch, PIG-1295_0.2.patch, > PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch, PIG-1295_0.6.patch > > > When hadoop framework doing the sorting, it will try to use binary version of > comparator if available. The benefit of binary comparator is we do not need > to instantiate the object before we compare. We see a ~30% speedup after we > switch to binary comparator. Currently, Pig use binary comparator in > following case: > 1. When semantics of order doesn't matter. For example, in distinct, we need > to do a sort in order to filter out duplicate values; however, we do not care > how comparator sort keys. Groupby also share this character. In this case, we > rely on hadoop's default binary comparator > 2. Semantics of order matter, but the key is of simple type. In this case, we > have implementation for simple types, such as integer, long, float, > chararray, databytearray, string > However, if the key is a tuple and the sort semantics matters, we do not have > a binary comparator implementation. This especially matters when we switch to > use secondary sort. In secondary sort, we convert the inner sort of nested > foreach into the secondary key and rely on hadoop to sorting on both main key > and secondary key. The sorting key will become a two items tuple. Since the > secondary key the sorting key of the nested foreach, so the sorting semantics > matters. It turns out we do not have binary comparator once we use secondary > sort, and we see a significant slow down. > Binary comparator for tuple should be doable once we understand the binary > structure of the serialized tuple. We can focus on most common use cases > first, which is "group by" followed by a nested sort. In this case, we will > use secondary sort. Semantics of the first key does not matter but semantics > of secondary key matters. We need to identify the boundary of main key and > secondary key in the binary tuple buffer without instantiate tuple itself. > Then if the first key equals, we use a binary comparator to compare secondary > key. Secondary key can also be a complex data type, but for the first step, > we focus on simple secondary key, which is the most common use case. > We mark this issue to be a candidate project for "Google summer of code 2010" > program. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1472) Optimize serialization/deserialization between Map and Reduce and between MR jobs
Optimize serialization/deserialization between Map and Reduce and between MR jobs - Key: PIG-1472 URL: https://issues.apache.org/jira/browse/PIG-1472 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0 Reporter: Thejas M Nair Assignee: Thejas M Nair Fix For: 0.8.0 In certain types of pig queries most of the execution time is spent in serializing/deserializing (sedes) records between Map and Reduce and between MR jobs. For example, if PigMix queries are modified to specify types for all the fields in the load statement schema, some of the queries (L2,L3,L9, L10 in pigmix v1) that have records with bags and maps being transmitted across map or reduce boundaries run a lot longer (runtime increase of few times has been seen. There are a few optimizations that have shown to improve the performance of sedes in my tests - 1. Use smaller number of bytes to store length of the column . For example if a bytearray is smaller than 255 bytes , a byte can be used to store the length instead of the integer that is currently used. 2. Instead of custom code to do sedes on Strings, use DataOutput.writeUTF and DataInput.readUTF. This reduces the cost of serialization by more than 1/2. Zebra and BinStorage are known to use DefaultTuple sedes functionality. The serialization format that these loaders use cannot change, so after the optimization their format is going to be different from the format used between M/R boundaries. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1471) inline UDFs in scripting languages
[ https://issues.apache.org/jira/browse/PIG-1471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883327#action_12883327 ] Aniket Mokashi commented on PIG-1471: - The proposed syntax is {code} define hellopig using org.apache.pig.scripting.jython.JythonScriptEngine as '@outputSchema("x:{t:(word:chararray)}")\ndef helloworld():\n\treturn ('Hello, World')'; {code} > inline UDFs in scripting languages > -- > > Key: PIG-1471 > URL: https://issues.apache.org/jira/browse/PIG-1471 > Project: Pig > Issue Type: New Feature >Reporter: Aniket Mokashi >Assignee: Aniket Mokashi > Fix For: 0.8.0 > > > It should be possible to write UDFs in scripting languages such as python, > ruby, etc. This frees users from needing to compile Java, generate a jar, > etc. It also opens Pig to programmers who prefer scripting languages over > Java. It should be possible to write these scripts inline as part of pig > scripts. This feature is an extension of > https://issues.apache.org/jira/browse/PIG-928 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1471) inline UDFs in scripting languages
inline UDFs in scripting languages -- Key: PIG-1471 URL: https://issues.apache.org/jira/browse/PIG-1471 Project: Pig Issue Type: New Feature Reporter: Aniket Mokashi Assignee: Aniket Mokashi Fix For: 0.8.0 It should be possible to write UDFs in scripting languages such as python, ruby, etc. This frees users from needing to compile Java, generate a jar, etc. It also opens Pig to programmers who prefer scripting languages over Java. It should be possible to write these scripts inline as part of pig scripts. This feature is an extension of https://issues.apache.org/jira/browse/PIG-928 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1470) map/red jobs fail using G1 GC (Couldn't find heap)
[ https://issues.apache.org/jira/browse/PIG-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883302#action_12883302 ] Randy Prager commented on PIG-1470: --- thanks. we started testing w/ G1 GC on our hadoop cluster to avoid (which it seems to do) the exceptions {noformat} java.io.IOException: Task process exit with nonzero status of 134. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418) {noformat} which occur randomly on 6u18,6u20 and the default GC. We are going to try some other Java version + GC combinations ... do you have any insight into a stable mix of Java versions and GC? > map/red jobs fail using G1 GC (Couldn't find heap) > -- > > Key: PIG-1470 > URL: https://issues.apache.org/jira/browse/PIG-1470 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.6.0 > Environment: OS: 2.6.27.19-5-default #1 SMP 2009-02-28 04:40:21 +0100 > x86_64 x86_64 x86_64 GNU/Linux > Java: Java(TM) SE Runtime Environment (build 1.6.0_18-b07) > Hadoop: 0.20.1 >Reporter: Randy Prager > > Here is the hadoop map/red configuration (conf/mapred-site.xml) that fails > {noformat} > > mapred.child.java.opts > -Xmx300m -XX:+DoEscapeAnalysis -XX:+UseCompressedOops > -XX:+UnlockExperimentalVMOptions -XX:+UseG1GC > > {noformat} > Here is the hadoop map/red configuration that succeeds > {noformat} > > mapred.child.java.opts > -Xmx300m -XX:+DoEscapeAnalysis > -XX:+UseCompressedOops > > {noformat} > Here is the exception from the pig script. > {noformat} > Backend error message > - > org.apache.pig.backend.executionengine.ExecException: ERROR 2081: Unable to > set up the load function. > at > org.apache.pig.backend.executionengine.PigSlice.init(PigSlice.java:89) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.SliceWrapper.makeReader(SliceWrapper.java:144) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getRecordReader(PigInputFormat.java:282) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:338) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) > at org.apache.hadoop.mapred.Child.main(Child.java:170) > Caused by: java.lang.RuntimeException: could not instantiate 'PigStorage' > with arguments '[,]' > at > org.apache.pig.impl.PigContext.instantiateFuncFromSpec(PigContext.java:519) > at > org.apache.pig.backend.executionengine.PigSlice.init(PigSlice.java:85) > ... 5 more > Caused by: java.lang.reflect.InvocationTargetException > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) > at java.lang.reflect.Constructor.newInstance(Constructor.java:513) > at > org.apache.pig.impl.PigContext.instantiateFuncFromSpec(PigContext.java:487) > ... 6 more > Caused by: java.lang.RuntimeException: Couldn't find heap > at > org.apache.pig.impl.util.SpillableMemoryManager.(SpillableMemoryManager.java:95) > at org.apache.pig.data.BagFactory.(BagFactory.java:106) > at > org.apache.pig.data.DefaultBagFactory.(DefaultBagFactory.java:71) > at org.apache.pig.data.BagFactory.getInstance(BagFactory.java:76) > at > org.apache.pig.builtin.Utf8StorageConverter.(Utf8StorageConverter.java:49) > at org.apache.pig.builtin.PigStorage.(PigStorage.java:69) > at org.apache.pig.builtin.PigStorage.(PigStorage.java:79) > ... 11 more > {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1389) Implement Pig counter to track number of rows for each input files
[ https://issues.apache.org/jira/browse/PIG-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883247#action_12883247 ] Ashutosh Chauhan commented on PIG-1389: --- In cases of Merge Join and Merge Cogroup there is a possibility of double-counting and under-counting the records from the side loaders inherently due to design. So, in those cases reported numbers may confuse users. > Implement Pig counter to track number of rows for each input files > --- > > Key: PIG-1389 > URL: https://issues.apache.org/jira/browse/PIG-1389 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.7.0 >Reporter: Richard Ding >Assignee: Richard Ding > Fix For: 0.8.0 > > Attachments: PIG-1389.patch, PIG-1389.patch > > > A MR job generated by Pig not only can have multiple outputs (in the case of > multiquery) but also can have multiple inputs (in the case of join or > cogroup). In both cases, the existing Hadoop counters (e.g. > MAP_INPUT_RECORDS, REDUCE_OUTPUT_RECORDS) can not be used to count the number > of records in the given input or output. PIG-1299 addressed the case of > multiple outputs. We need to add new counters for jobs with multiple inputs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1467) order by fail when set "fs.file.impl.disable.cache" to true
[ https://issues.apache.org/jira/browse/PIG-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1467: Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Resolution: Fixed Patch committed to both trunk and 0.7 branch. > order by fail when set "fs.file.impl.disable.cache" to true > --- > > Key: PIG-1467 > URL: https://issues.apache.org/jira/browse/PIG-1467 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.7.0 >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: 0.8.0, 0.7.0 > > Attachments: PIG-1467-1.patch, PIG-1467-2.patch > > > Order by fail with the message: > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.setConf(WeightedRangePartitioner.java:135) > at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62) > at > org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) > at > org.apache.hadoop.mapred.MapTask$NewOutputCollector.(MapTask.java:551) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:630) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:314) > at org.apache.hadoop.mapred.Child$4.run(Child.java:217) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062) > at org.apache.hadoop.mapred.Child.main(Child.java:211) > This happens with the following hadoop settings: > fs.file.impl.disable.cache=true > fs.hdfs.impl.disable.cache=true -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1467) order by fail when set "fs.file.impl.disable.cache" to true
[ https://issues.apache.org/jira/browse/PIG-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883230#action_12883230 ] Richard Ding commented on PIG-1467: --- +1 > order by fail when set "fs.file.impl.disable.cache" to true > --- > > Key: PIG-1467 > URL: https://issues.apache.org/jira/browse/PIG-1467 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.7.0 >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: 0.7.0, 0.8.0 > > Attachments: PIG-1467-1.patch, PIG-1467-2.patch > > > Order by fail with the message: > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.setConf(WeightedRangePartitioner.java:135) > at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62) > at > org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) > at > org.apache.hadoop.mapred.MapTask$NewOutputCollector.(MapTask.java:551) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:630) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:314) > at org.apache.hadoop.mapred.Child$4.run(Child.java:217) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062) > at org.apache.hadoop.mapred.Child.main(Child.java:211) > This happens with the following hadoop settings: > fs.file.impl.disable.cache=true > fs.hdfs.impl.disable.cache=true -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1389) Implement Pig counter to track number of rows for each input files
[ https://issues.apache.org/jira/browse/PIG-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1389: -- Attachment: PIG-1389.patch sync with the latest trunk. > Implement Pig counter to track number of rows for each input files > --- > > Key: PIG-1389 > URL: https://issues.apache.org/jira/browse/PIG-1389 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.7.0 >Reporter: Richard Ding >Assignee: Richard Ding > Fix For: 0.8.0 > > Attachments: PIG-1389.patch, PIG-1389.patch > > > A MR job generated by Pig not only can have multiple outputs (in the case of > multiquery) but also can have multiple inputs (in the case of join or > cogroup). In both cases, the existing Hadoop counters (e.g. > MAP_INPUT_RECORDS, REDUCE_OUTPUT_RECORDS) can not be used to count the number > of records in the given input or output. PIG-1299 addressed the case of > multiple outputs. We need to add new counters for jobs with multiple inputs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Bug in new logical optimizer framework?
On Jun 28, 2010, at 12:36 AM, Swati Jain wrote: Thanks for the prompt reply. As you mentioned optimization is in its developing stage, does it mean optimization framework is not complete or only rules are in developing stage? In addition to that, I would really appreciate if you could give a rough idea when the patch will be available and what functionality will it contain? At this point we believe the framework is complete and rules are being developed. But the framework has never been used in user testing situations (alpha or beta testing) so there will be a whole round of bugs to fix once that testing is done. The current plan is to switch to this code as the actual optimizer with 0.8, which we hope to release late this year (no promises). Alan.
[jira] Commented: (PIG-1470) map/red jobs fail using G1 GC (Couldn't find heap)
[ https://issues.apache.org/jira/browse/PIG-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883205#action_12883205 ] Ashutosh Chauhan commented on PIG-1470: --- This is actually a bug in G1. http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6815790 Towards the bottom of page there is a comment: {code} Evaluation The monitoring and management support for G1 is yet to be implemented {code} I think until it gets fixed in G1, we should recommend users not to use G1. > map/red jobs fail using G1 GC (Couldn't find heap) > -- > > Key: PIG-1470 > URL: https://issues.apache.org/jira/browse/PIG-1470 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.6.0 > Environment: OS: 2.6.27.19-5-default #1 SMP 2009-02-28 04:40:21 +0100 > x86_64 x86_64 x86_64 GNU/Linux > Java: Java(TM) SE Runtime Environment (build 1.6.0_18-b07) > Hadoop: 0.20.1 >Reporter: Randy Prager > > Here is the hadoop map/red configuration (conf/mapred-site.xml) that fails > {noformat} > > mapred.child.java.opts > -Xmx300m -XX:+DoEscapeAnalysis -XX:+UseCompressedOops > -XX:+UnlockExperimentalVMOptions -XX:+UseG1GC > > {noformat} > Here is the hadoop map/red configuration that succeeds > {noformat} > > mapred.child.java.opts > -Xmx300m -XX:+DoEscapeAnalysis > -XX:+UseCompressedOops > > {noformat} > Here is the exception from the pig script. > {noformat} > Backend error message > - > org.apache.pig.backend.executionengine.ExecException: ERROR 2081: Unable to > set up the load function. > at > org.apache.pig.backend.executionengine.PigSlice.init(PigSlice.java:89) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.SliceWrapper.makeReader(SliceWrapper.java:144) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getRecordReader(PigInputFormat.java:282) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:338) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) > at org.apache.hadoop.mapred.Child.main(Child.java:170) > Caused by: java.lang.RuntimeException: could not instantiate 'PigStorage' > with arguments '[,]' > at > org.apache.pig.impl.PigContext.instantiateFuncFromSpec(PigContext.java:519) > at > org.apache.pig.backend.executionengine.PigSlice.init(PigSlice.java:85) > ... 5 more > Caused by: java.lang.reflect.InvocationTargetException > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) > at java.lang.reflect.Constructor.newInstance(Constructor.java:513) > at > org.apache.pig.impl.PigContext.instantiateFuncFromSpec(PigContext.java:487) > ... 6 more > Caused by: java.lang.RuntimeException: Couldn't find heap > at > org.apache.pig.impl.util.SpillableMemoryManager.(SpillableMemoryManager.java:95) > at org.apache.pig.data.BagFactory.(BagFactory.java:106) > at > org.apache.pig.data.DefaultBagFactory.(DefaultBagFactory.java:71) > at org.apache.pig.data.BagFactory.getInstance(BagFactory.java:76) > at > org.apache.pig.builtin.Utf8StorageConverter.(Utf8StorageConverter.java:49) > at org.apache.pig.builtin.PigStorage.(PigStorage.java:69) > at org.apache.pig.builtin.PigStorage.(PigStorage.java:79) > ... 11 more > {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1470) map/red jobs fail using G1 GC (Couldn't find heap)
map/red jobs fail using G1 GC (Couldn't find heap) -- Key: PIG-1470 URL: https://issues.apache.org/jira/browse/PIG-1470 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Environment: OS: 2.6.27.19-5-default #1 SMP 2009-02-28 04:40:21 +0100 x86_64 x86_64 x86_64 GNU/Linux Java: Java(TM) SE Runtime Environment (build 1.6.0_18-b07) Hadoop: 0.20.1 Reporter: Randy Prager Here is the hadoop map/red configuration (conf/mapred-site.xml) that fails {noformat} mapred.child.java.opts -Xmx300m -XX:+DoEscapeAnalysis -XX:+UseCompressedOops -XX:+UnlockExperimentalVMOptions -XX:+UseG1GC {noformat} Here is the hadoop map/red configuration that succeeds {noformat} mapred.child.java.opts -Xmx300m -XX:+DoEscapeAnalysis -XX:+UseCompressedOops {noformat} Here is the exception from the pig script. {noformat} Backend error message - org.apache.pig.backend.executionengine.ExecException: ERROR 2081: Unable to set up the load function. at org.apache.pig.backend.executionengine.PigSlice.init(PigSlice.java:89) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.SliceWrapper.makeReader(SliceWrapper.java:144) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getRecordReader(PigInputFormat.java:282) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:338) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:170) Caused by: java.lang.RuntimeException: could not instantiate 'PigStorage' with arguments '[,]' at org.apache.pig.impl.PigContext.instantiateFuncFromSpec(PigContext.java:519) at org.apache.pig.backend.executionengine.PigSlice.init(PigSlice.java:85) ... 5 more Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.pig.impl.PigContext.instantiateFuncFromSpec(PigContext.java:487) ... 6 more Caused by: java.lang.RuntimeException: Couldn't find heap at org.apache.pig.impl.util.SpillableMemoryManager.(SpillableMemoryManager.java:95) at org.apache.pig.data.BagFactory.(BagFactory.java:106) at org.apache.pig.data.DefaultBagFactory.(DefaultBagFactory.java:71) at org.apache.pig.data.BagFactory.getInstance(BagFactory.java:76) at org.apache.pig.builtin.Utf8StorageConverter.(Utf8StorageConverter.java:49) at org.apache.pig.builtin.PigStorage.(PigStorage.java:69) at org.apache.pig.builtin.PigStorage.(PigStorage.java:79) ... 11 more {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: load files
part-x for is old hadoop mapred api, and part-m-x and part-r-x is for new hadoop mapred api You can use hadoop's globstatus("part-*") to handle both of these cases. 2010/6/28 Gang Luo : > Thanks, Jeff. > In pig, the file name look like this: part-m-x(for map result) or > part-r-x(for reduce result), which are different from the hadoop style > (part-x). So, can we control the name of each generated file? How? > > Thanks, > -Gang > > > > - 原始邮件 > 发件人: Jeff Zhang > 收件人: pig-dev@hadoop.apache.org > 发送日期: 2010/6/27 (周日) 9:22:30 下午 > 主 题: Re: load files > > Hi Gang, > > The path specified in load can be both file or directory, besides you > can also leverage hadoop's globstatus. The path specified in store is > a directory. > > > > On Mon, Jun 28, 2010 at 4:44 AM, Gang Luo wrote: >> Hi all, >> when we specify the path of input to a load operator, is it a file or a >> directory? Similarly, when we use store-load to connect two MR operators, is >> the path specified in the store and load a directory? >> >> Thanks, >> -Gang >> >> >> >> >> > > > > -- > Best Regards > > Jeff Zhang > > > > > -- Best Regards Jeff Zhang
Re: load files
Thanks, Jeff. In pig, the file name look like this: part-m-x(for map result) or part-r-x(for reduce result), which are different from the hadoop style (part-x). So, can we control the name of each generated file? How? Thanks, -Gang - 原始邮件 发件人: Jeff Zhang 收件人: pig-dev@hadoop.apache.org 发送日期: 2010/6/27 (周日) 9:22:30 下午 主 题: Re: load files Hi Gang, The path specified in load can be both file or directory, besides you can also leverage hadoop's globstatus. The path specified in store is a directory. On Mon, Jun 28, 2010 at 4:44 AM, Gang Luo wrote: > Hi all, > when we specify the path of input to a load operator, is it a file or a > directory? Similarly, when we use store-load to connect two MR operators, is > the path specified in the store and load a directory? > > Thanks, > -Gang > > > > > -- Best Regards Jeff Zhang
Re: Bug in new logical optimizer framework?
Thanks for the prompt reply. As you mentioned optimization is in its developing stage, does it mean optimization framework is not complete or only rules are in developing stage? In addition to that, I would really appreciate if you could give a rough idea when the patch will be available and what functionality will it contain? Actually, I had attached seven files in my previous mail to reproduce the bug including the error log but as you couldn't find them I am inlining all the attachments : * My patch:* (To enable the optimization) Index: src/org/apache/pig/PigServer.java === --- src/org/apache/pig/PigServer.java(revision 951297) +++ src/org/apache/pig/PigServer.java(working copy) @@ -179,6 +179,11 @@ aggregateWarning = "true".equalsIgnoreCase(pigContext.getProperties().getProperty("aggregate.warning")); isMultiQuery = "true".equalsIgnoreCase(pigContext.getProperties().getProperty("opt.multiquery","true")); + getPigContext().getProperties().setProperty("pig.usenewlogicalplan", "true"); +log.info( +"-> pig.usenewlogicalplan set to " + + getPigContext().getProperties().getProperty("pig.usenewlogicalplan", "false") + +" in PigServer" ); if (connect) { pigContext.connect(); *Script 1: * A = load '/home/pig/exfile1' USING PigStorage(' ') as (x:int,y:int); B = Group A by x; dump B; * Script 2:* A = load '/home/pig/exfile1' USING PigStorage(',') as (a1:int,a2:int); B = load '/home/pig/exfile1' USING PigStorage(',') as (b1:int,b2:int); C = JOIN A by a1, B by b1; dump C; *exfile1:* 1,5 Please let me know if you have any further questions. Thanks, Swati On Sun, Jun 27, 2010 at 9:32 PM, Daniel Dai wrote: > Swati, > New logical plan is half way done so it is not surprising to see exceptions > at current stage. We are actively developing it and will deliver patch > shortly. Meanwhile, please attach the problematic scripts (I didn't see it > in your mail) so we can make sure those exceptions are addressed. > > Thanks, > Daniel > > > From: Swati Jain > Sent: Sunday, June 27, 2010 7:07 PM > To: pig-dev@hadoop.apache.org > Subject: Bug in new logical optimizer framework? > > > Folks, > > Posting on the dev since this is regarding the new logical plan > optimization framework which is not enabled yet. I was interested in playing > around with the new optimization framework and try adding some simple rules > to it. > > I have attached two simple programs which do not work when the new logical > optimization framework is enabled (they work when it is disabled). My > changes to enable the new optimizer are pretty straightforward and the diff > on branch-0.7 are attached (I just set the appropriate property to true). I > have attached two very simple scripts both of which raise an exception (in > local mode of execution) "java.io.IOException: Type mismatch in key from > map: expected org.apache.pig.impl.io.NullableIntWritable, recieved > org.apache.pig.impl.io.NullableBytesWritable" if there is atleast 1 row to > be output. The error goes away if I replace "DUMP" with "EXPLAIN" > (presumably because the bug manifests during plan execution). It would be > great if someone could throw some light on this issue or give pointers on > workarounds or ways to fix this. I have not filed a JIRA for the above, > please let me know if I should. > > Also, it would be great to get some guidance on the state of the new > optimizer wrt testing (I do understand it is not GA ready since it is > disabled by default) and whether it is too early to start playing around > with adding new rules. > > Thanks > Swati >