[jira] Commented: (PIG-1389) Implement Pig counter to track number of rows for each input files
[ https://issues.apache.org/jira/browse/PIG-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882380#action_12882380 ] Arun C Murthy commented on PIG-1389: Can we not just increment the standard MR counters rather than inventing new ones? Implement Pig counter to track number of rows for each input files --- Key: PIG-1389 URL: https://issues.apache.org/jira/browse/PIG-1389 Project: Pig Issue Type: Improvement Affects Versions: 0.7.0 Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1389.patch A MR job generated by Pig not only can have multiple outputs (in the case of multiquery) but also can have multiple inputs (in the case of join or cogroup). In both cases, the existing Hadoop counters (e.g. MAP_INPUT_RECORDS, REDUCE_OUTPUT_RECORDS) can not be used to count the number of records in the given input or output. PIG-1299 addressed the case of multiple outputs. We need to add new counters for jobs with multiple inputs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1389) Implement Pig counter to track number of rows for each input files
[ https://issues.apache.org/jira/browse/PIG-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882381#action_12882381 ] Arun C Murthy commented on PIG-1389: How many new counters are we really adding here? I only see the counter-groups. I'm afraid this will be too many new counters... Implement Pig counter to track number of rows for each input files --- Key: PIG-1389 URL: https://issues.apache.org/jira/browse/PIG-1389 Project: Pig Issue Type: Improvement Affects Versions: 0.7.0 Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1389.patch A MR job generated by Pig not only can have multiple outputs (in the case of multiquery) but also can have multiple inputs (in the case of join or cogroup). In both cases, the existing Hadoop counters (e.g. MAP_INPUT_RECORDS, REDUCE_OUTPUT_RECORDS) can not be used to count the number of records in the given input or output. PIG-1299 addressed the case of multiple outputs. We need to add new counters for jobs with multiple inputs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Consider cleaning up backend code
+1 Arun On Apr 22, 2010, at 11:35 AM, Richard Ding wrote: Pig has an abstraction layer (interfaces and abstract classes) to support multiple execution engines. After PIG-1053, Hadoop is the only execution engine supported by Pig. I wonder if we should remove this layer of code, and make Hadoop THE execution engine for Pig. This will simplify a lot the backend code. Thanks, -Richard
Re: Consider cleaning up backend code
I read it as getting rid of concepts parallel to hadoop in src/org/ apache/pig/backend/hadoop/datastorage. Is that true? thanks, Arun On Apr 22, 2010, at 1:34 PM, Dmitriy Ryaboy wrote: I kind of dig the concept of being able to plug in a different backend, though I definitely thing we should get rid of the dead localmode code. Can you give an example of how this will simplify the codebase? Is it more than just GenericClass foo = new SpecificClass(), and the associated extra files? -D On Thu, Apr 22, 2010 at 1:25 PM, Arun C Murthy a...@yahoo-inc.com wrote: +1 Arun On Apr 22, 2010, at 11:35 AM, Richard Ding wrote: Pig has an abstraction layer (interfaces and abstract classes) to support multiple execution engines. After PIG-1053, Hadoop is the only execution engine supported by Pig. I wonder if we should remove this layer of code, and make Hadoop THE execution engine for Pig. This will simplify a lot the backend code. Thanks, -Richard
Re: Consider cleaning up backend code
On Apr 22, 2010, at 4:38 PM, Richard Ding wrote: Yes. The abstraction layer I was referring to is src/org/apache/pig/backend/executionengine and src/org/apache/pig/backend/datastorage. Thanks for the clarification. +1 Arun Thanks, -Richard -Original Message- From: Arun C Murthy [mailto:a...@yahoo-inc.com] Sent: Thursday, April 22, 2010 4:14 PM To: pig-dev@hadoop.apache.org Subject: Re: Consider cleaning up backend code I read it as getting rid of concepts parallel to hadoop in src/org/ apache/pig/backend/hadoop/datastorage. Is that true? thanks, Arun On Apr 22, 2010, at 1:34 PM, Dmitriy Ryaboy wrote: I kind of dig the concept of being able to plug in a different backend, though I definitely thing we should get rid of the dead localmode code. Can you give an example of how this will simplify the codebase? Is it more than just GenericClass foo = new SpecificClass(), and the associated extra files? -D On Thu, Apr 22, 2010 at 1:25 PM, Arun C Murthy a...@yahoo-inc.com wrote: +1 Arun On Apr 22, 2010, at 11:35 AM, Richard Ding wrote: Pig has an abstraction layer (interfaces and abstract classes) to support multiple execution engines. After PIG-1053, Hadoop is the only execution engine supported by Pig. I wonder if we should remove this layer of code, and make Hadoop THE execution engine for Pig. This will simplify a lot the backend code. Thanks, -Richard
[jira] Created: (PIG-1280) Add a pig-script-id to the JobConf of all jobs run in a pig-script
Add a pig-script-id to the JobConf of all jobs run in a pig-script -- Key: PIG-1280 URL: https://issues.apache.org/jira/browse/PIG-1280 Project: Pig Issue Type: Improvement Components: impl Reporter: Arun C Murthy It would be very useful for tools like gridmix if pig could add a 'pig-script-id' to all Map-Reduce jobs spawned by a single pig-script. Potentially we could use this to re-construct the DAG of jobs in gridmix and so on. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Private variables are not eco-friendly
The current model forces people to 'convince' others to open up classes for inheritance at the precise point it is necessary. This is a model which has served, at least, Hadoop very well. So, I think we should not go make every member protected - rather we should open them up one at a time, as and when necessary. Arun On Feb 2, 2010, at 7:34 PM, Dmitriy Ryaboy wrote: Hi all, I keep running into problems trying to extend Pig due to variables being declared private. The latest time around it was in PigSlice -- one can't inherit it and do much meaningful overriding of methods because the input streams are private rather than protected, so I can't change how it gets created. I wound up having to copy+paste the class wholesale, which is unfortunate. I know the Slice/Slicer interfaces are going away, but as a general rule -- can we be mindful of folks trying to extend classes, and make inner members protected, rather than private or package? Thanks -Dmitriy
[jira] Commented: (PIG-1218) Use distributed cache to store samples
[ https://issues.apache.org/jira/browse/PIG-1218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12829305#action_12829305 ] Arun C Murthy commented on PIG-1218: I'd also suggest we increase replication factor for the sample-file in HDFS before adding it to the distributed-cache. Use distributed cache to store samples -- Key: PIG-1218 URL: https://issues.apache.org/jira/browse/PIG-1218 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Richard Ding Fix For: 0.7.0 Currently, in the case of skew join and order by we use sample that is just written to the dfs (not distributed cache) and, as the result, get opened and copied around more than necessary. This impacts query performance and also places unnecesary load on the name node -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1062) load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface
[ https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12778666#action_12778666 ] Arun C Murthy commented on PIG-1062: bq. It looks like ReduceContext has a getCounter() method. Am I missing a subtlety? The counters you get from a {Map|Reduce}Context are only specific to the specific task. One would have to jump through a whole set of hoops i.e. create new JobClient or equivalent in the new context object apis), query the JobTracker for rolled up counters and even then they aren't guaranteed to be completely accurate (until job completion), thus I wouldn't recommend that we rely upon them. load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface --- Key: PIG-1062 URL: https://issues.apache.org/jira/browse/PIG-1062 Project: Pig Issue Type: Sub-task Reporter: Thejas M Nair Assignee: Thejas M Nair Attachments: PIG-1062.patch, PIG-1062.patch.3 This is part of the effort to implement new load store interfaces as laid out in http://wiki.apache.org/pig/LoadStoreRedesignProposal . PigStorage and BinStorage are now working. SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to be changed to work with new LoadFunc interface. Fixing SampleLoader and RandomSampleLoader will get order-by queries working. PoissonSampleLoader is used by skew join. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Proposal to create a branch for contrib project Zebra
On Aug 17, 2009, at 4:38 PM, Santhosh Srinivasan wrote: Is there any precedence for such proposals? I am not comfortable with extending committer access to contrib teams. I would suggest that Zebra be made a sub-project of Hadoop and have a life of its own. There has been sufficient precedence for 'contrib committers' in Hadoop (e.g. Chukwa vis-a-vis the former 'Hadoop Core' sub-project) and is normal within the Apache world for committers with specific 'roles' e.g specific Contrib modules, QA, Release/Build etc. (http://hadoop.apache.org/common/credits.html - in fact, Giridharan Kesavan is an unlisted 'release' committer for Apache Hadoop) I believe it's a desired, nay stated, goal for Zebra to graduate as a Hadoop sub-project eventually, based on which it was voted-in as a contrib module by the Apache Pig. Given these, I don't see any cause for concern here. Arun Santhosh -Original Message- From: Raghu Angadi [mailto:rang...@yahoo-inc.com] Sent: Monday, August 17, 2009 4:06 PM To: pig-dev@hadoop.apache.org Subject: Proposal to create a branch for contrib project Zebra Thanks to the PIG team, The first version of contrib project Zebra (PIG-833) is committed to PIG trunk. In short, Zebra is a table storage layer built for use in PIG and other Hadoop applications. While we are stabilizing current version V1 in the trunk, we plan to add more new features to it. We would like to create an svn branch for the new features. We will be responsible for managing zebra in PIG trunk and in the new branch. We will merge the branch when it is ready. We expect the changes to affect only 'contrib/zebra' directory. As a regular contributor to Hadoop, I will be the initial committer for Zebra. As more patches are contributed by other Zebra developers, there might be more commiters added through normal Hadoop/Apache procedure. I would like to create a branch called 'zebra-v2' with approval from PIG team. Thanks, Raghu.
Re: Proposal to create a branch for contrib project Zebra
That leaves us with contrib committers. Can you point to earlier email threads that cover the topic of giving committer access to contrib projects? Specifically, what does it mean to award someone committer privileges to a contrib project, what are the access privileges that come with such rights, what are the dos/don'ts, etc. Chukwa was a contrib module prior to it's current avatar as a full- fledged sub-project. It's 'contrib committers' Ari Rabkin and Eric Yang became it's first committers: http://markmail.org/message/75qvvcigi3qumifp Unfortunately the email threads for voting contrib committers are private to the Hadoop PMC, you'll just have to take my word for it. *smile* I did dig-up some other examples for you: http://www.gossamer-threads.com/lists/lucene/java-dev/81122 http://www.nabble.com/ANNOUNCE:-Welcome--as-Contrib-Committer-td21506295.html Contrib committers have privileges to commit only to their 'module': pig/trunk/contrib/zebra in this case. Thirdly, are there instances of contrib committers creating branches? Branches are a development tool... I don't see the problem with creating/using them. Arun
[jira] Commented: (PIG-901) InputSplit (SliceWrapper) created by Pig is big in size due to serialized PigContext
[ https://issues.apache.org/jira/browse/PIG-901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12739048#action_12739048 ] Arun C Murthy commented on PIG-901: --- bq. This may require some design changes which we should address at some point for these kinds of tests. Could you please track this with a new jira? Thanks! InputSplit (SliceWrapper) created by Pig is big in size due to serialized PigContext Key: PIG-901 URL: https://issues.apache.org/jira/browse/PIG-901 Project: Pig Issue Type: Bug Affects Versions: 0.3.1 Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: 0.4.0 Attachments: PIG-901-1.patch, PIG-901-branch-0.3.patch, PIG-901-trunk.patch InputSplit (SliceWrapper) created by Pig is big in size due to serialized PigContext. SliceWrapper only needs ExecType - so the entire PigContext should not be serialized and only the ExecType should be serialized. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-901) InputSplit (SliceWrapper) created by Pig is big in size due to serialized PigContext
[ https://issues.apache.org/jira/browse/PIG-901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12738740#action_12738740 ] Arun C Murthy commented on PIG-901: --- It would be nice to add a test case which (for now) checks to ensure that the size of a serialized 'slice' is less than 500KB or so... InputSplit (SliceWrapper) created by Pig is big in size due to serialized PigContext Key: PIG-901 URL: https://issues.apache.org/jira/browse/PIG-901 Project: Pig Issue Type: Bug Affects Versions: 0.3.1 Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: 0.4.0 Attachments: PIG-901-1.patch, PIG-901-branch-0.3.patch InputSplit (SliceWrapper) created by Pig is big in size due to serialized PigContext. SliceWrapper only needs ExecType - so the entire PigContext should not be serialized and only the ExecType should be serialized. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-878) Pig is returning too many blocks in the InputSplit
[ https://issues.apache.org/jira/browse/PIG-878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12733311#action_12733311 ] Arun C Murthy commented on PIG-878: --- bq. Should note also that I didn't add any tests because this was a fix for existing functionality, and frankly I'm not exactly sure how to test it. We could check the #splits returned by the slicer to ensure it's equal to the replication factor of the input files? Pig is returning too many blocks in the InputSplit -- Key: PIG-878 URL: https://issues.apache.org/jira/browse/PIG-878 Project: Pig Issue Type: Bug Affects Versions: 0.3.0 Reporter: Alan Gates Assignee: Alan Gates Priority: Critical Fix For: 0.4.0 Attachments: PIG-878.patch When SlicerWrapper builds a slice, it currently returns the 3 locations for every block in the file it is slicing, instead of the 3 locations for the block covered by that slice. This means Pig's odds of having its maps placed on nodes local to the data goes way down. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-864) Record graph of execution of Map-Reduce jobs executed by a Pig script
Record graph of execution of Map-Reduce jobs executed by a Pig script - Key: PIG-864 URL: https://issues.apache.org/jira/browse/PIG-864 Project: Pig Issue Type: Improvement Reporter: Arun C Murthy It would be useful for offline analysis if Pig were to record the entire graph of Map-Reduce jobs executed by a singe Pig script. For starters a simple 'parent jobid' for each MR job in the graph would be nice. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [VOTE] Release Pig 0.3.0 (candidate 0)
On Jun 18, 2009, at 12:30 PM, Olga Natkovich wrote: Hi, I created a candidate build for Pig 0.3.0 release. The main feature of this release is support for multiquery which allows to share computation across multiple queries within the same script. We see significant performance improvements (up to order of magnitude) as the result of this optimization. +1 I downloaded the release, validated checksums and ran the unit-tests successfully. Arun
Re: [VOTE] Release Pig 0.1.1 (candidate 0)
+1. I downloaded the release, checked the signatures and checksums. All unit test pass. Arun On Nov 25, 2008, at 3:58 PM, Olga Natkovich wrote: Hi, I have created a candidate build for Pig 0.1.1. This release is almost identical to Pig 0.1.0 with a couple of exceptions: (1) It is integrated with hadoop 18 (2) It has one small bug fix (PIG-253) (3) Several UDF were added to piggybank - pig's UDF repository The rat report is attached. Keys used to sign the release are available at http://svn.apache.org/viewvc/hadoop/pig/trunk/KEYS?view=markup . Please download, test, and try it out: http://people.apache.org/~olga/pig-0.1.1-candidate-0 Should we release this? Vote closes on Wednesday, December 3rd. Olga