[jira] Commented: (PIG-1389) Implement Pig counter to track number of rows for each input files

2010-06-24 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882380#action_12882380
 ] 

Arun C Murthy commented on PIG-1389:


Can we not just increment the standard MR counters rather than inventing new 
ones?

 Implement Pig counter to track number of rows for each input files 
 ---

 Key: PIG-1389
 URL: https://issues.apache.org/jira/browse/PIG-1389
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.7.0
Reporter: Richard Ding
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1389.patch


 A MR job generated by Pig not only can have multiple outputs (in the case of 
 multiquery) but also can have multiple inputs (in the case of join or 
 cogroup). In both cases, the existing Hadoop counters (e.g. 
 MAP_INPUT_RECORDS, REDUCE_OUTPUT_RECORDS) can not be used to count the number 
 of records in the given input or output.  PIG-1299 addressed the case of 
 multiple outputs.  We need to add new counters for jobs with multiple inputs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1389) Implement Pig counter to track number of rows for each input files

2010-06-24 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882381#action_12882381
 ] 

Arun C Murthy commented on PIG-1389:


How many new counters are we really adding here? I only see the counter-groups. 
I'm afraid this will be too many new counters...

 Implement Pig counter to track number of rows for each input files 
 ---

 Key: PIG-1389
 URL: https://issues.apache.org/jira/browse/PIG-1389
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.7.0
Reporter: Richard Ding
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1389.patch


 A MR job generated by Pig not only can have multiple outputs (in the case of 
 multiquery) but also can have multiple inputs (in the case of join or 
 cogroup). In both cases, the existing Hadoop counters (e.g. 
 MAP_INPUT_RECORDS, REDUCE_OUTPUT_RECORDS) can not be used to count the number 
 of records in the given input or output.  PIG-1299 addressed the case of 
 multiple outputs.  We need to add new counters for jobs with multiple inputs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Consider cleaning up backend code

2010-04-22 Thread Arun C Murthy

+1

Arun

On Apr 22, 2010, at 11:35 AM, Richard Ding wrote:


Pig has an abstraction layer (interfaces and abstract classes) to
support multiple execution engines. After PIG-1053, Hadoop is the only
execution engine supported by Pig. I wonder if we should remove this
layer of code, and make Hadoop THE execution engine for Pig. This will
simplify a lot the backend code.



Thanks,

-Richard







Re: Consider cleaning up backend code

2010-04-22 Thread Arun C Murthy
I read it as getting rid of concepts parallel to hadoop in  src/org/ 
apache/pig/backend/hadoop/datastorage.


Is that true?

thanks,
Arun

On Apr 22, 2010, at 1:34 PM, Dmitriy Ryaboy wrote:

I kind of dig the concept of being able to plug in a different  
backend,
though I definitely thing we should get rid of the dead localmode  
code. Can
you give an example of how this will simplify the codebase? Is it  
more than
just GenericClass foo = new SpecificClass(), and the associated  
extra files?


-D

On Thu, Apr 22, 2010 at 1:25 PM, Arun C Murthy a...@yahoo-inc.com  
wrote:



+1

Arun


On Apr 22, 2010, at 11:35 AM, Richard Ding wrote:

Pig has an abstraction layer (interfaces and abstract classes) to
support multiple execution engines. After PIG-1053, Hadoop is the  
only

execution engine supported by Pig. I wonder if we should remove this
layer of code, and make Hadoop THE execution engine for Pig. This  
will

simplify a lot the backend code.



Thanks,

-Richard










Re: Consider cleaning up backend code

2010-04-22 Thread Arun C Murthy


On Apr 22, 2010, at 4:38 PM, Richard Ding wrote:


Yes.

The abstraction layer I was referring to is
src/org/apache/pig/backend/executionengine and
src/org/apache/pig/backend/datastorage.



Thanks for the clarification. +1

Arun


Thanks,
-Richard

-Original Message-
From: Arun C Murthy [mailto:a...@yahoo-inc.com]
Sent: Thursday, April 22, 2010 4:14 PM
To: pig-dev@hadoop.apache.org
Subject: Re: Consider cleaning up backend code

I read it as getting rid of concepts parallel to hadoop in  src/org/
apache/pig/backend/hadoop/datastorage.

Is that true?

thanks,
Arun

On Apr 22, 2010, at 1:34 PM, Dmitriy Ryaboy wrote:


I kind of dig the concept of being able to plug in a different
backend,
though I definitely thing we should get rid of the dead localmode
code. Can
you give an example of how this will simplify the codebase? Is it
more than
just GenericClass foo = new SpecificClass(), and the associated
extra files?

-D

On Thu, Apr 22, 2010 at 1:25 PM, Arun C Murthy a...@yahoo-inc.com
wrote:


+1

Arun


On Apr 22, 2010, at 11:35 AM, Richard Ding wrote:

Pig has an abstraction layer (interfaces and abstract classes) to

support multiple execution engines. After PIG-1053, Hadoop is the
only
execution engine supported by Pig. I wonder if we should remove  
this

layer of code, and make Hadoop THE execution engine for Pig. This
will
simplify a lot the backend code.



Thanks,

-Richard












[jira] Created: (PIG-1280) Add a pig-script-id to the JobConf of all jobs run in a pig-script

2010-03-05 Thread Arun C Murthy (JIRA)
Add a pig-script-id to the JobConf of all jobs run in a pig-script
--

 Key: PIG-1280
 URL: https://issues.apache.org/jira/browse/PIG-1280
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Arun C Murthy


It would be very useful for tools like gridmix if pig could add a 
'pig-script-id' to all Map-Reduce jobs spawned by a single pig-script. 
Potentially we could use this to re-construct the DAG of jobs in gridmix and so 
on.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Private variables are not eco-friendly

2010-02-03 Thread Arun C Murthy
The current model forces people to 'convince' others to open up  
classes for inheritance at the precise point it is necessary. This is  
a model which has served, at least, Hadoop very well.


So, I think we should not go make every member protected - rather we  
should open them up one at a time, as and when necessary.


Arun

On Feb 2, 2010, at 7:34 PM, Dmitriy Ryaboy wrote:


Hi all,
I keep running into problems trying to extend Pig due to variables
being declared private. The latest time around it was in PigSlice --
one can't inherit it and do much meaningful overriding of methods
because the input streams are private rather than protected, so I
can't change how it gets created. I wound up having to copy+paste the
class wholesale, which is unfortunate. I know the Slice/Slicer
interfaces are going away, but as a general rule -- can we be mindful
of folks trying to extend classes, and make inner members protected,
rather than private or package?

Thanks
-Dmitriy




[jira] Commented: (PIG-1218) Use distributed cache to store samples

2010-02-03 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12829305#action_12829305
 ] 

Arun C Murthy commented on PIG-1218:


I'd also suggest we increase replication factor for the sample-file in HDFS 
before adding it to the distributed-cache.

 Use distributed cache to store samples
 --

 Key: PIG-1218
 URL: https://issues.apache.org/jira/browse/PIG-1218
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Richard Ding
 Fix For: 0.7.0


 Currently, in the case of skew join and order by we use sample that is just 
 written to the dfs (not distributed cache) and, as the result, get opened and 
 copied around more than necessary. This impacts query performance and also 
 places unnecesary load on the name node

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1062) load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface

2009-11-16 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12778666#action_12778666
 ] 

Arun C Murthy commented on PIG-1062:


bq. It looks like ReduceContext has a getCounter() method. Am I missing a 
subtlety?

The counters you get from a {Map|Reduce}Context are only specific to the 
specific task. One would have to jump through a whole set of hoops i.e. create 
new JobClient or equivalent in the new context object apis), query the 
JobTracker for rolled up counters and even then they aren't guaranteed to be 
completely accurate (until job completion), thus I wouldn't recommend that we 
rely upon them.

 load-store-redesign branch: change SampleLoader and subclasses to work with 
 new LoadFunc interface 
 ---

 Key: PIG-1062
 URL: https://issues.apache.org/jira/browse/PIG-1062
 Project: Pig
  Issue Type: Sub-task
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Attachments: PIG-1062.patch, PIG-1062.patch.3


 This is part of the effort to implement new load store interfaces as laid out 
 in http://wiki.apache.org/pig/LoadStoreRedesignProposal .
 PigStorage and BinStorage are now working.
 SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to 
 be changed to work with new LoadFunc interface.  
 Fixing SampleLoader and RandomSampleLoader will get order-by queries working.
 PoissonSampleLoader is used by skew join. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Proposal to create a branch for contrib project Zebra

2009-08-17 Thread Arun C Murthy


On Aug 17, 2009, at 4:38 PM, Santhosh Srinivasan wrote:


Is there any precedence for such proposals? I am not comfortable with
extending committer access to contrib teams. I would suggest that  
Zebra

be made a sub-project of Hadoop and have a life of its own.



There has been sufficient precedence for 'contrib committers' in  
Hadoop (e.g. Chukwa vis-a-vis the former 'Hadoop Core' sub-project)  
and is normal within the Apache world for committers with specific  
'roles' e.g specific Contrib modules, QA, Release/Build etc. (http://hadoop.apache.org/common/credits.html 
 - in fact, Giridharan Kesavan is an unlisted 'release' committer for  
Apache Hadoop)


I believe it's a desired, nay stated,  goal for Zebra to graduate as a  
Hadoop sub-project eventually, based on which it was voted-in as a  
contrib module by the Apache Pig.


Given these, I don't see  any cause for concern here.

Arun


Santhosh

-Original Message-
From: Raghu Angadi [mailto:rang...@yahoo-inc.com]
Sent: Monday, August 17, 2009 4:06 PM
To: pig-dev@hadoop.apache.org
Subject: Proposal to create a branch for contrib project Zebra


Thanks to the PIG team, The first version of contrib project Zebra
(PIG-833) is committed to PIG trunk.

In short, Zebra is a table storage layer built for use in PIG and  
other

Hadoop applications.

While we are stabilizing current version V1 in the trunk, we plan to  
add


more new features to it. We would like to create an svn branch for the
new features. We will be responsible for managing zebra in PIG trunk  
and


in the new branch. We will merge the branch when it is ready. We  
expect

the changes to affect only 'contrib/zebra' directory.

As a regular contributor to Hadoop, I will be the initial committer  
for
Zebra. As more patches are contributed by other Zebra developers,  
there

might be more commiters added through normal Hadoop/Apache procedure.

I would like to create a branch called 'zebra-v2' with approval from  
PIG


team.

Thanks,
Raghu.




Re: Proposal to create a branch for contrib project Zebra

2009-08-17 Thread Arun C Murthy


That leaves us with contrib committers.

Can you point to earlier email threads that cover the topic of giving
committer access to contrib projects? Specifically, what does it  
mean to

award someone committer privileges to a contrib project, what are the
access privileges that come with such rights, what are the dos/don'ts,
etc.



Chukwa was a contrib module prior to it's current avatar as a full- 
fledged sub-project.


It's 'contrib committers' Ari Rabkin and Eric Yang became it's first  
committers: http://markmail.org/message/75qvvcigi3qumifp


Unfortunately the email threads for voting contrib committers are  
private to the Hadoop PMC, you'll just have to take my word for it.  
*smile*

I did dig-up some other examples for you:
http://www.gossamer-threads.com/lists/lucene/java-dev/81122
http://www.nabble.com/ANNOUNCE:-Welcome--as-Contrib-Committer-td21506295.html

Contrib committers have privileges to commit only to their 'module':  
pig/trunk/contrib/zebra in this case.




Thirdly, are there instances of contrib committers creating branches?



Branches are a development tool... I don't see the problem with  
creating/using them.


Arun



[jira] Commented: (PIG-901) InputSplit (SliceWrapper) created by Pig is big in size due to serialized PigContext

2009-08-04 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12739048#action_12739048
 ] 

Arun C Murthy commented on PIG-901:
---

bq. This may require some design changes which we should address at some point 
for these kinds of tests.

Could you please track this with a new jira? Thanks!

 InputSplit (SliceWrapper) created by Pig is big in size due to serialized 
 PigContext
 

 Key: PIG-901
 URL: https://issues.apache.org/jira/browse/PIG-901
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.1
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: 0.4.0

 Attachments: PIG-901-1.patch, PIG-901-branch-0.3.patch, 
 PIG-901-trunk.patch


 InputSplit (SliceWrapper) created by Pig is big in size due to serialized 
 PigContext. SliceWrapper only needs ExecType - so the entire PigContext 
 should not be serialized and only the ExecType should be serialized.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-901) InputSplit (SliceWrapper) created by Pig is big in size due to serialized PigContext

2009-08-03 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12738740#action_12738740
 ] 

Arun C Murthy commented on PIG-901:
---

It would be nice to add a test case which (for now) checks to ensure that the 
size of a serialized 'slice' is less than 500KB or so...

 InputSplit (SliceWrapper) created by Pig is big in size due to serialized 
 PigContext
 

 Key: PIG-901
 URL: https://issues.apache.org/jira/browse/PIG-901
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.1
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: 0.4.0

 Attachments: PIG-901-1.patch, PIG-901-branch-0.3.patch


 InputSplit (SliceWrapper) created by Pig is big in size due to serialized 
 PigContext. SliceWrapper only needs ExecType - so the entire PigContext 
 should not be serialized and only the ExecType should be serialized.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-878) Pig is returning too many blocks in the InputSplit

2009-07-20 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12733311#action_12733311
 ] 

Arun C Murthy commented on PIG-878:
---

bq. Should note also that I didn't add any tests because this was a fix for 
existing functionality, and frankly I'm not exactly sure how to test it. 

We could check the #splits returned by the slicer to ensure it's equal to the 
replication factor of the input files?

 Pig is returning too many blocks in the InputSplit
 --

 Key: PIG-878
 URL: https://issues.apache.org/jira/browse/PIG-878
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.0
Reporter: Alan Gates
Assignee: Alan Gates
Priority: Critical
 Fix For: 0.4.0

 Attachments: PIG-878.patch


 When SlicerWrapper builds a slice, it currently returns the 3 locations for 
 every block in the file it is slicing, instead of the 3 locations for the 
 block covered by that slice.  This means Pig's odds of having its maps placed 
 on nodes local to the data goes way down.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-864) Record graph of execution of Map-Reduce jobs executed by a Pig script

2009-06-25 Thread Arun C Murthy (JIRA)
Record graph of execution of Map-Reduce jobs executed by a Pig script
-

 Key: PIG-864
 URL: https://issues.apache.org/jira/browse/PIG-864
 Project: Pig
  Issue Type: Improvement
Reporter: Arun C Murthy


It would be useful for offline analysis if Pig were to record the entire graph 
of Map-Reduce jobs executed by a singe Pig script.

For starters a simple 'parent jobid' for each MR job in the graph would be nice.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [VOTE] Release Pig 0.3.0 (candidate 0)

2009-06-22 Thread Arun C Murthy


On Jun 18, 2009, at 12:30 PM, Olga Natkovich wrote:


Hi,

I created a candidate build for Pig 0.3.0 release. The main feature of
this release is support for multiquery which allows to share  
computation

across multiple queries within the same script. We see significant
performance improvements (up to order of magnitude) as the result of
this optimization.



+1

I downloaded the release, validated checksums and ran the unit-tests  
successfully.


Arun



Re: [VOTE] Release Pig 0.1.1 (candidate 0)

2008-12-02 Thread Arun C Murthy

+1.

I downloaded the release, checked the signatures and checksums. All  
unit test pass.


Arun

On Nov 25, 2008, at 3:58 PM, Olga Natkovich wrote:


Hi,

I have created a candidate build for Pig 0.1.1. This release is  
almost identical to Pig 0.1.0 with a couple of exceptions:


(1) It is integrated with hadoop 18
(2) It has one small bug fix (PIG-253)
(3) Several UDF were added to piggybank - pig's UDF repository

The rat report is attached.

Keys used to sign the release are available at http://svn.apache.org/viewvc/hadoop/pig/trunk/KEYS?view=markup 
.


Please download, test, and try it out:

http://people.apache.org/~olga/pig-0.1.1-candidate-0

Should we release this? Vote closes on Wednesday, December 3rd.

Olga