Re: nested group?

2011-10-12 Thread Thejas Nair
yes, it should be easy to implement. -Thejas On 10/12/11 5:26 PM, Dmitriy Ryaboy wrote: Hah, sorry Aniket! Hmm isn't nested group just a collected group with relaxed constraints? D On Wed, Oct 12, 2011 at 3:38 PM, Thejas Nair wrote: Aniket added nested-foreach, and Zhijie added nested-cros

Re: How to store each record in a seperate file

2011-10-12 Thread kiranprasad
Hi Ayon I have just started working on PIG and trying with different usecases. one of my use case is there are 10 million records and after grouping them with a field (say location), I want all the records of particular location in separate file. I am presently working on the local mode. Kira

Re: How to store each record in a seperate file

2011-10-12 Thread kiranprasad
I wanna compare 2 files. A.txt and B.txt cat A; (1,2,3) (4,2,1) (8,3,4) (8,3,4) (4,2,1) (8,3,4) (4,2,1) cat B.txt; 1 2 3 now I wanna compare each A.$0 == B.$0 then write the result in separate file. -Original Message- From: kiranprasad Sent: Thursday, October 13, 2011 10:49 AM To: u

Re: How to store each record in a seperate file

2011-10-12 Thread Ayon Sinha
Hi Kiranprasad, What is your usecase? Are you sure you have picked the right tool for the job? Pig/Hadoop is meant for massive datasets which mean millions and billions of rows. Which in your case would lead to millions & billions of files which Hadoop doesn't like anyway. Now if your dataset is

Re: How to store each record in a seperate file

2011-10-12 Thread jacob.a.perk...@gmail.com
Refer to the MultiStorage storefunc in contrib/piggybank. --jacob @thedatachef Sent from my HTC Inspireā„¢ 4G on AT&T - Reply message - From: "kiranprasad" To: Subject: How to store each record in a seperate file Date: Wed, Oct 12, 2011 11:35 pm Hi After grouping a data set, how do I

Re: How to store each record in a seperate file

2011-10-12 Thread kiranprasad
Thank you for quick response, But how can I perform the below in local mode. -Original Message- From: Jonathan Coveney Sent: Thursday, October 13, 2011 10:28 AM To: user@pig.apache.org ; Ayon Sinha Subject: Re: How to store each record in a seperate file To Ayon's point, MultipleOutput

Re: How to store each record in a seperate file

2011-10-12 Thread Jonathan Coveney
To Ayon's point, MultipleOutputFormat can get the job done, but keep in mind that Hadoop deals better with larger files than smaller ones. Every file is allocated in blocks (64MB, 128MB, 256MB), so lot's of small blocks is bad. 2011/10/12 Ayon Sinha > Besides the bigger question of Why would you

Re: How to store each record in a seperate file

2011-10-12 Thread Ayon Sinha
Besides the bigger question of Why would you want to store each record in a separate file? I'm not sure how to do this in Pig but it is definitely possible in Hadoop (and also streaming) via MultipleOutputFormat where the name of the output file can be based on the base_dir and key and value. Yo

How to store each record in a seperate file

2011-10-12 Thread kiranprasad
Hi After grouping a data set, how do I save each group in a separate file. ex: A = E:/data.txt' USING PigStorage(','); B = GROUP A BY $0; cat data.txt; (1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3) After grouping (1,{(1,2,3)}) (4,{(4,2,1),(4,3,3)}) (7,{(7,2,5)}) (8,{(8,3,4),(8,4,3)}) How

Re: jython udfs

2011-10-12 Thread Jonathan Coveney
Bags have to contain tuples. It's how the type is defined. 2011/10/12 Stan Rosenberg > Hi, > > I have three constant udfs in jython: > > @outputSchema("m:map[bag{tuple()}]") > def dummy1(): >return {"key":[("value1", "value2")]} > > @outputSchema("m:map[tuple()]") > def dummy2(): >return

Re: Is there a way to set reducer number of pig besides using parallel keyword?

2011-10-12 Thread Dmitriy Ryaboy
Yeah, "group all" is a special case that always has parallelism of 1 (due to the semantics of grouping by all). On Wed, Oct 12, 2011 at 3:47 PM, Andrew Clegg wrote: > Something I was wondering the other day... If you do a "group > all" and then pass the result to a non-algebraic aggregate funct

Re: nested group?

2011-10-12 Thread Dmitriy Ryaboy
Hah, sorry Aniket! Hmm isn't nested group just a collected group with relaxed constraints? D On Wed, Oct 12, 2011 at 3:38 PM, Thejas Nair wrote: > Aniket added nested-foreach, and Zhijie added nested-cross. > We still need somebody to implement nested group :) > > -Thejas > > > > On 10/12/11 2

jython udfs

2011-10-12 Thread Stan Rosenberg
Hi, I have three constant udfs in jython: @outputSchema("m:map[bag{tuple()}]") def dummy1(): return {"key":[("value1", "value2")]} @outputSchema("m:map[tuple()]") def dummy2(): return {"key":("value1", "value2")} # doesn't work! @outputSchema("m:map[bag{}]") def dummy3(): return {"k

Re: Is there a way to set reducer number of pig besides using parallel keyword?

2011-10-12 Thread Andrew Clegg
Something I was wondering the other day... If you do a "group all" and then pass the result to a non-algebraic aggregate function, will that guarantee that all the records go to a single reducer? Or is it more subtle than that? On 12 October 2011 22:08, Norbert Burger wrote: > For a more detaile

Re: nested group?

2011-10-12 Thread Thejas Nair
Aniket added nested-foreach, and Zhijie added nested-cross. We still need somebody to implement nested group :) -Thejas On 10/12/11 2:11 PM, Dmitriy Ryaboy wrote: Hi guys, I know Gianmarco recently worked on the nested foreach -- any chance nested group got done at the same time? :) D

nested group?

2011-10-12 Thread Dmitriy Ryaboy
Hi guys, I know Gianmarco recently worked on the nested foreach -- any chance nested group got done at the same time? :) D

Re: Is there a way to set reducer number of pig besides using parallel keyword?

2011-10-12 Thread Norbert Burger
For a more detailed explanation, take a look also at http://pig.apache.org/docs/r0.8.1/cookbook.html#Use+the+Parallel+Features. In summary: * The PARALLEL keyword at the operator level overrides any other setting * SET default_parallel determines reducer count for all blocking operators (ones tha

Re: Is there a way to set reducer number of pig besides using parallel keyword?

2011-10-12 Thread Dmitriy Ryaboy
set default_parallel 8 -D On Wed, Oct 12, 2011 at 11:35 AM, Hui Qi wrote: > Hi, > I try to set a reducer number in the following way: > java -Dmapred.reduce.tasks=8 -cp pig.jar:$HADOOP_HOME/conf > org.apache.pig.Main ./L1.pig > > but it doesn't work, the reducers number remain the same the as 4

Is there a way to set reducer number of pig besides using parallel keyword?

2011-10-12 Thread Hui Qi
Hi, I try to set a reducer number in the following way: java -Dmapred.reduce.tasks=8 -cp pig.jar:$HADOOP_HOME/conf org.apache.pig.Main ./L1.pig but it doesn't work, the reducers number remain the same the as 40, which is the parallel number in L1.pig.(L1.pig is from pigmix). If I delete the parall

Re: calculate percentage

2011-10-12 Thread Marco Cadetg
That's great. Thanks so much! -Marco On Wed, Oct 12, 2011 at 3:36 PM, Norbert Burger wrote: > Adding FLATTEN to your "grouped-by-multiple-cols" relation > (iq_per_region_per_gender) will make it much easier to join and visualize. > Once your join keys are flat string literals ("gender"), then it

Re: calculate percentage

2011-10-12 Thread Norbert Burger
Adding FLATTEN to your "grouped-by-multiple-cols" relation (iq_per_region_per_gender) will make it much easier to join and visualize. Once your join keys are flat string literals ("gender"), then it's just a straightforward JOIN/FOREACH. Here's a fragment that seems to do what you need: A = LOAD

Re: classloader question

2011-10-12 Thread Norbert Burger
Take a look at the pig-withouthadoop target in the build.xml from your pig release. Usage of the target is documented here (for a different goal, although): http://thedatachef.blogspot.com/2011/01/apache-pig-08-with-cloudera-cdh3.html Essentially, the target allows you to build pig without hadoo

Re: calculate percentage

2011-10-12 Thread Marco Cadetg
Yes but I'm still not able to compute the percentage. I've joined the bags as below. A = LOAD '/data/marco/foo.csv' USING PigStorage(',') AS (name:cha rarray, region:chararray, gender:chararray, iq:int); iq_per_region_per_gender = GROUP A BY (region, gender); total_iq_per_gender = GROUP A BY (gend

Re: calculate percentage

2011-10-12 Thread Dmitriy Ryaboy
Sure, just join your total counts with your partials on gender. D On Tue, Oct 11, 2011 at 11:58 PM, Marco Cadetg wrote: > D'oh I just see that unfortunately my example was a bit over simplified. > The > total needs to be grouped by another field like below. > > A = LOAD 'student' USING PigStora