yes, it should be easy to implement.
-Thejas
On 10/12/11 5:26 PM, Dmitriy Ryaboy wrote:
Hah, sorry Aniket!
Hmm isn't nested group just a collected group with relaxed constraints?
D
On Wed, Oct 12, 2011 at 3:38 PM, Thejas Nair wrote:
Aniket added nested-foreach, and Zhijie added nested-cros
Hi Ayon
I have just started working on PIG and trying with different usecases.
one of my use case is there are 10 million records and after grouping them
with a field (say location), I want all the records of particular location
in separate file.
I am presently working on the local mode.
Kira
I wanna compare 2 files.
A.txt and B.txt
cat A;
(1,2,3)
(4,2,1)
(8,3,4)
(8,3,4)
(4,2,1)
(8,3,4)
(4,2,1)
cat B.txt;
1
2
3
now I wanna compare each A.$0 == B.$0 then write the result in separate
file.
-Original Message-
From: kiranprasad
Sent: Thursday, October 13, 2011 10:49 AM
To: u
Hi Kiranprasad,
What is your usecase? Are you sure you have picked the right tool for the job?
Pig/Hadoop is meant for massive datasets which mean millions and billions of
rows. Which in your case would lead to millions & billions of files which
Hadoop doesn't like anyway.
Now if your dataset is
Refer to the MultiStorage storefunc in contrib/piggybank.
--jacob
@thedatachef
Sent from my HTC Inspireā¢ 4G on AT&T
- Reply message -
From: "kiranprasad"
To:
Subject: How to store each record in a seperate file
Date: Wed, Oct 12, 2011 11:35 pm
Hi
After grouping a data set, how do I
Thank you for quick response, But how can I perform the below in local mode.
-Original Message-
From: Jonathan Coveney
Sent: Thursday, October 13, 2011 10:28 AM
To: user@pig.apache.org ; Ayon Sinha
Subject: Re: How to store each record in a seperate file
To Ayon's point, MultipleOutput
To Ayon's point, MultipleOutputFormat can get the job done, but keep in mind
that Hadoop deals better with larger files than smaller ones. Every file is
allocated in blocks (64MB, 128MB, 256MB), so lot's of small blocks is bad.
2011/10/12 Ayon Sinha
> Besides the bigger question of Why would you
Besides the bigger question of Why would you want to store each record in a
separate file?
I'm not sure how to do this in Pig but it is definitely possible in Hadoop (and
also streaming) via MultipleOutputFormat where the name of the output file can
be based on the base_dir and key and value. Yo
Hi
After grouping a data set, how do I save each group in a separate file.
ex:
A = E:/data.txt' USING PigStorage(',');
B = GROUP A BY $0;
cat data.txt;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
After grouping
(1,{(1,2,3)})
(4,{(4,2,1),(4,3,3)})
(7,{(7,2,5)})
(8,{(8,3,4),(8,4,3)})
How
Bags have to contain tuples. It's how the type is defined.
2011/10/12 Stan Rosenberg
> Hi,
>
> I have three constant udfs in jython:
>
> @outputSchema("m:map[bag{tuple()}]")
> def dummy1():
>return {"key":[("value1", "value2")]}
>
> @outputSchema("m:map[tuple()]")
> def dummy2():
>return
Yeah, "group all" is a special case that always has parallelism of 1 (due to
the semantics of grouping by all).
On Wed, Oct 12, 2011 at 3:47 PM, Andrew Clegg wrote:
> Something I was wondering the other day... If you do a "group
> all" and then pass the result to a non-algebraic aggregate funct
Hah, sorry Aniket!
Hmm isn't nested group just a collected group with relaxed constraints?
D
On Wed, Oct 12, 2011 at 3:38 PM, Thejas Nair wrote:
> Aniket added nested-foreach, and Zhijie added nested-cross.
> We still need somebody to implement nested group :)
>
> -Thejas
>
>
>
> On 10/12/11 2
Hi,
I have three constant udfs in jython:
@outputSchema("m:map[bag{tuple()}]")
def dummy1():
return {"key":[("value1", "value2")]}
@outputSchema("m:map[tuple()]")
def dummy2():
return {"key":("value1", "value2")}
# doesn't work!
@outputSchema("m:map[bag{}]")
def dummy3():
return {"k
Something I was wondering the other day... If you do a "group
all" and then pass the result to a non-algebraic aggregate function,
will that guarantee that all the records go to a single reducer? Or is
it more subtle than that?
On 12 October 2011 22:08, Norbert Burger wrote:
> For a more detaile
Aniket added nested-foreach, and Zhijie added nested-cross.
We still need somebody to implement nested group :)
-Thejas
On 10/12/11 2:11 PM, Dmitriy Ryaboy wrote:
Hi guys, I know Gianmarco recently worked on the nested foreach -- any
chance nested group got done at the same time? :)
D
Hi guys, I know Gianmarco recently worked on the nested foreach -- any
chance nested group got done at the same time? :)
D
For a more detailed explanation, take a look also at
http://pig.apache.org/docs/r0.8.1/cookbook.html#Use+the+Parallel+Features.
In summary:
* The PARALLEL keyword at the operator level overrides any other setting
* SET default_parallel determines reducer count for all blocking operators
(ones tha
set default_parallel 8
-D
On Wed, Oct 12, 2011 at 11:35 AM, Hui Qi wrote:
> Hi,
> I try to set a reducer number in the following way:
> java -Dmapred.reduce.tasks=8 -cp pig.jar:$HADOOP_HOME/conf
> org.apache.pig.Main ./L1.pig
>
> but it doesn't work, the reducers number remain the same the as 4
Hi,
I try to set a reducer number in the following way:
java -Dmapred.reduce.tasks=8 -cp pig.jar:$HADOOP_HOME/conf
org.apache.pig.Main ./L1.pig
but it doesn't work, the reducers number remain the same the as 40, which is
the parallel number in L1.pig.(L1.pig is from pigmix).
If I delete the parall
That's great. Thanks so much!
-Marco
On Wed, Oct 12, 2011 at 3:36 PM, Norbert Burger wrote:
> Adding FLATTEN to your "grouped-by-multiple-cols" relation
> (iq_per_region_per_gender) will make it much easier to join and visualize.
> Once your join keys are flat string literals ("gender"), then it
Adding FLATTEN to your "grouped-by-multiple-cols" relation
(iq_per_region_per_gender) will make it much easier to join and visualize.
Once your join keys are flat string literals ("gender"), then it's just a
straightforward JOIN/FOREACH.
Here's a fragment that seems to do what you need:
A = LOAD
Take a look at the pig-withouthadoop target in the build.xml from your pig
release. Usage of the target is documented here (for a different goal,
although):
http://thedatachef.blogspot.com/2011/01/apache-pig-08-with-cloudera-cdh3.html
Essentially, the target allows you to build pig without hadoo
Yes but I'm still not able to compute the percentage. I've joined the bags
as below.
A = LOAD '/data/marco/foo.csv' USING PigStorage(',') AS (name:cha
rarray, region:chararray, gender:chararray, iq:int);
iq_per_region_per_gender = GROUP A BY (region, gender);
total_iq_per_gender = GROUP A BY (gend
Sure, just join your total counts with your partials on gender.
D
On Tue, Oct 11, 2011 at 11:58 PM, Marco Cadetg wrote:
> D'oh I just see that unfortunately my example was a bit over simplified.
> The
> total needs to be grouped by another field like below.
>
> A = LOAD 'student' USING PigStora
24 matches
Mail list logo