Hi ,
I have a small doubt in how pig handles queries containing join of more than
2 tables .
Suppose we have 3 tables A,B,C .. and the plan is "((AB)C)" ..
We can join A,B in a map reduce job and join the resultant table with "C". I
have a doubt whether the result of "AB" is stored to disk befor
Answers inlined:
On Feb 2, 2010, at 3:15 AM, Guy Jeffery wrote:
Hi,
Hope this gets to the right list...
I'm fairly new to Pig, been playing around with it for a couple of
days.
Essentially I'm doing a bit of work to evaluate Pig and its ability to
simplify the use of Hadoop - basicall
Hi Folks
This note is to let you know that we'll be kicking off the inaugural
Austin Hadoop User Group on March the 18th. At present, we have speakers
lined up from IBM and Rackspace and will cover quite a wide variety of
topics along with a few demos. This event will follow directly on the
he
Out of curiosity, how many tuples exist in each group's data bag? I'd
imagine it's a highly variable number but what order of magnitude are you
dealing with? I think it would make more sense to implement this in Java
MapReduce using MultipleOutputs or MultipleOutputFormat as these classes
were desi
Thanks for the clarification. We went down the path of using a UDF
inside the FOREACH after the GROUP as yes, there are >5k unique
groups. We cant reduce the number of unique groups as there is a
downstream application whose requirements we must meet.
To further the question, our current
Thanks for the suggestion - we dont know the groups a priori and there
are quite a few of them (~20k) so in our case this wont work.
On Jan 31, 2010, at 10:02 PM, Rekha Joshi wrote:
If it pig0.3 or higher you would be able to just use STORE command
multiple times in the pig script to store
Hi,
Hope this gets to the right list...
I'm fairly new to Pig, been playing around with it for a couple of days.
Essentially I'm doing a bit of work to evaluate Pig and its ability to
simplify the use of Hadoop - basically to allow users without a massive
Java background to run Hadoop jobs.
Jennie,
A hadoop cluster has an enforced limit on the number of concurrent
streams that can be kept open at any time.
This limit is the number of concurrent threads that a Datanode can run for
doing I/O specified by the cluster level job config parameter -
dfs.datanode.max.xcievers.
So