Is Intermediate data written to disk?

2010-02-02 Thread bharath v
Hi , I have a small doubt in how pig handles queries containing join of more than 2 tables . Suppose we have 3 tables A,B,C .. and the plan is "((AB)C)" .. We can join A,B in a map reduce job and join the resultant table with "C". I have a doubt whether the result of "AB" is stored to disk befor

Re: Various pig questions

2010-02-02 Thread Alan Gates
Answers inlined: On Feb 2, 2010, at 3:15 AM, Guy Jeffery wrote: Hi, Hope this gets to the right list... I'm fairly new to Pig, been playing around with it for a couple of days. Essentially I'm doing a bit of work to evaluate Pig and its ability to simplify the use of Hadoop - basicall

First Official Austin Hadoop User Group - March 18th

2010-02-02 Thread Stephen Watt
Hi Folks This note is to let you know that we'll be kicking off the inaugural Austin Hadoop User Group on March the 18th. At present, we have speakers lined up from IBM and Rackspace and will cover quite a wide variety of topics along with a few demos. This event will follow directly on the he

Re: storing to different files

2010-02-02 Thread zaki rahaman
Out of curiosity, how many tuples exist in each group's data bag? I'd imagine it's a highly variable number but what order of magnitude are you dealing with? I think it would make more sense to implement this in Java MapReduce using MultipleOutputs or MultipleOutputFormat as these classes were desi

Re: storing to different files

2010-02-02 Thread Jennie Cochran-Chinn
Thanks for the clarification. We went down the path of using a UDF inside the FOREACH after the GROUP as yes, there are >5k unique groups. We cant reduce the number of unique groups as there is a downstream application whose requirements we must meet. To further the question, our current

Re: storing to different files

2010-02-02 Thread Jennie Cochran-Chinn
Thanks for the suggestion - we dont know the groups a priori and there are quite a few of them (~20k) so in our case this wont work. On Jan 31, 2010, at 10:02 PM, Rekha Joshi wrote: If it pig0.3 or higher you would be able to just use STORE command multiple times in the pig script to store

Various pig questions

2010-02-02 Thread Guy Jeffery
Hi, Hope this gets to the right list... I'm fairly new to Pig, been playing around with it for a couple of days. Essentially I'm doing a bit of work to evaluate Pig and its ability to simplify the use of Hadoop - basically to allow users without a massive Java background to run Hadoop jobs.

Re: storing to different files

2010-02-02 Thread Ankur C. Goel
Jennie, A hadoop cluster has an enforced limit on the number of concurrent streams that can be kept open at any time. This limit is the number of concurrent threads that a Datanode can run for doing I/O specified by the cluster level job config parameter - dfs.datanode.max.xcievers. So