Re: Merge sorting reduce output files

2012-02-28 Thread Robert Evans
Niels, I am not sure I can help with that unless I know better what "a special distribution" means. Unless you are doing a massive amount of processing in your reducer having a partition that is only close to balancing the distribution is a big win over all of the other options that put the d

Re: Merge sorting reduce output files

2012-02-28 Thread Niels Basjes
Hi Robert, On Tue, Feb 28, 2012 at 21:41, Robert Evans wrote: > I would recommend that you do what terrasort does and use a different > partitioner, to ensure that all keys within a given range will go to a > single reducer. If your partitioner is set up correctly then all you have > to do is

Re: Merge sorting reduce output files

2012-02-28 Thread Robert Evans
I would recommend that you do what terrasort does and use a different partitioner, to ensure that all keys within a given range will go to a single reducer. If your partitioner is set up correctly then all you have to do is to concatenate the files together, if you even need to do that. Look a

Re: Query Regarding design MR job for Billing

2012-02-28 Thread Marcos Ortiz
On 02/27/2012 11:33 PM, Stuti Awasthi wrote: Hi Marcos, Thanks for the pointers. I am also thinking on the similar lines. I am doubtful at 1 point : I will be having separate data files for every interval. Let's take example if I have 5 mins interval file which contain data for 2 hours and 10

Merge sorting reduce output files

2012-02-28 Thread Niels Basjes
Hi, We have a job that outputs a set of files that are several hundred MB of text each. Using the comparators and such we can produce output files that are each sorted by themselves. What we want is to have one giant outputfile (outside of the cluster) that is sorted. Now we see the following op

Re: How can I list all jobs history?

2012-02-28 Thread Jie Li
OK I'm not sure if there's a better way, but at least you can write a shell script to combine "job -history" and "job -list", like: foreach `hadoop job -list` hadoop job -history $i Jie On Tue, Feb 28, 2012 at 10:47 AM, Pedro Costa wrote: > hadoop job -list" will only list the JobId Stat

Should splittable Gzip be a "core" hadoop feature?

2012-02-28 Thread Niels Basjes
Hi, Some time ago I had an idea and implemented it. Normally you can only run a single gzipped input file through a single mapper and thus only on a single CPU core. What I created makes it possible to process a Gzipped file in such a way that it can run on several mappers in parallel. I've put

Re: How can I list all jobs history?

2012-02-28 Thread Pedro Costa
hadoop job -list" will only list the JobId State StartTime UserNamePrioritySchedulingInfo. The job history will list in detail the time spent on each phase of the Job. The problem is that, if I've a list of job that completed, the job history only prints the details the first j

Re: How can I list all jobs history?

2012-02-28 Thread Jie Li
Try "hadoop job -list" :) Jie On Tue, Feb 28, 2012 at 8:37 AM, Pedro Costa wrote: > Hi, > > In MapReduce the command bin/hadoop -job history only list > the first job. How can I list the history of all jobs? > > -- > Best regards, > >

Re: Unit Testing for Map Reduce

2012-02-28 Thread Brock Noland
Hi, I am a committer on MRUnit. I'd love to help you use it. We have a user-list which you can subscribe to here: http://incubator.apache.org/mrunit/mail-lists.html Cheers, Brock On Tue, Feb 28, 2012 at 1:02 PM, Akhtar Muhammad Din wrote: > Yes, I have checked it before, there is only single