Re: Are hadoop fs commands serial or parallel

2011-05-20 Thread Dieter Plaetinck
What do you mean clunky? IMHO this is a quite elegant, simple, working solution. Sure this spawns multiple processes, but it beats any api-overcomplications, imho. Dieter On Wed, 18 May 2011 11:39:36 -0500 Patrick Angeles wrote: > kinda clunky but you could do this via shell: > > for $FILE in

Re: outputCollector vs. Localfile

2011-05-20 Thread Harsh J
Mark, On Fri, May 20, 2011 at 10:17 AM, Mark question wrote: > This is puzzling me ... > >  With a mapper producing output of size ~ 400 MB ... which one is supposed > to be faster? > >  1) output collector: which will write to local file then copy to HDFS since > I don't have reducers. A regula

Why Only 1 Reducer is running ??

2011-05-20 Thread praveenesh kumar
Hello everyone, I am using wordcount application to test on my hadoop cluster of 5 nodes. The file size is around 5 GB. Its taking around 2 min - 40 sec for execution. But when I am checking the JobTracker web portal, I am seeing only one reducer is running. Why so ?? How can I change the code so

Re: Why Only 1 Reducer is running ??

2011-05-20 Thread James Seigel Tynt
The job could be designed to use one reducer On 2011-05-20, at 7:19 AM, praveenesh kumar wrote: > Hello everyone, > > I am using wordcount application to test on my hadoop cluster of 5 nodes. > The file size is around 5 GB. > Its taking around 2 min - 40 sec for execution. > But when I am check

Re: Why Only 1 Reducer is running ??

2011-05-20 Thread praveenesh kumar
I am using the wordcount example that comes along with hadoop. How can I configure it to make it use multiple reducers. I guess mutiple reducers will make it run more fast .. Does it ?? On Fri, May 20, 2011 at 6:51 PM, James Seigel Tynt wrote: > The job could be designed to use one reducer > >

Re: Why Only 1 Reducer is running ??

2011-05-20 Thread modemide
what does your mapred-site.xml file say? I've used wordcount and had close to 12 reduces running on a 6 datanode cluster on a 3 GB file. I have a configuration in there which says: mapred.reduce.tasks = 12 The reason I chose 12 was because it was recommended that I choose 2x number of tasktrack

Re: Why Only 1 Reducer is running ??

2011-05-20 Thread praveenesh kumar
Hi, My mapred-site.xml is pretty simple. mapred.job.tracker ub13:54311 The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. where I should put the settings that you are saying ?? On Fri, May 20, 2011 at 6:

RE: Why Only 1 Reducer is running ??

2011-05-20 Thread Evert Lammerts
Hi Praveenesh, * You can set the maximum amount of reducers per node in your mapred-site.xml using mapred.tasktracker.reduce.tasks.maximum (default set to 2). * You can set the default number of reduce tasks with mapred.reduce.tasks (default set to 1 - this causes your single reducer). * Your jo

Re: Are hadoop fs commands serial or parallel

2011-05-20 Thread Brian Bockelman
On May 20, 2011, at 6:10 AM, Dieter Plaetinck wrote: > What do you mean clunky? > IMHO this is a quite elegant, simple, working solution. Try giving it to a user; watch them feed it a list of 10,000 files; watch the machine swap to death and the disks uselessly thrash. > Sure this spawns multi

Configuring jvm metrics in hadoop-0.20.203.0

2011-05-20 Thread Matyas Markovics
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi, I am trying to get jvm metrics from the new verison of hadoop. I have read the migration instructions and come up with the following content for hadoop-metrics2.properties: *.sink.file.class=org.apache.hadoop.metrics2.sink.FileSink jvm.sink.file.p

Re: outputCollector vs. Localfile

2011-05-20 Thread Mark question
I thought it was, because of FileBytesWritten counter. Thanks for the clarification. Mark On Fri, May 20, 2011 at 4:23 AM, Harsh J wrote: > Mark, > > On Fri, May 20, 2011 at 10:17 AM, Mark question > wrote: > > This is puzzling me ... > > > > With a mapper producing output of size ~ 400 MB ...

What's the easiest way to count the number of pairs in a directory?

2011-05-20 Thread W.P. McNeill
I've got a directory with a bunch of MapReduce data in it. I want to know how many pairs it contains. I could write a mapper-only process that takes pairs as input and updates a counter, but it seems like this utility should already exist. Does it, or do I have to roll my own? Bonus question,

Can I number output results with a Counter?

2011-05-20 Thread Mark Kerzner
Hi, can I use a Counter to give each record in all reducers a consecutive number? Currently I am using a single Reducer, but it is an anti-pattern. But I need to assign consecutive numbers to all output records in all reducers, and it does not matter how, as long as each gets its own number. If it

Re: What's the easiest way to count the number of pairs in a directory?

2011-05-20 Thread Joey Echeverria
What format is the input data in? At first glance, I would run an identity mapper and use a NullOutputFormat so you don't get any data written. The built in counters already count the number of key, value pairs read in by the mappers. -Joey On Fri, May 20, 2011 at 9:34 AM, W.P. McNeill wrote: >

Re: What's the easiest way to count the number of pairs in a directory?

2011-05-20 Thread James Seigel
The cheapest way would be to check the counters as you write them in the first place and keep a running score. :) Sent from my mobile. Please excuse the typos. On 2011-05-20, at 10:35 AM, "W.P. McNeill" wrote: > I've got a directory with a bunch of MapReduce data in it. I want to know > how ma

Re: Can I number output results with a Counter?

2011-05-20 Thread Joey Echeverria
To make sure I understand you correctly, you need a globally unique one up counter for each output record? If you had an upper bound on the number of records a single reducer could output and you can afford to have gaps, you could just use the task id and multiply that by the max number of records

Re: Can I number output results with a Counter?

2011-05-20 Thread Mark Kerzner
Joey, You understood me perfectly well. I see your first advice, but I am not allowed to have gaps. A central service is something I may consider if single reducer becomes a worse bottleneck than it. But what are counters for? They seem to be exactly that. Mark On Fri, May 20, 2011 at 12:01 PM,

Re: What's the easiest way to count the number of pairs in a directory?

2011-05-20 Thread W.P. McNeill
The keys are Text and the values are large custom data structures serialized with Avro. I also have counters for the job that generates these files that gives me this information but sometimes...Well, it's a long story. Suffice to say that it's nice to have a post-hoc method too. :-) The identi

Re: Can I number output results with a Counter?

2011-05-20 Thread Joey Echeverria
Counters are a way to get status from your running job. They don't increment a global state. They locally save increments and periodically report those increments to the central counter. That means that the final count will be correct, but you can't use them to coordinate counts while your job is r

Re: What's the easiest way to count the number of pairs in a directory?

2011-05-20 Thread Joey Echeverria
Are you storing the data in sequence files? -Joey On Fri, May 20, 2011 at 10:33 AM, W.P. McNeill wrote: > The keys are Text and the values are large custom data structures serialized > with Avro. > > I also have counters for the job that generates these files that gives me > this information but

Re: What's the easiest way to count the number of pairs in a directory?

2011-05-20 Thread W.P. McNeill
No.

Re: Can I number output results with a Counter?

2011-05-20 Thread Kai Voigt
Also, with speculative execution enabled, you might see a higher count as you expect while the same task is running multiple times in parallel. When a task gets killed because another instance was quicker, those counters will be removed from the global count though. Kai Am 20.05.2011 um 19:34

Re: Can I number output results with a Counter?

2011-05-20 Thread Mark Kerzner
Thank you, Kai and Joey, for the explanation. That's what I thought about them, but did not want to miss the "magical" replacement for a central services in the counters. No, there is no magic, just great reality. Mark On Fri, May 20, 2011 at 12:39 PM, Kai Voigt wrote: > Also, with speculative

Problem: Unknown scheme hdfs. It should correspond to a JournalType enumeration value

2011-05-20 Thread Eduardo Dario Ricci
Hy People I'm starting in hadoop commom.. and got some problem to try using a cluster.. I'm following the steps of this page: http://hadoop.apache.org/common/docs/r0.21.0/cluster_setup.html I done everything, but when I will format the HDFS, this error happens: I searched for something to help-

Re: Problem: Unknown scheme hdfs. It should correspond to a JournalType enumeration value

2011-05-20 Thread Todd Lipcon
Hi Eduardo, Sounds like you've configured your dfs.name.dirs to be on HDFS instead of like file paths. -Todd On Fri, May 20, 2011 at 2:20 PM, Eduardo Dario Ricci wrote: > Hy People > > I'm starting in hadoop commom.. and got some problem to try using a > cluster.. > > I'm following the steps of

Using df instead of du to calculate datanode space

2011-05-20 Thread Joe Stein
I came up with a nice little hack to trick hadoop into calculating disk usage with df instead of du http://allthingshadoop.com/2011/05/20/faster-datanodes-with-less-wait-io-using-df-instead-of-du/ I am running this in production, works like a charm and already seeing benefit, woot! I hope it wor

Re: Applications creates bigger output than input?

2011-05-20 Thread elton sky
Thanks Robert, Niels Ye, I think text manipulation, especially ngram is a good application for me. Cheers On Fri, May 20, 2011 at 12:57 AM, Robert Evans wrote: > I'm not sure if this has been mentioned or not but in Machine Learning with > text based documents, the first stage is often a glorif

How to see block information on NameNode ?

2011-05-20 Thread praveenesh kumar
hey..!! I have a question. If I copy some file on HDFS file system, it will get split into blocks and Namenode will keep all these meta info with it. How can I see that info. I copied 5 GB file on NameNode, but I see that file only on the NameNode.. It doesnot get split into blocks..?? How can I s