Re: Are hadoop fs commands serial or parallel
What do you mean clunky? IMHO this is a quite elegant, simple, working solution. Sure this spawns multiple processes, but it beats any api-overcomplications, imho. Dieter On Wed, 18 May 2011 11:39:36 -0500 Patrick Angeles patr...@cloudera.com wrote: kinda clunky but you could do this via shell: for $FILE in $LIST_OF_FILES ; do hadoop fs -copyFromLocal $FILE $DEST_PATH done If doing this via the Java API, then, yes you will have to use multiple threads. On Wed, May 18, 2011 at 1:04 AM, Mapred Learn mapred.le...@gmail.comwrote: Thanks harsh ! That means basically both APIs as well as hadoop client commands allow only serial writes. I was wondering what could be other ways to write data in parallel to HDFS other than using multiple parallel threads. Thanks, JJ Sent from my iPhone On May 17, 2011, at 10:59 PM, Harsh J ha...@cloudera.com wrote: Hello, Adding to Joey's response, copyFromLocal's current implementation is serial given a list of files. On Wed, May 18, 2011 at 9:57 AM, Mapred Learn mapred.le...@gmail.com wrote: Thanks Joey ! I will try to find out abt copyFromLocal. Looks like Hadoop Apis write serially as you pointed out. Thanks, -JJ On May 17, 2011, at 8:32 PM, Joey Echeverria j...@cloudera.com wrote: The sequence file writer definitely does it serially as you can only ever write to the end of a file in Hadoop. Doing copyFromLocal could write multiple files in parallel (I'm not sure if it does or not), but a single file would be written serially. -Joey On Tue, May 17, 2011 at 5:44 PM, Mapred Learn mapred.le...@gmail.com wrote: Hi, My question is when I run a command from hdfs client, for eg. hadoop fs -copyFromLocal or create a sequence file writer in java code and append key/values to it through Hadoop APIs, does it internally transfer/write data to HDFS serially or in parallel ? Thanks in advance, -JJ -- Joseph Echeverria Cloudera, Inc. 443.305.9434 -- Harsh J
Re: outputCollector vs. Localfile
Mark, On Fri, May 20, 2011 at 10:17 AM, Mark question markq2...@gmail.com wrote: This is puzzling me ... With a mapper producing output of size ~ 400 MB ... which one is supposed to be faster? 1) output collector: which will write to local file then copy to HDFS since I don't have reducers. A regular map-only job does not write to the local FS, it writes to the HDFS directly (i.e., a local DN if one is found). -- Harsh J
Why Only 1 Reducer is running ??
Hello everyone, I am using wordcount application to test on my hadoop cluster of 5 nodes. The file size is around 5 GB. Its taking around 2 min - 40 sec for execution. But when I am checking the JobTracker web portal, I am seeing only one reducer is running. Why so ?? How can I change the code so that I will run multiple reducers also ?? Thanks, Praveenesh
Re: Why Only 1 Reducer is running ??
The job could be designed to use one reducer On 2011-05-20, at 7:19 AM, praveenesh kumar praveen...@gmail.com wrote: Hello everyone, I am using wordcount application to test on my hadoop cluster of 5 nodes. The file size is around 5 GB. Its taking around 2 min - 40 sec for execution. But when I am checking the JobTracker web portal, I am seeing only one reducer is running. Why so ?? How can I change the code so that I will run multiple reducers also ?? Thanks, Praveenesh
Re: Why Only 1 Reducer is running ??
I am using the wordcount example that comes along with hadoop. How can I configure it to make it use multiple reducers. I guess mutiple reducers will make it run more fast .. Does it ?? On Fri, May 20, 2011 at 6:51 PM, James Seigel Tynt ja...@tynt.com wrote: The job could be designed to use one reducer On 2011-05-20, at 7:19 AM, praveenesh kumar praveen...@gmail.com wrote: Hello everyone, I am using wordcount application to test on my hadoop cluster of 5 nodes. The file size is around 5 GB. Its taking around 2 min - 40 sec for execution. But when I am checking the JobTracker web portal, I am seeing only one reducer is running. Why so ?? How can I change the code so that I will run multiple reducers also ?? Thanks, Praveenesh
RE: Why Only 1 Reducer is running ??
Hi Praveenesh, * You can set the maximum amount of reducers per node in your mapred-site.xml using mapred.tasktracker.reduce.tasks.maximum (default set to 2). * You can set the default number of reduce tasks with mapred.reduce.tasks (default set to 1 - this causes your single reducer). * Your job can try to override this setting by calling Job.setNumReduceTasks(INT) (http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/Job.html#setNumReduceTasks(int)). Cheers, Evert -Original Message- From: modemide [mailto:modem...@gmail.com] Sent: vrijdag 20 mei 2011 15:26 To: common-user@hadoop.apache.org Subject: Re: Why Only 1 Reducer is running ?? what does your mapred-site.xml file say? I've used wordcount and had close to 12 reduces running on a 6 datanode cluster on a 3 GB file. I have a configuration in there which says: mapred.reduce.tasks = 12 The reason I chose 12 was because it was recommended that I choose 2x number of tasktrackers. On 5/20/11, praveenesh kumar praveen...@gmail.com wrote: Hello everyone, I am using wordcount application to test on my hadoop cluster of 5 nodes. The file size is around 5 GB. Its taking around 2 min - 40 sec for execution. But when I am checking the JobTracker web portal, I am seeing only one reducer is running. Why so ?? How can I change the code so that I will run multiple reducers also ?? Thanks, Praveenesh
Re: Are hadoop fs commands serial or parallel
On May 20, 2011, at 6:10 AM, Dieter Plaetinck wrote: What do you mean clunky? IMHO this is a quite elegant, simple, working solution. Try giving it to a user; watch them feed it a list of 10,000 files; watch the machine swap to death and the disks uselessly thrash. Sure this spawns multiple processes, but it beats any api-overcomplications, imho. Simple doesn't imply scalable, unfortunately. Brian Dieter On Wed, 18 May 2011 11:39:36 -0500 Patrick Angeles patr...@cloudera.com wrote: kinda clunky but you could do this via shell: for $FILE in $LIST_OF_FILES ; do hadoop fs -copyFromLocal $FILE $DEST_PATH done If doing this via the Java API, then, yes you will have to use multiple threads. On Wed, May 18, 2011 at 1:04 AM, Mapred Learn mapred.le...@gmail.comwrote: Thanks harsh ! That means basically both APIs as well as hadoop client commands allow only serial writes. I was wondering what could be other ways to write data in parallel to HDFS other than using multiple parallel threads. Thanks, JJ Sent from my iPhone On May 17, 2011, at 10:59 PM, Harsh J ha...@cloudera.com wrote: Hello, Adding to Joey's response, copyFromLocal's current implementation is serial given a list of files. On Wed, May 18, 2011 at 9:57 AM, Mapred Learn mapred.le...@gmail.com wrote: Thanks Joey ! I will try to find out abt copyFromLocal. Looks like Hadoop Apis write serially as you pointed out. Thanks, -JJ On May 17, 2011, at 8:32 PM, Joey Echeverria j...@cloudera.com wrote: The sequence file writer definitely does it serially as you can only ever write to the end of a file in Hadoop. Doing copyFromLocal could write multiple files in parallel (I'm not sure if it does or not), but a single file would be written serially. -Joey On Tue, May 17, 2011 at 5:44 PM, Mapred Learn mapred.le...@gmail.com wrote: Hi, My question is when I run a command from hdfs client, for eg. hadoop fs -copyFromLocal or create a sequence file writer in java code and append key/values to it through Hadoop APIs, does it internally transfer/write data to HDFS serially or in parallel ? Thanks in advance, -JJ -- Joseph Echeverria Cloudera, Inc. 443.305.9434 -- Harsh J smime.p7s Description: S/MIME cryptographic signature
Configuring jvm metrics in hadoop-0.20.203.0
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi, I am trying to get jvm metrics from the new verison of hadoop. I have read the migration instructions and come up with the following content for hadoop-metrics2.properties: *.sink.file.class=org.apache.hadoop.metrics2.sink.FileSink jvm.sink.file.period=2 jvm.sink.file.filename=/home/ec2-user/jvmmetrics.log Any help would be appreciated even if you have a different approach to get memory usage from reducers. Thanks in advance. - -- Best Regards, Matyas Markovics -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.17 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk3WkKoACgkQGp7rraycDA7lMQCbBbVqYEyOdwVAjfDHvGtr58BN nYUAn39gGORQKwVzt+Mlz8gZZlYYdymF =1GJs -END PGP SIGNATURE-
Re: outputCollector vs. Localfile
I thought it was, because of FileBytesWritten counter. Thanks for the clarification. Mark On Fri, May 20, 2011 at 4:23 AM, Harsh J ha...@cloudera.com wrote: Mark, On Fri, May 20, 2011 at 10:17 AM, Mark question markq2...@gmail.com wrote: This is puzzling me ... With a mapper producing output of size ~ 400 MB ... which one is supposed to be faster? 1) output collector: which will write to local file then copy to HDFS since I don't have reducers. A regular map-only job does not write to the local FS, it writes to the HDFS directly (i.e., a local DN if one is found). -- Harsh J
What's the easiest way to count the number of Key, Value pairs in a directory?
I've got a directory with a bunch of MapReduce data in it. I want to know how many Key, Value pairs it contains. I could write a mapper-only process that takes Writeable, Writeable pairs as input and updates a counter, but it seems like this utility should already exist. Does it, or do I have to roll my own? Bonus question, is there a way to count the number of Key, Value pairs without deserializing the values? This can be expensive for the data I'm working with.
Can I number output results with a Counter?
Hi, can I use a Counter to give each record in all reducers a consecutive number? Currently I am using a single Reducer, but it is an anti-pattern. But I need to assign consecutive numbers to all output records in all reducers, and it does not matter how, as long as each gets its own number. If it IS possible, then how are multiple processes accessing those counters without creating race conditions. Thank you, Mark
Re: What's the easiest way to count the number of Key, Value pairs in a directory?
What format is the input data in? At first glance, I would run an identity mapper and use a NullOutputFormat so you don't get any data written. The built in counters already count the number of key, value pairs read in by the mappers. -Joey On Fri, May 20, 2011 at 9:34 AM, W.P. McNeill bill...@gmail.com wrote: I've got a directory with a bunch of MapReduce data in it. I want to know how many Key, Value pairs it contains. I could write a mapper-only process that takes Writeable, Writeable pairs as input and updates a counter, but it seems like this utility should already exist. Does it, or do I have to roll my own? Bonus question, is there a way to count the number of Key, Value pairs without deserializing the values? This can be expensive for the data I'm working with. -- Joseph Echeverria Cloudera, Inc. 443.305.9434
Re: What's the easiest way to count the number of Key, Value pairs in a directory?
The cheapest way would be to check the counters as you write them in the first place and keep a running score. :) Sent from my mobile. Please excuse the typos. On 2011-05-20, at 10:35 AM, W.P. McNeill bill...@gmail.com wrote: I've got a directory with a bunch of MapReduce data in it. I want to know how many Key, Value pairs it contains. I could write a mapper-only process that takes Writeable, Writeable pairs as input and updates a counter, but it seems like this utility should already exist. Does it, or do I have to roll my own? Bonus question, is there a way to count the number of Key, Value pairs without deserializing the values? This can be expensive for the data I'm working with.
Re: Can I number output results with a Counter?
To make sure I understand you correctly, you need a globally unique one up counter for each output record? If you had an upper bound on the number of records a single reducer could output and you can afford to have gaps, you could just use the task id and multiply that by the max number of records and then one up from there. If that doesn't work for you, then you'll need to use some kind of central service for allocating numbers which could become a bottleneck. -Joey On Fri, May 20, 2011 at 9:55 AM, Mark Kerzner markkerz...@gmail.com wrote: Hi, can I use a Counter to give each record in all reducers a consecutive number? Currently I am using a single Reducer, but it is an anti-pattern. But I need to assign consecutive numbers to all output records in all reducers, and it does not matter how, as long as each gets its own number. If it IS possible, then how are multiple processes accessing those counters without creating race conditions. Thank you, Mark -- Joseph Echeverria Cloudera, Inc. 443.305.9434
Re: Can I number output results with a Counter?
Joey, You understood me perfectly well. I see your first advice, but I am not allowed to have gaps. A central service is something I may consider if single reducer becomes a worse bottleneck than it. But what are counters for? They seem to be exactly that. Mark On Fri, May 20, 2011 at 12:01 PM, Joey Echeverria j...@cloudera.com wrote: To make sure I understand you correctly, you need a globally unique one up counter for each output record? If you had an upper bound on the number of records a single reducer could output and you can afford to have gaps, you could just use the task id and multiply that by the max number of records and then one up from there. If that doesn't work for you, then you'll need to use some kind of central service for allocating numbers which could become a bottleneck. -Joey On Fri, May 20, 2011 at 9:55 AM, Mark Kerzner markkerz...@gmail.com wrote: Hi, can I use a Counter to give each record in all reducers a consecutive number? Currently I am using a single Reducer, but it is an anti-pattern. But I need to assign consecutive numbers to all output records in all reducers, and it does not matter how, as long as each gets its own number. If it IS possible, then how are multiple processes accessing those counters without creating race conditions. Thank you, Mark -- Joseph Echeverria Cloudera, Inc. 443.305.9434
Re: What's the easiest way to count the number of Key, Value pairs in a directory?
The keys are Text and the values are large custom data structures serialized with Avro. I also have counters for the job that generates these files that gives me this information but sometimes...Well, it's a long story. Suffice to say that it's nice to have a post-hoc method too. :-) The identity mapper sounds like the way to go.
Re: Can I number output results with a Counter?
Counters are a way to get status from your running job. They don't increment a global state. They locally save increments and periodically report those increments to the central counter. That means that the final count will be correct, but you can't use them to coordinate counts while your job is running. -Joey On Fri, May 20, 2011 at 10:17 AM, Mark Kerzner markkerz...@gmail.com wrote: Joey, You understood me perfectly well. I see your first advice, but I am not allowed to have gaps. A central service is something I may consider if single reducer becomes a worse bottleneck than it. But what are counters for? They seem to be exactly that. Mark On Fri, May 20, 2011 at 12:01 PM, Joey Echeverria j...@cloudera.com wrote: To make sure I understand you correctly, you need a globally unique one up counter for each output record? If you had an upper bound on the number of records a single reducer could output and you can afford to have gaps, you could just use the task id and multiply that by the max number of records and then one up from there. If that doesn't work for you, then you'll need to use some kind of central service for allocating numbers which could become a bottleneck. -Joey On Fri, May 20, 2011 at 9:55 AM, Mark Kerzner markkerz...@gmail.com wrote: Hi, can I use a Counter to give each record in all reducers a consecutive number? Currently I am using a single Reducer, but it is an anti-pattern. But I need to assign consecutive numbers to all output records in all reducers, and it does not matter how, as long as each gets its own number. If it IS possible, then how are multiple processes accessing those counters without creating race conditions. Thank you, Mark -- Joseph Echeverria Cloudera, Inc. 443.305.9434 -- Joseph Echeverria Cloudera, Inc. 443.305.9434
Re: What's the easiest way to count the number of Key, Value pairs in a directory?
Are you storing the data in sequence files? -Joey On Fri, May 20, 2011 at 10:33 AM, W.P. McNeill bill...@gmail.com wrote: The keys are Text and the values are large custom data structures serialized with Avro. I also have counters for the job that generates these files that gives me this information but sometimes...Well, it's a long story. Suffice to say that it's nice to have a post-hoc method too. :-) The identity mapper sounds like the way to go. -- Joseph Echeverria Cloudera, Inc. 443.305.9434
Re: What's the easiest way to count the number of Key, Value pairs in a directory?
No.
Re: Can I number output results with a Counter?
Also, with speculative execution enabled, you might see a higher count as you expect while the same task is running multiple times in parallel. When a task gets killed because another instance was quicker, those counters will be removed from the global count though. Kai Am 20.05.2011 um 19:34 schrieb Joey Echeverria: Counters are a way to get status from your running job. They don't increment a global state. They locally save increments and periodically report those increments to the central counter. That means that the final count will be correct, but you can't use them to coordinate counts while your job is running. -Joey On Fri, May 20, 2011 at 10:17 AM, Mark Kerzner markkerz...@gmail.com wrote: Joey, You understood me perfectly well. I see your first advice, but I am not allowed to have gaps. A central service is something I may consider if single reducer becomes a worse bottleneck than it. But what are counters for? They seem to be exactly that. Mark On Fri, May 20, 2011 at 12:01 PM, Joey Echeverria j...@cloudera.com wrote: To make sure I understand you correctly, you need a globally unique one up counter for each output record? If you had an upper bound on the number of records a single reducer could output and you can afford to have gaps, you could just use the task id and multiply that by the max number of records and then one up from there. If that doesn't work for you, then you'll need to use some kind of central service for allocating numbers which could become a bottleneck. -Joey On Fri, May 20, 2011 at 9:55 AM, Mark Kerzner markkerz...@gmail.com wrote: Hi, can I use a Counter to give each record in all reducers a consecutive number? Currently I am using a single Reducer, but it is an anti-pattern. But I need to assign consecutive numbers to all output records in all reducers, and it does not matter how, as long as each gets its own number. If it IS possible, then how are multiple processes accessing those counters without creating race conditions. Thank you, Mark -- Joseph Echeverria Cloudera, Inc. 443.305.9434 -- Joseph Echeverria Cloudera, Inc. 443.305.9434 -- Kai Voigt k...@123.org
Problem: Unknown scheme hdfs. It should correspond to a JournalType enumeration value
Hy People I'm starting in hadoop commom.. and got some problem to try using a cluster.. I'm following the steps of this page: http://hadoop.apache.org/common/docs/r0.21.0/cluster_setup.html I done everything, but when I will format the HDFS, this error happens: I searched for something to help-me, but didn't find nothing. If some guy could help-me, I will be thankfull. Re-format filesystem in /fontes/cluster/namedir ? (Y or N) Y 11/05/20 16:41:40 ERROR namenode.NameNode: java.io.IOException: Unknown scheme hdfs. It should correspond to a JournalType enumeration value at org.apache.hadoop.hdfs.server.namenode.FSImage.checkSchemeConsistency(FSImage.java:269) at org.apache.hadoop.hdfs.server.namenode.FSImage.setStorageDirectories(FSImage.java:222) at org.apache.hadoop.hdfs.server.namenode.FSImage.init(FSImage.java:178) at org.apache.hadoop.hdfs.server.namenode.NameNode.format(NameNode.java:1240) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1348) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1368) 11/05/20 16:41:40 INFO namenode.NameNode: SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down NameNode at hadoop/192.168.217.134 / -- Eduardo Dario Ricci Cel: 14-81354813 MSN: thenigma...@hotmail.com
Re: Problem: Unknown scheme hdfs. It should correspond to a JournalType enumeration value
Hi Eduardo, Sounds like you've configured your dfs.name.dirs to be on HDFS instead of like file paths. -Todd On Fri, May 20, 2011 at 2:20 PM, Eduardo Dario Ricci duzas...@gmail.com wrote: Hy People I'm starting in hadoop commom.. and got some problem to try using a cluster.. I'm following the steps of this page: http://hadoop.apache.org/common/docs/r0.21.0/cluster_setup.html I done everything, but when I will format the HDFS, this error happens: I searched for something to help-me, but didn't find nothing. If some guy could help-me, I will be thankfull. Re-format filesystem in /fontes/cluster/namedir ? (Y or N) Y 11/05/20 16:41:40 ERROR namenode.NameNode: java.io.IOException: Unknown scheme hdfs. It should correspond to a JournalType enumeration value at org.apache.hadoop.hdfs.server.namenode.FSImage.checkSchemeConsistency(FSImage.java:269) at org.apache.hadoop.hdfs.server.namenode.FSImage.setStorageDirectories(FSImage.java:222) at org.apache.hadoop.hdfs.server.namenode.FSImage.init(FSImage.java:178) at org.apache.hadoop.hdfs.server.namenode.NameNode.format(NameNode.java:1240) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1348) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1368) 11/05/20 16:41:40 INFO namenode.NameNode: SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down NameNode at hadoop/192.168.217.134 / -- Eduardo Dario Ricci Cel: 14-81354813 MSN: thenigma...@hotmail.com -- Todd Lipcon Software Engineer, Cloudera
Using df instead of du to calculate datanode space
I came up with a nice little hack to trick hadoop into calculating disk usage with df instead of du http://allthingshadoop.com/2011/05/20/faster-datanodes-with-less-wait-io-using-df-instead-of-du/ I am running this in production, works like a charm and already seeing benefit, woot! I hope it works well for others too. /* Joe Stein http://www.twitter.com/allthingshadoop */
Re: Applications creates bigger output than input?
Thanks Robert, Niels Ye, I think text manipulation, especially ngram is a good application for me. Cheers On Fri, May 20, 2011 at 12:57 AM, Robert Evans ev...@yahoo-inc.com wrote: I'm not sure if this has been mentioned or not but in Machine Learning with text based documents, the first stage is often a glorified word count action. Except much of the time they will do N-Gram. So Map Input: Hello this is a test Map Output: Hello This is a test Hello this this is is a a test ... You may also be extracting all kinds of other features form the text, but the tokenization/n-gram is not that CPU intensive. --Bobby Evans On 5/19/11 3:06 AM, elton sky eltonsky9...@gmail.com wrote: Hello, I pick up this topic again, because what I am looking for is something not CPU bound. Augmenting data for ETL and generating index are good examples. Neither of them requires too much cpu time on map side. The main bottle neck for them is shuffle and merge. Market basket analysis is cpu intensive in map phase, for sampling all possible combinations of items. I am still looking for more applications, which creates bigger output and not CPU bound. Any further idea? I appreciate. On Tue, May 3, 2011 at 3:10 AM, Steve Loughran ste...@apache.org wrote: On 30/04/2011 05:31, elton sky wrote: Thank you for suggestions: Weblog analysis, market basket analysis and generating search index. I guess for these applications we need more reduces than maps, for handling large intermediate output, isn't it. Besides, the input split for map should be smaller than usual, because the workload for spill and merge on map's local disk is heavy. any form of rendering can generate very large images see: http://www.hpl.hp.com/techreports/2009/HPL-2009-345.pdf