Re: Easy Question
Hi Neil, Thanks for responding. Basically formatting removes all my files, is there away not to? I didn't thought about checking the log. Thanks, Maha On Oct 4, 2010, at 10:54 PM, Neil Ghosh wrote: Maha, Is there any specific reason you don't want to format the name node ? Did you see the log why name node is not starting ? Thanks Neil On Tue, Oct 5, 2010 at 10:54 AM, maha m...@umail.ucsb.edu wrote: Hi Folks, I'm sure this is easy for you guy, so please let me know. What's the solution when the NameNode doesn't start other than formatting it? I also tried stop-dfs.sh and starting it again over and over, no hope until I format it :( Please help and thank you, Maha -- Thanks and Regards Neil http://neilghosh.com
Fwd: Easy Question
Sorry Harsh and thanks for the advice. I'm new to Hadoop and didn't thought of reading the logs. But you're right. Now, the DATANODE is not starting: 2010-10-04 23:09:22,812 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Incompatible namespaceIDs in /private/tmp/hadoop-Hadoop/dfs/data: namenode namespaceID = 200395975; datanode namespaceID = 1970823831 at org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:233) at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:148) at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:298) at org.apache.hadoop.hdfs.server.datanode.DataNode.init(DataNode.java:216) at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1283) at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1238) at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1246) at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1368) But again what is incompatible here? I think it's because I added dfs.data.dir to be /user/Hadoop/hdfs/data and dfs.name.dir to be user/Hadoop/hdfs/name into the core-site.xml. Maha On Oct 4, 2010, at 11:08 PM, Harsh J wrote: The logs tell what the problem is precisely 99% of times. Formatting is not the only solution. How and when does your node go down? Give the list some more information to help you better :) On Oct 5, 2010 11:35 AM, maha m...@umail.ucsb.edu wrote: Hi Neil, Thanks for responding. Basically formatting removes all my files, is there away not to? I didn't thought about checking the log. Thanks, Maha On Oct 4, 2010, at 10:54 PM, Neil Ghosh wrote: Maha, Is there any specific reason you don't...
Re: Easy Question
hi Maha, try the folowing : goto ur dfs.data.dir/current You will find a file VERSION.. just modify the namespace id in it with your namespace id found in the log ( in this prev post -- 200395975 ).. restart hadoop.. (bin/start-all.sh) ... see if all the daemons are up.. regards, Matthew
Help!!The problem about Hadoop
Hi, all I do an application using hadoop. I take 1GB text data as input the result as follows: (1) the cluster of 3 PCs: the time consumed is 1020 seconds. (2) the cluster of 4 PCs: the time is about 680 seconds. But the application before I use Hadoop takes about 280 seconds, so as the speed above, I must use 8 PCs in order to have the same speed as before. Now the problem: whether it is correct? Jander, Thanks.
Re:Re: Help!!The problem about Hadoop
Hi Jeff, Thank you very much for your reply sincerely. I exactly know hadoop has overhead, but is it too large in my problem? The 1GB text input has about 500 map tasks because the input is composed of little text file. And the time each map taken is from 8 seconds to 20 seconds. I use compression like conf.setCompressMapOutput(true). Thanks, Jander At 2010-10-05 16:28:55,Jeff Zhang zjf...@gmail.com wrote: Hi Jander, Hadoop has overhead compared to single-machine solution. How many task have you get when you run your hadoop job ? And what is time consuming for each map and reduce task ? There's lots of tips for performance tuning of hadoop. Such as compression and jvm reuse. 2010/10/5 Jander 442950...@163.com: Hi, all I do an application using hadoop. I take 1GB text data as input the result as follows: (1) the cluster of 3 PCs: the time consumed is 1020 seconds. (2) the cluster of 4 PCs: the time is about 680 seconds. But the application before I use Hadoop takes about 280 seconds, so as the speed above, I must use 8 PCs in order to have the same speed as before. Now the problem: whether it is correct? Jander, Thanks. -- Best Regards Jeff Zhang
Re: is there no streaming.jar file in hadoop-0.21.0??
Edward, Yep, you should use the one from contrib/ Alejandro On Tue, Oct 5, 2010 at 1:55 PM, edward choi mp2...@gmail.com wrote: Thanks, Tom. Didn't expect the author of THE BOOK would answer my question. Very surprised and honored :-) Just one more question if you don't mind. I read it on the Internet that in order to user Hadoop Streaming in Hadoop-0.21.0 you should go $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar args (Of course I don't see any hadoop-streaming.jar in $HADOOP_HOME) But according to your reply I should go $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/mapred/contrib/streaming/hadoop-*-streaming.jar args I suppose the latter one is the way to go? Ed. 2010/10/5 Tom White t...@cloudera.com Hi Ed, The directory structure moved around as a result of the project splitting into three subprojects (Common, HDFS, MapReduce). The streaming jar is in mapred/contrib/streaming in the distribution. Cheers, Tom On Mon, Oct 4, 2010 at 8:03 PM, edward choi mp2...@gmail.com wrote: Hi, I've recently downloaded Hadoop-0.21.0. After the installation, I've noticed that there is no contrib directory that used to exist in Hadoop-0.20.2. So I was wondering if there is no hadoop-0.21.0-streaming.jar file in Hadoop-0.21.0. Anyone had any luck finding it? If the way to use streaming has changed in Hadoop-0.21.0, then please tell me how. Appreciate the help, thx. Ed.
Re: Re: Help!!The problem about Hadoop
Or you could try using MultiFileInputFormat for your MR job. http://hadoop.apache.org/mapreduce/docs/current/api/org/apache/hadoop/mapred/MultiFileInputFormat.html Alejandro On Tue, Oct 5, 2010 at 4:55 PM, Harsh J qwertyman...@gmail.com wrote: 500 small files comprising one gigabyte? Perhaps you should try concatenating them all into one big file and try; as a mapper is supposed to run at least for a minute optimally. And small files don't make good use of the HDFS block feature. Have a read: http://www.cloudera.com/blog/2009/02/the-small-files-problem/ 2010/10/5 Jander 442950...@163.com: Hi Jeff, Thank you very much for your reply sincerely. I exactly know hadoop has overhead, but is it too large in my problem? The 1GB text input has about 500 map tasks because the input is composed of little text file. And the time each map taken is from 8 seconds to 20 seconds. I use compression like conf.setCompressMapOutput(true). Thanks, Jander At 2010-10-05 16:28:55,Jeff Zhang zjf...@gmail.com wrote: Hi Jander, Hadoop has overhead compared to single-machine solution. How many task have you get when you run your hadoop job ? And what is time consuming for each map and reduce task ? There's lots of tips for performance tuning of hadoop. Such as compression and jvm reuse. 2010/10/5 Jander 442950...@163.com: Hi, all I do an application using hadoop. I take 1GB text data as input the result as follows: (1) the cluster of 3 PCs: the time consumed is 1020 seconds. (2) the cluster of 4 PCs: the time is about 680 seconds. But the application before I use Hadoop takes about 280 seconds, so as the speed above, I must use 8 PCs in order to have the same speed as before. Now the problem: whether it is correct? Jander, Thanks. -- Best Regards Jeff Zhang -- Harsh J www.harshj.com
how to set diffent VM parameters for mappers and reducers?
Hello, I have mappers that do not need much ram but combiners and reducers need a lot. Is it possible to set different VM parameters for mappers and reducers? PS Often I face interesting problem, on same set of data I recieve I have java.lang.OutOfMemoryError: Java heap space in combiner but it happens not all the time. What could be cause of such behavior? My personal opinion is that I have mapred.job.reuse.jvm.num.tasks=-1 and jvm GC doesn't always start when it should. Thanks in Advance, Vitaliy S
RE: how to set diffent VM parameters for mappers and reducers?
Hi, You don't say which version of Hadoop you are using. Going from memory, I believe in the CDH3 release from Cloudera, there are some specific OPTs you can set in hadoop-env.sh. HTH -Mike Date: Tue, 5 Oct 2010 16:59:35 +0400 Subject: how to set diffent VM parameters for mappers and reducers? From: vitaliy...@gmail.com To: common-user@hadoop.apache.org Hello, I have mappers that do not need much ram but combiners and reducers need a lot. Is it possible to set different VM parameters for mappers and reducers? PS Often I face interesting problem, on same set of data I recieve I have java.lang.OutOfMemoryError: Java heap space in combiner but it happens not all the time. What could be cause of such behavior? My personal opinion is that I have mapred.job.reuse.jvm.num.tasks=-1 and jvm GC doesn't always start when it should. Thanks in Advance, Vitaliy S
Re: how to set diffent VM parameters for mappers and reducers?
You can set mapred.child.java.opts in mapred-site.xml BTW, combiner can been run both in map side and reduce side On Tue, Oct 5, 2010 at 8:59 PM, Vitaliy Semochkin vitaliy...@gmail.com wrote: Hello, I have mappers that do not need much ram but combiners and reducers need a lot. Is it possible to set different VM parameters for mappers and reducers? PS Often I face interesting problem, on same set of data I recieve I have java.lang.OutOfMemoryError: Java heap space in combiner but it happens not all the time. What could be cause of such behavior? My personal opinion is that I have mapred.job.reuse.jvm.num.tasks=-1 and jvm GC doesn't always start when it should. Thanks in Advance, Vitaliy S -- Best Regards Jeff Zhang
Re: how to set diffent VM parameters for mappers and reducers?
The following 2 properties should work: mapred.map.child.java.opts mapred.reduce.child.java.opts Alejandro On Tue, Oct 5, 2010 at 9:02 PM, Michael Segel michael_se...@hotmail.com wrote: Hi, You don't say which version of Hadoop you are using. Going from memory, I believe in the CDH3 release from Cloudera, there are some specific OPTs you can set in hadoop-env.sh. HTH -Mike Date: Tue, 5 Oct 2010 16:59:35 +0400 Subject: how to set diffent VM parameters for mappers and reducers? From: vitaliy...@gmail.com To: common-user@hadoop.apache.org Hello, I have mappers that do not need much ram but combiners and reducers need a lot. Is it possible to set different VM parameters for mappers and reducers? PS Often I face interesting problem, on same set of data I recieve I have java.lang.OutOfMemoryError: Java heap space in combiner but it happens not all the time. What could be cause of such behavior? My personal opinion is that I have mapred.job.reuse.jvm.num.tasks=-1 and jvm GC doesn't always start when it should. Thanks in Advance, Vitaliy S
Re: how to set diffent VM parameters for mappers and reducers?
I'm using apache hadoop-0.20.2 - the recent version i found in maven central repo. Regards, Vitaliy S On Tue, Oct 5, 2010 at 5:02 PM, Michael Segel michael_se...@hotmail.com wrote: Hi, You don't say which version of Hadoop you are using. Going from memory, I believe in the CDH3 release from Cloudera, there are some specific OPTs you can set in hadoop-env.sh. HTH -Mike Date: Tue, 5 Oct 2010 16:59:35 +0400 Subject: how to set diffent VM parameters for mappers and reducers? From: vitaliy...@gmail.com To: common-user@hadoop.apache.org Hello, I have mappers that do not need much ram but combiners and reducers need a lot. Is it possible to set different VM parameters for mappers and reducers? PS Often I face interesting problem, on same set of data I recieve I have java.lang.OutOfMemoryError: Java heap space in combiner but it happens not all the time. What could be cause of such behavior? My personal opinion is that I have mapred.job.reuse.jvm.num.tasks=-1 and jvm GC doesn't always start when it should. Thanks in Advance, Vitaliy S
Set number Reducer per machines.
Hi, I am trying to run a job on my hadoop cluster, where I get consistently get heap space error. I increased the heap-space to 4 GB in hadoop-env.sh and reboot the cluster. However, I still get the heap space error. One of things, I want to try is to reduce the number of map / reduce process per machine. Currently each machine can have 2 maps and 2 reduce process running. I want to configure the hadoop to run 1 map and 1 reduce per machine to give more heap space per process. How can I configure the number of maps and number of reducer per node ? thanks in advance, -- Pramod
Re: Set number Reducer per machines.
You can set the mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum properties in your mapred-site.xml file, but you may also want to check your current mapred.child.java.opts and mapred.child.ulimit values to make sure they aren't overriding the 4GB you set globally. Cheers, Marcos Hi, I am trying to run a job on my hadoop cluster, where I get consistently get heap space error. I increased the heap-space to 4 GB in hadoop-env.sh and reboot the cluster. However, I still get the heap space error. One of things, I want to try is to reduce the number of map / reduce process per machine. Currently each machine can have 2 maps and 2 reduce process running. I want to configure the hadoop to run 1 map and 1 reduce per machine to give more heap space per process. How can I configure the number of maps and number of reducer per node ? thanks in advance, -- Pramod
Re: Problem with DistributedCache after upgrading to CDH3b2
I'm experiencing the same problem. I was hoping there were be a reply to this. Anyone? Bueller? -Kim On Fri, Jul 16, 2010 at 1:58 AM, Jamie Cockrill jamie.cockr...@gmail.comwrote: Dear All, We recently upgraded from CDH3b1 to b2 and ever since, all our mapreduce jobs that use the DistributedCache have failed. Typically, we add files to the cache prior to job startup, using addCacheFile(URI, conf) and then get them on the other side, using getLocalCacheFiles(conf). I believe the hadoop-core versions for these are 0.20.2+228 and +320 respectively. We then open the files and read them in using a standard FileReader, using the toString on the path object as the constructor parameter, which has worked fine up to now. However, we're now getting FileNotFound exceptions when the file reader tries to open the file. Unfortunately the cluster is on an airgapped network, but the FileNotFound line comes out like: java.io.FileNotFoundException: /tmp/hadoop-hadoop/mapred/local/taskTracker/archive/master/path/to/my/file/filename.txt/filename.txt Note, the duplication of filename.txt is deliberate. I'm not sure if that's strange or not as this has previously worked absolutely fine. Has anyone else experienced this? Apologies if this is known, I've only just joined the list. Many thanks, Jamie
Re: Problem with DistributedCache after upgrading to CDH3b2
Hi Kim, We didn't fix it in the end. I just ended up manually writing the files to the cluster using the FileSystem class, and then reading them back out again on the other side. Not terribly efficient as I guess the point of DistributedCache is that the files get distributed to every node, whereas I'm only writing to two or three nodes, then every map-task is then trying to read back from those two or three nodes the data are stored on. Unfortunately I didn't have the will or inclination to investigate it any further as I had some pretty tight deadlines to keep to and it hasn't caused me any significant problems yet... Thanks, Jamie On 5 October 2010 22:30, Kim Vogt k...@simplegeo.com wrote: I'm experiencing the same problem. I was hoping there were be a reply to this. Anyone? Bueller? -Kim On Fri, Jul 16, 2010 at 1:58 AM, Jamie Cockrill jamie.cockr...@gmail.comwrote: Dear All, We recently upgraded from CDH3b1 to b2 and ever since, all our mapreduce jobs that use the DistributedCache have failed. Typically, we add files to the cache prior to job startup, using addCacheFile(URI, conf) and then get them on the other side, using getLocalCacheFiles(conf). I believe the hadoop-core versions for these are 0.20.2+228 and +320 respectively. We then open the files and read them in using a standard FileReader, using the toString on the path object as the constructor parameter, which has worked fine up to now. However, we're now getting FileNotFound exceptions when the file reader tries to open the file. Unfortunately the cluster is on an airgapped network, but the FileNotFound line comes out like: java.io.FileNotFoundException: /tmp/hadoop-hadoop/mapred/local/taskTracker/archive/master/path/to/my/file/filename.txt/filename.txt Note, the duplication of filename.txt is deliberate. I'm not sure if that's strange or not as this has previously worked absolutely fine. Has anyone else experienced this? Apologies if this is known, I've only just joined the list. Many thanks, Jamie
Re: Datanode Registration DataXceiver java.io.EOFException
We use gangalia for monitoring our cluster and use a nagios plugin that interfaces with gmeta node to setup various rules around number of datanodes, missing/corrupted blocks etc http://www.cloudera.com/blog/2009/03/hadoop-metrics/ http://exchange.nagios.org/directory/Plugins/Network-and-Systems-Management/ Others/check_ganglia/details From: Arthur Caranta art...@caranta.com Date: Mon, 04 Oct 2010 15:46:19 +0200 To: common-user@hadoop.apache.org Subject: Re: Datanode Registration DataXceiver java.io.EOFException On 04/10/10 15:42, Steve Loughran wrote: On 04/10/10 14:30, Arthur Caranta wrote: Damn I found the answer to this problem, thanks to someone on the #hadoop IRC channel ... It was a network check I added for our supervision ... therefore every 5 minutes the supervision connects to the datanode port to check if it is alive and then disconnects ... why not just GET the various local pages and let your HTTP monitoring tools do the work. True ... however the tcp method was the fastest to implement and script with our current supervision system. but I think I might be switching monitoring method. iCrossing Privileged and Confidential Information This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information of iCrossing. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.
Re: conf.setCombinerClass in Map/Reduce
Hi, thanks for the answer, Antonio. I have found one of the main problem. It was because I used the MultipleOutputs in the Reduce class, so when I set the Combiner and the Reducer, the Combiner will not provide normal data flow to the Reducer. Therefore, the program ceases at the Combiner and no Reducer actually works. To solve this, I have to use both outputs: OutputCollector collector = multipleOutputs.getCollector(stringlabel,keyText,reporter) collector.collect(keyText, value); output.collect(key,value); The collector generates the separated output files, the output makes sure the data flow is exchanged towards the Reducer. After this change, both Combiner and Reducer now work. The remaining question is if I want to use the Combiner and the Reducer, should the input and output of Reduce class be the same K2,V2? Otherwise how to do it? I found the use case is very limited here, for example, if the Reducer class is a little bit complicated having the input as K2,V2 and output as K3,V3? Thanks again. Shi On 2010-10-5 23:48, Antonio Piccolboni wrote: On Tue, Oct 5, 2010 at 4:32 PM, Shi Yush...@uchicago.edu wrote: Hi, I am still confused about the effect of using Combiner in Hadoop Map/Reduce. The performance tips ( http://www.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce-performance/) suggest us to write a combiner to do initial aggregation before the data hits the reducer for performance advantages. But in most of the example code or book I have seen, a same reduce class is set as the reducer and the combiner, such as conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); I don't know what is the specific reason doing like this. In my own code based on Hadoop 0.19.2, if I set the combiner class as the reduce class using MultipleOutputs, the output files will be named as xxx-m-0. And if there are multiple input paths, the number of output files will be the same as the input paths number. The conf.setNumReduceTasks(int) has no use to control the output file number now. I wonder where are the reducer generated outputs in this case because I cannot see them. To see the reducer output, I have to remove the combiner class //conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); and then get the output files named as xxx-r-0. I could then control the output file number using conf.setNumReduceTasks(int). So my question is what is the main advantage to set combiner class and reducer class using the same reduce class? When the calculation performed by the reducer is commutative and associative, with a combiner you get more work done before the shuffle, less sorting and shuffling and less work in the reducer. Like in the word count app, the mapper emitsthe, 1 a billion times, but with a combiner equal to the reducer onlythe, 10^9 has to travel to the reducer. If you couldn't use the combiner, not only the shuffle phase would be as heavy as if you had a billion distinct words, but also the poor reducer that gets the the key would be very slow. So you would have to go through multiple mapreduce phases to aggregate the data anyway. How to merge the output files in this case? While I am not sure what you mean, there is no difference to you. The output is the same. And where to find any real example using different Combiner/Reducer classes to improve the map/reduce performance? If you want to compute an average, the combiner needs to do only sums, the reducer sums and the final division. It would not be OK to divide in the combiner. See also http://philippeadjiman.com/blog/2010/01/14/hadoop-tutorial-series-issue-4-to-use-or-not-to-use-a-combiner/ The interface of reducer and combiner are the same, but they need not be the same class. Antonio Thanks. Shi -- Postdoctoral Scholar Institute for Genomics and Systems Biology Department of Medicine, the University of Chicago Knapp Center for Biomedical Discovery 900 E. 57th St. Room 10148 Chicago, IL 60637, US Tel: 773-702-6799
Re: Efficient query to directory-num-files?
On 2010, Oct 04, at 11:38 AM, Harsh J wrote: On Mon, Oct 4, 2010 at 11:11 PM, Keith Wiley kwi...@keithwiley.com wrote: - I want to know how many files are in a directory. - Well, actually, I want to know how many files are in a few thousand directories. - I anticipate the answer to be approximately four million. - If I were to pipe hadoop fs -ls | wc I estimate a return of about 360MBs of textual ls data to my client (Each hadoop ls entry is about 90B since it is always ls -l style), when all I really want is the file-count. Is there a smarter way to do this? Thanks. There's a FileSystem.listStatus(...).length you could use, in Java. (cook up a utility for it if you need it in commandline. Its what the FsShell does anyway when you use it via 'hadoop fs/dfs'.) But I do not know if this will indeed reduce the querying time also, as it seems to create an array of all the entries under a path. I could not find a direct counting command, as even the count given by the FsShell seems to be of this manner. Trying it on some 50,000 items I created for testing it out seemed quick enough. I wouldn't know about 4 million though! Try it out and wait for better answers if any! :) Thanks, I'll take a look at that and see what I can do with it. Cheers! Keith Wiley kwi...@keithwiley.com keithwiley.com music.keithwiley.com What I primarily learned in grad school is how much I *don't* know. Consequently, I left grad school with a higher ignorance to knowledge ratio than when I entered. -- Keith Wiley