Re: Problems with HOD and HDFS
On Monday 14 June 2010 09:51 AM, David Milne wrote: Ok, thanks Jeff. This is pretty surprising though. I would have thought many people would be in my position, where they have to use Hadoop on a general purpose cluster, and need it to play nice with a resource manager? What do other people do in this position, if they don't use HOD? Deprecated normally means there is a better alternative. - Dave It isn't formally deprecated though. May be we'll need to do it explicitly; that'll help putting up proper documentation about what else to use instead. A quick reply is that you start a static cluster on a set of nodes. Static cluster means bringing up hadoop dameons on a set of nodes using the startup scripts distributed along in bin/ directory. That said, there are no changes in HOD in 0.21 and beyond. Deploying 0.21 clusters should mostly work out of the box. But beyond 0.21, it may not work because HOD needs to be updated w.r.t removed/updated hadoop specific configuration parameters and environmental variables it generates itself. HTH, +vinod On Mon, Jun 14, 2010 at 2:39 PM, Jeff Hammerbacherham...@cloudera.com wrote: Hey Dave, I can't speak for the folks at Yahoo!, but from watching the JIRA, I don't think HOD is actively used or developed anywhere these days. You're attempting to use a mostly deprecated project, and hence not receiving any support on the mailing list. Thanks, Jeff On Sun, Jun 13, 2010 at 7:33 PM, David Milned.n.mi...@gmail.com wrote: Anybody? I am completely stuck here. I have no idea who else I can ask or where I can go for more information. Is there somewhere specific where I should be asking about HOD? Thank you, Dave On Thu, Jun 10, 2010 at 2:56 PM, David Milned.n.mi...@gmail.com wrote: Hi there, I am trying to get Hadoop on Demand up and running, but am having problems with the ringmaster not being able to communicate with HDFS. The output from the hod allocate command ends with this, with full verbosity: [2010-06-10 14:40:22,650] CRITICAL/50 hadoop:298 - Failed to retrieve 'hdfs' service address. [2010-06-10 14:40:22,654] DEBUG/10 hadoop:631 - Cleaning up cluster id 34029.symphony.cs.waikato.ac.nz, as cluster could not be allocated. [2010-06-10 14:40:22,655] DEBUG/10 hadoop:635 - Calling rm.stop() [2010-06-10 14:40:22,665] DEBUG/10 hadoop:637 - Returning from rm.stop() [2010-06-10 14:40:22,666] CRITICAL/50 hod:401 - Cannot allocate cluster /home/dmilne/hadoop/cluster [2010-06-10 14:40:23,090] DEBUG/10 hod:597 - return code: 7 I've attached the hodrc file below, but briefly HOD is supposed to provision an HDFS cluster as well as a Map/Reduce cluster, and seems to be failing to do so. The ringmaster log looks like this: [2010-06-10 14:36:05,144] DEBUG/10 ringMaster:479 - getServiceAddr name: hdfs [2010-06-10 14:36:05,145] DEBUG/10 ringMaster:487 - getServiceAddr service:hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8 [2010-06-10 14:36:05,147] DEBUG/10 ringMaster:504 - getServiceAddr addr hdfs: not found [2010-06-10 14:36:06,195] DEBUG/10 ringMaster:479 - getServiceAddr name: hdfs [2010-06-10 14:36:06,197] DEBUG/10 ringMaster:487 - getServiceAddr service:hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8 [2010-06-10 14:36:06,198] DEBUG/10 ringMaster:504 - getServiceAddr addr hdfs: not found ... and so on, until it gives up Any ideas why? One red flag is that when running the allocate command, some of the variables echo-ed back look dodgy: --gridservice-hdfs.fs_port 0 --gridservice-hdfs.host localhost --gridservice-hdfs.info_port 0 These are not what I specified in the hodrc. Are the port numbers just set to 0 because I am not using an external HDFS, or is this a problem? The software versions involved are: - Hadoop 0.20.2 - Python 2.5.2 (no Twisted) - Java 1.6.0_20 - Torque 2.4.5 The hodrc file looks like this: [hod] stream = True java-home = /opt/jdk1.6.0_20 cluster = debian5 cluster-factor = 1.8 xrs-port-range = 32768-65536 debug = 3 allocate-wait-time = 3600 temp-dir= /scratch/local/dmilne/hod [ringmaster] register= True stream = False temp-dir= /scratch/local/dmilne/hod log-dir = /scratch/local/dmilne/hod/log http-port-range = 8000-9000 idleness-limit = 864000 work-dirs = /scratch/local/dmilne/hod/1,/scratch/local/dmilne/hod/2 xrs-port-range = 32768-65536 debug = 4 [hodring] stream = False temp-dir= /scratch/local/dmilne/hod log-dir = /scratch/local/dmilne/hod/log register= True java-home
changing my hadoop log level is not getting reflected in logs
Hi, I changed the default log level of hadoop from INFO to ERROR by setting the property hadoop.root.logger to error in hadoop/conf/log4j.properties But when I start namenode, the INFO logs are seen in the log file. I did a workaround and found that HADOOP_ROOT_LOGGER is hard coded to INFO in hadoop-daemon.sh and hadoop script files in hadoop/bin. Is there anything to do with that or they are provided for any purpose?? PS: I am using hadoop 0.20.1 Thanks, Gokul *** This e-mail and attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient's) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it!
Re: Appending and seeking files while writing
Hi. Thanks for clarification. Append will be supported fully in 0.21. Any ETA for this version? Will it work both with Fuse and HDFS API? Also, append does *not* add random write. It simply adds the ability to re-open a file and add more data to the end. Just to clarify, even with append it won't be possible to: 1) Pause writing of new file, skip to any position, and update the data. 2) Open existing file, skip to any position and update the data. This will be even with FUSE. Is this correct? Regards.
Re: Appending and seeking files while writing
By the way, what about an ability for node to read file which is being written by another node? Or the file must be written and closed completely, before it becomes available for other nodes? (AFAIK in 0.18.3 the file appeared as 0 size until it was closed). Regards.
Re: Problems with HOD and HDFS
Dave, Yes, many others have the same situation, the recommended solution is either to use the Fair Share Scheduler or the Capacity Scheduler. These schedulers are much better than HOD since they take data locality into consideration (they don't just spin up 20 TT nodes on machines that have nothing to do with your data). They also don't lock down the nodes just for you, so as TT are freed other jobs can use them immediately (as opposed to no body can use them till your entire job is done). Also, if you are brave and want to try something spanking new, then I recommend you reach out to the Mesos guys, they have a scheduler layer under Hadoop that is data locality aware: http://mesos.berkeley.edu/ -- amr On Sun, Jun 13, 2010 at 9:21 PM, David Milne d.n.mi...@gmail.com wrote: Ok, thanks Jeff. This is pretty surprising though. I would have thought many people would be in my position, where they have to use Hadoop on a general purpose cluster, and need it to play nice with a resource manager? What do other people do in this position, if they don't use HOD? Deprecated normally means there is a better alternative. - Dave On Mon, Jun 14, 2010 at 2:39 PM, Jeff Hammerbacher ham...@cloudera.com wrote: Hey Dave, I can't speak for the folks at Yahoo!, but from watching the JIRA, I don't think HOD is actively used or developed anywhere these days. You're attempting to use a mostly deprecated project, and hence not receiving any support on the mailing list. Thanks, Jeff On Sun, Jun 13, 2010 at 7:33 PM, David Milne d.n.mi...@gmail.com wrote: Anybody? I am completely stuck here. I have no idea who else I can ask or where I can go for more information. Is there somewhere specific where I should be asking about HOD? Thank you, Dave On Thu, Jun 10, 2010 at 2:56 PM, David Milne d.n.mi...@gmail.com wrote: Hi there, I am trying to get Hadoop on Demand up and running, but am having problems with the ringmaster not being able to communicate with HDFS. The output from the hod allocate command ends with this, with full verbosity: [2010-06-10 14:40:22,650] CRITICAL/50 hadoop:298 - Failed to retrieve 'hdfs' service address. [2010-06-10 14:40:22,654] DEBUG/10 hadoop:631 - Cleaning up cluster id 34029.symphony.cs.waikato.ac.nz, as cluster could not be allocated. [2010-06-10 14:40:22,655] DEBUG/10 hadoop:635 - Calling rm.stop() [2010-06-10 14:40:22,665] DEBUG/10 hadoop:637 - Returning from rm.stop() [2010-06-10 14:40:22,666] CRITICAL/50 hod:401 - Cannot allocate cluster /home/dmilne/hadoop/cluster [2010-06-10 14:40:23,090] DEBUG/10 hod:597 - return code: 7 I've attached the hodrc file below, but briefly HOD is supposed to provision an HDFS cluster as well as a Map/Reduce cluster, and seems to be failing to do so. The ringmaster log looks like this: [2010-06-10 14:36:05,144] DEBUG/10 ringMaster:479 - getServiceAddr name: hdfs [2010-06-10 14:36:05,145] DEBUG/10 ringMaster:487 - getServiceAddr service: hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8 [2010-06-10 14:36:05,147] DEBUG/10 ringMaster:504 - getServiceAddr addr hdfs: not found [2010-06-10 14:36:06,195] DEBUG/10 ringMaster:479 - getServiceAddr name: hdfs [2010-06-10 14:36:06,197] DEBUG/10 ringMaster:487 - getServiceAddr service: hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8 [2010-06-10 14:36:06,198] DEBUG/10 ringMaster:504 - getServiceAddr addr hdfs: not found ... and so on, until it gives up Any ideas why? One red flag is that when running the allocate command, some of the variables echo-ed back look dodgy: --gridservice-hdfs.fs_port 0 --gridservice-hdfs.host localhost --gridservice-hdfs.info_port 0 These are not what I specified in the hodrc. Are the port numbers just set to 0 because I am not using an external HDFS, or is this a problem? The software versions involved are: - Hadoop 0.20.2 - Python 2.5.2 (no Twisted) - Java 1.6.0_20 - Torque 2.4.5 The hodrc file looks like this: [hod] stream = True java-home = /opt/jdk1.6.0_20 cluster = debian5 cluster-factor = 1.8 xrs-port-range = 32768-65536 debug = 3 allocate-wait-time = 3600 temp-dir= /scratch/local/dmilne/hod [ringmaster] register= True stream = False temp-dir= /scratch/local/dmilne/hod log-dir = /scratch/local/dmilne/hod/log http-port-range = 8000-9000 idleness-limit = 864000 work-dirs = /scratch/local/dmilne/hod/1,/scratch/local/dmilne/hod/2 xrs-port-range = 32768-65536 debug
No KeyValueTextInputFormat in hadoop-0.20.2?
Hi, I am upgrading my code from hadoop-0.19.2 to hadoop-0.20.2, during the process I found that there was no KeyValueTextInputFormat class which exists in hadoop-0.19.2. It's so strange that this version of hadoop does not come with this commonly used InputFormat. I have taken a look at the SecondarySort.java example code, it uses TextInputFormat and StringTokenizer to split each line, it is ok but kinda awkward to me. Do I have to implement a new InputFormat myself or there's a KeyValueTextInputFormat that exists somewhere I didn't notice? Thank you. Kevin Tse
Re: No KeyValueTextInputFormat in hadoop-0.20.2?
Have you checked src/mapred/org/apache/hadoop/mapred/KeyValueTextInputFormat.java ? On Mon, Jun 14, 2010 at 6:51 AM, Kevin Tse kevintse.on...@gmail.com wrote: Hi, I am upgrading my code from hadoop-0.19.2 to hadoop-0.20.2, during the process I found that there was no KeyValueTextInputFormat class which exists in hadoop-0.19.2. It's so strange that this version of hadoop does not come with this commonly used InputFormat. I have taken a look at the SecondarySort.java example code, it uses TextInputFormat and StringTokenizer to split each line, it is ok but kinda awkward to me. Do I have to implement a new InputFormat myself or there's a KeyValueTextInputFormat that exists somewhere I didn't notice? Thank you. Kevin Tse
Re: No KeyValueTextInputFormat in hadoop-0.20.2?
Hi Ted, I mean the new API: org.apache.hadoop.mapreduce.Job.setInputFormatClass(org.apache.hadoop.mapreduce.InputFormat) Job.setInputFormatClass() only accepts org.apache.hadoop.mapreduce.InputFormat(of which there are several subclasses, while KeyValueTextInputFormat is not one of them) as its parameter. On Mon, Jun 14, 2010 at 10:03 PM, Ted Yu yuzhih...@gmail.com wrote: Have you checked src/mapred/org/apache/hadoop/mapred/KeyValueTextInputFormat.java ? On Mon, Jun 14, 2010 at 6:51 AM, Kevin Tse kevintse.on...@gmail.com wrote: Hi, I am upgrading my code from hadoop-0.19.2 to hadoop-0.20.2, during the process I found that there was no KeyValueTextInputFormat class which exists in hadoop-0.19.2. It's so strange that this version of hadoop does not come with this commonly used InputFormat. I have taken a look at the SecondarySort.java example code, it uses TextInputFormat and StringTokenizer to split each line, it is ok but kinda awkward to me. Do I have to implement a new InputFormat myself or there's a KeyValueTextInputFormat that exists somewhere I didn't notice? Thank you. Kevin Tse
Re: Caching in HDFS C API Client
Indeed. On the terasort benchmark, I had to run intermediate jobs that were larger than ram on the cluster to ensure that the data was not coming from the file cache. -- Owen
Re: Caching in HDFS C API Client
Hey Owen, all, I find this one handy if you have root access: http://linux-mm.org/Drop_Caches echo 3 /proc/sys/vm/drop_caches Drops the pagecache, dentries, and inodes. Without this, you can still get caching effects doing the normal read and write large files if the linux pagecache outsmarts you (and I don't know about you, but it often outsmarts me...). Brian On Jun 14, 2010, at 9:35 AM, Owen O'Malley wrote: Indeed. On the terasort benchmark, I had to run intermediate jobs that were larger than ram on the cluster to ensure that the data was not coming from the file cache. -- Owen smime.p7s Description: S/MIME cryptographic signature
Re: Problems with HOD and HDFS
On Mon, Jun 14, 2010 at 8:37 AM, Amr Awadallah a...@cloudera.com wrote: Dave, Yes, many others have the same situation, the recommended solution is either to use the Fair Share Scheduler or the Capacity Scheduler. These schedulers are much better than HOD since they take data locality into consideration (they don't just spin up 20 TT nodes on machines that have nothing to do with your data). They also don't lock down the nodes just for you, so as TT are freed other jobs can use them immediately (as opposed to no body can use them till your entire job is done). Also, if you are brave and want to try something spanking new, then I recommend you reach out to the Mesos guys, they have a scheduler layer under Hadoop that is data locality aware: http://mesos.berkeley.edu/ -- amr On Sun, Jun 13, 2010 at 9:21 PM, David Milne d.n.mi...@gmail.com wrote: Ok, thanks Jeff. This is pretty surprising though. I would have thought many people would be in my position, where they have to use Hadoop on a general purpose cluster, and need it to play nice with a resource manager? What do other people do in this position, if they don't use HOD? Deprecated normally means there is a better alternative. - Dave On Mon, Jun 14, 2010 at 2:39 PM, Jeff Hammerbacher ham...@cloudera.com wrote: Hey Dave, I can't speak for the folks at Yahoo!, but from watching the JIRA, I don't think HOD is actively used or developed anywhere these days. You're attempting to use a mostly deprecated project, and hence not receiving any support on the mailing list. Thanks, Jeff On Sun, Jun 13, 2010 at 7:33 PM, David Milne d.n.mi...@gmail.com wrote: Anybody? I am completely stuck here. I have no idea who else I can ask or where I can go for more information. Is there somewhere specific where I should be asking about HOD? Thank you, Dave On Thu, Jun 10, 2010 at 2:56 PM, David Milne d.n.mi...@gmail.com wrote: Hi there, I am trying to get Hadoop on Demand up and running, but am having problems with the ringmaster not being able to communicate with HDFS. The output from the hod allocate command ends with this, with full verbosity: [2010-06-10 14:40:22,650] CRITICAL/50 hadoop:298 - Failed to retrieve 'hdfs' service address. [2010-06-10 14:40:22,654] DEBUG/10 hadoop:631 - Cleaning up cluster id 34029.symphony.cs.waikato.ac.nz, as cluster could not be allocated. [2010-06-10 14:40:22,655] DEBUG/10 hadoop:635 - Calling rm.stop() [2010-06-10 14:40:22,665] DEBUG/10 hadoop:637 - Returning from rm.stop() [2010-06-10 14:40:22,666] CRITICAL/50 hod:401 - Cannot allocate cluster /home/dmilne/hadoop/cluster [2010-06-10 14:40:23,090] DEBUG/10 hod:597 - return code: 7 I've attached the hodrc file below, but briefly HOD is supposed to provision an HDFS cluster as well as a Map/Reduce cluster, and seems to be failing to do so. The ringmaster log looks like this: [2010-06-10 14:36:05,144] DEBUG/10 ringMaster:479 - getServiceAddr name: hdfs [2010-06-10 14:36:05,145] DEBUG/10 ringMaster:487 - getServiceAddr service: hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8 [2010-06-10 14:36:05,147] DEBUG/10 ringMaster:504 - getServiceAddr addr hdfs: not found [2010-06-10 14:36:06,195] DEBUG/10 ringMaster:479 - getServiceAddr name: hdfs [2010-06-10 14:36:06,197] DEBUG/10 ringMaster:487 - getServiceAddr service: hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8 [2010-06-10 14:36:06,198] DEBUG/10 ringMaster:504 - getServiceAddr addr hdfs: not found ... and so on, until it gives up Any ideas why? One red flag is that when running the allocate command, some of the variables echo-ed back look dodgy: --gridservice-hdfs.fs_port 0 --gridservice-hdfs.host localhost --gridservice-hdfs.info_port 0 These are not what I specified in the hodrc. Are the port numbers just set to 0 because I am not using an external HDFS, or is this a problem? The software versions involved are: - Hadoop 0.20.2 - Python 2.5.2 (no Twisted) - Java 1.6.0_20 - Torque 2.4.5 The hodrc file looks like this: [hod] stream = True java-home = /opt/jdk1.6.0_20 cluster = debian5 cluster-factor = 1.8 xrs-port-range = 32768-65536 debug = 3 allocate-wait-time = 3600 temp-dir= /scratch/local/dmilne/hod [ringmaster] register= True stream = False temp-dir= /scratch/local/dmilne/hod log-dir = /scratch/local/dmilne/hod/log http-port-range =
Re: Appending and seeking files while writing
On Mon, Jun 14, 2010 at 4:00 AM, Stas Oskin stas.os...@gmail.com wrote: Hi. Thanks for clarification. Append will be supported fully in 0.21. Any ETA for this version? Should be out soon - Tom White is working hard on the release. Note that the first release, 0.21.0, will be somewhat of a development quality release not recommended for production use. Of course, the way it will become production-worthy is by less risk-averse people trying it and finding the bugs :) Will it work both with Fuse and HDFS API? I don't know that the Fuse code has been updated to call append. My guess is that a small patch would be required. Also, append does *not* add random write. It simply adds the ability to re-open a file and add more data to the end. Just to clarify, even with append it won't be possible to: 1) Pause writing of new file, skip to any position, and update the data. 2) Open existing file, skip to any position and update the data. Correct, neither of those are allowed. This will be even with FUSE. Is this correct? Regards. -- Todd Lipcon Software Engineer, Cloudera
Re: Appending and seeking files while writing
On Mon, Jun 14, 2010 at 4:28 AM, Stas Oskin stas.os...@gmail.com wrote: By the way, what about an ability for node to read file which is being written by another node? This is allowed, though there are some remaining bugs to be ironed out here. See https://issues.apache.org/jira/browse/HDFS-1057 for example. Or the file must be written and closed completely, before it becomes available for other nodes? (AFAIK in 0.18.3 the file appeared as 0 size until it was closed). Regards. -- Todd Lipcon Software Engineer, Cloudera
Re: Problems with HOD and HDFS
Edward Capriolo wrote: I have not used it much, but I think HOD is pretty cool. I guess most people who are looking to (spin up, run job ,transfer off, spin down) are using EC2. HOD does something like make private hadoop clouds on your hardware and many probably do not have that use case. As schedulers advance and get better HOD becomes less attractive, but I can always see a place for it. I don't know who is using it, or maintaining it; we've been bringing up short-lived Hadoop clusters different. I think I should write a little article on the topic; I presented about it at Berlin Buzzwords last week. Short lived Hadoop clusters on VMs are fine if you don't have enough data or CPU load to justify a set of dedicated physical machines, and is a good way of experimenting with Hadoop at scale. You can maybe lock down the network better too, though that depends on your VM infrastructure. Where VMs are weak is in disk IO performance, but there's no reason why the VM infrastructure can't take a list of filenames/directories as a hint for VM placement (placement is the new scheduling, incidentally), and virtualized IO can only improve. If you can run Hadoop MapReduce directly against SAN-mounted storage then you can stop worrying about locality of data and still gain from parallelisation of the operations. -steve
Re: Appending and seeking files while writing
Hi. Should be out soon - Tom White is working hard on the release. Note that the first release, 0.21.0, will be somewhat of a development quality release not recommended for production use. Of course, the way it will become production-worthy is by less risk-averse people trying it and finding the bugs :) Will it work both with Fuse and HDFS API? I don't know that the Fuse code has been updated to call append. My guess is that a small patch would be required. Also, append does *not* add random write. It simply adds the ability to re-open a file and add more data to the end. Just to clarify, even with append it won't be possible to: 1) Pause writing of new file, skip to any position, and update the data. 2) Open existing file, skip to any position and update the data. Correct, neither of those are allowed. Thanks for clarification.
job execution
Hi, According to the doc, JobControl can maintain the dependency among different jobs and only jobs without dependency can execute. How does JobControl maintain the dependency and how can we indicate the dependency? Thanks, -Gang
Re: job execution
Use ControlledJob class from Hadoop trunk. And run it through JobControl. Regards Akash Deep Shakya OpenAK FOSS Nepal Community akashakya at gmail dot com ~ Failure to prepare is preparing to fail ~ On Mon, Jun 14, 2010 at 10:40 PM, Gang Luo lgpub...@yahoo.com.cn wrote: Hi, According to the doc, JobControl can maintain the dependency among different jobs and only jobs without dependency can execute. How does JobControl maintain the dependency and how can we indicate the dependency? Thanks, -Gang
Task process exit with nonzero status of 1 - deleting userlogs helps
Hi, i have running a 4-node cluster with hadoop-0.20.2. Now i suddenly run into a situation where every task scheduled on 2 of the 4 nodes failed. Seems like the child jvm crashes. There are no child logs under logs/userlogs. Tasktracker gives this: 2010-06-14 09:34:12,714 INFO org.apache.hadoop.mapred.JvmManager: In JvmRunner constructed JVM ID: jvm_201006091425_0049_m_-946174604 2010-06-14 09:34:12,714 INFO org.apache.hadoop.mapred.JvmManager: JVM Runner jvm_201006091425_0049_m_-946174604 spawned. 2010-06-14 09:34:12,727 INFO org.apache.hadoop.mapred.JvmManager: JVM : jvm_201006091425_0049_m_-946174604 exited. Number of tasks it ran: 0 2010-06-14 09:34:12,727 WARN org.apache.hadoop.mapred.TaskRunner: attempt_201006091425_0049_m_003179_0 Child Error java.io.IOException: Task process exit with nonzero status of 1. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418) At some point i simply renamed logs/userlogs to logs/userlogsOLD. A new job created the logs/userlogs again and no error ocuured anymore on this host. The permissions of userlogs and userlogsOLD are exactly the same. userlogsOLD contains about 378M in 132747 files. When copying the content of userlogsOLD into userlogs, the tasks of the belonging node starts failing again. Some questions: - this seems to me like a problem with too many files in one folder - any thoughts on this ? - is the content of logs/userlogs cleaned up by hadoop regularly ? - the logs/stdout file of the tasks are not existent, the logs/out fiels of the tasktracker hasn't any specific message (other then message posted above) - is there any log file left where an error message could be found ? best regards Johannes
Re: Task process exit with nonzero status of 1 - deleting userlogs helps
On Mon, Jun 14, 2010 at 1:15 PM, Johannes Zillmann jzillm...@googlemail.com wrote: Hi, i have running a 4-node cluster with hadoop-0.20.2. Now i suddenly run into a situation where every task scheduled on 2 of the 4 nodes failed. Seems like the child jvm crashes. There are no child logs under logs/userlogs. Tasktracker gives this: 2010-06-14 09:34:12,714 INFO org.apache.hadoop.mapred.JvmManager: In JvmRunner constructed JVM ID: jvm_201006091425_0049_m_-946174604 2010-06-14 09:34:12,714 INFO org.apache.hadoop.mapred.JvmManager: JVM Runner jvm_201006091425_0049_m_-946174604 spawned. 2010-06-14 09:34:12,727 INFO org.apache.hadoop.mapred.JvmManager: JVM : jvm_201006091425_0049_m_-946174604 exited. Number of tasks it ran: 0 2010-06-14 09:34:12,727 WARN org.apache.hadoop.mapred.TaskRunner: attempt_201006091425_0049_m_003179_0 Child Error java.io.IOException: Task process exit with nonzero status of 1. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418) At some point i simply renamed logs/userlogs to logs/userlogsOLD. A new job created the logs/userlogs again and no error ocuured anymore on this host. The permissions of userlogs and userlogsOLD are exactly the same. userlogsOLD contains about 378M in 132747 files. When copying the content of userlogsOLD into userlogs, the tasks of the belonging node starts failing again. Some questions: - this seems to me like a problem with too many files in one folder - any thoughts on this ? - is the content of logs/userlogs cleaned up by hadoop regularly ? - the logs/stdout file of the tasks are not existent, the logs/out fiels of the tasktracker hasn't any specific message (other then message posted above) - is there any log file left where an error message could be found ? best regards Johannes Most file systems have an upper limit on number of subfiles/folders in a folder. You have probably hit the EXT3 limit. If you launch lots and lots of jobs you can hit the limit before any cleanup happens. You can experiment with cleanup and other filesystems. The following log related issue might be relevant. https://issues.apache.org/jira/browse/MAPREDUCE-323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12877614#action_12877614 Regards, Edward
Hadoop and IP on InfiniBand (IPoIB)
I'm a new user of Hadoop. I have a Linux cluster with both gigabit ethernet and InfiniBand communications interfaces. Could someone please tell me how to switch IP communication from ethernet (the default) to InfiniBand? Thanks. -- Russell A. Brown| Oracle russ.br...@oracle.com | UMPK14-260 (650) 786-3011 (office) | 14 Network Circle (650) 786-3453 (fax)| Menlo Park, CA 94025
Re: Caching in HDFS C API Client
Nice, thanks Brian! On Jun 14, 2010, at 7:39 AM, Brian Bockelman wrote: Hey Owen, all, I find this one handy if you have root access: http://linux-mm.org/Drop_Caches echo 3 /proc/sys/vm/drop_caches Drops the pagecache, dentries, and inodes. Without this, you can still get caching effects doing the normal read and write large files if the linux pagecache outsmarts you (and I don't know about you, but it often outsmarts me...). Brian On Jun 14, 2010, at 9:35 AM, Owen O'Malley wrote: Indeed. On the terasort benchmark, I had to run intermediate jobs that were larger than ram on the cluster to ensure that the data was not coming from the file cache. -- Owen
Re: Problems with HOD and HDFS
Thanks everyone for your replies. Even though HOD looks like a dead-end I would prefer to use it. I am just one user of the cluster among many, and currently the only one using Hadoop. The jobs I need to run are pretty much one-off: they are big jobs that I can't do without Hadoop, but I might need to run them once a month or less. The ability to provision MapReduce and HDFS when I need it sounds ideal. Following Vinod's advice, I have rolled back to Hadoop 0.20.1 (the last version that HOD kept up with) and taken a closer look at the ringmaster logs. However, I am still getting the same problems as before, and I can't find anything in the logs to help me identify the NameNode. The full ringmaster log is below. It's a pretty repetitive song, so I've identified the chorus. [2010-06-15 10:07:40,236] DEBUG/10 ringMaster:569 - Getting service ID. [2010-06-15 10:07:40,237] DEBUG/10 ringMaster:573 - Got service ID: 34350.symphony.cs.waikato.ac.nz [2010-06-15 10:07:40,239] DEBUG/10 ringMaster:756 - Command to execute: /bin/cp /home/dmilne/hadoop/hadoop-0.20.1.tar.gz /scratch/local/dmilne/hod/dmilne.34350.symphony.cs.waikato.ac.nz.ringmaster [2010-06-15 10:07:42,314] DEBUG/10 ringMaster:762 - Completed command execution. Exit Code: 0. [2010-06-15 10:07:42,315] DEBUG/10 ringMaster:591 - Service registry @ http://symphony.cs.waikato.ac.nz:36372 [2010-06-15 10:07:47,503] DEBUG/10 ringMaster:726 - tarball name : /scratch/local/dmilne/hod/dmilne.34350.symphony.cs.waikato.ac.nz.ringmaster/hadoop-0.20.1.tar.gz hadoop package name : hadoop-0.20.1/ [2010-06-15 10:07:47,505] DEBUG/10 ringMaster:716 - Returning Hadoop directory as: /scratch/local/dmilne/hod/dmilne.34350.symphony.cs.waikato.ac.nz.ringmaster/hadoop-0.20.1/ [2010-06-15 10:07:47,515] DEBUG/10 util:215 - Executing command /scratch/local/dmilne/hod/dmilne.34350.symphony.cs.waikato.ac.nz.ringmaster/hadoop-0.20.1/bin/hadoop version to find hadoop version [2010-06-15 10:07:48,241] DEBUG/10 util:224 - Version from hadoop command: Hadoop 0.20.1 [2010-06-15 10:07:48,244] DEBUG/10 ringMaster:117 - Using max-connect value 30 [2010-06-15 10:07:48,246] INFO/20 ringMaster:61 - Twisted interface not found. Using hodXMLRPCServer. [2010-06-15 10:07:48,257] DEBUG/10 ringMaster:73 - Ringmaster RPC Server at 33771 [2010-06-15 10:07:48,265] DEBUG/10 ringMaster:121 - registering: http://cn71:8030/hadoop-0.20.1.tar.gz [2010-06-15 10:07:48,275] DEBUG/10 ringMaster:658 - dmilne 34350.symphony.cs.waikato.ac.nz cn71.symphony.cs.waikato.ac.nz ringmaster hod [2010-06-15 10:07:48,307] DEBUG/10 ringMaster:670 - Registered with serivce registry: http://symphony.cs.waikato.ac.nz:36372. //chorus start [2010-06-15 10:07:48,393] DEBUG/10 ringMaster:479 - getServiceAddr name: hdfs [2010-06-15 10:07:48,394] DEBUG/10 ringMaster:487 - getServiceAddr service: hodlib.GridServices.hdfs.Hdfs instance at 0xc9e050 [2010-06-15 10:07:48,395] DEBUG/10 ringMaster:504 - getServiceAddr addr hdfs: not found //chorus end //chorus (3x) [2010-06-15 10:07:51,461] DEBUG/10 ringMaster:726 - tarball name : /scratch/local/dmilne/hod/dmilne.34350.symphony.cs.waikato.ac.nz.ringmaster/hadoop-0.20.1.tar.gz hadoop package name : hadoop-0.20.1/ [2010-06-15 10:07:51,463] DEBUG/10 ringMaster:716 - Returning Hadoop directory as: /scratch/local/dmilne/hod/dmilne.34350.symphony.cs.waikato.ac.nz.ringmaster/hadoop-0.20.1/ [2010-06-15 10:07:51,465] DEBUG/10 ringMaster:690 - hadoopdir=/scratch/local/dmilne/hod/dmilne.34350.symphony.cs.waikato.ac.nz.ringmaster/hadoop-0.20.1/, java-home=/opt/jdk1.6.0_20 [2010-06-15 10:07:51,470] DEBUG/10 util:215 - Executing command /scratch/local/dmilne/hod/dmilne.34350.symphony.cs.waikato.ac.nz.ringmaster/hadoop-0.20.1/bin/hadoop version to find hadoop version //chorus (1x) [2010-06-15 10:07:52,448] DEBUG/10 util:224 - Version from hadoop command: Hadoop 0.20.1 [2010-06-15 10:07:52,450] DEBUG/10 ringMaster:697 - starting jt monitor [2010-06-15 10:07:52,453] DEBUG/10 ringMaster:913 - Entered start method. [2010-06-15 10:07:52,455] DEBUG/10 ringMaster:924 - /home/dmilne/hadoop/hadoop-0.20.1/contrib/hod/bin/hodring --hodring.tarball-retry-initial-time 1.0 --hodring.cmd-retry-initial-time 2.0 --hodring.cmd-retry-interval 2.0 --hodring.service-id 34350.symphony.cs.waikato.ac.nz --hodring.temp-dir /scratch/local/dmilne/hod --hodring.http-port-range 8000-9000 --hodring.userid dmilne --hodring.java-home /opt/jdk1.6.0_20 --hodring.svcrgy-addr symphony.cs.waikato.ac.nz:36372 --hodring.download-addr h:t --hodring.tarball-retry-interval 3.0 --hodring.log-dir /scratch/local/dmilne/hod/log --hodring.mapred-system-dir-root /mapredsystem --hodring.xrs-port-range 32768-65536 --hodring.debug 4 --hodring.ringmaster-xrs-addr cn71:33771 --hodring.register [2010-06-15 10:07:52,456] DEBUG/10 ringMaster:479 - getServiceAddr name: mapred [2010-06-15 10:07:52,458] DEBUG/10 ringMaster:487 - getServiceAddr service: hodlib.GridServices.mapred.MapReduce instance at 0xc9e098 [2010-06-15 10:07:52,460] DEBUG/10
Re: Problems with HOD and HDFS
Is there something else I could read about setting up short-lived Hadoop clusters on virtual machines? I have no experience with VMs at all. I see there is quite a bit of material about using them to get Hadoop up and running with a psuedo-cluster on a single machine, but I don't follow how this stretches out to using multiple machines allocated by Torque. Thanks, Dave On Tue, Jun 15, 2010 at 3:49 AM, Steve Loughran ste...@apache.org wrote: Edward Capriolo wrote: I have not used it much, but I think HOD is pretty cool. I guess most people who are looking to (spin up, run job ,transfer off, spin down) are using EC2. HOD does something like make private hadoop clouds on your hardware and many probably do not have that use case. As schedulers advance and get better HOD becomes less attractive, but I can always see a place for it. I don't know who is using it, or maintaining it; we've been bringing up short-lived Hadoop clusters different. I think I should write a little article on the topic; I presented about it at Berlin Buzzwords last week. Short lived Hadoop clusters on VMs are fine if you don't have enough data or CPU load to justify a set of dedicated physical machines, and is a good way of experimenting with Hadoop at scale. You can maybe lock down the network better too, though that depends on your VM infrastructure. Where VMs are weak is in disk IO performance, but there's no reason why the VM infrastructure can't take a list of filenames/directories as a hint for VM placement (placement is the new scheduling, incidentally), and virtualized IO can only improve. If you can run Hadoop MapReduce directly against SAN-mounted storage then you can stop worrying about locality of data and still gain from parallelisation of the operations. -steve
How do I configure 'Skipbadrecords' in Hadoop Streaming?
HI, I am trying to use hadoop streaming and there seems to be a few bad records in my data. I'd like to use Skipbadrecords but I can't find how to use it in hadoop streaming. Is it at all possible? Thanks in advance.
Re: job execution
There's a class org.apache.hadoop.mapred.jobcontrol.Job which is a wapper of JobConf. And You and dependent jobs to it. Then put it to JobControl. On Mon, Jun 14, 2010 at 9:55 AM, Gang Luo lgpub...@yahoo.com.cn wrote: Hi, According to the doc, JobControl can maintain the dependency among different jobs and only jobs without dependency can execute. How does JobControl maintain the dependency and how can we indicate the dependency? Thanks, -Gang -- Best Regards Jeff Zhang
CFP for Surge Scalability Conference 2010
We're excited to announce Surge, the Scalability and Performance Conference, to be held in Baltimore on Sept 30 and Oct 1, 2010. The event focuses on case studies that demonstrate successes (and failures) in Web applications and Internet architectures. Our Keynote speakers include John Allspaw and Theo Schlossnagle. We are currently accepting submissions for the Call For Papers through July 9th. You can find more information, including our current list of speakers, online: http://omniti.com/surge/2010 If you've been to Velocity, or wanted to but couldn't afford it, then Surge is just what you've been waiting for. For more information, including CFP, sponsorship of the event, or participating as an exhibitor, please contact us at su...@omniti.com. Thanks, -- Jason Dixon OmniTI Computer Consulting, Inc. jdi...@omniti.com 443.325.1357 x.241
setting up hadoop 0.20.1 development environment
Hi, I am trying to set up a development cluster for hadoop 0.20.1 in eclipse. I used this url http://svn.apache.org/repos/asf/hadoop/common/tags/release-0.20.1/ to check out the build. I compiled compile , compile-core-test , and eclipse-files using ant. Then when I build the project , I am getting errors in bin/benchmarks directory. I have followed the screencast from cloudera http://www.cloudera.com/blog/2009/04/configuring-eclipse-for-hadoop-development-a-screencast/. Thanks, Vidur -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean.
running elephant-bird in eclipse codec property
Hi peeps, I'm trying to run elephant-bird code in eclipse, specifically ( http://github.com/kevinweil/elephant-bird/blob/master/examples/src/pig/json_word_count.pig), but I'm not sure how to set the core-site.xml properties via eclipse. I tried adding them to VM args but am still getting the following error: 10/06/14 21:23:34 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 10/06/14 21:23:34 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String). 10/06/14 21:23:34 INFO input.FileInputFormat: Total input paths to process : 2 10/06/14 21:23:34 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library 10/06/14 21:23:34 INFO lzo.LzoCodec: Successfully loaded initialized native-lzo library [hadoop-lzo rev 916aeae88ceb6734a679ebf9b48a93bea4cd9a06] 10/06/14 21:23:34 INFO input.LzoInputFormat: Added LZO split for file:/home/kim/code/data/jsonData/json.txt.lzo[start=0, length=100] 10/06/14 21:23:34 INFO mapred.JobClient: Running job: job_local_0001 10/06/14 21:23:34 INFO input.FileInputFormat: Total input paths to process : 2 10/06/14 21:23:34 INFO input.LzoInputFormat: Added LZO split for file:/home/kim/code/data/jsonData/json.txt.lzo[start=0, length=100] 10/06/14 21:23:34 INFO mapred.MapTask: io.sort.mb = 100 10/06/14 21:23:34 INFO mapred.MapTask: data buffer = 79691776/99614720 10/06/14 21:23:34 INFO mapred.MapTask: record buffer = 262144/327680 10/06/14 21:23:34 WARN mapred.LocalJobRunner: job_local_0001 java.io.IOException: No codec for file file:/home/kim/code/data/jsonData/json.txt.lzo not found, cannot run at com.twitter.elephantbird.mapreduce.input.LzoRecordReader.initialize(LzoRecordReader.java:64) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:418) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:582) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:176) 10/06/14 21:23:35 INFO mapred.JobClient: map 0% reduce 0% 10/06/14 21:23:35 INFO mapred.JobClient: Job complete: job_local_0001 10/06/14 21:23:35 INFO mapred.JobClient: Counters: 0 Help appreciated :-) Thanks! -Kim
Re: job execution
@Jeff, I think JobConf is already deprecated org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob; org.apache.hadoop.mapreduce.lib.jobcontrol.JobControl; can be used instead. Regards Akash Deep Shakya OpenAK FOSS Nepal Community akashakya at gmail dot com ~ Failure to prepare is preparing to fail ~ On Tue, Jun 15, 2010 at 7:28 AM, Jeff Zhang zjf...@gmail.com wrote: There's a class org.apache.hadoop.mapred.jobcontrol.Job which is a wapper of JobConf. And You and dependent jobs to it. Then put it to JobControl. On Mon, Jun 14, 2010 at 9:55 AM, Gang Luo lgpub...@yahoo.com.cn wrote: Hi, According to the doc, JobControl can maintain the dependency among different jobs and only jobs without dependency can execute. How does JobControl maintain the dependency and how can we indicate the dependency? Thanks, -Gang -- Best Regards Jeff Zhang