Re: Hadoop performance - xfs and ext4
On 4/23/2010 6:17 AM, stephen mulcahy wrote: Steve Loughran wrote: That's really interesting. Do you want to update the bits of the Hadoop wiki that talks about filesystems? I can if people think that would be useful. Absolutely. +1 Thanks, --Konstantin I'm not sure if my results are neccesarily going to reflect what will happen on other peoples systems and configs though - whats the best way of addressing that? Do my apache credentials work for the wiki or do I need to explicitly have a new account for the hadoop wiki? -stephen
Chaining M/R Jobs
Hi, I'm trying to find a way to control the output file names. I need this because I have a situation where I need to run a Job and then use it's output in the DistributedCache. So far the only way I've seen that makes it possible is rewriting the OutputFormat class but that seems a lot of work for such a simple task. Is there any way to do what I'm looking for? Tiago Veloso ti.vel...@gmail.com
Re: Chaining M/R Jobs
The easiest way to do this is to write your job outputs to a known place and then use the FileSystem APIs to rename the part-* files to what you want them to be. On Mon, Apr 26, 2010 at 2:22 PM, Tiago Veloso ti.vel...@gmail.com wrote: Hi, I'm trying to find a way to control the output file names. I need this because I have a situation where I need to run a Job and then use it's output in the DistributedCache. So far the only way I've seen that makes it possible is rewriting the OutputFormat class but that seems a lot of work for such a simple task. Is there any way to do what I'm looking for? Tiago Veloso ti.vel...@gmail.com -- Eric Sammer phone: +1-917-287-2675 twitter: esammer data: www.cloudera.com
RE: Chaining M/R Jobs
I don't usually bother renaming the files. If you know you want all of the files, you just iterate over the files in the output directory from the previous job. And then add those to distributed cache. If the data is fairly small you can set the number of reducers to 1 on the previous step as well. -Xavier -Original Message- From: Eric Sammer [mailto:esam...@cloudera.com] Sent: Monday, April 26, 2010 11:33 AM To: common-user@hadoop.apache.org Subject: Re: Chaining M/R Jobs The easiest way to do this is to write your job outputs to a known place and then use the FileSystem APIs to rename the part-* files to what you want them to be. On Mon, Apr 26, 2010 at 2:22 PM, Tiago Veloso ti.vel...@gmail.com wrote: Hi, I'm trying to find a way to control the output file names. I need this because I have a situation where I need to run a Job and then use it's output in the DistributedCache. So far the only way I've seen that makes it possible is rewriting the OutputFormat class but that seems a lot of work for such a simple task. Is there any way to do what I'm looking for? Tiago Veloso ti.vel...@gmail.com -- Eric Sammer phone: +1-917-287-2675 twitter: esammer data: www.cloudera.com
Re: Hadoop Log Collection
It should actually be straightforward to do this with Chukwa. Chukwa has a bunch of other pieces, but at its core, it does basically what you describe. The one complexity is that instead of storing each file separately, Chukwa runs them together into larger sequence files. This turns out to be important if you want good filesystem performance or if you have large data volumes or if you want to keep metadata telling you which machine your file came from. --Ari On Fri, Apr 23, 2010 at 5:38 AM, Patrick Datko patrick.da...@ymc.ch wrote: Hey everyone, i deal with hadoop since a few weeks to build up a cluster with hdfs. I was looking for several Monitoring tools to observe my cluster and find a good solution with ganglia+nagios. To complete the monitoring part of the cluster, i am looking for an Log collection tool, which store the log files of the nodes centralized. I have tested Chukwa and Facebook's Scribe, but both are not that type of simple storing log files, in my opinion they are too big, only for such a job. So i've thinking about writing an own LogCollector. I didn't want something special. My idea is, to build a deamon, which could be installed on every node in the cluster and onxml-file, which describes which log files have to be collected. The daemon should collect, in configured time interval, all needed log files and store them using the Java API in HDFS. This was just an idea for a simple LogCollector and it would cool if you can give me some opinion about this or whether such a LogCollector exits. Kind regards, Patrick -- Ari Rabkin asrab...@gmail.com UC Berkeley Computer Science Department
Re: Chaining M/R Jobs
You can use MultipleOutputs for this purpose, even though it was not designed for this and a few people on this list are going to raise an eyebrow. Alex K On Mon, Apr 26, 2010 at 11:39 AM, Xavier Stevens xavier.stev...@fox.comwrote: I don't usually bother renaming the files. If you know you want all of the files, you just iterate over the files in the output directory from the previous job. And then add those to distributed cache. If the data is fairly small you can set the number of reducers to 1 on the previous step as well. -Xavier -Original Message- From: Eric Sammer [mailto:esam...@cloudera.com] Sent: Monday, April 26, 2010 11:33 AM To: common-user@hadoop.apache.org Subject: Re: Chaining M/R Jobs The easiest way to do this is to write your job outputs to a known place and then use the FileSystem APIs to rename the part-* files to what you want them to be. On Mon, Apr 26, 2010 at 2:22 PM, Tiago Veloso ti.vel...@gmail.com wrote: Hi, I'm trying to find a way to control the output file names. I need this because I have a situation where I need to run a Job and then use it's output in the DistributedCache. So far the only way I've seen that makes it possible is rewriting the OutputFormat class but that seems a lot of work for such a simple task. Is there any way to do what I'm looking for? Tiago Veloso ti.vel...@gmail.com -- Eric Sammer phone: +1-917-287-2675 twitter: esammer data: www.cloudera.com
Re: Chaining M/R Jobs
On Apr 26, 2010, at 7:39 PM, Xavier Stevens wrote: I don't usually bother renaming the files. If you know you want all of the files, you just iterate over the files in the output directory from the previous job. And then add those to distributed cache. If the data is fairly small you can set the number of reducers to 1 on the previous step as well. And how do I Iterate on a directory? Could you give me a sample code? If relevant I am using hadoop 0.20.2. Tiago Veloso ti.vel...@gmail.com
RE: Chaining M/R Jobs
I know this works for 0.18.x. I'm not using 0.20 yet but as long as the API hasn't changed to much this should be pretty straightforward. Path prevOutputPath = new Path(...); for (FileStatus fstatus : hdfs.listStatus(prevOutputPath)) { if (!fstatus.isDir()) { DistributedCache.addCacheFile(fstatus.getPath().toUri(), conf); } } -Original Message- From: Tiago Veloso [mailto:ti.vel...@gmail.com] Sent: Monday, April 26, 2010 12:11 PM To: common-user@hadoop.apache.org Cc: Tiago Veloso Subject: Re: Chaining M/R Jobs On Apr 26, 2010, at 7:39 PM, Xavier Stevens wrote: I don't usually bother renaming the files. If you know you want all of the files, you just iterate over the files in the output directory from the previous job. And then add those to distributed cache. If the data is fairly small you can set the number of reducers to 1 on the previous step as well. And how do I Iterate on a directory? Could you give me a sample code? If relevant I am using hadoop 0.20.2. Tiago Veloso ti.vel...@gmail.com
Re: Chaining M/R Jobs
It worked thanks. Tiago Veloso ti.vel...@gmail.com On Apr 26, 2010, at 8:57 PM, Xavier Stevens wrote: I know this works for 0.18.x. I'm not using 0.20 yet but as long as the API hasn't changed to much this should be pretty straightforward. Path prevOutputPath = new Path(...); for (FileStatus fstatus : hdfs.listStatus(prevOutputPath)) { if (!fstatus.isDir()) { DistributedCache.addCacheFile(fstatus.getPath().toUri(), conf); } } -Original Message- From: Tiago Veloso [mailto:ti.vel...@gmail.com] Sent: Monday, April 26, 2010 12:11 PM To: common-user@hadoop.apache.org Cc: Tiago Veloso Subject: Re: Chaining M/R Jobs On Apr 26, 2010, at 7:39 PM, Xavier Stevens wrote: I don't usually bother renaming the files. If you know you want all of the files, you just iterate over the files in the output directory from the previous job. And then add those to distributed cache. If the data is fairly small you can set the number of reducers to 1 on the previous step as well. And how do I Iterate on a directory? Could you give me a sample code? If relevant I am using hadoop 0.20.2. Tiago Veloso ti.vel...@gmail.com
Re: Try to mount HDFS
The issue that required you changing ports is HDFS-961. Thanks, Eli On Fri, Apr 23, 2010 at 6:30 AM, Christian Baun c...@unix-ag.uni-kl.de wrote: Brian, You got it!!! :-) It works (partly)! i switched to Port 9000. core-site.xml includes now: property namefs.default.name/name valuehdfs://ec2-75-101-210-65.compute-1.amazonaws.com:9000/value finaltrue/final /property $ hadoop fs -ls / Found 1 items drwxr-xr-x - hadoop supergroup 0 2010-04-23 05:18 /mnt $ hadoop fs -ls /mnt/ Found 1 items drwxr-xr-x - hadoop supergroup 0 2010-04-23 13:00 /mnt/mapred # ./fuse_dfs_wrapper.sh dfs://ec2-75-101-210-65.compute-1.amazonaws.com:9000 /mnt/hdfs/ port=9000,server=ec2-75-101-210-65.compute-1.amazonaws.com fuse-dfs didn't recognize /mnt/hdfs/,-2 This tiny error message remains. # mount | grep fuse fuse_dfs on /hdfs type fuse.fuse_dfs (rw,nosuid,nodev,allow_other,default_permissions) # ls /mnt/hdfs/ mnt # mkdir /mnt/hdfs/testverzeichnis # touch /mnt/hdfs/testdatei # ls -l /mnt/hdfs/ total 8 drwxr-xr-x 3 hadoop 99 4096 2010-04-23 05:18 mnt -rw-r--r-- 1 root 99 0 2010-04-23 13:07 testdatei drwxr-xr-x 2 root 99 4096 2010-04-23 13:05 testverzeichnis In /var/log/messages there was no information about hdfs/fuse. Only in /var/log/user.log were these lines: Apr 23 13:04:34 ip-10-242-231-63 fuse_dfs: mounting dfs://ec2-75-101-210-65.compute-1.amazonaws.com:9000/ mkdir and touch works. But I cannot write data into files(?!). They are all read only. When I try to copy files from outside into the HDFS, only an empty file is created and in user.log appear these error messages: Apr 23 13:18:46 ip-10-242-231-63 fuse_dfs: ERROR: fuse problem - could not write all the bytes for /testordner/testfile -1!=4096fuse_impls_write.c:60 Apr 23 13:18:46 ip-10-242-231-63 fuse_dfs: WARN: fuse problem - could not write all the bytes for /testordner/testfile -1!=4096fuse_impls_write.c:64 Apr 23 13:18:46 ip-10-242-231-63 fuse_dfs: ERROR: fuse problem - could not write all the bytes for /testordner/testfile -1!=4096fuse_impls_write.c:60 Apr 23 13:18:46 ip-10-242-231-63 fuse_dfs: WARN: fuse problem - could not write all the bytes for /testordner/testfile -1!=4096fuse_impls_write.c:64 Apr 23 13:18:46 ip-10-242-231-63 fuse_dfs: ERROR: dfs problem - could not close file_handle(23486496) for /testordner/testfile fuse_impls_release.c:58 Weird... But this is a big step forward. Thanks a lot!!! Best Regards Christian Am Freitag, 23. April 2010 schrieb Brian Bockelman: Hm, ok, now you have me stumped. One last hunch - can you include the port information, but also switch to port 9000? Additionally, can you do the following: 1) In /var/log/messages and copy out the hdfs/fuse-related messages and post them 2) Using the hadoop clients do, hadoop fs -ls / Brian On Apr 23, 2010, at 12:33 AM, Christian Baun wrote: Hi, When adding the port information inside core-site.xml, the problem remains: property namefs.default.name/name valuehdfs://ec2-75-101-210-65.compute-1.amazonaws.com:8020/value finaltrue/final /property # ./fuse_dfs_wrapper.sh dfs://ec2-75-101-210-65.compute-1.amazonaws.com:8020 /mnt/hdfs/ port=8020,server=ec2-75-101-210-65.compute-1.amazonaws.com fuse-dfs didn't recognize /mnt/hdfs/,-2 # ls /mnt/hdfs ls: cannot access /mnt/hdfs/®1 : No such file or directory Best Regards, Christian Am Freitag, 23. April 2010 schrieb Christian Baun: Hi Brian, this is inside my core-site.xml configuration property namefs.default.name/name valuehdfs://ec2-75-101-210-65.compute-1.amazonaws.com//value finaltrue/final /property property namehadoop.tmp.dir/name value/mnt/value descriptionA base for other temporary directories./description /property /configuration Do I need to give the port here? this is inside my hdfs-site.xml configuration property namedfs.name.dir/name value${hadoop.tmp.dir}/dfs/name/value finaltrue/final /property property namedfs.data.dir/name value${hadoop.tmp.dir}/dfs/data/value /property property namefs.checkpoint.dir/name value${hadoop.tmp.dir}/dfs/namesecondary/value finaltrue/final finaltrue/final /property /configuration These directories do all exist # ls -l /mnt/dfs/ total 12 drwxr-xr-x 2 hadoop hadoop 4096 2010-04-23 05:08 data drwxr-xr-x 4 hadoop hadoop 4096 2010-04-23 05:17 name drwxr-xr-x 2 hadoop hadoop 4096 2010-04-23 05:08 namesecondary I don't have the config file hadoop-site.xml in /etc/... In the source directory of hadoop I have a hadoop-site.xml but with
Shared library error
My Java mapper drops down to C++ through JNI. The C++ side then runs various code which in some cases links to shared libraries which I put in the distributed cache. The problem isn't that the library isn't found or something to that effect...but I unsure what the problem *is*. This is all I'm getting: java.lang.Throwable: Child Error at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:471) Caused by: java.io.IOException: Task process exit with nonzero status of 127. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:458) I do believe that 127 indicates a failure to load the library, but I don't understand why. The library is of the correct architecture. I'm unsure what else to do. Any ideas? Thanks. Keith Wiley kwi...@keithwiley.com www.keithwiley.com I used to be with it, but then they changed what it was. Now, what I'm with isn't it, and what's it seems weird and scary to me. -- Abe (Grandpa) Simpson
Re: Reducer ID
JobConf.get(mapred.task.id) gives you everything (including attempt id). -Gang - 原始邮件 发件人: Farhan Husain farhan.hus...@csebuet.org 收件人: common-user@hadoop.apache.org 发送日期: 2010/4/26 (周一) 7:13:03 下午 主 题: Reducer ID Hello, Is it possible to know the unique id of a reducer inside the reduce or setup method of a reducer class? I tried to find any method of the context class which might help in this regard but could not get any. Thanks, Farhan
Re: Reducer ID
context.getTaskAttemptID() gives the task attempt id and context,getTaskAttemptID().getTaskID() gives the task id of the reducer. Context.getTaskAttemptID().getTaskID().getId() gives the reducer number. Thanks Amareshwari On 4/27/10 5:34 AM, Gang Luo lgpub...@yahoo.com.cn wrote: JobConf.get(mapred.task.id) gives you everything (including attempt id). -Gang - 原始邮件 发件人: Farhan Husain farhan.hus...@csebuet.org 收件人: common-user@hadoop.apache.org 发送日期: 2010/4/26 (周一) 7:13:03 下午 主 题: Reducer ID Hello, Is it possible to know the unique id of a reducer inside the reduce or setup method of a reducer class? I tried to find any method of the context class which might help in this regard but could not get any. Thanks, Farhan