Re: Hadoop performance - xfs and ext4

2010-04-26 Thread Konstantin Shvachko



On 4/23/2010 6:17 AM, stephen mulcahy wrote:

Steve Loughran wrote:

That's really interesting. Do you want to update the bits of the
Hadoop wiki that talks about filesystems?


I can if people think that would be useful.


Absolutely.
+1

Thanks,
--Konstantin


I'm not sure if my results are neccesarily going to reflect what will
happen on other peoples systems and configs though - whats the best way
of addressing that?

Do my apache credentials work for the wiki or do I need to explicitly
have a new account for the hadoop wiki?

-stephen





Chaining M/R Jobs

2010-04-26 Thread Tiago Veloso
Hi,

I'm trying to find a way to control the output file names. I need this because 
I have a situation where I need to run a Job and then use it's output in the 
DistributedCache.

So far the only way I've seen that makes it possible is rewriting the 
OutputFormat class but that seems a lot of work for such a simple task. Is 
there any way to do what I'm looking for?

Tiago Veloso
ti.vel...@gmail.com





Re: Chaining M/R Jobs

2010-04-26 Thread Eric Sammer
The easiest way to do this is to write your job outputs to a known
place and then use the FileSystem APIs to rename the part-* files to
what you want them to be.

On Mon, Apr 26, 2010 at 2:22 PM, Tiago Veloso ti.vel...@gmail.com wrote:
 Hi,

 I'm trying to find a way to control the output file names. I need this 
 because I have a situation where I need to run a Job and then use it's output 
 in the DistributedCache.

 So far the only way I've seen that makes it possible is rewriting the 
 OutputFormat class but that seems a lot of work for such a simple task. Is 
 there any way to do what I'm looking for?

 Tiago Veloso
 ti.vel...@gmail.com







-- 
Eric Sammer
phone: +1-917-287-2675
twitter: esammer
data: www.cloudera.com


RE: Chaining M/R Jobs

2010-04-26 Thread Xavier Stevens
I don't usually bother renaming the files.  If you know you want all of
the files, you just iterate over the files in the output directory from
the previous job.  And then add those to distributed cache.  If the data
is fairly small you can set the number of reducers to 1 on the previous
step as well.


-Xavier


-Original Message-
From: Eric Sammer [mailto:esam...@cloudera.com] 
Sent: Monday, April 26, 2010 11:33 AM
To: common-user@hadoop.apache.org
Subject: Re: Chaining M/R Jobs

The easiest way to do this is to write your job outputs to a known
place and then use the FileSystem APIs to rename the part-* files to
what you want them to be.

On Mon, Apr 26, 2010 at 2:22 PM, Tiago Veloso ti.vel...@gmail.com
wrote:
 Hi,

 I'm trying to find a way to control the output file names. I need this
because I have a situation where I need to run a Job and then use it's
output in the DistributedCache.

 So far the only way I've seen that makes it possible is rewriting the
OutputFormat class but that seems a lot of work for such a simple task.
Is there any way to do what I'm looking for?

 Tiago Veloso
 ti.vel...@gmail.com







-- 
Eric Sammer
phone: +1-917-287-2675
twitter: esammer
data: www.cloudera.com




Re: Hadoop Log Collection

2010-04-26 Thread Ariel Rabkin
It should actually be straightforward to do this with Chukwa.  Chukwa
has a bunch of other pieces, but at its core, it does basically what
you describe.

The one complexity is that instead of storing each file separately,
Chukwa runs them together into larger sequence files.  This turns out
to be important if you want good filesystem performance or if you have
large data volumes or if you want to keep metadata telling you which
machine your file came from.

--Ari

On Fri, Apr 23, 2010 at 5:38 AM, Patrick Datko patrick.da...@ymc.ch wrote:
 Hey everyone,

 i deal with hadoop since a few weeks to build up a cluster with hdfs. I
 was looking for several Monitoring tools to observe my cluster and find
 a good solution with ganglia+nagios. To complete the monitoring part of
 the cluster, i am looking for an Log collection tool, which store the
 log files of the nodes centralized. I have tested Chukwa and Facebook's
 Scribe, but both are not that type of simple storing log files, in my
 opinion they are too big, only for such a job.

 So i've thinking about writing an own LogCollector. I didn't want
 something special. My idea is, to build a deamon, which could be
 installed on every node in the cluster and onxml-file, which describes
 which log files have to be collected. The daemon should collect, in
 configured time interval, all needed log files and store them using the
 Java API in HDFS.

 This was just an idea for a simple LogCollector and it would cool if you
 can give me some opinion about this or whether such a LogCollector
 exits.

 Kind regards,
 Patrick





-- 
Ari Rabkin asrab...@gmail.com
UC Berkeley Computer Science Department


Re: Chaining M/R Jobs

2010-04-26 Thread Alex Kozlov
You can use MultipleOutputs for this purpose, even though it was not
designed for this and a few people on this list are going to raise an
eyebrow.

Alex K

On Mon, Apr 26, 2010 at 11:39 AM, Xavier Stevens xavier.stev...@fox.comwrote:

 I don't usually bother renaming the files.  If you know you want all of
 the files, you just iterate over the files in the output directory from
 the previous job.  And then add those to distributed cache.  If the data
 is fairly small you can set the number of reducers to 1 on the previous
 step as well.


 -Xavier


 -Original Message-
 From: Eric Sammer [mailto:esam...@cloudera.com]
 Sent: Monday, April 26, 2010 11:33 AM
 To: common-user@hadoop.apache.org
 Subject: Re: Chaining M/R Jobs

 The easiest way to do this is to write your job outputs to a known
 place and then use the FileSystem APIs to rename the part-* files to
 what you want them to be.

 On Mon, Apr 26, 2010 at 2:22 PM, Tiago Veloso ti.vel...@gmail.com
 wrote:
  Hi,
 
  I'm trying to find a way to control the output file names. I need this
 because I have a situation where I need to run a Job and then use it's
 output in the DistributedCache.
 
  So far the only way I've seen that makes it possible is rewriting the
 OutputFormat class but that seems a lot of work for such a simple task.
 Is there any way to do what I'm looking for?
 
  Tiago Veloso
  ti.vel...@gmail.com
 
 
 
 



 --
 Eric Sammer
 phone: +1-917-287-2675
 twitter: esammer
 data: www.cloudera.com





Re: Chaining M/R Jobs

2010-04-26 Thread Tiago Veloso
On Apr 26, 2010, at 7:39 PM, Xavier Stevens wrote:

 I don't usually bother renaming the files.  If you know you want all of
 the files, you just iterate over the files in the output directory from
 the previous job.  And then add those to distributed cache.  If the data
 is fairly small you can set the number of reducers to 1 on the previous
 step as well.


And how do I Iterate on a directory? Could you give me a sample code?

If relevant I am using hadoop 0.20.2.

Tiago Veloso
ti.vel...@gmail.com


RE: Chaining M/R Jobs

2010-04-26 Thread Xavier Stevens
I know this works for 0.18.x.  I'm not using 0.20 yet but as long as the API 
hasn't changed to much this should be pretty straightforward.


Path prevOutputPath = new Path(...);
for (FileStatus fstatus : hdfs.listStatus(prevOutputPath)) {
if (!fstatus.isDir()) {
DistributedCache.addCacheFile(fstatus.getPath().toUri(), conf);
}
}


-Original Message-
From: Tiago Veloso [mailto:ti.vel...@gmail.com] 
Sent: Monday, April 26, 2010 12:11 PM
To: common-user@hadoop.apache.org
Cc: Tiago Veloso
Subject: Re: Chaining M/R Jobs

On Apr 26, 2010, at 7:39 PM, Xavier Stevens wrote:

 I don't usually bother renaming the files.  If you know you want all of
 the files, you just iterate over the files in the output directory from
 the previous job.  And then add those to distributed cache.  If the data
 is fairly small you can set the number of reducers to 1 on the previous
 step as well.


And how do I Iterate on a directory? Could you give me a sample code?

If relevant I am using hadoop 0.20.2.

Tiago Veloso
ti.vel...@gmail.com



Re: Chaining M/R Jobs

2010-04-26 Thread Tiago Veloso
It worked thanks.

Tiago Veloso
ti.vel...@gmail.com



On Apr 26, 2010, at 8:57 PM, Xavier Stevens wrote:

 I know this works for 0.18.x.  I'm not using 0.20 yet but as long as the API 
 hasn't changed to much this should be pretty straightforward.
 
 
 Path prevOutputPath = new Path(...);
 for (FileStatus fstatus : hdfs.listStatus(prevOutputPath)) {
   if (!fstatus.isDir()) {
   DistributedCache.addCacheFile(fstatus.getPath().toUri(), conf);
   }
 }
 
 
 -Original Message-
 From: Tiago Veloso [mailto:ti.vel...@gmail.com] 
 Sent: Monday, April 26, 2010 12:11 PM
 To: common-user@hadoop.apache.org
 Cc: Tiago Veloso
 Subject: Re: Chaining M/R Jobs
 
 On Apr 26, 2010, at 7:39 PM, Xavier Stevens wrote:
 
 I don't usually bother renaming the files.  If you know you want all of
 the files, you just iterate over the files in the output directory from
 the previous job.  And then add those to distributed cache.  If the data
 is fairly small you can set the number of reducers to 1 on the previous
 step as well.
 
 
 And how do I Iterate on a directory? Could you give me a sample code?
 
 If relevant I am using hadoop 0.20.2.
 
 Tiago Veloso
 ti.vel...@gmail.com
 



Re: Try to mount HDFS

2010-04-26 Thread Eli Collins
The issue that required you changing ports is HDFS-961.

Thanks,
Eli


On Fri, Apr 23, 2010 at 6:30 AM, Christian Baun c...@unix-ag.uni-kl.de wrote:
 Brian,

 You got it!!! :-)
 It works (partly)!

 i switched to Port 9000. core-site.xml includes now:

        property
                namefs.default.name/name
                
 valuehdfs://ec2-75-101-210-65.compute-1.amazonaws.com:9000/value
                finaltrue/final
        /property


 $ hadoop fs -ls /
 Found 1 items
 drwxr-xr-x   - hadoop supergroup          0 2010-04-23 05:18 /mnt

 $ hadoop fs -ls /mnt/
 Found 1 items
 drwxr-xr-x   - hadoop supergroup          0 2010-04-23 13:00 /mnt/mapred

 # ./fuse_dfs_wrapper.sh dfs://ec2-75-101-210-65.compute-1.amazonaws.com:9000 
 /mnt/hdfs/
 port=9000,server=ec2-75-101-210-65.compute-1.amazonaws.com
 fuse-dfs didn't recognize /mnt/hdfs/,-2

 This tiny error message remains.

 # mount | grep fuse
 fuse_dfs on /hdfs type fuse.fuse_dfs 
 (rw,nosuid,nodev,allow_other,default_permissions)

 # ls /mnt/hdfs/
 mnt
 # mkdir /mnt/hdfs/testverzeichnis
 # touch /mnt/hdfs/testdatei
 # ls -l /mnt/hdfs/
 total 8
 drwxr-xr-x 3 hadoop 99 4096 2010-04-23 05:18 mnt
 -rw-r--r-- 1 root   99    0 2010-04-23 13:07 testdatei
 drwxr-xr-x 2 root   99 4096 2010-04-23 13:05 testverzeichnis

 In /var/log/messages there was no information about hdfs/fuse.

 Only in /var/log/user.log were these lines:
 Apr 23 13:04:34 ip-10-242-231-63 fuse_dfs: mounting 
 dfs://ec2-75-101-210-65.compute-1.amazonaws.com:9000/

 mkdir and touch works. But I cannot write data into files(?!). They are all 
 read only.
 When I try to copy files from outside into the HDFS, only an empty file is 
 created and in user.log appear these error messages:

 Apr 23 13:18:46 ip-10-242-231-63 fuse_dfs: ERROR: fuse problem - could not 
 write all the bytes for /testordner/testfile -1!=4096fuse_impls_write.c:60
 Apr 23 13:18:46 ip-10-242-231-63 fuse_dfs: WARN: fuse problem - could not 
 write all the bytes for /testordner/testfile -1!=4096fuse_impls_write.c:64
 Apr 23 13:18:46 ip-10-242-231-63 fuse_dfs: ERROR: fuse problem - could not 
 write all the bytes for /testordner/testfile -1!=4096fuse_impls_write.c:60
 Apr 23 13:18:46 ip-10-242-231-63 fuse_dfs: WARN: fuse problem - could not 
 write all the bytes for /testordner/testfile -1!=4096fuse_impls_write.c:64
 Apr 23 13:18:46 ip-10-242-231-63 fuse_dfs: ERROR: dfs problem - could not 
 close file_handle(23486496) for /testordner/testfile fuse_impls_release.c:58

 Weird...

 But this is a big step forward.

 Thanks a lot!!!

 Best Regards
    Christian


 Am Freitag, 23. April 2010 schrieb Brian Bockelman:
 Hm, ok, now you have me stumped.

 One last hunch - can you include the port information, but also switch to 
 port 9000?

 Additionally, can you do the following:

 1) In /var/log/messages and copy out the hdfs/fuse-related messages and post 
 them
 2) Using the hadoop clients do,
 hadoop fs -ls /

 Brian

 On Apr 23, 2010, at 12:33 AM, Christian Baun wrote:

  Hi,
 
  When adding the port information inside core-site.xml, the problem remains:
 
      property
              namefs.default.name/name
              
  valuehdfs://ec2-75-101-210-65.compute-1.amazonaws.com:8020/value
              finaltrue/final
      /property
 
  # ./fuse_dfs_wrapper.sh 
  dfs://ec2-75-101-210-65.compute-1.amazonaws.com:8020 /mnt/hdfs/
  port=8020,server=ec2-75-101-210-65.compute-1.amazonaws.com
  fuse-dfs didn't recognize /mnt/hdfs/,-2
 
  # ls /mnt/hdfs
  ls: cannot access /mnt/hdfs/®1  : No such file or directory
 
  Best Regards,
    Christian
 
 
  Am Freitag, 23. April 2010 schrieb Christian Baun:
  Hi Brian,
 
  this is inside my core-site.xml
 
  configuration
     property
             namefs.default.name/name
             
  valuehdfs://ec2-75-101-210-65.compute-1.amazonaws.com//value
             finaltrue/final
     /property
     property
             namehadoop.tmp.dir/name
             value/mnt/value
             descriptionA base for other temporary 
  directories./description
     /property
  /configuration
 
  Do I need to give the port here?
 
  this is inside my hdfs-site.xml
 
  configuration
     property
             namedfs.name.dir/name
             value${hadoop.tmp.dir}/dfs/name/value
             finaltrue/final
     /property
     property
             namedfs.data.dir/name
             value${hadoop.tmp.dir}/dfs/data/value
 
  /property
     property
             namefs.checkpoint.dir/name
             value${hadoop.tmp.dir}/dfs/namesecondary/value
             finaltrue/final
             finaltrue/final
     /property
  /configuration
 
  These directories do all exist
 
  # ls -l /mnt/dfs/
  total 12
  drwxr-xr-x 2 hadoop hadoop 4096 2010-04-23 05:08 data
  drwxr-xr-x 4 hadoop hadoop 4096 2010-04-23 05:17 name
  drwxr-xr-x 2 hadoop hadoop 4096 2010-04-23 05:08 namesecondary
 
  I don't have the config file hadoop-site.xml in /etc/...
  In the source directory of hadoop I have a hadoop-site.xml but with 

Shared library error

2010-04-26 Thread Keith Wiley
My Java mapper drops down to C++ through JNI.  The C++ side then runs various 
code which in some cases links to shared libraries which I put in the 
distributed cache.  The problem isn't that the library isn't found or something 
to that effect...but I unsure what the problem *is*.  This is all I'm getting:

java.lang.Throwable: Child Error
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:471)
Caused by: java.io.IOException: Task process exit with nonzero status of 127.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:458)

I do believe that 127 indicates a failure to load the library, but I don't 
understand why.  The library is of the correct architecture.  I'm unsure what 
else to do.

Any ideas?

Thanks.


Keith Wiley   kwi...@keithwiley.com   www.keithwiley.com

I used to be with it, but then they changed what it was.  Now, what I'm with
isn't it, and what's it seems weird and scary to me.
  -- Abe (Grandpa) Simpson






Re: Reducer ID

2010-04-26 Thread Gang Luo
JobConf.get(mapred.task.id) gives you everything (including attempt id).

-Gang


- 原始邮件 
发件人: Farhan Husain farhan.hus...@csebuet.org
收件人: common-user@hadoop.apache.org
发送日期: 2010/4/26 (周一) 7:13:03 下午
主   题: Reducer ID

Hello,

Is it possible to know the unique id of a reducer inside the reduce or setup
method of a reducer class? I tried to find any method of the context class
which might help in this regard but could not get any.

Thanks,
Farhan






Re: Reducer ID

2010-04-26 Thread Amareshwari Sri Ramadasu
context.getTaskAttemptID() gives the task attempt id and 
context,getTaskAttemptID().getTaskID() gives the  task id of the reducer.
Context.getTaskAttemptID().getTaskID().getId() gives the reducer number.

Thanks
Amareshwari

On 4/27/10 5:34 AM, Gang Luo lgpub...@yahoo.com.cn wrote:

JobConf.get(mapred.task.id) gives you everything (including attempt id).

-Gang


- 原始邮件 
发件人: Farhan Husain farhan.hus...@csebuet.org
收件人: common-user@hadoop.apache.org
发送日期: 2010/4/26 (周一) 7:13:03 下午
主   题: Reducer ID

Hello,

Is it possible to know the unique id of a reducer inside the reduce or setup
method of a reducer class? I tried to find any method of the context class
which might help in this regard but could not get any.

Thanks,
Farhan