Re: Redirecting hadoop log messages to a log file at client side

2010-03-30 Thread Pallavi Palleti
Hi Alex, Thanks for the reply. I have already created a logger (from log4j.logger)and configured the same to log it to a file and it is logging for all the log statements that I have in my client code. However, the error/info logs of DFSClient are going to stdout. The DFSClient code is using

Single datanode setup

2010-03-30 Thread Ed Mazur
Hi, I have a 12 node cluster where instead of running a DN on each compute node, I'm running just one DN backed by a large RAID (with a dfs.replication of 1). The compute node storage is limited, so the idea behind this was to free up more space for intermediate job data. So the cluster has that o

Re: Single datanode setup

2010-03-30 Thread Ankur C. Goel
M/R is performance is known to be better when using just a bunch of disks (BOD) instead of RAID. >From your setup it looks like your single datanode must be running hot on I/O >activity. The parameter- dfs.datanode.handler.count only control the number of datanode threads serving IPC request.

Re: java.io.IOException: Function not implemented

2010-03-30 Thread Steve Loughran
Edson Ramiro wrote: I'm not involved with Debian community :( I think you are now...

Re: why does 'jps' lose track of hadoop processes ?

2010-03-30 Thread Steve Loughran
Marcos Medrado Rubinelli wrote: jps gets its information from the files stored under /tmp/hsperfdata_*, so when a cron job clears your /tmp directory, it also erases these files. You can submit jobs as long as your jobtracker and namenode are responding to requests over TCP, though. I never k

Re: Single datanode setup

2010-03-30 Thread Ed Mazur
I set dfs.datanode.max.xcievers to 4096, but this didn't seem to have any effect on performance. Here are some benchmarks (not sure what typical values are): - TestDFSIO - : write Date & time: Tue Mar 30 04:53:18 EDT 2010 Number of files: 10 Total MBytes processed: 1

Re: Single datanode setup

2010-03-30 Thread Steve Loughran
Ed Mazur wrote: Hi, I have a 12 node cluster where instead of running a DN on each compute node, I'm running just one DN backed by a large RAID (with a dfs.replication of 1). The compute node storage is limited, so the idea behind this was to free up more space for intermediate job data. So the

Query over DFSClient

2010-03-30 Thread Pallavi Palleti
Hi, Could some one kindly let me know if the DFSClient takes care of datanode failures and attempt to write to another datanode if primary datanode (and replicated datanodes) fail. I looked into the souce code of DFSClient and figured out that it attempts to write to one of the datanodes in p

Listing subdirectories in Hadoop

2010-03-30 Thread Santiago Pérez
Hej I've checking the API and on internet but I have not found any method for listing the subdirectories of a given directory in the HDFS. Can anybody show me how to get the list of subdirectories or even how to implement the method? (I guess that it should be possible and not very hard). Than

a question about automatic restart of the NameNode

2010-03-30 Thread 毛宏
Hi all, Does automatic restart and failover of the NameNode software to another machine available in hadoop 0.20.2?

Re: Listing subdirectories in Hadoop

2010-03-30 Thread Ted Yu
Does this get what you want ? hadoop dfs -ls | grep drwx On Tue, Mar 30, 2010 at 8:24 AM, Santiago Pérez wrote: > > Hej > > I've checking the API and on internet but I have not found any method for > listing the subdirectories of a given directory in the HDFS. > > Can anybody show me how to get

C++ pipes on full (nonpseudo) cluster

2010-03-30 Thread Keith Wiley
I'm confused as to how to run a C++ pipes program on a full HDFS system. First off, I have everything working in pseudo-distributed mode so that's a good start...but full HDFS has no concept of an executable file (to the best of my understanding, O'Reilly/White, p.47). I haven't even been succ

Re: a question about automatic restart of the NameNode

2010-03-30 Thread Ted Yu
Please refer to highavailability contrib of 0.20.2: HDFS-976 http://hadoopblog.blogspot.com/2010/02/hadoop-namenode-high-availability.html On Tue, Mar 30, 2010 at 8:51 AM, 毛宏 wrote: > Hi all, > Does automatic restart and failover of the NameNode software to > another machine available in

Re: Redirecting hadoop log messages to a log file at client side

2010-03-30 Thread Alex Kozlov
Hi Pallavi, DFSClient uses log4j.properties for configuration. What is your classpath? I need to know how exactly you invoke your program (java, hadoop script, etc.). The log level and appender is driven by the hadoop.root.logger config variable. I would also recommend to use one logging syste

C++ pipes on full (nonpseudo) cluster

2010-03-30 Thread Keith Wiley
I'm confused as to how to run a C++ pipes program on a full HDFS system. I have everything working in pseudo-distributed mode so that's a good start...but I can't figure out the full cluster mode. As I see it, there are two basic approaches: upload the executable directly to HDFS or specify it

Re: C++ pipes on full (nonpseudo) cluster

2010-03-30 Thread Keith Wiley
Please disregard this thread. I started another thread which is more specific and pertinent to my problem...but if you have any helpful information, please respond to the other thread. I need to get this figured out. Thank you. _

swapping on hadoop

2010-03-30 Thread Vasilis Liaskovitis
Hi all, I 've noticed swapping for a single terasort job on a small 8-node cluster using hadoop-0.20.1. The swapping doesn't happen repeatably; I can have back to back runs of the same job from the same hdfs input data and get swapping only on 1 out of 4 identical runs. I 've noticed this swapping

Re: Listing subdirectories in Hadoop

2010-03-30 Thread A Levine
If you were talking about looking at directories within a Java program, here is what has worked for me. FileSystem fs; FileStatus[] fileStat; Path[] fileList; SequenceFile.Reader reader = null; try{ // connect to the file system fs = FileSystem.get(conf); // get the stat on all fil

CfP with Extended Deadline 5th Workshop on Virtualization in High-Performance Cloud Computing (VHPC'10)

2010-03-30 Thread Michael Alexander
Apologies if you received multiple copies of this message. = CALL FOR PAPERS 5th Workshop on Virtualization in High-Performance Cloud Computing VHPC'10 as part of Euro-Par 2010, Island of Ischia-Naples, Italy ==

Re: C++ pipes on full (nonpseudo) cluster

2010-03-30 Thread Keith Wiley
No responses yet, although I admit it's only been a few hours. As a follow-up, permit me to pose the following question: Is it, in fact, impossible to run C++ pipes on a fully-distributed system (as opposed to a pseudo-distributed system)? I haven't found any definitive clarification on this t

Re: C++ pipes on full (nonpseudo) cluster

2010-03-30 Thread Gianluigi Zanetti
Hello. Did you try following the tutorial in http://wiki.apache.org/hadoop/C++WordCount ? We use C++ pipes in production on a large cluster, and it works. --gianluigi On Tue, 2010-03-30 at 13:28 -0700, Keith Wiley wrote: > No responses yet, although I admit it's only been a few hours. > > As

Re: C++ pipes on full (nonpseudo) cluster

2010-03-30 Thread Keith Wiley
Yep, tried and tried and tried it. Works perfectly on a pseudo-distributed cluster which is why I didn't think the example or the code was the problem, but rather that the cluster was the problem. I have only just (in the last two minutes) heard back from the administrator of our cluster and h

Re: C++ pipes on full (nonpseudo) cluster

2010-03-30 Thread Keith Wiley
My cluster admin noticed that there is some additional pipes package he could add to the cluster configuration, but he admits to knowing very little about how the C++ pipes component of Hadoop works. Can you offer any insight into this cluster configuration package? What exactly does it do tha

Re: C++ pipes on full (nonpseudo) cluster

2010-03-30 Thread Gianluigi Zanetti
What are the symptoms? Pipes should run out of the box in a standard installation. BTW what version of bash are you using? Is it bash 4.0 by any chance? See https://issues.apache.org/jira/browse/HADOOP-6388 --gianluigi On Tue, 2010-03-30 at 14:13 -0700, Keith Wiley wrote: > My cluster admin not

Re: C++ pipes on full (nonpseudo) cluster

2010-03-30 Thread Keith Wiley
The closest I've gotten so far is for the job to basically try to start up but to get an error complaining about the permissions on the executable binary...which makes perfect sense since the permissions are not "executable". Problem is, the hdfs chmod command ignores executable commands. For

Re: C++ pipes on full (nonpseudo) cluster

2010-03-30 Thread Keith Wiley
$ hadoop fs -rmr HDFSPATH/output ; hadoop pipes -D hadoop.pipes.java.recordreader=true -D hadoop.pipes.java.recordwriter=true -input HDFSPATH/input -output HDFSPATH/output -program HDFSPATH/EXECUTABLE Deleted hdfs://mainclusternn.hipods.ihost.com/HDFSPATH/output 10/03/30 14:56:55 WARN mapred.JobC

Hadoop DFS IO Performance measurement

2010-03-30 Thread sagar naik
Hi All, I am trying to get DFS IO performance. I used TestDFSIO from hadoop jars. The results were abt 100Mbps read and write . I think it should be more than this Pl share some stats to compare Either I am missing something like config params or something else -Sagar

Re: java.io.IOException: Function not implemented

2010-03-30 Thread Edson Ramiro
Hi all, Thanks for help Todd and Steve, I configured Hadoop (0.20.2) again and I'm getting the same error (Function not implemented). Do you think it's a Hadoop bug? This is the situation: I've 28 nodes where just four are running the datanode. In all other nodes the tasktracker in running ok

Re: java.io.IOException: Function not implemented

2010-03-30 Thread Todd Lipcon
Hi Edson, I noticed that only the h01 nodes are running 2.6.32.9, the other broken DNs are 2.6.32.10. Is there some reason you are running a kernel that is literally 2 weeks old? I wouldn't be at all surprised if there were a bug here, or some issue with your Debian "unstable" distribution... -T

Re: java.io.IOException: Function not implemented

2010-03-30 Thread Edson Ramiro
May be it's a bug. I'm not the admin. : ( so, I'll talk to him and may be he install a 2.6.32.9 in another node to test : ) Thanks Edson Ramiro On 30 March 2010 20:00, Todd Lipcon wrote: > Hi Edson, > > I noticed that only the h01 nodes are running 2.6.32.9, the other broken > DNs > are 2.

Re: Hadoop DFS IO Performance measurement

2010-03-30 Thread Edson Ramiro
Hi Sagar, What hardware did you run it on ? Edson Ramiro On 30 March 2010 19:41, sagar naik wrote: > Hi All, > > I am trying to get DFS IO performance. > I used TestDFSIO from hadoop jars. > The results were abt 100Mbps read and write . > I think it should be more than this > > Pl share some

question on shuffle and sort

2010-03-30 Thread Cui tony
Hi, Did all key-value pairs of the map output, which have the same key, will be sent to the same reducer tasknode?

Re: question on shuffle and sort

2010-03-30 Thread Ed Mazur
On Tue, Mar 30, 2010 at 9:56 PM, Cui tony wrote: >  Did all key-value pairs of the map output, which have the same key, will > be sent to the same reducer tasknode? Yes, this is at the core of the MapReduce model. There is one call to the user reduce function per unique map output key. This groupi

Re: question on shuffle and sort

2010-03-30 Thread 毛宏
yes ,indeed 在 2010-03-31三的 09:56 +0800,Cui tony写道: > Hi, > Did all key-value pairs of the map output, which have the same key, will > be sent to the same reducer tasknode?

Re: question on shuffle and sort

2010-03-30 Thread Jones, Nick
Something to keep in mind though, sorting is appropriate to the key type. Text will be sorted lexicographically. Nick Jones - Original Message - From: Ed Mazur To: common-user@hadoop.apache.org Sent: Tue Mar 30 21:07:29 2010 Subject: Re: question on shuffle and sort On Tue, Mar 30, 2

Re: question on shuffle and sort

2010-03-30 Thread Cui tony
Consider this extreme situation: The input data is very large, and also the map result. 90% of map result have the same key, then all of them will be sent to one reducer tasknode. So 90% of work of reduce phase have to been done on a single node, not the cluster. That is very ineffective and less s

Re: question on shuffle and sort

2010-03-30 Thread Jones, Nick
I ran into an issue where lots of data was passing from mappers to a single reducer. Enabling a combiner saved quite a bit of processing time by reducing mapper disk writes and data movements to the reducer. Nick Jones - Original Message - From: Cui tony To: common-user@hadoop.apache.

Re: question on shuffle and sort

2010-03-30 Thread Cui tony
Hi, Jones As you have met the situation I am worried about, I got my answer now. Maybe re-design the map function or add a combiner is the only way to deal with this kind of input data . 2010/3/31 Jones, Nick > I ran into an issue where lots of data was passing from mappers to a single > reduce

is there any way we can limit Hadoop Datanode's disk usage?

2010-03-30 Thread steven zhuang
hi, guys, we have some machine with 1T disk, some with 100GB disk, I have this question that is there any means we can limit the disk usage of datanodes on those machines with smaller disk? thanks!

Re: is there any way we can limit Hadoop Data node's disk usage?

2010-03-30 Thread Ravi Phulari
Hello Steven , You can use dfs.datanode.du.reserved configuration value in $HADOOP_HOME/conf/hdfs-site.xml for limiting disk usage. dfs.datanode.du.reserved 182400 Reserved space in bytes per volume. Always leave this much space free for non dfs use. Ravi Hadoop @ Yahoo

log

2010-03-30 Thread Gang Luo
Hi all, I find there is a directory "_log/history/..." under the output directory of a mapreduce job. Is the file in that directory a log file? Is the information there sufficient to allow me to figure out what nodes the job runs on? Besides, not every job has such a directory. Is there such set