Redirecting hadoop log messages to a log file at client side
Hi, I am copying certain data from a client machine (which is not part of the cluster) using DFSClient to HDFS. During this process, I am encountering some issues and the error/info logs are going to stdout. Is there a way, I can configure the property at client side so that the error/info logs are appended to existing log file (being created using logger at client code) rather writing to stdout. Thanks Pallavi
Re: hadoop.log.dir
HADOOP_LOG_DIR is used to set hadoop.log.dir (see bin/hadoop). It is passed to a JVM via the -D java flag (or set it in the log4j.properties file). The best way for you would be to set this variable in bin/hadoop-env.sh (essentially, uncomment the prepared stub). Alex K On Mon, Mar 29, 2010 at 10:55 PM, Amareshwari Sri Ramadasu amar...@yahoo-inc.com wrote: Hadoop.log.dir is not config parameter, it is a system property. You can specify the log directory in the environment variable HADOOP_LOG_DIR. Thanks Amareshwari On 3/30/10 11:17 AM, Vasilis Liaskovitis vlias...@gmail.com wrote: Hi all, is there a config option that controls placement of all hadoop logs? I 'd like to put all hadoop logs under a specific directory e.g. /tmp. on the namenode and all datanodes. Is hadoop.log.dir the right config? Can I change this in the log4j.properties file, or pass it e.g. in the JVM opts as -Dhadoop.log.dir=/tmp ? I am using hadoop-0.20.1 or hadoop-0.20.2. thanks, - Vasilis
Re: Redirecting hadoop log messages to a log file at client side
Hi Pallavi, It depends what logging configuration you are using. If it's log4j, you need to modify (or create) log4j.properties file and point you code (via classpath) to it. A sample log4j.properties is in the conf directory (either apache or CDH distributions). Alex K On Mon, Mar 29, 2010 at 11:25 PM, Pallavi Palleti pallavi.pall...@corp.aol.com wrote: Hi, I am copying certain data from a client machine (which is not part of the cluster) using DFSClient to HDFS. During this process, I am encountering some issues and the error/info logs are going to stdout. Is there a way, I can configure the property at client side so that the error/info logs are appended to existing log file (being created using logger at client code) rather writing to stdout. Thanks Pallavi
Re: Redirecting hadoop log messages to a log file at client side
Hi Alex, Thanks for the reply. I have already created a logger (from log4j.logger)and configured the same to log it to a file and it is logging for all the log statements that I have in my client code. However, the error/info logs of DFSClient are going to stdout. The DFSClient code is using log from commons-logging.jar. I am wondering how to redirect those logs (which are right now going to stdout) to append to the existing logger in client code. Thanks Pallavi On 03/30/2010 12:06 PM, Alex Kozlov wrote: Hi Pallavi, It depends what logging configuration you are using. If it's log4j, you need to modify (or create) log4j.properties file and point you code (via classpath) to it. A sample log4j.properties is in the conf directory (either apache or CDH distributions). Alex K On Mon, Mar 29, 2010 at 11:25 PM, Pallavi Palleti pallavi.pall...@corp.aol.com wrote: Hi, I am copying certain data from a client machine (which is not part of the cluster) using DFSClient to HDFS. During this process, I am encountering some issues and the error/info logs are going to stdout. Is there a way, I can configure the property at client side so that the error/info logs are appended to existing log file (being created using logger at client code) rather writing to stdout. Thanks Pallavi
Re: Single datanode setup
M/R is performance is known to be better when using just a bunch of disks (BOD) instead of RAID. From your setup it looks like your single datanode must be running hot on I/O activity. The parameter- dfs.datanode.handler.count only control the number of datanode threads serving IPC request. These are NOT used for actual block transfer. Try upping - dfs.datanode.max.xcievers. You can then run the I/O benchmarks to measure the I/O throughput - jar $HADOOP_INSTALL/hadoop-*-test.jar TestDFSIO -write -nrFiles 10 -fileSize 1000 -...@nkur On 3/30/10 12:46 PM, Ed Mazur ma...@cs.umass.edu wrote: Hi, I have a 12 node cluster where instead of running a DN on each compute node, I'm running just one DN backed by a large RAID (with a dfs.replication of 1). The compute node storage is limited, so the idea behind this was to free up more space for intermediate job data. So the cluster has that one node with the DN, a master node with the JT/NN, and 10 compute nodes each with a TT. I am running 0.20.1+169.68 from Cloudera. The problem is that MR job performance is now worse than when using a traditional HDFS setup. A job that took 76 minutes before now takes 169 minutes. I've used this single DN setup before on a similarly-sized cluster without any problems, so what can I do to find the bottleneck? -Loading data into HDFS was fast, under 30 minutes to load ~240GB, so I'm thinking this is a DN - map task communication problem. -With a traditional HDFS setup, map tasks were taking 10-30 seconds, but they now take 45-90 seconds or more. -I grep'd the DN logs to find how long the size 67633152 HDFS reads (map inputs) were taking. With the central DN, the reads were an order of magnitude slower than with traditional HDFS (e.g. 82008147000 vs. 8238455000). -I tried increasing dfs.datanode.handler.count to 10, but this didn't seem to have any effect. -Could low memory be an issue? The machine the DN is running on only has 2GB and there is less than 100MB free without the DN running. I haven't observed any swapping going on though. -I looked at netstat during a job. I wasn't too sure what to look for, but I didn't see any substantial send/receive buffering. I've tried everything I can think of, so I'd really appreciate any tips. Thanks. Ed
Re: java.io.IOException: Function not implemented
Edson Ramiro wrote: I'm not involved with Debian community :( I think you are now...
Re: why does 'jps' lose track of hadoop processes ?
Marcos Medrado Rubinelli wrote: jps gets its information from the files stored under /tmp/hsperfdata_*, so when a cron job clears your /tmp directory, it also erases these files. You can submit jobs as long as your jobtracker and namenode are responding to requests over TCP, though. I never knew that. ps -ef | grep java works quite well; jps has fairly steep startup costs and if a JVM is playing up, jps can hang too
Listing subdirectories in Hadoop
Hej I've checking the API and on internet but I have not found any method for listing the subdirectories of a given directory in the HDFS. Can anybody show me how to get the list of subdirectories or even how to implement the method? (I guess that it should be possible and not very hard). Thanks in advance ;) -- View this message in context: http://old.nabble.com/Listing-subdirectories-in-Hadoop-tp28084164p28084164.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
a question about automatic restart of the NameNode
Hi all, Does automatic restart and failover of the NameNode software to another machine available in hadoop 0.20.2?
Re: Listing subdirectories in Hadoop
Does this get what you want ? hadoop dfs -ls path | grep drwx On Tue, Mar 30, 2010 at 8:24 AM, Santiago Pérez elara...@gmail.com wrote: Hej I've checking the API and on internet but I have not found any method for listing the subdirectories of a given directory in the HDFS. Can anybody show me how to get the list of subdirectories or even how to implement the method? (I guess that it should be possible and not very hard). Thanks in advance ;) -- View this message in context: http://old.nabble.com/Listing-subdirectories-in-Hadoop-tp28084164p28084164.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: a question about automatic restart of the NameNode
Please refer to highavailability contrib of 0.20.2: HDFS-976 http://hadoopblog.blogspot.com/2010/02/hadoop-namenode-high-availability.html On Tue, Mar 30, 2010 at 8:51 AM, 毛宏 maohong1...@gmail.com wrote: Hi all, Does automatic restart and failover of the NameNode software to another machine available in hadoop 0.20.2?
Re: Redirecting hadoop log messages to a log file at client side
Hi Pallavi, DFSClient uses log4j.properties for configuration. What is your classpath? I need to know how exactly you invoke your program (java, hadoop script, etc.). The log level and appender is driven by the hadoop.root.logger config variable. I would also recommend to use one logging system in the code, which will be commons-logging in this case. Alex K On Tue, Mar 30, 2010 at 12:12 AM, Pallavi Palleti pallavi.pall...@corp.aol.com wrote: Hi Alex, Thanks for the reply. I have already created a logger (from log4j.logger)and configured the same to log it to a file and it is logging for all the log statements that I have in my client code. However, the error/info logs of DFSClient are going to stdout. The DFSClient code is using log from commons-logging.jar. I am wondering how to redirect those logs (which are right now going to stdout) to append to the existing logger in client code. Thanks Pallavi On 03/30/2010 12:06 PM, Alex Kozlov wrote: Hi Pallavi, It depends what logging configuration you are using. If it's log4j, you need to modify (or create) log4j.properties file and point you code (via classpath) to it. A sample log4j.properties is in the conf directory (either apache or CDH distributions). Alex K On Mon, Mar 29, 2010 at 11:25 PM, Pallavi Palleti pallavi.pall...@corp.aol.com wrote: Hi, I am copying certain data from a client machine (which is not part of the cluster) using DFSClient to HDFS. During this process, I am encountering some issues and the error/info logs are going to stdout. Is there a way, I can configure the property at client side so that the error/info logs are appended to existing log file (being created using logger at client code) rather writing to stdout. Thanks Pallavi
Re: C++ pipes on full (nonpseudo) cluster
Please disregard this thread. I started another thread which is more specific and pertinent to my problem...but if you have any helpful information, please respond to the other thread. I need to get this figured out. Thank you. Keith Wiley kwi...@keithwiley.com www.keithwiley.com And what if we picked the wrong religion? Every week, we're just making God madder and madder! -- Homer Simpson
swapping on hadoop
Hi all, I 've noticed swapping for a single terasort job on a small 8-node cluster using hadoop-0.20.1. The swapping doesn't happen repeatably; I can have back to back runs of the same job from the same hdfs input data and get swapping only on 1 out of 4 identical runs. I 've noticed this swapping behaviour on both terasort jobs and hive query jobs. - Focusing on a single job config, Is there a rule of thumb about how much node memory should be left for use outside of Child JVMs? I make sure that per Node, there is free memory: (#maxmapTasksperTaskTracker + #maxreduceTasksperTaskTracker) * JVMHeapSize PhysicalMemoryonNode The total JVM heap size per node per job from the above equation currently account 65%-75% of the node's memory. (I 've tried allocating a riskier 90% of the node's memory, with similar swapping observations). - Could there be an issue with HDFS data or metadata taking up memory? I am not cleaning output or intermediate outputs from HDFS between runs. Is this possible? - Do people use any specific java flags (particularly garbage collection flags) for production environments where one job runs (or possibly more jobs run simultaneously) ? - What are the memory requirements for the jobtracker,namenode and tasktracker,datanode JVMs? - I am setting io.sort.mb to about half of the JVM heap size (half of -Xmx in javaopts). Should this be set to a different ratio? (this setting doesn't sound like it should be causing swapping in the first place). - The buffer cache is cleaned before each run (flush and echo 3 /proc/sys/vm/drop_caches) any empirical advice and suggestions to solve this are appreciated. thanks, - Vasilis
Re: Listing subdirectories in Hadoop
If you were talking about looking at directories within a Java program, here is what has worked for me. FileSystem fs; FileStatus[] fileStat; Path[] fileList; SequenceFile.Reader reader = null; try{ // connect to the file system fs = FileSystem.get(conf); // get the stat on all files in the source directory fileStat = fs.listStatus(sourceDir); // get paths to the files in the source directory fileList = FileUtil.stat2Paths(fileStat); // then you can do something like for(int x = 0; x fileList.length; x++){ System.out.println(x + + fileList[x]); } } catch(IOException ioe){ // do something } Hope this helps. andrew -- On Tue, Mar 30, 2010 at 11:54 AM, Ted Yu yuzhih...@gmail.com wrote: Does this get what you want ? hadoop dfs -ls path | grep drwx On Tue, Mar 30, 2010 at 8:24 AM, Santiago Pérez elara...@gmail.com wrote: Hej I've checking the API and on internet but I have not found any method for listing the subdirectories of a given directory in the HDFS. Can anybody show me how to get the list of subdirectories or even how to implement the method? (I guess that it should be possible and not very hard). Thanks in advance ;) -- View this message in context: http://old.nabble.com/Listing-subdirectories-in-Hadoop-tp28084164p28084164.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
CfP with Extended Deadline 5th Workshop on Virtualization in High-Performance Cloud Computing (VHPC'10)
Apologies if you received multiple copies of this message. = CALL FOR PAPERS 5th Workshop on Virtualization in High-Performance Cloud Computing VHPC'10 as part of Euro-Par 2010, Island of Ischia-Naples, Italy = Date: August 31, 2010 Euro-Par 2009: http://www.europar2010.org/ Workshop URL: http://vhpc.org SUBMISSION DEADLINE: Abstracts: April 4, 2010 (extended) Full Paper: June 19, 2010 (extended) Scope: Virtualization has become a common abstraction layer in modern data centers, enabling resource owners to manage complex infrastructure independently of their applications. Conjointly virtualization is becoming a driving technology for a manifold of industry grade IT services. Piloted by the Amazon Elastic Computing Cloud services, the cloud concept includes the notion of a separation between resource owners and users, adding services such as hosted application frameworks and queuing. Utilizing the same infrastructure, clouds carry significant potential for use in high-performance scientific computing. The ability of clouds to provide for requests and releases of vast computing resource dynamically and close to the marginal cost of providing the services is unprecedented in the history of scientific and commercial computing. Distributed computing concepts that leverage federated resource access are popular within the grid community, but have not seen previously desired deployed levels so far. Also, many of the scientific datacenters have not adopted virtualization or cloud concepts yet. This workshop aims to bring together industrial providers with the scientific community in order to foster discussion, collaboration and mutual exchange of knowledge and experience. The workshop will be one day in length, composed of 20 min paper presentations, each followed by 10 min discussion sections. Presentations may be accompanied by interactive demonstrations. It concludes with a 30 min panel discussion by presenters. TOPICS Topics include, but are not limited to, the following subjects: - Virtualization in cloud, cluster and grid HPC environments - VM cloud, cluster load distribution algorithms - Cloud, cluster and grid filesystems - QoS and and service level guarantees - Cloud programming models, APIs and databases - Software as a service (SaaS) - Cloud provisioning - Virtualized I/O - VMMs and storage virtualization - MPI, PVM on virtual machines - High-performance network virtualization - High-speed interconnects - Hypervisor extensions - Tools for cluster and grid computing - Xen/other VMM cloud/cluster/grid tools - Raw device access from VMs - Cloud reliability, fault-tolerance, and security - Cloud load balancing - VMs - power efficiency - Network architectures for VM-based environments - VMMs/Hypervisors - Hardware support for virtualization - Fault tolerant VM environments - Workload characterizations for VM-based environments - Bottleneck management - Metering - VM-based cloud performance modeling - Cloud security, access control and data integrity - Performance management and tuning hosts and guest VMs - VMM performance tuning on various load types - Research and education use cases - Cloud use cases - Management of VM environments and clouds - Deployment of VM-based environments PAPER SUBMISSION Papers submitted to the workshop will be reviewed by at least two members of the program committee and external reviewers. Submissions should include abstract, key words, the e-mail address of the corresponding author, and must not exceed 10 pages, including tables and figures at a main font size no smaller than 11 point. Submission of a paper should be regarded as a commitment that, should the paper be accepted, at least one of the authors will register and attend the conference to present the work. Accepted papers will be published in the Springer LNCS series - the format must be according to the Springer LNCS Style. Initial submissions are in PDF, accepted papers will be requested to provided source files. Format Guidelines: http://www.springer.de/comp/lncs/authors.html Submission Link: http://edas.info/newPaper.php?c=8553 IMPORTANT DATES April 4 - Abstract submission due (extended) May 19 - Full paper submission (extended) July 14 - Acceptance notification August 3 - Camera-ready version due August 31 - September 3 - conference CHAIR Michael Alexander (chair), scaledinfra technologies GmbH, Austria Gianluigi Zanetti (co-chair), CRS4, Italy PROGRAM COMMITTEE Padmashree Apparao, Intel Corp., USA Volker Buege, University of Karlsruhe, Germany Roberto Canonico, University of Napoli Federico II, Italy Tommaso Cucinotta, Scuola Superiore Sant'Anna, Italy Werner Fischer, Thomas Krenn AG, Germany William Gardner, University of Guelph, Canada Wolfgang Gentzsch, DEISA. Max Planck Gesellschaft, Germany Derek Groen, UVA, The Netherlands Marcus Hardt,
Re: C++ pipes on full (nonpseudo) cluster
No responses yet, although I admit it's only been a few hours. As a follow-up, permit me to pose the following question: Is it, in fact, impossible to run C++ pipes on a fully-distributed system (as opposed to a pseudo-distributed system)? I haven't found any definitive clarification on this topic one way or the other. The only statement that I found in the least bit illuminating is in the O'Reilly book (not official Hadoop documentation mind you), p.38, which states: To run a Pipes job, we need to run Hadoop in pseudo-distributed mode...Pipes doesn't run in standalone (local) mode, since it relies on Hadoop's distributed cache mechanism, which works only when HDFS is running. The phrasing of those statements is a little unclear in that the distinction being made appears to be between standalone and pseudo-distributed mode, without any specific reference to fully-distributed mode. Namely, the section that qualifies the need for pseudo-distributed mode (the need for HDFS) would obviously also apply to full distributed mode despite the lack of mention of fully distributed mode in the quoted section. So can pipes run in fully distributed mode or not? Bottom line, I can't get C++ pipes to work on a fully distributed cluster yet and I don't know if I am wasting my time, if this is a truly impossible effort or if it can be done and I simply haven't figured out how to do it yet. Thanks for any help. Keith Wiley kwi...@keithwiley.com www.keithwiley.com The easy confidence with which I know another man's religion is folly teaches me to suspect that my own is also. -- Mark Twain
Re: C++ pipes on full (nonpseudo) cluster
Hello. Did you try following the tutorial in http://wiki.apache.org/hadoop/C++WordCount ? We use C++ pipes in production on a large cluster, and it works. --gianluigi On Tue, 2010-03-30 at 13:28 -0700, Keith Wiley wrote: No responses yet, although I admit it's only been a few hours. As a follow-up, permit me to pose the following question: Is it, in fact, impossible to run C++ pipes on a fully-distributed system (as opposed to a pseudo-distributed system)? I haven't found any definitive clarification on this topic one way or the other. The only statement that I found in the least bit illuminating is in the O'Reilly book (not official Hadoop documentation mind you), p.38, which states: To run a Pipes job, we need to run Hadoop in pseudo-distributed mode...Pipes doesn't run in standalone (local) mode, since it relies on Hadoop's distributed cache mechanism, which works only when HDFS is running. The phrasing of those statements is a little unclear in that the distinction being made appears to be between standalone and pseudo-distributed mode, without any specific reference to fully-distributed mode. Namely, the section that qualifies the need for pseudo-distributed mode (the need for HDFS) would obviously also apply to full distributed mode despite the lack of mention of fully distributed mode in the quoted section. So can pipes run in fully distributed mode or not? Bottom line, I can't get C++ pipes to work on a fully distributed cluster yet and I don't know if I am wasting my time, if this is a truly impossible effort or if it can be done and I simply haven't figured out how to do it yet. Thanks for any help. Keith Wiley kwi...@keithwiley.com www.keithwiley.com The easy confidence with which I know another man's religion is folly teaches me to suspect that my own is also. -- Mark Twain
Re: C++ pipes on full (nonpseudo) cluster
Yep, tried and tried and tried it. Works perfectly on a pseudo-distributed cluster which is why I didn't think the example or the code was the problem, but rather that the cluster was the problem. I have only just (in the last two minutes) heard back from the administrator of our cluster and he says the pipes package is not installed on the cluster...so that kinda explains it, although I'm still unclear what the symptoms would be for various kinds of problems. In other words, I'm not sure if the errors I got were the result of the lack of a pipes package on the cluster or if I still wasn't doing it right. At any rate, it sounds like pipes is an additional extraneous add-on during cluster configuration and that our cluster didn't add it. Does that make sense to you?...that pipes needs to be enabled on the cluster, not merely run properly by the user? Thanks. Cheers! On Mar 30, 2010, at 13:43 , Gianluigi Zanetti wrote: Hello. Did you try following the tutorial in http://wiki.apache.org/hadoop/C++WordCount ? We use C++ pipes in production on a large cluster, and it works. --gianluigi Keith Wiley kwi...@keithwiley.com www.keithwiley.com Yet mark his perfect self-contentment, and hence learn his lesson, that to be self-contented is to be vile and ignorant, and that to aspire is better than to be blindly and impotently happy. -- Edwin A. Abbott, Flatland
Re: C++ pipes on full (nonpseudo) cluster
My cluster admin noticed that there is some additional pipes package he could add to the cluster configuration, but he admits to knowing very little about how the C++ pipes component of Hadoop works. Can you offer any insight into this cluster configuration package? What exactly does it do that makes a cluster capable of running pipes programs (and what symptom should its absence present from a user's point of view)? On Mar 30, 2010, at 13:43 , Gianluigi Zanetti wrote: Hello. Did you try following the tutorial in http://wiki.apache.org/hadoop/C++WordCount ? We use C++ pipes in production on a large cluster, and it works. --gianluigi Keith Wiley kwi...@keithwiley.com www.keithwiley.com I do not feel obliged to believe that the same God who has endowed us with sense, reason, and intellect has intended us to forgo their use. -- Galileo Galilei
Re: C++ pipes on full (nonpseudo) cluster
What are the symptoms? Pipes should run out of the box in a standard installation. BTW what version of bash are you using? Is it bash 4.0 by any chance? See https://issues.apache.org/jira/browse/HADOOP-6388 --gianluigi On Tue, 2010-03-30 at 14:13 -0700, Keith Wiley wrote: My cluster admin noticed that there is some additional pipes package he could add to the cluster configuration, but he admits to knowing very little about how the C++ pipes component of Hadoop works. Can you offer any insight into this cluster configuration package? What exactly does it do that makes a cluster capable of running pipes programs (and what symptom should its absence present from a user's point of view)? On Mar 30, 2010, at 13:43 , Gianluigi Zanetti wrote: Hello. Did you try following the tutorial in http://wiki.apache.org/hadoop/C++WordCount ? We use C++ pipes in production on a large cluster, and it works. --gianluigi Keith Wiley kwi...@keithwiley.com www.keithwiley.com I do not feel obliged to believe that the same God who has endowed us with sense, reason, and intellect has intended us to forgo their use. -- Galileo Galilei
Re: C++ pipes on full (nonpseudo) cluster
The closest I've gotten so far is for the job to basically try to start up but to get an error complaining about the permissions on the executable binary...which makes perfect sense since the permissions are not executable. Problem is, the hdfs chmod command ignores executable commands. For example, hd fs -chmod 755 somefile yields -rw-r--r--. The x is simply dropped from the command. This makes sense to me in light of documentation (O'Reilly p.47) that indicates HDFS doesn't support executable file permissions, but it leaves me perplexed how any file could ever be executable under HDFS or Hadoop in general. Using slightly different attempts at the pipes command I usually get errors that the executable is not found. This occurs when I point to a local file for the executable instead of one uploaded to HDFS. In other words, I haven't found any way to run pipes such that the executable starts out on the local machine and is automatically distributed to the cluster as a component of the pipes command. Rather, it seems that the executable must already reside in HDFS and be indicated during the pipes command (ala -program or hadoop.pipes.executable of course). I have even tried adding the -files option to pipes, but so far to no positive effect. I'll send another post with some specific transcripts of what I'm seeing. One could ask, w.r.t. the -program flag for pipes, should that indicate a local path, an hdfs path, or are both options possible? As to bash, I'm running on a 10.6.2 Mac, thus: $ bash --version bash --version GNU bash, version 3.2.48(1)-release (x86_64-apple-darwin10.0) Copyright (C) 2007 Free Software Foundation, Inc. ...so not v4.0 as you asked. On Mar 30, 2010, at 14:29 , Gianluigi Zanetti wrote: What are the symptoms? Pipes should run out of the box in a standard installation. BTW what version of bash are you using? Is it bash 4.0 by any chance? See https://issues.apache.org/jira/browse/HADOOP-6388 --gianluigi On Tue, 2010-03-30 at 14:13 -0700, Keith Wiley wrote: My cluster admin noticed that there is some additional pipes package he could add to the cluster configuration, but he admits to knowing very little about how the C++ pipes component of Hadoop works. Can you offer any insight into this cluster configuration package? What exactly does it do that makes a cluster capable of running pipes programs (and what symptom should its absence present from a user's point of view)? On Mar 30, 2010, at 13:43 , Gianluigi Zanetti wrote: Hello. Did you try following the tutorial in http://wiki.apache.org/hadoop/C++WordCount ? We use C++ pipes in production on a large cluster, and it works. --gianluigi Keith Wiley kwi...@keithwiley.com www.keithwiley.com I do not feel obliged to believe that the same God who has endowed us with sense, reason, and intellect has intended us to forgo their use. -- Galileo Galilei Keith Wiley kwi...@keithwiley.com www.keithwiley.com I used to be with it, but then they changed what it was. Now, what I'm with isn't it, and what's it seems weird and scary to me. -- Abe (Grandpa) Simpson
Re: C++ pipes on full (nonpseudo) cluster
$ hadoop fs -rmr HDFSPATH/output ; hadoop pipes -D hadoop.pipes.java.recordreader=true -D hadoop.pipes.java.recordwriter=true -input HDFSPATH/input -output HDFSPATH/output -program HDFSPATH/EXECUTABLE Deleted hdfs://mainclusternn.hipods.ihost.com/HDFSPATH/output 10/03/30 14:56:55 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String). 10/03/30 14:56:55 INFO mapred.FileInputFormat: Total input paths to process : 1 10/03/30 14:57:05 INFO mapred.JobClient: Running job: job_201003241650_1076 10/03/30 14:57:06 INFO mapred.JobClient: map 0% reduce 0% ^C $ At that point the terminal hung, so I eventually ctrl-Ced to break it. Now if I investigate the Hadoop task logs for the mapper, I see this: stderr logs bash: /data/disk2/hadoop/mapred/local/taskTracker/archive/mainclusternn.hipods.ihost.com/uwphysics/kwiley/mosaic/c++_bin/Mosaic/Mosaic: cannot execute binary file ...which makes perfect sense in light of the following: $ hd fs -ls /uwphysics/kwiley/mosaic/c++_bin Found 1 items -rw-r--r-- 1 kwiley uwphysics 211808 2010-03-30 10:26 /uwphysics/kwiley/mosaic/c++_bin/Mosaic $ hd fs -chmod 755 /uwphysics/kwiley/mosaic/c++_bin/Mosaic $ hd fs -ls /uwphysics/kwiley/mosaic/c++_bin Found 1 items -rw-r--r-- 1 kwiley uwphysics 211808 2010-03-30 10:26 /uwphysics/kwiley/mosaic/c++_bin/Mosaic $ Note that this is all in attempt to run an executable that was uploaded to HDFS in advance. In this example I am not attempting to run an executable stored on my local machine. Any attempt to do that results in a file not found error: $ hadoop fs -rmr HDFSPATH/output ; hadoop pipes -D hadoop.pipes.java.recordreader=true -D hadoop.pipes.java.recordwriter=true -input HDFSPATH/input -output HDFSPATH/output -program LOCALPATH/EXECUTABLE Deleted hdfs://mainclusternn.hipods.ihost.com/uwphysics/kwiley/mosaic/output Exception in thread main java.io.FileNotFoundException: File does not exist: /Users/kwiley/hadoop-0.20.1+152/Mosaic/clue/Mosaic/src/cpp/Mosaic at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:457) at org.apache.hadoop.filecache.DistributedCache.getTimestamp(DistributedCache.java:509) at org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:681) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:802) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:771) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1290) at org.apache.hadoop.mapred.pipes.Submitter.runJob(Submitter.java:248) at org.apache.hadoop.mapred.pipes.Submitter.run(Submitter.java:479) at org.apache.hadoop.mapred.pipes.Submitter.main(Submitter.java:494) $ It's clearly looking or the executable in HDFS, not on the local system, thus the file not found error. Keith Wiley kwi...@keithwiley.com www.keithwiley.com What I primarily learned in grad school is how much I *don't* know. Consequently, I left grad school with a higher ignorance to knowledge ratio than when I entered. -- Keith Wiley
Hadoop DFS IO Performance measurement
Hi All, I am trying to get DFS IO performance. I used TestDFSIO from hadoop jars. The results were abt 100Mbps read and write . I think it should be more than this Pl share some stats to compare Either I am missing something like config params or something else -Sagar
Re: java.io.IOException: Function not implemented
Hi all, Thanks for help Todd and Steve, I configured Hadoop (0.20.2) again and I'm getting the same error (Function not implemented). Do you think it's a Hadoop bug? This is the situation: I've 28 nodes where just four are running the datanode. In all other nodes the tasktracker in running ok. The NN and JT are running ok. The configuration of the machines is the same, its a nfs shared home. In all machines the Java version is 1.6.0_17. This is the kernel version of the nodes, note that are two versions and in both the datanode doesn't work. Just in the h0* machines. ram...@lcpad:~/hadoop-0.20.2$ ./bin/slaves.sh uname -a | sort a01: Linux a01 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64 GNU/Linux a02: Linux a02 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64 GNU/Linux a03: Linux a03 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64 GNU/Linux a04: Linux a04 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64 GNU/Linux a05: Linux a05 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64 GNU/Linux a06: Linux a06 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64 GNU/Linux a07: Linux a07 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64 GNU/Linux a09: Linux a09 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64 GNU/Linux a10: Linux a10 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64 GNU/Linux ag06: Linux ag06 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64 GNU/Linux ag07: Linux ag07 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64 GNU/Linux bl02: Linux bl02 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64 GNU/Linux bl03: Linux bl03 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64 GNU/Linux bl04: Linux bl04 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64 GNU/Linux bl06: Linux bl06 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64 GNU/Linux bl07: Linux bl07 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64 GNU/Linux ct02: Linux ct02 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64 GNU/Linux ct03: Linux ct03 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64 GNU/Linux ct04: Linux ct04 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64 GNU/Linux ct06: Linux ct06 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64 GNU/Linux h01: Linux h01 2.6.32.9 #2 SMP Sat Mar 6 19:09:13 BRT 2010 x86_64 GNU/Linux h02: Linux h02 2.6.32.9 #2 SMP Sat Mar 6 19:09:13 BRT 2010 x86_64 GNU/Linux h03: Linux h03 2.6.32.9 #2 SMP Sat Mar 6 19:09:13 BRT 2010 x86_64 GNU/Linux h04: Linux h04 2.6.32.9 #2 SMP Sat Mar 6 19:09:13 BRT 2010 x86_64 GNU/Linux sd02: Linux sd02 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64 GNU/Linux sd05: Linux sd05 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64 GNU/Linux sd06: Linux sd06 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64 GNU/Linux sd07: Linux sd07 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64 GNU/Linux These are the java processes running on each clients. Jjust the h0* machines are running ok. ram...@lcpad:~/hadoop-0.20.2$ ./bin/slaves.sh pgrep -lc java | sort a01: 1 a02: 1 a03: 1 a04: 1 a05: 1 a06: 1 a07: 1 a09: 1 a10: 1 ag06: 1 ag07: 1 bl02: 1 bl03: 1 bl04: 1 bl06: 1 bl07: 1 ct02: 1 ct03: 1 ct04: 1 ct06: 1 h01: 2 h02: 2 h03: 2 h04: 2 sd02: 1 sd05: 1 sd06: 1 sd07: 1 This is my configuration: ram...@lcpad:~/hadoop-0.20.2$ cat conf/*site* ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? !-- Put site-specific property overrides in this file. -- configuration property namefs.default.name/name valuehdfs://lcpad:9000/value /property /configuration ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? !-- Put site-specific property overrides in this file. -- configuration property namedfs.replication/name value1/value /property /configuration ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? !-- Put site-specific property overrides in this file. -- configuration property namemapred.job.tracker/name valuelcpad:9001/value /property /configuration Thanks in Advance, Edson Ramiro On 30 March 2010 05:58, Steve Loughran ste...@apache.org wrote: Edson Ramiro wrote: I'm not involved with Debian community :( I think you are now...
Re: java.io.IOException: Function not implemented
Hi Edson, I noticed that only the h01 nodes are running 2.6.32.9, the other broken DNs are 2.6.32.10. Is there some reason you are running a kernel that is literally 2 weeks old? I wouldn't be at all surprised if there were a bug here, or some issue with your Debian unstable distribution... -Todd On Tue, Mar 30, 2010 at 3:54 PM, Edson Ramiro erlfi...@gmail.com wrote: Hi all, Thanks for help Todd and Steve, I configured Hadoop (0.20.2) again and I'm getting the same error (Function not implemented). Do you think it's a Hadoop bug? This is the situation: I've 28 nodes where just four are running the datanode. In all other nodes the tasktracker in running ok. The NN and JT are running ok. The configuration of the machines is the same, its a nfs shared home. In all machines the Java version is 1.6.0_17. This is the kernel version of the nodes, note that are two versions and in both the datanode doesn't work. Just in the h0* machines. ram...@lcpad:~/hadoop-0.20.2$ ./bin/slaves.sh uname -a | sort a01: Linux a01 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64 GNU/Linux a02: Linux a02 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64 GNU/Linux a03: Linux a03 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64 GNU/Linux a04: Linux a04 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64 GNU/Linux a05: Linux a05 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64 GNU/Linux a06: Linux a06 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64 GNU/Linux a07: Linux a07 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64 GNU/Linux a09: Linux a09 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64 GNU/Linux a10: Linux a10 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64 GNU/Linux ag06: Linux ag06 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64 GNU/Linux ag07: Linux ag07 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64 GNU/Linux bl02: Linux bl02 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64 GNU/Linux bl03: Linux bl03 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64 GNU/Linux bl04: Linux bl04 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64 GNU/Linux bl06: Linux bl06 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64 GNU/Linux bl07: Linux bl07 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64 GNU/Linux ct02: Linux ct02 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64 GNU/Linux ct03: Linux ct03 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64 GNU/Linux ct04: Linux ct04 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64 GNU/Linux ct06: Linux ct06 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64 GNU/Linux h01: Linux h01 2.6.32.9 #2 SMP Sat Mar 6 19:09:13 BRT 2010 x86_64 GNU/Linux h02: Linux h02 2.6.32.9 #2 SMP Sat Mar 6 19:09:13 BRT 2010 x86_64 GNU/Linux h03: Linux h03 2.6.32.9 #2 SMP Sat Mar 6 19:09:13 BRT 2010 x86_64 GNU/Linux h04: Linux h04 2.6.32.9 #2 SMP Sat Mar 6 19:09:13 BRT 2010 x86_64 GNU/Linux sd02: Linux sd02 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64 GNU/Linux sd05: Linux sd05 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64 GNU/Linux sd06: Linux sd06 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64 GNU/Linux sd07: Linux sd07 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64 GNU/Linux These are the java processes running on each clients. Jjust the h0* machines are running ok. ram...@lcpad:~/hadoop-0.20.2$ ./bin/slaves.sh pgrep -lc java | sort a01: 1 a02: 1 a03: 1 a04: 1 a05: 1 a06: 1 a07: 1 a09: 1 a10: 1 ag06: 1 ag07: 1 bl02: 1 bl03: 1 bl04: 1 bl06: 1 bl07: 1 ct02: 1 ct03: 1 ct04: 1 ct06: 1 h01: 2 h02: 2 h03: 2 h04: 2 sd02: 1 sd05: 1 sd06: 1 sd07: 1 This is my configuration: ram...@lcpad:~/hadoop-0.20.2$ cat conf/*site* ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? !-- Put site-specific property overrides in this file. -- configuration property namefs.default.name/name valuehdfs://lcpad:9000/value /property /configuration ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? !-- Put site-specific property overrides in this file. -- configuration property namedfs.replication/name value1/value /property /configuration ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? !-- Put site-specific property overrides in this file. -- configuration property namemapred.job.tracker/name valuelcpad:9001/value /property /configuration Thanks in Advance, Edson Ramiro On 30 March 2010 05:58, Steve Loughran ste...@apache.org wrote: Edson Ramiro wrote: I'm not involved with Debian community :( I think you are now... -- Todd Lipcon Software Engineer, Cloudera
Re: java.io.IOException: Function not implemented
May be it's a bug. I'm not the admin. : ( so, I'll talk to him and may be he install a 2.6.32.9 in another node to test : ) Thanks Edson Ramiro On 30 March 2010 20:00, Todd Lipcon t...@cloudera.com wrote: Hi Edson, I noticed that only the h01 nodes are running 2.6.32.9, the other broken DNs are 2.6.32.10. Is there some reason you are running a kernel that is literally 2 weeks old? I wouldn't be at all surprised if there were a bug here, or some issue with your Debian unstable distribution... -Todd On Tue, Mar 30, 2010 at 3:54 PM, Edson Ramiro erlfi...@gmail.com wrote: Hi all, Thanks for help Todd and Steve, I configured Hadoop (0.20.2) again and I'm getting the same error (Function not implemented). Do you think it's a Hadoop bug? This is the situation: I've 28 nodes where just four are running the datanode. In all other nodes the tasktracker in running ok. The NN and JT are running ok. The configuration of the machines is the same, its a nfs shared home. In all machines the Java version is 1.6.0_17. This is the kernel version of the nodes, note that are two versions and in both the datanode doesn't work. Just in the h0* machines. ram...@lcpad:~/hadoop-0.20.2$ ./bin/slaves.sh uname -a | sort a01: Linux a01 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64 GNU/Linux a02: Linux a02 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64 GNU/Linux a03: Linux a03 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64 GNU/Linux a04: Linux a04 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64 GNU/Linux a05: Linux a05 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64 GNU/Linux a06: Linux a06 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64 GNU/Linux a07: Linux a07 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64 GNU/Linux a09: Linux a09 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64 GNU/Linux a10: Linux a10 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64 GNU/Linux ag06: Linux ag06 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64 GNU/Linux ag07: Linux ag07 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64 GNU/Linux bl02: Linux bl02 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64 GNU/Linux bl03: Linux bl03 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64 GNU/Linux bl04: Linux bl04 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64 GNU/Linux bl06: Linux bl06 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64 GNU/Linux bl07: Linux bl07 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64 GNU/Linux ct02: Linux ct02 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64 GNU/Linux ct03: Linux ct03 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64 GNU/Linux ct04: Linux ct04 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64 GNU/Linux ct06: Linux ct06 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64 GNU/Linux h01: Linux h01 2.6.32.9 #2 SMP Sat Mar 6 19:09:13 BRT 2010 x86_64 GNU/Linux h02: Linux h02 2.6.32.9 #2 SMP Sat Mar 6 19:09:13 BRT 2010 x86_64 GNU/Linux h03: Linux h03 2.6.32.9 #2 SMP Sat Mar 6 19:09:13 BRT 2010 x86_64 GNU/Linux h04: Linux h04 2.6.32.9 #2 SMP Sat Mar 6 19:09:13 BRT 2010 x86_64 GNU/Linux sd02: Linux sd02 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64 GNU/Linux sd05: Linux sd05 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64 GNU/Linux sd06: Linux sd06 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64 GNU/Linux sd07: Linux sd07 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64 GNU/Linux These are the java processes running on each clients. Jjust the h0* machines are running ok. ram...@lcpad:~/hadoop-0.20.2$ ./bin/slaves.sh pgrep -lc java | sort a01: 1 a02: 1 a03: 1 a04: 1 a05: 1 a06: 1 a07: 1 a09: 1 a10: 1 ag06: 1 ag07: 1 bl02: 1 bl03: 1 bl04: 1 bl06: 1 bl07: 1 ct02: 1 ct03: 1 ct04: 1 ct06: 1 h01: 2 h02: 2 h03: 2 h04: 2 sd02: 1 sd05: 1 sd06: 1 sd07: 1 This is my configuration: ram...@lcpad:~/hadoop-0.20.2$ cat conf/*site* ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? !-- Put site-specific property overrides in this file. -- configuration property namefs.default.name/name valuehdfs://lcpad:9000/value /property /configuration ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? !-- Put site-specific property overrides in this file. -- configuration property namedfs.replication/name value1/value /property /configuration ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? !-- Put site-specific property overrides in this file. -- configuration property namemapred.job.tracker/name valuelcpad:9001/value /property /configuration Thanks in Advance, Edson Ramiro On 30 March 2010 05:58, Steve Loughran ste...@apache.org wrote: Edson Ramiro wrote: I'm not involved
Re: Hadoop DFS IO Performance measurement
Hi Sagar, What hardware did you run it on ? Edson Ramiro On 30 March 2010 19:41, sagar naik sn...@attributor.com wrote: Hi All, I am trying to get DFS IO performance. I used TestDFSIO from hadoop jars. The results were abt 100Mbps read and write . I think it should be more than this Pl share some stats to compare Either I am missing something like config params or something else -Sagar
question on shuffle and sort
Hi, Did all key-value pairs of the map output, which have the same key, will be sent to the same reducer tasknode?
Re: question on shuffle and sort
On Tue, Mar 30, 2010 at 9:56 PM, Cui tony wrote: Did all key-value pairs of the map output, which have the same key, will be sent to the same reducer tasknode? Yes, this is at the core of the MapReduce model. There is one call to the user reduce function per unique map output key. This grouping is achieved by sorting which means you see keys in increasing order. Ed
Re: question on shuffle and sort
yes ,indeed 在 2010-03-31三的 09:56 +0800,Cui tony写道: Hi, Did all key-value pairs of the map output, which have the same key, will be sent to the same reducer tasknode?
Re: question on shuffle and sort
Something to keep in mind though, sorting is appropriate to the key type. Text will be sorted lexicographically. Nick Jones - Original Message - From: Ed Mazur ma...@cs.umass.edu To: common-user@hadoop.apache.org common-user@hadoop.apache.org Sent: Tue Mar 30 21:07:29 2010 Subject: Re: question on shuffle and sort On Tue, Mar 30, 2010 at 9:56 PM, Cui tony wrote: Did all key-value pairs of the map output, which have the same key, will be sent to the same reducer tasknode? Yes, this is at the core of the MapReduce model. There is one call to the user reduce function per unique map output key. This grouping is achieved by sorting which means you see keys in increasing order. Ed
Re: question on shuffle and sort
Consider this extreme situation: The input data is very large, and also the map result. 90% of map result have the same key, then all of them will be sent to one reducer tasknode. So 90% of work of reduce phase have to been done on a single node, not the cluster. That is very ineffective and less scalable. 2010/3/31 Jones, Nick nick.jo...@amd.com Something to keep in mind though, sorting is appropriate to the key type. Text will be sorted lexicographically. Nick Jones - Original Message - From: Ed Mazur ma...@cs.umass.edu To: common-user@hadoop.apache.org common-user@hadoop.apache.org Sent: Tue Mar 30 21:07:29 2010 Subject: Re: question on shuffle and sort On Tue, Mar 30, 2010 at 9:56 PM, Cui tony wrote: Did all key-value pairs of the map output, which have the same key, will be sent to the same reducer tasknode? Yes, this is at the core of the MapReduce model. There is one call to the user reduce function per unique map output key. This grouping is achieved by sorting which means you see keys in increasing order. Ed
Re: question on shuffle and sort
I ran into an issue where lots of data was passing from mappers to a single reducer. Enabling a combiner saved quite a bit of processing time by reducing mapper disk writes and data movements to the reducer. Nick Jones - Original Message - From: Cui tony tony.cui1...@gmail.com To: common-user@hadoop.apache.org common-user@hadoop.apache.org Sent: Tue Mar 30 21:24:18 2010 Subject: Re: question on shuffle and sort Consider this extreme situation: The input data is very large, and also the map result. 90% of map result have the same key, then all of them will be sent to one reducer tasknode. So 90% of work of reduce phase have to been done on a single node, not the cluster. That is very ineffective and less scalable. 2010/3/31 Jones, Nick nick.jo...@amd.com Something to keep in mind though, sorting is appropriate to the key type. Text will be sorted lexicographically. Nick Jones - Original Message - From: Ed Mazur ma...@cs.umass.edu To: common-user@hadoop.apache.org common-user@hadoop.apache.org Sent: Tue Mar 30 21:07:29 2010 Subject: Re: question on shuffle and sort On Tue, Mar 30, 2010 at 9:56 PM, Cui tony wrote: Did all key-value pairs of the map output, which have the same key, will be sent to the same reducer tasknode? Yes, this is at the core of the MapReduce model. There is one call to the user reduce function per unique map output key. This grouping is achieved by sorting which means you see keys in increasing order. Ed
Re: question on shuffle and sort
Hi, Jones As you have met the situation I am worried about, I got my answer now. Maybe re-design the map function or add a combiner is the only way to deal with this kind of input data . 2010/3/31 Jones, Nick nick.jo...@amd.com I ran into an issue where lots of data was passing from mappers to a single reducer. Enabling a combiner saved quite a bit of processing time by reducing mapper disk writes and data movements to the reducer. Nick Jones - Original Message - From: Cui tony tony.cui1...@gmail.com To: common-user@hadoop.apache.org common-user@hadoop.apache.org Sent: Tue Mar 30 21:24:18 2010 Subject: Re: question on shuffle and sort Consider this extreme situation: The input data is very large, and also the map result. 90% of map result have the same key, then all of them will be sent to one reducer tasknode. So 90% of work of reduce phase have to been done on a single node, not the cluster. That is very ineffective and less scalable. 2010/3/31 Jones, Nick nick.jo...@amd.com Something to keep in mind though, sorting is appropriate to the key type. Text will be sorted lexicographically. Nick Jones - Original Message - From: Ed Mazur ma...@cs.umass.edu To: common-user@hadoop.apache.org common-user@hadoop.apache.org Sent: Tue Mar 30 21:07:29 2010 Subject: Re: question on shuffle and sort On Tue, Mar 30, 2010 at 9:56 PM, Cui tony wrote: Did all key-value pairs of the map output, which have the same key, will be sent to the same reducer tasknode? Yes, this is at the core of the MapReduce model. There is one call to the user reduce function per unique map output key. This grouping is achieved by sorting which means you see keys in increasing order. Ed
is there any way we can limit Hadoop Datanode's disk usage?
hi, guys, we have some machine with 1T disk, some with 100GB disk, I have this question that is there any means we can limit the disk usage of datanodes on those machines with smaller disk? thanks!
Re: is there any way we can limit Hadoop Data node's disk usage?
Hello Steven , You can use dfs.datanode.du.reserved configuration value in $HADOOP_HOME/conf/hdfs-site.xml for limiting disk usage. property namedfs.datanode.du.reserved/name !-- cluster variant -- value182400/value descriptionReserved space in bytes per volume. Always leave this much space free for non dfs use. /description /property Ravi Hadoop @ Yahoo! On 3/30/10 8:12 PM, steven zhuang steven.zhuang.1...@gmail.com wrote: hi, guys, we have some machine with 1T disk, some with 100GB disk, I have this question that is there any means we can limit the disk usage of datanodes on those machines with smaller disk? thanks! Ravi --
log
Hi all, I find there is a directory _log/history/... under the output directory of a mapreduce job. Is the file in that directory a log file? Is the information there sufficient to allow me to figure out what nodes the job runs on? Besides, not every job has such a directory. Is there such settings controlling this? Or is there other ways to get the nodes my job runs on? Thanks, -Gang