Profiling with Hadoop 0.17.2.1
I'm trying to profile my map/reduce processes under Hadoop 0.17.2. >From looking at the hadoop-default.xml, the property "mapred.task.profile.params" did not yet exist back then, so I'm trying to add to the property "mapred.child.java.opts" with -Xmx512m -verbose:gc -Xrunhprof:cpu=samples,depth=6,thread=y,file=/tmp/@tas...@.txt -Xloggc:/tmp/@tas...@.gc The resulting JVMs won't have the hprof parameters when I look at them via PS, the files are never created and there is no mention of dumping stats in the logs. Am I missing something? I'd switch to 0.19.1, but I haven't had time to setup a migration plan for my data yet. Jimmy Wan
Re: Batch processing map reduce jobs
Check out Cascading, it worked great for me. http://www.cascading.org/ Jimmy Wan On Thu, Mar 5, 2009 at 17:53, Richa Khandelwal wrote: > Hi All, > Does anyone know how to run map reduce jobs using pipes or batch process map > reduce jobs?
Re: Running 0.19.2 branch in production before release
Has anyone setup this dual cluster approach and come up with a complete scheme for being able to test a new hadoop revision and migrate data from one datanode revision to the next? It would be useful if someone could share their list of gotchas that have been experienced.. In the past, pretty much every attempt to upgrade Hadoop DataNodes via the documented (but outdated?) approaches mentioned in the FAQs and wikis has not worked well for me. That's why I'm still on 0.17.2.1. Jimmy Wan On Fri, Mar 6, 2009 at 00:17, Aaron Kimball wrote: > Right, there's no sense in freezing your Hadoop version forever :) > > But if you're an ops team tasked with keeping a production cluster running > 24/7, running on 0.19 (or even more daringly, TRUNK) is not something that I > would consider a Best Practice. Ideally you'll be able to carve out some > spare capacity (maybe 3--5 nodes) to use as a staging cluster that runs on > 0.19 or TRUNK that you can use to evaluate the next version. Then when you > are convinced that it's stable, and your staging cluster passes your > internal tests (e.g., running test versions of your critical nightly jobs > successfully), you can move that to production. > > - Aaron > > > On Thu, Mar 5, 2009 at 2:33 AM, Steve Loughran wrote: > >> Aaron Kimball wrote: >> >>> I recommend 0.18.3 for production use and avoid the 19 branch entirely. If >>> your priority is stability, then stay a full minor version behind, not >>> just >>> a revision. >>> >> >> Of course, if everyone stays that far behind, they don't get to find the >> bugs for other people. >> >> * If you play with the latest releases early, while they are in the beta >> phase -you will encounter the problems specific to your >> applications/datacentres, and get them fixed fast. >> >> * If you work with stuff further back you get stability, but not only are >> you behind on features, you can't be sure that all "fixes" that matter to >> you get pushed back. >> >> * If you plan on making changes, of adding features, get onto SVN_HEAD >> >> * If you want to catch changes being made that break your site, SVN_HEAD. >> Better yet, have a private Hudson server checking out SVN_HEAD hadoop *then* >> building and testing your app against it. >> >> Normally I work with stable releases of things I dont depend on, and >> SVN_HEAD of OSS stuff whose code I have any intent to change; there is a >> price -merge time, the odd change breaking your code- but you get to make >> changes that help you long term. >> >> Where Hadoop is different is that it is a filesystem, and you don't want to >> hit bugs that delete files that matter. I'm only bringing up transient >> clusters on VMs, pulling in data from elsewhere, so this isn't an issue. All >> that remains is changing APIs. >> >> -Steve >> >
Re: Batching key/value pairs to map
I've got a program that starts by hooking into a legacy system to look up it's maps out of a db and the keys are sparse. I'm sure there might be another way to do this, but this was by far the easiest/simplest solution. Jimmy Wan On Mon, Feb 23, 2009 at 19:39, Edward Capriolo wrote: > We have a MR program that collects once for each token on a line. What > types of applications can benefit from batch mapping?
Re: Batching key/value pairs to map
Great, thanks Owen. I actually ran into the object reuse problem a long time ago. The output of my MR processes gets turned into a series of large INSERT statements that wasn't performing unless I batched them in inserts of several K entries. I'm not sure if this is possible, but it would certainly be nice to either: 1) pass the OutputCollector and Reporter to the close() method. 2) Provide accessors to the OutputCollector and the Reporter. Now every single one of my maps is going to have a pair of 1-2 extra no-ops. I'll check to see if that's on the list of outstanding FRs. On Mon, Feb 23, 2009 at 15:30, Owen O'Malley wrote: > On Mon, Feb 23, 2009 at 12:06 PM, Jimmy Wan wrote: > >> part of my map/reduce process could be greatly sped up by mapping >> key/value pairs in batches instead of mapping them one by one. >> Can I safely hang onto my OutputCollector and Reporter from calls to map? > > Yes. You can even use them in the close, so that you can process the last > batch of records. *smile* One problem that you will quickly hit is that > Hadoop reuses the objects that are passed to map and reduce. So, you'll need > to clone them before putting them into the collection. > > I'm currently running Hadoop 0.17.2.1. Is this something I could do in >> Hadoop 0.19.X? > > I don't think any of this changed between 0.17 and 0.19, other than in 0.17 > the reduce's inputs were always new objects. In 0.18 and after, the reduce's > inputs are reused.
Batching key/value pairs to map
part of my map/reduce process could be greatly sped up by mapping key/value pairs in batches instead of mapping them one by one. I'd like to do the following: protected abstract void batchMap(OutputCollector k2V2OutputCollector, Reporter reporter) throws IOException; public void map(K1 key1, V1 value1, OutputCollector output, Reporter reporter) throws IOException { keys.add(key1.copy()); values.add(value1.copy()); if (++currentSize == batchSize) { batchMap(output, reporter); clear(); } } public void close() throws IOException { if (currentSize > 0) { // I don't have access to my OutputCollector or Reporter here! batchMap(output, reporter); clear(); } } Can I safely hang onto my OutputCollector and Reporter from calls to map? I'm currently running Hadoop 0.17.2.1. Is this something I could do in Hadoop 0.19.X?
Re: Recommendations on Job Status and Dependency Management
Figured I should respond to my own question and list the solution for the archives: Since I already had a bunch of existing MapReduce jobs created, I was able to quickly migrate my code to Cascading to take care of all the inter-hadoop job dependencies. By making use of the MapReduceFlow and dumping those flows into a Cascade with a CascadeConnector, I was able to throw out several hundred lines of hand-created Thread and dependency management code in favor of an automated solution that actually worked a wee bit better in terms of concurrency. I was able to see an immediate increase in the utilization of my cluster. I covered how I worked out the initial HDFS-dependencies in the other reply to this message. For determining the proper way to determine whether the trigger conditions are met (reliance on outside processes for which there is no easy way to read a signal), I'm currently polling a database for that data and I'm working with Chris to add a hook into Cascade to allow pluggable predicates to specify that condition. So yeah, I'm sold on Cascading. =) Relevant links: http://www.cascading.org Relevant API Docs http://www.cascading.org/javadoc/cascading/flow/MapReduceFlow.html http://www.cascading.org/javadoc/cascading/cascade/CascadeConnector.html http://www.cascading.org/javadoc/cascading/cascade/Cascade.html On Tue, 11 Nov 2008, Jimmy Wan wrote: >I'd like to take my prototype batch processing of hadoop jobs and implement >some type of "real" dependency management and scheduling in order to better >utilize my cluster as well as spread out more work over time. I was thinking >of adopting one of the existing packages (Cascading, Zookeeper, existing >JobControl?) and I was hoping to find some better advice from the mailing >list. I tried to find a more direct comparison of Cascading and Zookeeper but >I couldn't find one. > >This is a grossly simplified description my current completely naive >approach: > >1) for each day in a month, spawn N threads that each contain a dependent >series of map/reduce jobs. > >2) for each day in a month, spawn N threads that each contain a dependent >series of map/reduce jobs that are dependent on the output of step #1. These >are currently separated from the tasks in step #1 mainly because it's easier >to group them up this way in the event of a failure, but I expect this >separation to go away. > >3) At the end of the month, serially run a series of jobs outside of >Map/Reduce that basically consist of a single SQL query (I could easily >convert these to be very simple map/reduce jobs, and probably will, if it >makes my job processing easier). > >The main problems I have are the following: >1) right now I have a hard time determining which processes need to be run >in the event of a failure. > >Every job has an expected input/output in HDFS so if I have to rerun >something I usually just use something like "hadoop dfs -rmr " in a >shell script then hand edit the jobs that need to be rerun. > >Is there an example somewhere of code that can read HDFS in order to >determine if files exist? I poked around a bit and couldn't find one. >Ideally, my code would be able to read the HDFS config info right out of the >standard config files so I wouldn't need to create additional configuration >information. > >The job dependencies while enumerated well are not isolated all that well. >Example: I find a bug in 1 of 10 processes in step #1. I'd like to rerun just >that one process and any dependent processes, but not have to rerun >everything again. > >2) I typically run everything 1 month at a time, but I want to keep the >option of doing rollups by day. On the 2nd of the month, I'd like to be able >to run anything that requires data from the 1st of the month. On the 1st of >the month, I'd like to run anything that requires a full month of data from >the previous month. > >I'd also like my process to be able to account for system failures on >previous days. i.e. On any given day I'd like to be able to run everything >for which data is available. > >3) Certain types of jobs have external dependencies (ex. MySQL) and I don't >want to run too many of those types of jobs at the same time since it affects >my MySQL performance. I'd like some way of describing some type of >lock on external resources that can be shared across jobs. > >Any recommendations on how to best model these things? > >I'm thinking that something like Cascading or Zookeeper could help me here. >My initial take was that Zookeeper was more heavyweight than Cascading, >requiring additional processes to be running at all times. However, it seems >like Zookeeper would be better suited to describing mutual exclusions on
re: Recommendations on Job Status and Dependency Management
I was able to answer one of my own questions: "Is there an example somewhere of code that can read HDFS in order to determine if files exist? I poked around a bit and couldn't find one. Ideally, my code would be able to read the HDFS config info right out of the standard config files so I wouldn't need to create additional configuration information." The following code was all that I needed: Configuration configuration = new Configuration(); FileSystem fileSystem = FileSystem.get(configuration); Path path = new Path(filename); boolean fileExists = fileSystem.exists(path) At first, the code didn't work as I expected because my working shell scripts that made use of "hadoop/bin/hadoop jar my.jar" did not explicitly include HADOOP_CONF_DIR in my classpath. Once I did that, everything worked just fine. On Tue, 11 Nov 2008, Jimmy Wan wrote: >I'd like to take my prototype batch processing of hadoop jobs and implement >some type of "real" dependency management and scheduling in order to better >utilize my cluster as well as spread out more work over time. I was thinking >of adopting one of the existing packages (Cascading, Zookeeper, existing >JobControl?) and I was hoping to find some better advice from the mailing >list. I tried to find a more direct comparison of Cascading and Zookeeper but >I couldn't find one. > >This is a grossly simplified description my current completely naive >approach: > >1) for each day in a month, spawn N threads that each contain a dependent >series of map/reduce jobs. > >2) for each day in a month, spawn N threads that each contain a dependent >series of map/reduce jobs that are dependent on the output of step #1. These >are currently separated from the tasks in step #1 mainly because it's easier >to group them up this way in the event of a failure, but I expect this >separation to go away. > >3) At the end of the month, serially run a series of jobs outside of >Map/Reduce that basically consist of a single SQL query (I could easily >convert these to be very simple map/reduce jobs, and probably will, if it >makes my job processing easier). > >The main problems I have are the following: >1) right now I have a hard time determining which processes need to be run >in the event of a failure. > >Every job has an expected input/output in HDFS so if I have to rerun >something I usually just use something like "hadoop dfs -rmr " in a >shell script then hand edit the jobs that need to be rerun. > >Is there an example somewhere of code that can read HDFS in order to >determine if files exist? I poked around a bit and couldn't find one. >Ideally, my code would be able to read the HDFS config info right out of the >standard config files so I wouldn't need to create additional configuration >information. > >The job dependencies while enumerated well are not isolated all that well. >Example: I find a bug in 1 of 10 processes in step #1. I'd like to rerun just >that one process and any dependent processes, but not have to rerun >everything again. > >2) I typically run everything 1 month at a time, but I want to keep the >option of doing rollups by day. On the 2nd of the month, I'd like to be able >to run anything that requires data from the 1st of the month. On the 1st of >the month, I'd like to run anything that requires a full month of data from >the previous month. > >I'd also like my process to be able to account for system failures on >previous days. i.e. On any given day I'd like to be able to run everything >for which data is available. > >3) Certain types of jobs have external dependencies (ex. MySQL) and I don't >want to run too many of those types of jobs at the same time since it affects >my MySQL performance. I'd like some way of describing some type of >lock on external resources that can be shared across jobs. > >Any recommendations on how to best model these things? > >I'm thinking that something like Cascading or Zookeeper could help me here. >My initial take was that Zookeeper was more heavyweight than Cascading, >requiring additional processes to be running at all times. However, it seems >like Zookeeper would be better suited to describing mutual exclusions on >usage of external resources. Can Cascading even do this? > >I'd also appreciate any recommendations on how best to tune the hadoop >processes. My hadoop 0.16.4 cluster is currently relatively small (<10 nodes) >so I'm thinking the 1GB defaults for my NameNode, DataNodes, and JobTracker >might be overkill. I also plan to upgrade to 0.17.* or 0.18.* at some point >in the near future. > --
Recommendations on Job Status and Dependency Management
I'd like to take my prototype batch processing of hadoop jobs and implement some type of "real" dependency management and scheduling in order to better utilize my cluster as well as spread out more work over time. I was thinking of adopting one of the existing packages (Cascading, Zookeeper, existing JobControl?) and I was hoping to find some better advice from the mailing list. I tried to find a more direct comparison of Cascading and Zookeeper but I couldn't find one. This is a grossly simplified description my current completely naive approach: 1) for each day in a month, spawn N threads that each contain a dependent series of map/reduce jobs. 2) for each day in a month, spawn N threads that each contain a dependent series of map/reduce jobs that are dependent on the output of step #1. These are currently separated from the tasks in step #1 mainly because it's easier to group them up this way in the event of a failure, but I expect this separation to go away. 3) At the end of the month, serially run a series of jobs outside of Map/Reduce that basically consist of a single SQL query (I could easily convert these to be very simple map/reduce jobs, and probably will, if it makes my job processing easier). The main problems I have are the following: 1) right now I have a hard time determining which processes need to be run in the event of a failure. Every job has an expected input/output in HDFS so if I have to rerun something I usually just use something like "hadoop dfs -rmr " in a shell script then hand edit the jobs that need to be rerun. Is there an example somewhere of code that can read HDFS in order to determine if files exist? I poked around a bit and couldn't find one. Ideally, my code would be able to read the HDFS config info right out of the standard config files so I wouldn't need to create additional configuration information. The job dependencies while enumerated well are not isolated all that well. Example: I find a bug in 1 of 10 processes in step #1. I'd like to rerun just that one process and any dependent processes, but not have to rerun everything again. 2) I typically run everything 1 month at a time, but I want to keep the option of doing rollups by day. On the 2nd of the month, I'd like to be able to run anything that requires data from the 1st of the month. On the 1st of the month, I'd like to run anything that requires a full month of data from the previous month. I'd also like my process to be able to account for system failures on previous days. i.e. On any given day I'd like to be able to run everything for which data is available. 3) Certain types of jobs have external dependencies (ex. MySQL) and I don't want to run too many of those types of jobs at the same time since it affects my MySQL performance. I'd like some way of describing some type of lock on external resources that can be shared across jobs. Any recommendations on how to best model these things? I'm thinking that something like Cascading or Zookeeper could help me here. My initial take was that Zookeeper was more heavyweight than Cascading, requiring additional processes to be running at all times. However, it seems like Zookeeper would be better suited to describing mutual exclusions on usage of external resources. Can Cascading even do this? I'd also appreciate any recommendations on how best to tune the hadoop processes. My hadoop 0.16.4 cluster is currently relatively small (<10 nodes) so I'm thinking the 1GB defaults for my NameNode, DataNodes, and JobTracker might be overkill. I also plan to upgrade to 0.17.* or 0.18.* at some point in the near future.
Re: Changing my NameNode
Anyone out there with a suggestion? Is my best option to export data out of the cluster, reformat a new namenode, and reimport it all? On Tue, 10 Jun 2008, Jimmy Wan wrote: >I've got an HDFS cluster with 2 boxes in it. > >Server A is serving as the NameNode and also as a DataNode. >Server B is a DataNode. > >After successfully decommissioning the HDFS storage of Server A using a >dfs.hosts.exclude file and dfadmin -refreshNodes: > >Server A is serving as the NameNode. >Server B is serving as a DataNode. > >How do I change my NameNode to be Server B? > >Can I simply change the hadoop-site.xml, masters, and slaves files for all >machines in my cluster? I could have sworn that I tried that and it failed. > >P.S. This link is wrong wrt how to decommission a Dattanode. >http://hadoop.apache.org/core/docs/r0.16.4/hdfs_design.html#DFSAdmin
Changing my NameNode
I've got an HDFS cluster with 2 boxes in it. Server A is serving as the NameNode and also as a DataNode. Server B is a DataNode. After successfully decommissioned the HDFS storage of Server A using a dfs.hosts.exclude file and dfadmin -refreshNodes: Server A is serving as the NameNode. Server B is serving as the DataNode. How do I change my NameNode to be Server B? Can I simply change the hadoop-site.xml, masters, and slaves files for all machines in my cluster? I could have sworn that I tried that and it failed. P.S. This link is wrong wrt how to decommission a Dattanode. http://hadoop.apache.org/core/docs/r0.16.4/hdfs_design.html#DFSAdmin
Re: Limiting Total # of TaskTracker threads
On Tue, 18 Mar 2008 19:53:04 -0500, Ted Dunning <[EMAIL PROTECTED]> wrote: I think the original request was to limit the sum of maps and reduces rather than limiting the two parameters independently. Ted, yes this is exactly what I'm looking for. I just found an issue that seems to state that the old deprecated property is there, but it is not documented: https://issues.apache.org/jira/browse/HADOOP-2300 I tried using the max tasks in combination with setting the new values, but that didn't seem to work. =( My machine labelled as "LIMITED MACHINE" had 2 maps and 1 reduce running at the same time. The scenario I have is that I want to run multiple concurrent jobs through my cluster and have the CPU usage for that node be bound. Should I file a new issue? This was all with Hadoop 0.16.0 LIMITED MACHINE: mapred.tasktracker.tasks.maximum 2 The maximum number of total tasks that will be run simultaneously by a task tracker. mapred.tasktracker.map.tasks.maximum 1 The maximum number of map tasks that will be run simultaneously by a task tracker. mapred.tasktracker.reduce.tasks.maximum 1 The maximum number of reduce tasks that will be run simultaneously by a task tracker. OTHER CLUSTER MACHINES: mapred.tasktracker.tasks.maximum 8 The maximum number of total tasks that will be run simultaneously by a task tracker. mapred.tasktracker.map.tasks.maximum 4 The maximum number of map tasks that will be run simultaneously by a task tracker. mapred.tasktracker.reduce.tasks.maximum 4 The maximum number of reduce tasks that will be run simultaneously by a task tracker. On 3/18/08 5:26 PM, "Arun C Murthy" <[EMAIL PROTECTED]> wrote: The map/reduce tasks are not threads, they are run in separate JVMs which are forked by the tasktracker. Arun, yes, I did mean tasks, not threads. -- Jimmy
Limiting Total # of TaskTracker threads
The properties mentioned here: http://wiki.apache.org/hadoop/FAQ#13 have been deprecated in favor of two separate properties: mapred.tasktracker.map.tasks.maximum mapred.tasktracker.reduce.tasks.maximum I'd like to limit the total # of threads on a task tracker (think limited resources on a given compute node) to a given number, and there does not appear to be a way to do that anymore. Am I correct in my understanding that there is no capability to do this? -- Jimmy
Configuring Nodes on Windows in Distributed Environment
Has anyone succeeded in doing this? I've successfully got a cluster with a few linux nodes running, but I'd like to add my desktop machine (JIMMY) to the mix for some spare compute cycles. I can happily run standalone apps with the loopback config, but I can't quite get my machine to play nicely in a distributed conf. The log output is nonsensical to me. It appears to be adding CR where they don't belong. Any ideas? Configuration details and output from start-all.sh: I've got the conf/logs/tmp/hadoop directories created in C:/home/hadoop. I've got symlinks from /home/hadoop/* to C:/home/hadoop (I wasn't sure how the unix-style paths play nice with windows paths so I was trying to cover all the bases. I created C:/tmp just for kicks (it's empty) My .bashrc on contains: export JAVA_HOME="/cygdrive/c/Dev/jdk1.6.0_04" export HADOOP_IDENT_STRING=`hostname` export HADOOP_INSTALL=/cygdrive/c/home/hadoop export HADOOP_HOME=/cygdrive/c/home/hadoop/hadoop export HADOOP_LOG_DIR=/cygdrive/c/home/hadoop/logs export HADOOP_CONF_DIR=/cygdrive/c/home/hadoop/conf export HADOOP_HEAPSIZE=1000 [EMAIL PROTECTED] ~]$ hadoop/bin/start-all.sh starting namenode, logging to /home/hadoop/logs/hadoop-mainserver-namenode-mainserver.out mainserver: starting datanode, logging to /home/hadoop/logs/hadoop-mainserver-datanode-mainserver.out slavenode1: starting datanode, logging to /home/hadoop/logs/hadoop-slavenode1-datanode-slavenode1.out slavenode2: starting datanode, logging to /home/hadoop/hadoop/bin/../logs/hadoop-hadoop-datanode-slavenode2.out -datanode-JIMMY.outanode, logging to /home/hadoop/logs/hadoop-JIMMY -datanode.pid: No such file or directoryemon.sh: line 117: /tmp/hadoop-JIMMY -datanode-JIMMY.out: No such file or directoryh: line 116: /home/hadoop/logs/hadoop-JIMMY jimmy: head: cannot open `/home/hadoop/logs/hadoop-JIMMY\r-datanode-JIMMY.out' for reading: No such file or directory mainserver: starting secondarynamenode, logging to /home/hadoop/logs/hadoop-mainserver-secondarynamenode-mainserver.out starting jobtracker, logging to /home/hadoop/logs/hadoop-mainserver-jobtracker-mainserver.out slavenode2: starting tasktracker, logging to /home/hadoop/hadoop/bin/../logs/hadoop-hadoop-tasktracker-slavenode2.out slavenode1: starting tasktracker, logging to /home/hadoop/logs/hadoop-slavenode1-tasktracker-slavenode1.out mainserver: starting tasktracker, logging to /home/hadoop/logs/hadoop-mainserver-tasktracker-mainserver.out -tasktracker-JIMMY.outacker, logging to /home/hadoop/logs/hadoop-JIMMY -tasktracker.pid: No such file or directoryn.sh: line 117: /tmp/hadoop-JIMMY -tasktracker-JIMMY.out: No such file or directoryline 116: /home/hadoop/logs/hadoop-JIMMY jimmy: head: cannot open `/home/hadoop/logs/hadoop-JIMMY\r-tasktracker-JIMMY.out' for reading: No such file or directory -- Jimmy
Splitting compressed input from a single job to multiple map tasks
Is it possible to split compressed input from a single job to multiple map tasks? My current configuration has several task trackers but the job I kick off results in a single map task. I'm launching these jobs in sequence via a shell script, so they end up going through a pipeline of 1 concurrent map which is kinda suboptimal. When I run this task via a full local hadoop stack it does seem to split the file into multiple small task chunks. -- Jimmy
Re: Does Hadoop Honor Reserved Space?
Unfortunately, I had to clean up my HDFS in order to get some work done, but I was running Hadoop on Hadoop 0.16.0 running on a Linux box. My configuration is two machines. One has the JobTracker/NameNode and a TaskTracker instance all running on the same machine. The other machine is just running a TaskTracker. Replication was set to 2 for the default and the max. -- Jimmy On Thu, 06 Mar 2008 16:01:16 -0600, Hairong Kuang <[EMAIL PROTECTED]> wrote: In addition to the version, could you please send us a copy of the datanode report by running the command bin/hadoop dfsadmin -report? Thanks, Hairong On 3/6/08 11:56 AM, "Joydeep Sen Sarma" <[EMAIL PROTECTED]> wrote: but intermediate data is stored in a different directory from dfs/data (something like mapred/local by default i think). what version are u running? -Original Message- From: Ashwinder Ahluwalia on behalf of [EMAIL PROTECTED] Sent: Thu 3/6/2008 10:14 AM To: core-user@hadoop.apache.org Subject: RE: Does Hadoop Honor Reserved Space? I've run into a similar issue in the past. From what I understand, this parameter only controls the HDFS space usage. However, the intermediate data in the map reduce job is stored on the local file system (not HDFS) and is not subject to this configuration. In the past I have used mapred.local.dir.minspacekill and mapred.local.dir.minspacestart to control the amount of space that is allowable for use by this temporary data. Not sure if that is the best approach though, so I'd love to hear what other people have done. In your case, you have a map-red job that will consume too much space (without setting a limit, you didn't have enough disk capacity for the job), so looking at mapred.output.compress and mapred.compress.map.output might be useful to decrease the job's disk requirements. --Ash -Original Message- From: Jimmy Wan [mailto:[EMAIL PROTECTED] Sent: Thursday, March 06, 2008 9:56 AM To: core-user@hadoop.apache.org Subject: Does Hadoop Honor Reserved Space? I've got 2 datanodes setup with the following configuration parameter: dfs.datanode.du.reserved 429496729600 Reserved space in bytes per volume. Always leave this much space free for non dfs use. Both are housed on 800GB volumes, so I thought this would keep about half the volume free for non-HDFS usage. After some long running jobs last night, both disk volumes were completely filled. The bulk of the data was in: ${my.hadoop.tmp.dir}/hadoop-hadoop/dfs/data This is running as the user hadoop. Am I interpretting these parameters incorrectly? I noticed this issue, but it is marked as closed: http://issues.apache.org/jira/browse/HADOOP-2549
Equivalent of cmdline head or tail?
I've got some jobs where I'd like to just pull out the top N or bottom N values. It seems like I can't do this from the map or combine phases (due to not having enough data), but I could aggregate this data during the reduce phase. The problem I have is that I won't know when to actually write them out until I've gone through the entire set, at which point reduce isn't called anymore. It's easy enough to post-process with some combination of sort, head, and tail, but I was wondering if I was missing something obvious. -- Jimmy
Does Hadoop Honor Reserved Space?
I've got 2 datanodes setup with the following configuration parameter: dfs.datanode.du.reserved 429496729600 Reserved space in bytes per volume. Always leave this much space free for non dfs use. Both are housed on 800GB volumes, so I thought this would keep about half the volume free for non-HDFS usage. After some long running jobs last night, both disk volumes were completely filled. The bulk of the data was in: ${my.hadoop.tmp.dir}/hadoop-hadoop/dfs/data This is running as the user hadoop. Am I interpretting these parameters incorrectly? I noticed this issue, but it is marked as closed: http://issues.apache.org/jira/browse/HADOOP-2549 -- Jimmy