Profiling with Hadoop 0.17.2.1

2009-04-01 Thread Jimmy Wan
I'm trying to profile my map/reduce processes under Hadoop 0.17.2.
>From looking at the hadoop-default.xml, the property
"mapred.task.profile.params" did not yet exist back then, so I'm
trying to add to the property "mapred.child.java.opts" with
-Xmx512m -verbose:gc
-Xrunhprof:cpu=samples,depth=6,thread=y,file=/tmp/@tas...@.txt
-Xloggc:/tmp/@tas...@.gc

The resulting JVMs won't have the hprof parameters when I look at them
via PS, the files are never created and there is no mention of dumping
stats in the logs.

Am I missing something?

I'd switch to 0.19.1, but I haven't had time to setup a migration plan
for my data yet.

Jimmy Wan


Re: Batch processing map reduce jobs

2009-03-10 Thread Jimmy Wan
Check out Cascading, it worked great for me.
http://www.cascading.org/

Jimmy Wan

On Thu, Mar 5, 2009 at 17:53, Richa Khandelwal  wrote:
> Hi All,
> Does anyone know how to run map reduce jobs using pipes or batch process map
> reduce jobs?


Re: Running 0.19.2 branch in production before release

2009-03-07 Thread Jimmy Wan
Has anyone setup this dual cluster approach and come up with a
complete scheme for being able to test a new hadoop revision and
migrate data from one datanode revision to the next? It would be
useful if someone could share their list of gotchas that have been
experienced..

In the past, pretty much every attempt to upgrade Hadoop DataNodes via
the documented (but outdated?) approaches mentioned in the FAQs and
wikis has not worked well for me. That's why I'm still on 0.17.2.1.

Jimmy Wan

On Fri, Mar 6, 2009 at 00:17, Aaron Kimball  wrote:
> Right, there's no sense in freezing your Hadoop version forever :)
>
> But if you're an ops team tasked with keeping a production cluster running
> 24/7, running on 0.19 (or even more daringly, TRUNK) is not something that I
> would consider a Best Practice. Ideally you'll be able to carve out some
> spare capacity (maybe 3--5 nodes) to use as a staging cluster that runs on
> 0.19 or TRUNK that you can use to evaluate the next version. Then when you
> are convinced that it's stable, and your staging cluster passes your
> internal tests (e.g., running test versions of your critical nightly jobs
> successfully), you can move that to production.
>
> - Aaron
>
>
> On Thu, Mar 5, 2009 at 2:33 AM, Steve Loughran  wrote:
>
>> Aaron Kimball wrote:
>>
>>> I recommend 0.18.3 for production use and avoid the 19 branch entirely. If
>>> your priority is stability, then stay a full minor version behind, not
>>> just
>>> a revision.
>>>
>>
>> Of course, if everyone stays that far behind, they don't get to find the
>> bugs for other people.
>>
>> * If you play with the latest releases early, while they are in the beta
>> phase -you will encounter the problems specific to your
>> applications/datacentres, and get them fixed fast.
>>
>> * If you work with stuff further back you get stability, but not only are
>> you behind on features, you can't be sure that all "fixes" that matter to
>> you get pushed back.
>>
>> * If you plan on making changes, of adding features, get onto SVN_HEAD
>>
>> * If you want to catch changes being made that break your site, SVN_HEAD.
>> Better yet, have a private Hudson server checking out SVN_HEAD hadoop *then*
>> building and testing your app against it.
>>
>> Normally I work with stable releases of things I dont depend on, and
>> SVN_HEAD of OSS stuff whose code I have any intent to change; there is a
>> price -merge time, the odd change breaking your code- but you get to make
>> changes that help you long term.
>>
>> Where Hadoop is different is that it is a filesystem, and you don't want to
>> hit bugs that delete files that matter. I'm only bringing up transient
>> clusters on VMs, pulling in data from elsewhere, so this isn't an issue. All
>> that remains is changing APIs.
>>
>> -Steve
>>
>


Re: Batching key/value pairs to map

2009-02-24 Thread Jimmy Wan
I've got a program that starts by hooking into a legacy system to look
up it's maps out of a db and the keys are sparse. I'm sure there might
be another way to do this, but this was by far the easiest/simplest
solution.

Jimmy Wan

On Mon, Feb 23, 2009 at 19:39, Edward Capriolo  wrote:
> We have a MR program that collects once for each token on a line. What
> types of applications can benefit from batch mapping?


Re: Batching key/value pairs to map

2009-02-23 Thread Jimmy Wan
Great, thanks Owen. I actually ran into the object reuse problem a
long time ago. The output of my MR processes gets turned into a series
of large INSERT statements that wasn't performing unless I batched
them in inserts of several K entries. I'm not sure if this is
possible, but it would certainly be nice to either:
1) pass the OutputCollector and Reporter to the close() method.
2) Provide accessors to the OutputCollector and the Reporter.

Now every single one of my maps is going to have a pair of 1-2 extra no-ops.

I'll check to see if that's on the list of outstanding FRs.

On Mon, Feb 23, 2009 at 15:30, Owen O'Malley  wrote:
> On Mon, Feb 23, 2009 at 12:06 PM, Jimmy Wan  wrote:
>
>> part of my map/reduce process could be greatly sped up by mapping
>> key/value pairs in batches instead of mapping them one by one.
>> Can I safely hang onto my OutputCollector and Reporter from calls to map?
>
> Yes. You can even use them in the close, so that you can process the last
> batch of records. *smile* One problem that you will quickly hit is that
> Hadoop reuses the objects that are passed to map and reduce. So, you'll need
> to clone them before putting them into the collection.
>
> I'm currently running Hadoop 0.17.2.1. Is this something I could do in
>> Hadoop 0.19.X?
>
> I don't think any of this changed between 0.17 and 0.19, other than in 0.17
> the reduce's inputs were always new objects. In 0.18 and after, the reduce's
> inputs are reused.


Batching key/value pairs to map

2009-02-23 Thread Jimmy Wan
part of my map/reduce process could be greatly sped up by mapping
key/value pairs in batches instead of mapping them one by one. I'd
like to do the following:
protected abstract void batchMap(OutputCollector
k2V2OutputCollector, Reporter reporter) throws IOException;

public void map(K1 key1, V1 value1, OutputCollector
output, Reporter reporter) throws IOException {
keys.add(key1.copy());
values.add(value1.copy());
if (++currentSize == batchSize) {
batchMap(output, reporter);
clear();
}
}

public void close() throws IOException {
if (currentSize > 0) {
// I don't have access to my OutputCollector or Reporter here!
batchMap(output, reporter);
clear();
}
}

Can I safely hang onto my OutputCollector and Reporter from calls to map?

I'm currently running Hadoop 0.17.2.1. Is this something I could do in
Hadoop 0.19.X?


Re: Recommendations on Job Status and Dependency Management

2008-11-14 Thread Jimmy Wan
Figured I should respond to my own question and list the solution for
the archives:

Since I already had a bunch of existing MapReduce jobs created, I was able to
quickly migrate my code to Cascading to take care of all the inter-hadoop
job dependencies.

By making use of the MapReduceFlow and dumping those flows into a Cascade
with a CascadeConnector, I was able to throw out several hundred lines of
hand-created Thread and dependency management code in favor of an automated
solution that actually worked a wee bit better in terms of concurrency. I was
able to see an immediate increase in the utilization of my cluster.

I covered how I worked out the initial HDFS-dependencies in the other reply
to this message.

For determining the proper way to determine whether the trigger conditions
are met (reliance on outside processes for which there is no easy way to read
a signal), I'm currently polling a database for that data and I'm working
with Chris to add a hook into Cascade to allow pluggable predicates to
specify that condition.

So yeah, I'm sold on Cascading. =)

Relevant links:
http://www.cascading.org

Relevant API Docs
http://www.cascading.org/javadoc/cascading/flow/MapReduceFlow.html
http://www.cascading.org/javadoc/cascading/cascade/CascadeConnector.html
http://www.cascading.org/javadoc/cascading/cascade/Cascade.html

On Tue, 11 Nov 2008, Jimmy Wan wrote:

>I'd like to take my prototype batch processing of hadoop jobs and implement
>some type of "real" dependency management and scheduling in order to better
>utilize my cluster as well as spread out more work over time. I was thinking
>of adopting one of the existing packages (Cascading, Zookeeper, existing
>JobControl?) and I was hoping to find some better advice from the mailing
>list. I tried to find a more direct comparison of Cascading and Zookeeper but
>I couldn't find one.
>
>This is a grossly simplified description my current completely naive
>approach:
>
>1) for each day in a month, spawn N threads that each contain a dependent
>series of map/reduce jobs.
>
>2) for each day in a month, spawn N threads that each contain a dependent
>series of map/reduce jobs that are dependent on the output of step #1. These
>are currently separated from the tasks in step #1 mainly because it's easier
>to group them up this way in the event of a failure, but I expect this
>separation to go away.
>
>3) At the end of the month, serially run a series of jobs outside of
>Map/Reduce that basically consist of a single SQL query (I could easily
>convert these to be very simple map/reduce jobs, and probably will, if it
>makes my job processing easier).
>
>The main problems I have are the following:
>1) right now I have a hard time determining which processes need to be run
>in the event of a failure.
>
>Every job has an expected input/output in HDFS so if I have to rerun
>something I usually just use something like "hadoop dfs -rmr " in a
>shell script then hand edit the jobs that need to be rerun.
>
>Is there an example somewhere of code that can read HDFS in order to
>determine if files exist? I poked around a bit and couldn't find one.
>Ideally, my code would be able to read the HDFS config info right out of the
>standard config files so I wouldn't need to create additional configuration
>information.
>
>The job dependencies while enumerated well are not isolated all that well.
>Example: I find a bug in 1 of 10 processes in step #1. I'd like to rerun just
>that one process and any dependent processes, but not have to rerun
>everything again.
>
>2) I typically run everything 1 month at a time, but I want to keep the
>option of doing rollups by day. On the 2nd of the month, I'd like to be able
>to run anything that requires data from the 1st of the month. On the 1st of
>the month, I'd like to run anything that requires a full month of data from
>the previous month.
>
>I'd also like my process to be able to account for system failures on
>previous days. i.e. On any given day I'd like to be able to run everything
>for which data is available.
>
>3) Certain types of jobs have external dependencies (ex. MySQL) and I don't
>want to run too many of those types of jobs at the same time since it affects
>my MySQL performance. I'd like some way of describing some type of
>lock on external resources that can be shared across jobs.
>
>Any recommendations on how to best model these things?
>
>I'm thinking that something like Cascading or Zookeeper could help me here.
>My initial take was that Zookeeper was more heavyweight than Cascading,
>requiring additional processes to be running at all times. However, it seems
>like Zookeeper would be better suited to describing mutual exclusions on

re: Recommendations on Job Status and Dependency Management

2008-11-12 Thread Jimmy Wan
I was able to answer one of my own questions:

"Is there an example somewhere of code that can read HDFS in order to
determine if files exist? I poked around a bit and couldn't find one.
Ideally, my code would be able to read the HDFS config info right out of the
standard config files so I wouldn't need to create additional configuration
information."

The following code was all that I needed:
Configuration configuration = new Configuration();
FileSystem fileSystem = FileSystem.get(configuration);
Path path = new Path(filename);
boolean fileExists = fileSystem.exists(path)

At first, the code didn't work as I expected because my working shell scripts
that made use of "hadoop/bin/hadoop jar my.jar" did not explicitly include
HADOOP_CONF_DIR in my classpath. Once I did that, everything worked just
fine.

On Tue, 11 Nov 2008, Jimmy Wan wrote:

>I'd like to take my prototype batch processing of hadoop jobs and implement
>some type of "real" dependency management and scheduling in order to better
>utilize my cluster as well as spread out more work over time. I was thinking
>of adopting one of the existing packages (Cascading, Zookeeper, existing
>JobControl?) and I was hoping to find some better advice from the mailing
>list. I tried to find a more direct comparison of Cascading and Zookeeper but
>I couldn't find one.
>
>This is a grossly simplified description my current completely naive
>approach:
>
>1) for each day in a month, spawn N threads that each contain a dependent
>series of map/reduce jobs.
>
>2) for each day in a month, spawn N threads that each contain a dependent
>series of map/reduce jobs that are dependent on the output of step #1. These
>are currently separated from the tasks in step #1 mainly because it's easier
>to group them up this way in the event of a failure, but I expect this
>separation to go away.
>
>3) At the end of the month, serially run a series of jobs outside of
>Map/Reduce that basically consist of a single SQL query (I could easily
>convert these to be very simple map/reduce jobs, and probably will, if it
>makes my job processing easier).
>
>The main problems I have are the following:
>1) right now I have a hard time determining which processes need to be run
>in the event of a failure.
>
>Every job has an expected input/output in HDFS so if I have to rerun
>something I usually just use something like "hadoop dfs -rmr " in a
>shell script then hand edit the jobs that need to be rerun.
>
>Is there an example somewhere of code that can read HDFS in order to
>determine if files exist? I poked around a bit and couldn't find one.
>Ideally, my code would be able to read the HDFS config info right out of the
>standard config files so I wouldn't need to create additional configuration
>information.
>
>The job dependencies while enumerated well are not isolated all that well.
>Example: I find a bug in 1 of 10 processes in step #1. I'd like to rerun just
>that one process and any dependent processes, but not have to rerun
>everything again.
>
>2) I typically run everything 1 month at a time, but I want to keep the
>option of doing rollups by day. On the 2nd of the month, I'd like to be able
>to run anything that requires data from the 1st of the month. On the 1st of
>the month, I'd like to run anything that requires a full month of data from
>the previous month.
>
>I'd also like my process to be able to account for system failures on
>previous days. i.e. On any given day I'd like to be able to run everything
>for which data is available.
>
>3) Certain types of jobs have external dependencies (ex. MySQL) and I don't
>want to run too many of those types of jobs at the same time since it affects
>my MySQL performance. I'd like some way of describing some type of
>lock on external resources that can be shared across jobs.
>
>Any recommendations on how to best model these things?
>
>I'm thinking that something like Cascading or Zookeeper could help me here.
>My initial take was that Zookeeper was more heavyweight than Cascading,
>requiring additional processes to be running at all times. However, it seems
>like Zookeeper would be better suited to describing mutual exclusions on
>usage of external resources. Can Cascading even do this?
>
>I'd also appreciate any recommendations on how best to tune the hadoop
>processes. My hadoop 0.16.4 cluster is currently relatively small (<10 nodes)
>so I'm thinking the 1GB defaults for my NameNode, DataNodes, and JobTracker
>might be overkill. I also plan to upgrade to 0.17.* or 0.18.* at some point
>in the near future.
>

-- 


Recommendations on Job Status and Dependency Management

2008-11-11 Thread Jimmy Wan
I'd like to take my prototype batch processing of hadoop jobs and implement
some type of "real" dependency management and scheduling in order to better
utilize my cluster as well as spread out more work over time. I was thinking
of adopting one of the existing packages (Cascading, Zookeeper, existing
JobControl?) and I was hoping to find some better advice from the mailing
list. I tried to find a more direct comparison of Cascading and Zookeeper but
I couldn't find one.

This is a grossly simplified description my current completely naive
approach:

1) for each day in a month, spawn N threads that each contain a dependent
series of map/reduce jobs.

2) for each day in a month, spawn N threads that each contain a dependent
series of map/reduce jobs that are dependent on the output of step #1. These
are currently separated from the tasks in step #1 mainly because it's easier
to group them up this way in the event of a failure, but I expect this
separation to go away.

3) At the end of the month, serially run a series of jobs outside of
Map/Reduce that basically consist of a single SQL query (I could easily
convert these to be very simple map/reduce jobs, and probably will, if it
makes my job processing easier).

The main problems I have are the following:
1) right now I have a hard time determining which processes need to be run
in the event of a failure.

Every job has an expected input/output in HDFS so if I have to rerun
something I usually just use something like "hadoop dfs -rmr " in a
shell script then hand edit the jobs that need to be rerun.

Is there an example somewhere of code that can read HDFS in order to
determine if files exist? I poked around a bit and couldn't find one.
Ideally, my code would be able to read the HDFS config info right out of the
standard config files so I wouldn't need to create additional configuration
information.

The job dependencies while enumerated well are not isolated all that well.
Example: I find a bug in 1 of 10 processes in step #1. I'd like to rerun just
that one process and any dependent processes, but not have to rerun
everything again.

2) I typically run everything 1 month at a time, but I want to keep the
option of doing rollups by day. On the 2nd of the month, I'd like to be able
to run anything that requires data from the 1st of the month. On the 1st of
the month, I'd like to run anything that requires a full month of data from
the previous month.

I'd also like my process to be able to account for system failures on
previous days. i.e. On any given day I'd like to be able to run everything
for which data is available.

3) Certain types of jobs have external dependencies (ex. MySQL) and I don't
want to run too many of those types of jobs at the same time since it affects
my MySQL performance. I'd like some way of describing some type of
lock on external resources that can be shared across jobs.

Any recommendations on how to best model these things?

I'm thinking that something like Cascading or Zookeeper could help me here.
My initial take was that Zookeeper was more heavyweight than Cascading,
requiring additional processes to be running at all times. However, it seems
like Zookeeper would be better suited to describing mutual exclusions on
usage of external resources. Can Cascading even do this?

I'd also appreciate any recommendations on how best to tune the hadoop
processes. My hadoop 0.16.4 cluster is currently relatively small (<10 nodes)
so I'm thinking the 1GB defaults for my NameNode, DataNodes, and JobTracker
might be overkill. I also plan to upgrade to 0.17.* or 0.18.* at some point
in the near future.


Re: Changing my NameNode

2008-07-02 Thread Jimmy Wan
Anyone out there with a suggestion?

Is my best option to export data out of the cluster, reformat a new namenode,
and reimport it all?

On Tue, 10 Jun 2008, Jimmy Wan wrote:

>I've got an HDFS cluster with 2 boxes in it.
>
>Server A is serving as the NameNode and also as a DataNode.
>Server B is a DataNode.
>
>After successfully decommissioning the HDFS storage of Server A using a
>dfs.hosts.exclude file and dfadmin -refreshNodes:
>
>Server A is serving as the NameNode.
>Server B is serving as a DataNode.
>
>How do I change my NameNode to be Server B?
>
>Can I simply change the hadoop-site.xml, masters, and slaves files for all
>machines in my cluster? I could have sworn that I tried that and it failed.
>
>P.S. This link is wrong wrt how to decommission a Dattanode.
>http://hadoop.apache.org/core/docs/r0.16.4/hdfs_design.html#DFSAdmin


Changing my NameNode

2008-06-10 Thread Jimmy Wan
I've got an HDFS cluster with 2 boxes in it.

Server A is serving as the NameNode and also as a DataNode.
Server B is a DataNode.

After successfully decommissioned the HDFS storage of Server A using
a dfs.hosts.exclude file and dfadmin -refreshNodes:

Server A is serving as the NameNode.
Server B is serving as the DataNode.

How do I change my NameNode to be Server B?

Can I simply change the hadoop-site.xml, masters, and slaves files for all
machines in my cluster? I could have sworn that I tried that and it failed.

P.S. This link is wrong wrt how to decommission a Dattanode.
http://hadoop.apache.org/core/docs/r0.16.4/hdfs_design.html#DFSAdmin


Re: Limiting Total # of TaskTracker threads

2008-03-20 Thread Jimmy Wan

On Tue, 18 Mar 2008 19:53:04 -0500, Ted Dunning <[EMAIL PROTECTED]> wrote:

I think the original request was to limit the sum of maps and reduces  
rather than limiting the two parameters independently.


Ted, yes this is exactly what I'm looking for. I just found an issue that  
seems to state that the old deprecated property is there, but it is not  
documented:


https://issues.apache.org/jira/browse/HADOOP-2300

I tried using the max tasks in combination with setting the new values,  
but that didn't seem to work. =( My machine labelled as "LIMITED MACHINE"  
had 2 maps and 1 reduce running at the same time.


The scenario I have is that I want to run multiple concurrent jobs through  
my cluster and have the CPU usage for that node be bound. Should I file a  
new issue?


This was all with Hadoop 0.16.0

LIMITED MACHINE:

  mapred.tasktracker.tasks.maximum
  2
  The maximum number of total tasks that will be run
  simultaneously by a task tracker.
  


  mapred.tasktracker.map.tasks.maximum
  1
  The maximum number of map tasks that will be run
  simultaneously by a task tracker.
  


  mapred.tasktracker.reduce.tasks.maximum
  1
  The maximum number of reduce tasks that will be run
  simultaneously by a task tracker.
  


OTHER CLUSTER MACHINES:

  mapred.tasktracker.tasks.maximum
  8
  The maximum number of total tasks that will be run
  simultaneously by a task tracker.
  


  mapred.tasktracker.map.tasks.maximum
  4
  The maximum number of map tasks that will be run
  simultaneously by a task tracker.
  


  mapred.tasktracker.reduce.tasks.maximum
  4
  The maximum number of reduce tasks that will be run
  simultaneously by a task tracker.
  



On 3/18/08 5:26 PM, "Arun C Murthy" <[EMAIL PROTECTED]> wrote:


The map/reduce tasks are not threads, they are run in separate JVMs
which are forked by the tasktracker.


Arun, yes, I did mean tasks, not threads.


--
Jimmy


Limiting Total # of TaskTracker threads

2008-03-18 Thread Jimmy Wan

The properties mentioned here: http://wiki.apache.org/hadoop/FAQ#13

have been deprecated in favor of two separate properties:
mapred.tasktracker.map.tasks.maximum
mapred.tasktracker.reduce.tasks.maximum

I'd like to limit the total # of threads on a task tracker (think limited  
resources on a given compute node) to a given number, and there does not  
appear to be a way to do that anymore. Am I correct in my understanding  
that there is no capability to do this?


--
Jimmy


Configuring Nodes on Windows in Distributed Environment

2008-03-14 Thread Jimmy Wan
Has anyone succeeded in doing this? I've successfully got a cluster with a  
few linux nodes running, but I'd like to add my desktop machine (JIMMY) to  
the mix for some spare compute cycles. I can happily run standalone apps  
with the loopback config, but I can't quite get my machine to play nicely  
in a distributed conf. The log output is nonsensical to me. It appears to  
be adding CR where they don't belong. Any ideas?


Configuration details and output from start-all.sh:

I've got the conf/logs/tmp/hadoop directories created in C:/home/hadoop.
I've got symlinks from /home/hadoop/* to C:/home/hadoop (I wasn't sure how  
the unix-style paths play nice with windows paths so I was trying to cover  
all the bases.

I created C:/tmp just for kicks (it's empty)

My .bashrc on contains:
export JAVA_HOME="/cygdrive/c/Dev/jdk1.6.0_04"
export HADOOP_IDENT_STRING=`hostname`
export HADOOP_INSTALL=/cygdrive/c/home/hadoop
export HADOOP_HOME=/cygdrive/c/home/hadoop/hadoop
export HADOOP_LOG_DIR=/cygdrive/c/home/hadoop/logs
export HADOOP_CONF_DIR=/cygdrive/c/home/hadoop/conf
export HADOOP_HEAPSIZE=1000

[EMAIL PROTECTED] ~]$ hadoop/bin/start-all.sh
starting namenode, logging to  
/home/hadoop/logs/hadoop-mainserver-namenode-mainserver.out
mainserver: starting datanode, logging to  
/home/hadoop/logs/hadoop-mainserver-datanode-mainserver.out
slavenode1: starting datanode, logging to  
/home/hadoop/logs/hadoop-slavenode1-datanode-slavenode1.out
slavenode2: starting datanode, logging to  
/home/hadoop/hadoop/bin/../logs/hadoop-hadoop-datanode-slavenode2.out

-datanode-JIMMY.outanode, logging to /home/hadoop/logs/hadoop-JIMMY
-datanode.pid: No such file or directoryemon.sh: line 117:  
/tmp/hadoop-JIMMY
-datanode-JIMMY.out: No such file or directoryh: line 116:  
/home/hadoop/logs/hadoop-JIMMY
jimmy: head: cannot open  
`/home/hadoop/logs/hadoop-JIMMY\r-datanode-JIMMY.out' for reading: No such  
file or directory
mainserver: starting secondarynamenode, logging to  
/home/hadoop/logs/hadoop-mainserver-secondarynamenode-mainserver.out
starting jobtracker, logging to  
/home/hadoop/logs/hadoop-mainserver-jobtracker-mainserver.out
slavenode2: starting tasktracker, logging to  
/home/hadoop/hadoop/bin/../logs/hadoop-hadoop-tasktracker-slavenode2.out
slavenode1: starting tasktracker, logging to  
/home/hadoop/logs/hadoop-slavenode1-tasktracker-slavenode1.out
mainserver: starting tasktracker, logging to  
/home/hadoop/logs/hadoop-mainserver-tasktracker-mainserver.out

-tasktracker-JIMMY.outacker, logging to /home/hadoop/logs/hadoop-JIMMY
-tasktracker.pid: No such file or directoryn.sh: line 117:  
/tmp/hadoop-JIMMY
-tasktracker-JIMMY.out: No such file or directoryline 116:  
/home/hadoop/logs/hadoop-JIMMY
jimmy: head: cannot open  
`/home/hadoop/logs/hadoop-JIMMY\r-tasktracker-JIMMY.out' for reading: No  
such file or directory


--
Jimmy


Splitting compressed input from a single job to multiple map tasks

2008-03-11 Thread Jimmy Wan
Is it possible to split compressed input from a single job to multiple map  
tasks? My current configuration has several task trackers but the job I  
kick off results in a single map task. I'm launching these jobs in  
sequence via a shell script, so they end up going through a pipeline of 1  
concurrent map which is kinda suboptimal.


When I run this task via a full local hadoop stack it does seem to split  
the file into multiple small task chunks.


--
Jimmy


Re: Does Hadoop Honor Reserved Space?

2008-03-07 Thread Jimmy Wan
Unfortunately, I had to clean up my HDFS in order to get some work done,  
but
I was running Hadoop on Hadoop 0.16.0 running on a Linux box. My  
configuration is
two machines. One has the JobTracker/NameNode and a TaskTracker instance  
all
running on the same machine. The other machine is just running a  
TaskTracker.

Replication was set to 2 for the default and the max.

--
Jimmy

On Thu, 06 Mar 2008 16:01:16 -0600, Hairong Kuang <[EMAIL PROTECTED]>  
wrote:


In addition to the version, could you please send us a copy of the  
datanode

report by running the command bin/hadoop dfsadmin -report?

Thanks,
Hairong


On 3/6/08 11:56 AM, "Joydeep Sen Sarma" <[EMAIL PROTECTED]> wrote:


but intermediate data is stored in a different directory from dfs/data
(something like mapred/local by default i think).

what version are u running?


-Original Message-
From: Ashwinder Ahluwalia on behalf of [EMAIL PROTECTED]
Sent: Thu 3/6/2008 10:14 AM
To: core-user@hadoop.apache.org
Subject: RE: Does Hadoop Honor Reserved Space?

I've run into a similar issue in the past. From what I understand, this
parameter only controls the HDFS space usage. However, the intermediate  
data in the map reduce job is stored on the local file system (not

HDFS) and is not subject to this configuration.

In the past I have used mapred.local.dir.minspacekill and
mapred.local.dir.minspacestart to control the amount of space that is
allowable for use by this temporary data.

Not sure if that is the best approach though, so I'd love to hear what  
other people have done. In your case, you have a map-red job that will  
consume too much space (without setting a limit, you didn't have enough

disk capacity for the job), so looking at mapred.output.compress and
mapred.compress.map.output might be useful to decrease the job's disk
requirements.

--Ash

-Original Message-
From: Jimmy Wan [mailto:[EMAIL PROTECTED]
Sent: Thursday, March 06, 2008 9:56 AM
To: core-user@hadoop.apache.org
Subject: Does Hadoop Honor Reserved Space?

I've got 2 datanodes setup with the following configuration parameter:

 dfs.datanode.du.reserved
 429496729600
 Reserved space in bytes per volume. Always leave this
much
space free for non dfs use.
 


Both are housed on 800GB volumes, so I thought this would keep about  
half

the volume free for non-HDFS usage.

After some long running jobs last night, both disk volumes were  
completely

filled. The bulk of the data was in:
${my.hadoop.tmp.dir}/hadoop-hadoop/dfs/data

This is running as the user hadoop.

Am I interpretting these parameters incorrectly?

I noticed this issue, but it is marked as closed:
http://issues.apache.org/jira/browse/HADOOP-2549


Equivalent of cmdline head or tail?

2008-03-06 Thread Jimmy Wan
I've got some jobs where I'd like to just pull out the top N or bottom N  
values.


It seems like I can't do this from the map or combine phases (due to not  
having enough data), but I could aggregate this data during the reduce  
phase. The problem I have is that I won't know when to actually write them  
out until I've gone through the entire set, at which point reduce isn't  
called anymore.


It's easy enough to post-process with some combination of sort, head, and  
tail, but I was wondering if I was missing something obvious.


--
Jimmy


Does Hadoop Honor Reserved Space?

2008-03-06 Thread Jimmy Wan

I've got 2 datanodes setup with the following configuration parameter:

  dfs.datanode.du.reserved
  429496729600
	  Reserved space in bytes per volume. Always leave this much  
space free for non dfs use.

  


Both are housed on 800GB volumes, so I thought this would keep about half  
the volume free for non-HDFS usage.


After some long running jobs last night, both disk volumes were completely  
filled. The bulk of the data was in:

${my.hadoop.tmp.dir}/hadoop-hadoop/dfs/data

This is running as the user hadoop.

Am I interpretting these parameters incorrectly?

I noticed this issue, but it is marked as closed:  
http://issues.apache.org/jira/browse/HADOOP-2549


--
Jimmy