Re: reduce task failing after 24 hours waiting

2009-03-26 Thread Billy Pearson

mapred.jobtracker.retirejob.interval
is not in the default config

should this not be in the config?

Billy



"Amar Kamat"  wrote in 
message news:49caff11.8070...@yahoo-inc.com...

Amar Kamat wrote:

Amareshwari Sriramadasu wrote:

Set mapred.jobtracker.retirejob.interval

This is used to retire completed jobs.

and mapred.userlog.retain.hours to higher value.

This is used to discard user logs.
As Amareshwari pointed out, this might be the cause. Can you increase this 
value and try?

Amar
By default, their values are 24 hours. These might be the reason for 
failure, though I'm not sure.


Thanks
Amareshwari

Billy Pearson wrote:
I am seeing on one of my long running jobs about 50-60 hours that after 
24 hours all

active reduce task fail with the error messages

java.io.IOException: Task process exit with nonzero status of 255.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)

Is there something in the config that I can change to stop this?

Every time with in 1 min of 24 hours they all fail at the same time.
waist a lot of resource downloading the map outputs and merging them 
again.
What is the state of the reducer (copy or sort)? Check 
jobtracker/task-tracker logs to see what is the state of these reducers 
and whether it issued a kill signal. Either jobtracker/tasktracker is 
issuing a kill signal or the reducers are committing suicide. Were there 
any failures on the reducer side while pulling the map output? Also what 
is the nature of the job? How fast the maps finish?

Amar


Billy














Re: Super-long reduce task timeouts in hadoop-0.19.0

2009-03-26 Thread schubert zhang
Can you try to use branch-0.19. I am using it. and it seems fine.
0.19.1 have also following issues:

https://issues.apache.org/jira/browse/HADOOP-5269

https://issues.apache.org/jira/browse/HADOOP-5235

https://issues.apache.org/jira/browse/HADOOP-5367

On Fri, Mar 27, 2009 at 12:09 AM, John Lee  wrote:

> Woops. Sorry about the extra noise.
>


Re: how to mount specification-path of hdfs with fuse-dfs

2009-03-26 Thread jacky_ji

yes, thanks for your response.

Craig Macdonald wrote:
> 
> Hi Jacky,
> 
> Please to hear that fuse-dfs is working for you.
> 
> Do you mean that you want to mount dfs://localhost:9000/users at /mnt/hdfs
> ?
> 
> If so, fuse-dfs doesnt currently support this, but it would be a good 
> idea for a future improvement.
> 
> Craig
> 
> jacky_ji wrote:
>> i can use fuse-dfs to mount hdfs.  just like this: ./fuse-dfs
>> dfs://localhost:9000 /mnt/hdfs -d
>> but  i want to mount the specification path in hdfs now, and i have no
>> idea
>> ablut it, any advice will be appreciated.
>>   
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/how-to-mount-specification-path-of-hdfs-with-fuse-dfs-tp22716393p22734500.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: about dfsadmin -report

2009-03-26 Thread stchu
But when the web-ui shows the node dead, -report still shows "in service"
and the living nodes=3 (in web-ui: living=2 dead=1).

stchu

2009/3/26 Raghu Angadi 

> stchu wrote:
>
>> Hi,
>>
>> I do a test about the datanode crash. I stop the networking on one of the
>> datanode.
>> The Web app and fsck report that datanode dead after 10 mins. But dfsadmin
>> -report
>> are not report that over 25 mins. Is this correct?
>>
>
> Nope. Both web-ui and '-report' from the same source of info. They should
> be consistent.
>
> Raghu.
>
>  Thanks for your guide.
>>
>> stchu
>>
>>
>


Re: Using HDFS to serve www requests

2009-03-26 Thread Jimmy Lin

Brian---

Can you share some performance figures for typical workloads with your 
HDFS/Fuse setup?  Obviously, latency is going to be bad but throughput 
will probably be reasonable... but I'm curious to hear about concrete 
latency/throughput numbers.  And, of course, I'm interested in these 
numbers as a function of concurrent clients... ;)


Somewhat independent of file size is the workload... you can have huge 
TB-size files, but still have a seek-heavy workload (in which case HDFS 
is probably a sub-optimal choice).  But if seek-heavy loads are 
reasonable, one can solve the lots-of-little-files problem by simple 
concatenation.


Finally, I'm curious about the Fuse overhead (vs. directly using the 
Java API).


Thanks in advance for your insights!

-Jimmy

Brian Bockelman wrote:


On Mar 26, 2009, at 5:44 PM, Aaron Kimball wrote:


In general, Hadoop is unsuitable for the application you're suggesting.
Systems like Fuse HDFS do exist, though they're not widely used.


We use FUSE on a 270TB cluster to serve up physics data because the 
client (2.5M lines of C++) doesn't understand how to connect to HDFS 
directly.


Brian


I don't
know of anyone trying to connect Hadoop with Apache httpd.

When you say that you have huge images, how big is "huge?" It might be
useful if these images are 1 GB or larger. But in general, "huge" on 
Hadoop
means 10s of GBs up to TBs.  If you have a large number of 
moderately-sized

files, you'll find that HDFS responds very poorly for your needs.

It sounds like glusterfs is designed more for your needs.

- Aaron

On Thu, Mar 26, 2009 at 4:06 PM, phil cryer  wrote:


This is somewhat of a noob question I know, but after learning about
Hadoop, testing it in a small cluster and running Map Reduce jobs on
it, I'm still not sure if Hadoop is the right distributed file system
to serve web requests.  In other words, can, or is it right to, serve
Images and data from HDFS using something like FUSE to mount a
filesystem where Apache could serve images from it?  We have huge
images, thus the need for a distributed file system, and they go in,
get stored with lots of metadata, and are redundant with Hadoop/HDFS -
but is it the right way to serve web content?

I looked at glusterfs before, they had an Apache and Lighttpd module
which made it simple, does HDFS have something like this, do people
just use a FUSE option as I described, or is this not a good use of
Hadoop?

Thanks

P






How many nodes does one man want?

2009-03-26 Thread Sid123

Hi,
I am working of implementing some machine learning algorithms using Map Red.
I want to know that If I have data that takes 5-6 hours to train on a normal
machine. Will putting in 2-3 more nodes have an effect? I read in the yahoo
hadoop tutorial.
"Executing Hadoop on a limited amount of data on a small number of nodes may
not demonstrate particularly stellar performance as the overhead involved in
starting Hadoop programs is relatively high. Other parallel/distributed
programming paradigms such as MPI (Message Passing Interface) may perform
much better on two, four, or perhaps a dozen machines."

I have at my disposal 3 laptops each with 4 G RAM and 150G hard disk space
each...  I have 600M of training data
-- 
View this message in context: 
http://www.nabble.com/How-many-nodes-does-one-man-want--tp22733399p22733399.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Using HDFS to serve www requests

2009-03-26 Thread Jeff Hammerbacher
HDFS itself has some facilities for serving data over HTTP:
https://issues.apache.org/jira/browse/HADOOP-5010. YMMV.

On Thu, Mar 26, 2009 at 3:47 PM, Brian Bockelman wrote:

>
> On Mar 26, 2009, at 8:55 PM, phil cryer wrote:
>
>  When you say that you have huge images, how big is "huge?"
>>>
>>
>> Yes, we're looking at some images that are 100Megs in size, but
>> nothing like what you're speaking of.  This helps me understand
>> Hadoop's usage better and unfortunately it won't be the fit I was
>> hoping for.
>>
>>
> I wouldn't split hairs between 100MB and 1GB.  However, it may be less
> reliable due to the extra layer via FUSE if you want to serve it via apache.
>  It wouldn't be too bad to whip up a tomcat webapp that goes through
> Hadoop...
>
> It really depends on your hardware level and redundancy.  If you have the
> money to get the hardware necessary to go with a Lustre-based solution, do
> that.  If you have enough money to load up your pre-existing cluster with
> lots of disk, HDFS might be better.  Certainly it will be outperformed by
> lustre if you have lots of reliable hardware, especially in terms of
> latency.
>
> Brian
>
>
>  You can use the API or the FUSE module to mount hadoop but that is not
>>> a direct goal of hadoop. Hope that helps.
>>>
>>
>> Very interesting, and yes, that indeed does help, not to veer off
>> thread too much, but does Sun's Lustre follow in the steps of Gluster
>> then?  I know Lustre requires kernel patches to install, so it's at a
>> different level than the others, but I have seen some articles about
>> large scale clusters built with Lustre and want to look at that as
>> another option.
>>
>> Again, thanks for the info, if anyone has general information on
>> cluster software, or know of a more appropriate list, I'd appreciate
>> the advice.
>>
>> Thanks
>>
>> P
>>
>> On Thu, Mar 26, 2009 at 12:32 PM, Edward Capriolo 
>> wrote:
>>
>>> It is a little more natural to connect to HDFS from apache tomcat.
>>> This will allow you to skip the FUSE mounts and just use the HDFS-API.
>>>
>>> I have modified this code to run inside tomcat.
>>> http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample
>>>
>>> I will not testify to how well this setup will perform under internet
>>> traffic, but it does work.
>>>
>>> GlusterFS is more like a traditional POSIX filesystem. It supports
>>> locking and appends and you can do things like put the mysql data
>>> directory on it.
>>>
>>> GLUSTERFS is geared for storing data to be accessed with low latency.
>>> Nodes (Bricks) are normally connected via GIG-E or infiniban. The
>>> GlusterFS volume is mounted directly on a unix system.
>>>
>>> Hadoop is a user space file system. The latency is higher. Nodes are
>>> connected by GIG-E. It is closely coupled with MAP/REDUCE.
>>>
>>> You can use the API or the FUSE module to mount hadoop but that is not
>>> a direct goal of hadoop. Hope that helps.
>>>
>>>
>


Re: Using HDFS to serve www requests

2009-03-26 Thread Brian Bockelman


On Mar 26, 2009, at 8:55 PM, phil cryer wrote:


When you say that you have huge images, how big is "huge?"


Yes, we're looking at some images that are 100Megs in size, but
nothing like what you're speaking of.  This helps me understand
Hadoop's usage better and unfortunately it won't be the fit I was
hoping for.



I wouldn't split hairs between 100MB and 1GB.  However, it may be less  
reliable due to the extra layer via FUSE if you want to serve it via  
apache.  It wouldn't be too bad to whip up a tomcat webapp that goes  
through Hadoop...


It really depends on your hardware level and redundancy.  If you have  
the money to get the hardware necessary to go with a Lustre-based  
solution, do that.  If you have enough money to load up your pre- 
existing cluster with lots of disk, HDFS might be better.  Certainly  
it will be outperformed by lustre if you have lots of reliable  
hardware, especially in terms of latency.


Brian

You can use the API or the FUSE module to mount hadoop but that is  
not

a direct goal of hadoop. Hope that helps.


Very interesting, and yes, that indeed does help, not to veer off
thread too much, but does Sun's Lustre follow in the steps of Gluster
then?  I know Lustre requires kernel patches to install, so it's at a
different level than the others, but I have seen some articles about
large scale clusters built with Lustre and want to look at that as
another option.

Again, thanks for the info, if anyone has general information on
cluster software, or know of a more appropriate list, I'd appreciate
the advice.

Thanks

P

On Thu, Mar 26, 2009 at 12:32 PM, Edward Capriolo > wrote:

It is a little more natural to connect to HDFS from apache tomcat.
This will allow you to skip the FUSE mounts and just use the HDFS- 
API.


I have modified this code to run inside tomcat.
http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample

I will not testify to how well this setup will perform under internet
traffic, but it does work.

GlusterFS is more like a traditional POSIX filesystem. It supports
locking and appends and you can do things like put the mysql data
directory on it.

GLUSTERFS is geared for storing data to be accessed with low latency.
Nodes (Bricks) are normally connected via GIG-E or infiniban. The
GlusterFS volume is mounted directly on a unix system.

Hadoop is a user space file system. The latency is higher. Nodes are
connected by GIG-E. It is closely coupled with MAP/REDUCE.

You can use the API or the FUSE module to mount hadoop but that is  
not

a direct goal of hadoop. Hope that helps.





Re: Using HDFS to serve www requests

2009-03-26 Thread Norbert Burger
Have you looked into MogileFS already?  Seems like a good fit, based
on your description.  This question has come up more than once here,
and MogileFS is an oft-recommended solution.

Norbert

On 3/26/09, phil cryer  wrote:
> > When you say that you have huge images, how big is "huge?"
>
>
> Yes, we're looking at some images that are 100Megs in size, but
>  nothing like what you're speaking of.  This helps me understand
>  Hadoop's usage better and unfortunately it won't be the fit I was
>  hoping for.
>
>
>  > You can use the API or the FUSE module to mount hadoop but that is not
>  > a direct goal of hadoop. Hope that helps.
>
>
> Very interesting, and yes, that indeed does help, not to veer off
>  thread too much, but does Sun's Lustre follow in the steps of Gluster
>  then?  I know Lustre requires kernel patches to install, so it's at a
>  different level than the others, but I have seen some articles about
>  large scale clusters built with Lustre and want to look at that as
>  another option.
>
>  Again, thanks for the info, if anyone has general information on
>  cluster software, or know of a more appropriate list, I'd appreciate
>  the advice.
>
>  Thanks
>
>
>  P
>
>
>  On Thu, Mar 26, 2009 at 12:32 PM, Edward Capriolo  
> wrote:
>  > It is a little more natural to connect to HDFS from apache tomcat.
>  > This will allow you to skip the FUSE mounts and just use the HDFS-API.
>  >
>  > I have modified this code to run inside tomcat.
>  > http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample
>  >
>  > I will not testify to how well this setup will perform under internet
>  > traffic, but it does work.
>  >
>  > GlusterFS is more like a traditional POSIX filesystem. It supports
>  > locking and appends and you can do things like put the mysql data
>  > directory on it.
>  >
>  > GLUSTERFS is geared for storing data to be accessed with low latency.
>  > Nodes (Bricks) are normally connected via GIG-E or infiniban. The
>  > GlusterFS volume is mounted directly on a unix system.
>  >
>  > Hadoop is a user space file system. The latency is higher. Nodes are
>  > connected by GIG-E. It is closely coupled with MAP/REDUCE.
>  >
>  > You can use the API or the FUSE module to mount hadoop but that is not
>  > a direct goal of hadoop. Hope that helps.
>  >
>


Re: Using HDFS to serve www requests

2009-03-26 Thread phil cryer
> When you say that you have huge images, how big is "huge?"

Yes, we're looking at some images that are 100Megs in size, but
nothing like what you're speaking of.  This helps me understand
Hadoop's usage better and unfortunately it won't be the fit I was
hoping for.

> You can use the API or the FUSE module to mount hadoop but that is not
> a direct goal of hadoop. Hope that helps.

Very interesting, and yes, that indeed does help, not to veer off
thread too much, but does Sun's Lustre follow in the steps of Gluster
then?  I know Lustre requires kernel patches to install, so it's at a
different level than the others, but I have seen some articles about
large scale clusters built with Lustre and want to look at that as
another option.

Again, thanks for the info, if anyone has general information on
cluster software, or know of a more appropriate list, I'd appreciate
the advice.

Thanks

P

On Thu, Mar 26, 2009 at 12:32 PM, Edward Capriolo  wrote:
> It is a little more natural to connect to HDFS from apache tomcat.
> This will allow you to skip the FUSE mounts and just use the HDFS-API.
>
> I have modified this code to run inside tomcat.
> http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample
>
> I will not testify to how well this setup will perform under internet
> traffic, but it does work.
>
> GlusterFS is more like a traditional POSIX filesystem. It supports
> locking and appends and you can do things like put the mysql data
> directory on it.
>
> GLUSTERFS is geared for storing data to be accessed with low latency.
> Nodes (Bricks) are normally connected via GIG-E or infiniban. The
> GlusterFS volume is mounted directly on a unix system.
>
> Hadoop is a user space file system. The latency is higher. Nodes are
> connected by GIG-E. It is closely coupled with MAP/REDUCE.
>
> You can use the API or the FUSE module to mount hadoop but that is not
> a direct goal of hadoop. Hope that helps.
>


Re: Using HDFS to serve www requests

2009-03-26 Thread Edward Capriolo
It is a little more natural to connect to HDFS from apache tomcat.
This will allow you to skip the FUSE mounts and just use the HDFS-API.

I have modified this code to run inside tomcat.
http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample

I will not testify to how well this setup will perform under internet
traffic, but it does work.

GlusterFS is more like a traditional POSIX filesystem. It supports
locking and appends and you can do things like put the mysql data
directory on it.

GLUSTERFS is geared for storing data to be accessed with low latency.
Nodes (Bricks) are normally connected via GIG-E or infiniban. The
GlusterFS volume is mounted directly on a unix system.

Hadoop is a user space file system. The latency is higher. Nodes are
connected by GIG-E. It is closely coupled with MAP/REDUCE.

You can use the API or the FUSE module to mount hadoop but that is not
a direct goal of hadoop. Hope that helps.


RE: corrupt unreplicated block in dfs (0.18.3)

2009-03-26 Thread Koji Noguchi
Mike, you might want to look at -move option in fsck.

bash-3.00$ hadoop fsck
Usage: DFSck  [-move | -delete | -openforwrite] [-files [-blocks
[-locations | -racks]]]
  start checking from this path
-move   move corrupted files to /lost+found
-delete delete corrupted files
-files  print out files being checked
-openforwrite   print out files opened for write
-blocks print out block report
-locations  print out locations for every block
-racks  print out network topology for data-node locations



I never use it since I would rather have users' jobs fail than jobs
succeeding with incomplete inputs.

Koji


-Original Message-
From: Aaron Kimball [mailto:aa...@cloudera.com] 
Sent: Thursday, March 26, 2009 9:41 AM
To: core-user@hadoop.apache.org
Subject: Re: corrupt unreplicated block in dfs (0.18.3)

Just because a block is corrupt doesn't mean the entire file is corrupt.
Furthermore, the presence/absence of a file in the namespace is a
completely
separate issue to the data in the file. I think it would be a surprising
interface change if files suddenly disappeared just because 1 out of
potentially many blocks were corrupt.

- Aaron

On Thu, Mar 26, 2009 at 1:21 PM, Mike Andrews  wrote:

> i noticed that when a file with no replication (i.e., replication=1)
> develops a corrupt block, hadoop takes no action aside from the
> datanode throwing an exception to the client trying to read the file.
> i manually corrupted a block in order to observe this.
>
> obviously, with replication=1 its impossible to fix the block, but i
> thought perhaps hadoop would take some other action, such as deleting
> the file outright, or moving it to a "corrupt" directory, or marking
> it or keeping track of it somehow to note that there's un-fixable
> corruption in the filesystem? thus, the current behaviour seems to
> sweep the corruption under the rug and allows its continued existence,
> aside from notifying the specific client doing the read with an
> exception.
>
> if anyone has any information about this issue or how to work around
> it, please let me know.
>
> on the other hand, i tested that corrupting a block in a replication=3
> file causes hadoop to re-replicate the block from another existing
> copy, which is good and is i what i expected.
>
> best,
> mike
>
>
> --
> permanent contact information at http://mikerandrews.com
>


Re: Using HDFS to serve www requests

2009-03-26 Thread Brian Bockelman


On Mar 26, 2009, at 5:44 PM, Aaron Kimball wrote:

In general, Hadoop is unsuitable for the application you're  
suggesting.

Systems like Fuse HDFS do exist, though they're not widely used.


We use FUSE on a 270TB cluster to serve up physics data because the  
client (2.5M lines of C++) doesn't understand how to connect to HDFS  
directly.


Brian


I don't
know of anyone trying to connect Hadoop with Apache httpd.

When you say that you have huge images, how big is "huge?" It might be
useful if these images are 1 GB or larger. But in general, "huge" on  
Hadoop
means 10s of GBs up to TBs.  If you have a large number of  
moderately-sized

files, you'll find that HDFS responds very poorly for your needs.

It sounds like glusterfs is designed more for your needs.

- Aaron

On Thu, Mar 26, 2009 at 4:06 PM, phil cryer  wrote:


This is somewhat of a noob question I know, but after learning about
Hadoop, testing it in a small cluster and running Map Reduce jobs on
it, I'm still not sure if Hadoop is the right distributed file system
to serve web requests.  In other words, can, or is it right to, serve
Images and data from HDFS using something like FUSE to mount a
filesystem where Apache could serve images from it?  We have huge
images, thus the need for a distributed file system, and they go in,
get stored with lots of metadata, and are redundant with Hadoop/ 
HDFS -

but is it the right way to serve web content?

I looked at glusterfs before, they had an Apache and Lighttpd module
which made it simple, does HDFS have something like this, do people
just use a FUSE option as I described, or is this not a good use of
Hadoop?

Thanks

P





Re: Using HDFS to serve www requests

2009-03-26 Thread Aaron Kimball
In general, Hadoop is unsuitable for the application you're suggesting.
Systems like Fuse HDFS do exist, though they're not widely used. I don't
know of anyone trying to connect Hadoop with Apache httpd.

When you say that you have huge images, how big is "huge?" It might be
useful if these images are 1 GB or larger. But in general, "huge" on Hadoop
means 10s of GBs up to TBs.  If you have a large number of moderately-sized
files, you'll find that HDFS responds very poorly for your needs.

It sounds like glusterfs is designed more for your needs.

- Aaron

On Thu, Mar 26, 2009 at 4:06 PM, phil cryer  wrote:

> This is somewhat of a noob question I know, but after learning about
> Hadoop, testing it in a small cluster and running Map Reduce jobs on
> it, I'm still not sure if Hadoop is the right distributed file system
> to serve web requests.  In other words, can, or is it right to, serve
> Images and data from HDFS using something like FUSE to mount a
> filesystem where Apache could serve images from it?  We have huge
> images, thus the need for a distributed file system, and they go in,
> get stored with lots of metadata, and are redundant with Hadoop/HDFS -
> but is it the right way to serve web content?
>
> I looked at glusterfs before, they had an Apache and Lighttpd module
> which made it simple, does HDFS have something like this, do people
> just use a FUSE option as I described, or is this not a good use of
> Hadoop?
>
> Thanks
>
> P
>


Re: comma delimited files

2009-03-26 Thread Aaron Kimball
Sure. Put all your comma-delimited data into the output key as a Text
object, and set the output value to the empty string. It'll dump the output
key, as text, to the reducer output files.

- Aaron

On Thu, Mar 26, 2009 at 4:14 PM, nga pham  wrote:

> Hi all,
>
>
> Can Hadoop export into comma delimited files?
>
>
> Thank you,
>
> Nga P.
>


Re: corrupt unreplicated block in dfs (0.18.3)

2009-03-26 Thread Aaron Kimball
Just because a block is corrupt doesn't mean the entire file is corrupt.
Furthermore, the presence/absence of a file in the namespace is a completely
separate issue to the data in the file. I think it would be a surprising
interface change if files suddenly disappeared just because 1 out of
potentially many blocks were corrupt.

- Aaron

On Thu, Mar 26, 2009 at 1:21 PM, Mike Andrews  wrote:

> i noticed that when a file with no replication (i.e., replication=1)
> develops a corrupt block, hadoop takes no action aside from the
> datanode throwing an exception to the client trying to read the file.
> i manually corrupted a block in order to observe this.
>
> obviously, with replication=1 its impossible to fix the block, but i
> thought perhaps hadoop would take some other action, such as deleting
> the file outright, or moving it to a "corrupt" directory, or marking
> it or keeping track of it somehow to note that there's un-fixable
> corruption in the filesystem? thus, the current behaviour seems to
> sweep the corruption under the rug and allows its continued existence,
> aside from notifying the specific client doing the read with an
> exception.
>
> if anyone has any information about this issue or how to work around
> it, please let me know.
>
> on the other hand, i tested that corrupting a block in a replication=3
> file causes hadoop to re-replicate the block from another existing
> copy, which is good and is i what i expected.
>
> best,
> mike
>
>
> --
> permanent contact information at http://mikerandrews.com
>


Re: Super-long reduce task timeouts in hadoop-0.19.0

2009-03-26 Thread John Lee
Woops. Sorry about the extra noise.


Re: Super-long reduce task timeouts in hadoop-0.19.0

2009-03-26 Thread John Lee
Sorry, I should add the TaskTracker log messages I'm seeing relating
to such a hung task

2009-03-26 08:16:32,152 INFO org.apache.hadoop.mapred.TaskTracker:
LaunchTaskAction (registerTask): attempt_20090331_0766_r_07_0
2009-03-26 08:16:32,986 INFO org.apache.hadoop.mapred.TaskTracker:
Trying to launch : attempt_20090331_0766_r_07_0
2009-03-26 08:51:59,146 INFO org.apache.hadoop.mapred.TaskTracker: In
TaskLauncher, current free slots : 1 and trying to launch
attempt_20090331_0766_r_07_0
2009-03-26 08:51:59,361 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_20090331_0766_r_07_0: Task
attempt_20090331_0766_r_07_0 failed to report status for 2127
seconds. Killing!
2009-03-26 08:51:59,374 INFO org.apache.hadoop.mapred.TaskTracker:
About to purge task: attempt_20090331_0766_r_07_0
2009-03-26 08:51:59,375 INFO org.apache.hadoop.mapred.TaskRunner:
attempt_20090331_0766_r_07_0 done; removing files.
2009-03-26 08:51:59,464 WARN org.apache.hadoop.mapred.TaskTracker:
Unknown child task finshed: attempt_20090331_0766_r_07_0.
Ignored.

And the relevant JobTracker logs.

2009-03-26 08:16:32,150 INFO org.apache.hadoop.mapred.JobTracker:
Adding task 'attempt_20090331_0766_r_07_0' to tip
task_20090331_0766_r_07, for tracker
'tracker_dev-01.33across.com:localhost/
27.0.0.1:58102'
2009-03-26 08:52:04,365 INFO org.apache.hadoop.mapred.TaskInProgress:
Error from attempt_20090331_0766_r_07_0: Task
attempt_20090331_0766_r_07_0 failed to report status for 2127
seconds. Killi
g!
2009-03-26 08:52:04,367 INFO org.apache.hadoop.mapred.JobTracker:
Removed completed task 'attempt_20090331_0766_r_07_0' from
'tracker_dev-01.33across.com:localhost/127.0.0.1:58102'


On Thu, Mar 26, 2009 at 11:51 AM, John Lee  wrote:
> I have this same issue re: lots of failed reduce tasks.
>
> From the WebUI, it looks like the jobs are failing in the shuffle
> phase. The shuffle phase for the failed attempts took about a third
> the time of the successful attempts.
>
> I have also noted that in 0.19.0, my reduces often get started but
> then remained in the "unassigned" state for a long time before timing
> out. No evidence of these tasks in the local taskTracker dir.
>
> The latter problem sounds like HADOOP-5407, but is the former problem
> (reduces timing out) just a secondary symptom of HADOOP-5407? My
> TaskTrackers aren't hanging, though (other reduce tasks in the same
> job run to completion).
>
> - John
>
> On Sun, Feb 22, 2009 at 3:04 AM, Devaraj Das  wrote:
>> Bryan, the message
>>
>> 2009-02-19 22:48:19,380 INFO org.apache.hadoop.mapred.TaskTracker:
>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>> taskTracker/jobcache/job_200902061117_3388/
>> attempt_200902061117_3388_r_66_0/output/file.out in any of the
>> configured local directories
>>
>> is spurious. That had been reported in 
>> https://issues.apache.org/jira/browse/HADOOP-4963 and the fix is there in 
>> the trunk. I guess I should commit that fix to the 0.20 and 0.19 branches 
>> too. Meanwhile, please apply the patch on your repository if you can.
>> Regarding the tasks timing out, do you know whether the reduce tasks were in 
>> the shuffle phase or the reducer phase? That you can deduce by looking at 
>> the task web UI for the failed tasks, or, the task logs.
>> Also, from your reduce method, do you ensure that progress reports are sent 
>> every so often? By default, progress reports are sent for every record-group 
>> that the reducer method is invoked with, and, for every record that the 
>> reducer emits. If the timeout is not happening in the shuffle, then the 
>> problematic part is the reduce method where the timeout could be happening 
>> because a lot of time is spent in the processing of a particular 
>> record-group, or, the write of the output record to the hdfs is taking a 
>> long time.
>>
>>
>> On 2/21/09 5:28 AM, "Bryan Duxbury"  wrote:
>>
>> (Repost from the dev list)
>>
>> I noticed some really odd behavior today while reviewing the job
>> history of some of our jobs. Our Ganglia graphs showed really long
>> periods of inactivity across the entire cluster, which should
>> definitely not be the case - we have a really long string of jobs in
>> our workflow that should execute one after another. I figured out
>> which jobs were running during those periods of inactivity, and
>> discovered that almost all of them had 4-5 failed reduce tasks, with
>> the reason for failure being something like:
>>
>> Task attempt_200902061117_3382_r_38_0 failed to report status for
>> 1282 seconds. Killing!
>>
>> The actual timeout reported varies from 700-5000 seconds. Virtually
>> all of our longer-running jobs were affected by this problem. The
>> period of inactivity on the cluster seems to correspond to the amount
>> of time the job waited for these reduce tasks to fail.
>>
>> I checked out the tasktracker log for the machines with timed-out

Re: Super-long reduce task timeouts in hadoop-0.19.0

2009-03-26 Thread John Lee
Sorry, I should add the TaskTracker log messages I'm seeing relating
to such a hung task

2009-03-26 08:16:32,152 INFO org.apache.hadoop.mapred.TaskTracker:
LaunchTaskAction (registerTask): attempt_20090331_0766_r_07_0
2009-03-26 08:16:32,986 INFO org.apache.hadoop.mapred.TaskTracker:
Trying to launch : attempt_20090331_0766_r_07_0
2009-03-26 08:51:59,146 INFO org.apache.hadoop.mapred.TaskTracker: In
TaskLauncher, current free slots : 1 and trying to launch
attempt_20090331_0766_r_07_0
2009-03-26 08:51:59,361 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_20090331_0766_r_07_0: Task
attempt_20090331_0766_r_07_0 failed to report status for 2127
seconds. Killing!
2009-03-26 08:51:59,374 INFO org.apache.hadoop.mapred.TaskTracker:
About to purge task: attempt_20090331_0766_r_07_0
2009-03-26 08:51:59,375 INFO org.apache.hadoop.mapred.TaskRunner:
attempt_20090331_0766_r_07_0 done; removing files.
2009-03-26 08:51:59,464 WARN org.apache.hadoop.mapred.TaskTracker:
Unknown child task finshed: attempt_20090331_0766_r_07_0.
Ignored.

And the relevant JobTracker logs.

2009-03-26 08:16:32,150 INFO org.apache.hadoop.mapred.JobTracker:
Adding task 'attempt_20090331_0766_r_07_0' to tip
task_20090331_0766_r_07, for tracker
'tracker_dev-01.33across.com:localhost/
27.0.0.1:58102'
2009-03-26 08:52:04,365 INFO org.apache.hadoop.mapred.TaskInProgress:
Error from attempt_20090331_0766_r_07_0: Task
attempt_20090331_0766_r_07_0 failed to report status for 2127
seconds. Killi
g!
2009-03-26 08:52:04,367 INFO org.apache.hadoop.mapred.JobTracker:
Removed completed task 'attempt_20090331_0766_r_07_0' from
'tracker_dev-01.33across.com:localhost/127.0.0.1:58102'


On Thu, Mar 26, 2009 at 11:51 AM, John Lee  wrote:
> I have this same issue re: lots of failed reduce tasks.
>
> From the WebUI, it looks like the jobs are failing in the shuffle
> phase. The shuffle phase for the failed attempts took about a third
> the time of the successful attempts.
>
> I have also noted that in 0.19.0, my reduces often get started but
> then remained in the "unassigned" state for a long time before timing
> out. No evidence of these tasks in the local taskTracker dir.
>
> The latter problem sounds like HADOOP-5407, but is the former problem
> (reduces timing out) just a secondary symptom of HADOOP-5407? My
> TaskTrackers aren't hanging, though (other reduce tasks in the same
> job run to completion).
>
> - John
>
> On Sun, Feb 22, 2009 at 3:04 AM, Devaraj Das  wrote:
>> Bryan, the message
>>
>> 2009-02-19 22:48:19,380 INFO org.apache.hadoop.mapred.TaskTracker:
>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>> taskTracker/jobcache/job_200902061117_3388/
>> attempt_200902061117_3388_r_66_0/output/file.out in any of the
>> configured local directories
>>
>> is spurious. That had been reported in 
>> https://issues.apache.org/jira/browse/HADOOP-4963 and the fix is there in 
>> the trunk. I guess I should commit that fix to the 0.20 and 0.19 branches 
>> too. Meanwhile, please apply the patch on your repository if you can.
>> Regarding the tasks timing out, do you know whether the reduce tasks were in 
>> the shuffle phase or the reducer phase? That you can deduce by looking at 
>> the task web UI for the failed tasks, or, the task logs.
>> Also, from your reduce method, do you ensure that progress reports are sent 
>> every so often? By default, progress reports are sent for every record-group 
>> that the reducer method is invoked with, and, for every record that the 
>> reducer emits. If the timeout is not happening in the shuffle, then the 
>> problematic part is the reduce method where the timeout could be happening 
>> because a lot of time is spent in the processing of a particular 
>> record-group, or, the write of the output record to the hdfs is taking a 
>> long time.
>>
>>
>> On 2/21/09 5:28 AM, "Bryan Duxbury"  wrote:
>>
>> (Repost from the dev list)
>>
>> I noticed some really odd behavior today while reviewing the job
>> history of some of our jobs. Our Ganglia graphs showed really long
>> periods of inactivity across the entire cluster, which should
>> definitely not be the case - we have a really long string of jobs in
>> our workflow that should execute one after another. I figured out
>> which jobs were running during those periods of inactivity, and
>> discovered that almost all of them had 4-5 failed reduce tasks, with
>> the reason for failure being something like:
>>
>> Task attempt_200902061117_3382_r_38_0 failed to report status for
>> 1282 seconds. Killing!
>>
>> The actual timeout reported varies from 700-5000 seconds. Virtually
>> all of our longer-running jobs were affected by this problem. The
>> period of inactivity on the cluster seems to correspond to the amount
>> of time the job waited for these reduce tasks to fail.
>>
>> I checked out the tasktracker log for the machines with timed-out

Re: Super-long reduce task timeouts in hadoop-0.19.0

2009-03-26 Thread John Lee
I have this same issue re: lots of failed reduce tasks.

>From the WebUI, it looks like the jobs are failing in the shuffle
phase. The shuffle phase for the failed attempts took about a third
the time of the successful attempts.

I have also noted that in 0.19.0, my reduces often get started but
then remained in the "unassigned" state for a long time before timing
out. No evidence of these tasks in the local taskTracker dir.

The latter problem sounds like HADOOP-5407, but is the former problem
(reduces timing out) just a secondary symptom of HADOOP-5407? My
TaskTrackers aren't hanging, though (other reduce tasks in the same
job run to completion).

- John

On Sun, Feb 22, 2009 at 3:04 AM, Devaraj Das  wrote:
> Bryan, the message
>
> 2009-02-19 22:48:19,380 INFO org.apache.hadoop.mapred.TaskTracker:
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
> taskTracker/jobcache/job_200902061117_3388/
> attempt_200902061117_3388_r_66_0/output/file.out in any of the
> configured local directories
>
> is spurious. That had been reported in 
> https://issues.apache.org/jira/browse/HADOOP-4963 and the fix is there in the 
> trunk. I guess I should commit that fix to the 0.20 and 0.19 branches too. 
> Meanwhile, please apply the patch on your repository if you can.
> Regarding the tasks timing out, do you know whether the reduce tasks were in 
> the shuffle phase or the reducer phase? That you can deduce by looking at the 
> task web UI for the failed tasks, or, the task logs.
> Also, from your reduce method, do you ensure that progress reports are sent 
> every so often? By default, progress reports are sent for every record-group 
> that the reducer method is invoked with, and, for every record that the 
> reducer emits. If the timeout is not happening in the shuffle, then the 
> problematic part is the reduce method where the timeout could be happening 
> because a lot of time is spent in the processing of a particular 
> record-group, or, the write of the output record to the hdfs is taking a long 
> time.
>
>
> On 2/21/09 5:28 AM, "Bryan Duxbury"  wrote:
>
> (Repost from the dev list)
>
> I noticed some really odd behavior today while reviewing the job
> history of some of our jobs. Our Ganglia graphs showed really long
> periods of inactivity across the entire cluster, which should
> definitely not be the case - we have a really long string of jobs in
> our workflow that should execute one after another. I figured out
> which jobs were running during those periods of inactivity, and
> discovered that almost all of them had 4-5 failed reduce tasks, with
> the reason for failure being something like:
>
> Task attempt_200902061117_3382_r_38_0 failed to report status for
> 1282 seconds. Killing!
>
> The actual timeout reported varies from 700-5000 seconds. Virtually
> all of our longer-running jobs were affected by this problem. The
> period of inactivity on the cluster seems to correspond to the amount
> of time the job waited for these reduce tasks to fail.
>
> I checked out the tasktracker log for the machines with timed-out
> reduce tasks looking for something that might explain the problem,
> but the only thing I came up with that actually referenced the failed
> task was this log message, which was repeated many times:
>
> 2009-02-19 22:48:19,380 INFO org.apache.hadoop.mapred.TaskTracker:
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
> taskTracker/jobcache/job_200902061117_3388/
> attempt_200902061117_3388_r_66_0/output/file.out in any of the
> configured local directories
>
> I'm not sure what this means; can anyone shed some light on this
> message?
>
> Further confusing the issue, on the affected machines, I looked in
> logs/userlogs/, and to my surprise, the directory and log
> files existed, and the syslog file seemed to contain logs of a
> perfectly good reduce task!
>
> Overall, this seems like a pretty critical bug. It's consuming up to
> 50% of the runtime of our jobs in some instances, killing our
> throughput. At the very least, it seems like the reduce task timeout
> period should be MUCH shorter than the current 10-20 minutes.
>
> -Bryan
>
>


Re: virtualization with hadoop

2009-03-26 Thread Edward Capriolo
I use linux-vserver http://linux-vserver.org/

The Linux-VServer technology is a soft partitioning concept based on
Security Contexts which permits the creation of many independent
Virtual Private Servers (VPS) that run simultaneously on a single
physical server at full speed, efficiently sharing hardware resources.

Usually whenever people talk about virtual machines, I always here
about VMware, Xen, QEMU. For MY purposes Linux Vserver is far superior
to all of them and its very helpful for the hadoop work I do. (I only
want linux guests)

No emulation overhead - I installed VMWare server on my laptop and was
able to get 3 linux instances running before the system was unusable,
the instances were not even doing anything.

With VServer my system is not wasting cycles emulating devices. VMs
are securely sharing a kernel and memory. You can effectively run many
more VMs at once. This leaves the processor for user processes
(hadoop) not emulation overhear.

A minimal installation is 50 MB. I do not need a multi GB Linux
install just to test a version of hadoop. This allows me to recklessly
make VMs for whatever I want and not have to worry about GB chunks of
my hard drive going with each VM.

I can tar up a VM and use it as a template to install another VM. Thus
I can deploy a new system in under 30 seconds. The HTTP RPM install
takes about 2 minutes.

The guest is chroot 'ed. I can easily copy files into the guest using
copy commands. Think ant deploy -DTARGETDIR=/path/to/guest.

>>But it is horrible slow if you not have enough ram and multiple
>>disks since all I/o-Operations go to the same disk.

VServer will not solve this problem, but at least you want be losing
IO to 'emulation'.

If you are working with hadoop and you need to be able to have
multiple versions running, with different configurations, take a look
at VServer.


comma delimited files

2009-03-26 Thread nga pham
Hi all,


Can Hadoop export into comma delimited files?


Thank you,

Nga P.


Using HDFS to serve www requests

2009-03-26 Thread phil cryer
This is somewhat of a noob question I know, but after learning about
Hadoop, testing it in a small cluster and running Map Reduce jobs on
it, I'm still not sure if Hadoop is the right distributed file system
to serve web requests.  In other words, can, or is it right to, serve
Images and data from HDFS using something like FUSE to mount a
filesystem where Apache could serve images from it?  We have huge
images, thus the need for a distributed file system, and they go in,
get stored with lots of metadata, and are redundant with Hadoop/HDFS -
but is it the right way to serve web content?

I looked at glusterfs before, they had an Apache and Lighttpd module
which made it simple, does HDFS have something like this, do people
just use a FUSE option as I described, or is this not a good use of
Hadoop?

Thanks

P


Re: Join Variation

2009-03-26 Thread jason hadoop
For the classic map/reduce job, you have 3 requirements.

1) a comparator that provide the keys in ip address order, such that all
keys in one of your ranges, would be contiguous, when sorted with the
comparator
2) a partitioner that ensures that all keys that should be together end up
in the same partition
3) and output value grouping comparator that considered all keys in a
specified range equal.

The comparator only sorts by the first part of the key, the search file has
a 2 part key begin/end the input data has just a 1 part key.

A partitioner that new ahead of time the group sets in your search set, in
the way that the tera sort example works would be ideal:
ie: it builds an index of ranges from your seen set so that the ranges get
rougly evenly split between your reduces.
This requires a pass over the search file to write out a summary file, which
is then loaded by the partitioner.

The output value grouping comparator, will get the keys in order of the
first token, and will define the start of a group by the presence of a 2
part key, and consider the group ended when either another 2 part key
appears, or when the key value is larger than the second part of the
starting key. - This does require that the grouping comparator maintain
state.

At this point, your reduce will be called with the first key in the key
equivalence group of (3), with the values of all of the keys

In your map, any address that is not in a range of interest is not passed to
output.collect.

For the map side join code, you have to define a comparator on the key type
that defines your definition of equivalence and ordering, and call
WritableComparator.define( Key.class, comparator.class ), to force the join
code to use your comparator.

For tables with duplicates, per the key comparator, in map side join, your
map fuction will receive a row for every permutation of the duplicate keys:
if you have one table a, 1; a, 2; and another table with a, 3; a, 4; your
map will receive4 rows, a, 1, 3; a, 1, 4; a, 2, 3; a, 2, 4;


On Wed, Mar 25, 2009 at 11:19 PM, Tamir Kamara wrote:

> Thanks for all who replies.
>
> Stefan -
> I'm unable to see how converting IP ranges to network masks would help
> because different ranges can have the same network mask and with that I
> still have to do a comparison of two fields: the searched IP with
> from-IP&mask.
>
> Pig - I'm familier with pig and use it many times, but I can't think of a
> way to write a pig script that will do this type of "join". I'll ask the
> pig
> users group.
>
> The search file is indeed large in terms of the amount records. However, I
> don't see this as an issue yet, because I'm still puzzeled with how to
> write
> the job in plain MR. The join code is looking for an exact match in the
> keys
> and that is not what I need. Would a custom comperator which will look for
> a
> match in between the ranges, be the right choice to do this ?
>
> Thanks,
> Tamir
>
> On Wed, Mar 25, 2009 at 5:23 PM, jason hadoop  >wrote:
>
> > If the search file data set is large, the issue becomes ensuring that
> only
> > the required portion of search file is actually read, and that those
> reads
> > are ordered, in search file's key order.
> >
> > If the data set is small, most any of the common patterns will work.
> >
> > I haven't looked at pig for a while, does pig now use indexes in map
> files,
> > and take into account that a data set is sorted?
> > Out of the box, the map side join code, org.apache.hadoop.mapred.join
> will
> > do a decent job of this, but the entire search file set will be read.
> > To stop reading the entire search file, a record reader or join type,
> would
> > need to be put together to:
> > a) skip to the first key of interest, using the index if available
> > b) finish when the last possible key of interest has been delivered.
> >
> > On Wed, Mar 25, 2009 at 6:05 AM, John Lee 
> wrote:
> >
> > > In addition to other suggestions, you could also take a look at
> > > building a Cascading job with a custom Joiner class.
> > >
> > > - John
> > >
> > > On Tue, Mar 24, 2009 at 7:33 AM, Tamir Kamara 
> > > wrote:
> > > > Hi,
> > > >
> > > > We need to implement a Join with a between operator instead of an
> > equal.
> > > > What we are trying to do is search a file for a key where the key
> falls
> > > > between two fields in the search file like this:
> > > >
> > > > main file (ip, a, b):
> > > > (80, zz, yy)
> > > > (125, vv, bb)
> > > >
> > > > search file (from-ip, to-ip, d, e):
> > > > (52, 75, xxx, yyy)
> > > > (78, 98, aaa, bbb)
> > > > (99, 115, xxx, ddd)
> > > > (125, 130, hhh, aaa)
> > > > (150, 162, qqq, sss)
> > > >
> > > > the outcome should be in the form (ip, a, b, d, e):
> > > > (80, zz, yy, aaa, bbb)
> > > > (125, vv, bb, eee, hhh)
> > > >
> > > > We could convert the ip ranges in the search file to single record
> ips
> > > and
> > > > then do a regular join, but the number of single ips is huge and this
> > is
> > > > probably not a good way.
> > > > What 

Re: hadoop need help please suggest

2009-03-26 Thread Snehal Nagmote


Sorry for the inconvenience caused ...I will not spam core dev.
Scale we are thinking in terms of more nodes in coming future can go to
petabytes of data
Can you please give some pointers for handling the same issue, i am quite
new to hadoop

Regards,
Snehal




Raghu Angadi wrote:
> 
> 
> What is scale you are thinking of? (10s, 100s or more nodes)?
> 
> The memory for metadata at NameNode you mentioned is that main issue 
> with small files. There are multiple alternatives for the dealing with 
> that. This issue is discussed many times here.
> 
> Also please use core-user@ id alone for asking for help.. you don't need 
> to send to core-devel@
> 
> Raghu.
> 
> snehal nagmote wrote:
>> Hello Sir,
>> 
>> I have some doubts, please help me.
>> we have requirement of scalable storage system, we have developed one
>> agro-advisory system in which farmers will sent the crop pictures
>> particularly in sequential manner some
>> 6-7 photos of 3-4 kb each would be stored in storage server and these
>> photos
>> would be read sequentially by scientist to detect the problem, writing to
>> images would not be done.
>> 
>> So for storing these images we  are using hadoop file system, is it
>> feasible
>> to use hadoop
>> file system for the same purpose.
>> 
>> As also the images are of only 3-4 kb and hadoop reads the data in blocks
>> of
>> size 64 mb
>> how can we increase the performance, what could be the tricks and tweaks
>> that should be done to use hadoop for such kind of purpose.
>> 
>> Next problem is as hadoop stores all the metadata in memory,can we use
>> some
>> mechanism to store the files in the block of some greater size because as
>> the files would be of small size,so it will store the lots metadata and
>> will
>> overflow the main memory
>> please suggest what could be done
>> 
>> 
>> regards,
>> Snehal
>> 
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/hadoop-need-help-please-suggest-tp22666530p22721718.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



corrupt unreplicated block in dfs (0.18.3)

2009-03-26 Thread Mike Andrews
i noticed that when a file with no replication (i.e., replication=1)
develops a corrupt block, hadoop takes no action aside from the
datanode throwing an exception to the client trying to read the file.
i manually corrupted a block in order to observe this.

obviously, with replication=1 its impossible to fix the block, but i
thought perhaps hadoop would take some other action, such as deleting
the file outright, or moving it to a "corrupt" directory, or marking
it or keeping track of it somehow to note that there's un-fixable
corruption in the filesystem? thus, the current behaviour seems to
sweep the corruption under the rug and allows its continued existence,
aside from notifying the specific client doing the read with an
exception.

if anyone has any information about this issue or how to work around
it, please let me know.

on the other hand, i tested that corrupting a block in a replication=3
file causes hadoop to re-replicate the block from another existing
copy, which is good and is i what i expected.

best,
mike


-- 
permanent contact information at http://mikerandrews.com


Re: how to mount specification-path of hdfs with fuse-dfs

2009-03-26 Thread Craig Macdonald

Hi Jacky,

Please to hear that fuse-dfs is working for you.

Do you mean that you want to mount dfs://localhost:9000/users at /mnt/hdfs ?

If so, fuse-dfs doesnt currently support this, but it would be a good 
idea for a future improvement.


Craig

jacky_ji wrote:

i can use fuse-dfs to mount hdfs.  just like this: ./fuse-dfs
dfs://localhost:9000 /mnt/hdfs -d
but  i want to mount the specification path in hdfs now, and i have no idea
ablut it, any advice will be appreciated.
  




Identify the input file for a failed mapper/reducer

2009-03-26 Thread Jason Fennell
Is there a way to identify the input file a mapper was running on when
it failed?  When a large job fails because of bad input lines I have
to resort to rerunning the entire job to isolate a single bad line
(since the log doesn't contain information on the file that that
mapper was running on).

Basically, I would like to be able to do one of the following:
1. Find the file that a mapper was running on when it failed
2. Find the block that a mapper was running on when it failed (and be
able to find file names from block ids)

I haven't been able to find any documentation on facilities to
accomplish either (1) or (2), so I'm hoping someone on this list will
have a suggestion.

I am using the Hadoop streaming API on hadoop 0.18.2.

-Jason


Re: reduce task failing after 24 hours waiting

2009-03-26 Thread Billy Pearson
There is many maps finshing from 4 mins to 15 mins less time closer to the 
end of the jobs so no timeout there. The state of the reduce task is Shuffle 
there grabing the map task as they finsh. the current job took 50:43:37 each 
of the reduce task failed twice in that time once at 24 hours in and second 
at 48 hours in. I will test on the next run in a few days the settings 
mapred.jobtracker.retirejob.interval and mapred.userlog.retain.hours to 72 
hours and see if that solves the problem. So not a bad gess thought seams 
odd within 5 min's of 24 hours both times on all the task at the same time.



looks like from the tasktracker logs I get the WARN below 
org.apache.hadoop.mapred.TaskRunner: attempt_200903212204_0005_r_01_1 
Child Error



grep the tasktracker log for one of the reduce that failed I do not have 
debug turned on so all I got is the info logs


2009-03-25 18:37:45,473 INFO org.apache.hadoop.mapred.TaskTracker: 
attempt_200903212204_0005_r_01_1 0.3083758% reduce > copy (2360 of 2551 
at 0.87 MB/s) >
2009-03-25 18:37:48,476 INFO org.apache.hadoop.mapred.TaskTracker: 
attempt_200903212204_0005_r_01_1 0.3083758% reduce > copy (2360 of 2551 
at 0.87 MB/s) >
2009-03-25 18:37:49,194 INFO org.apache.hadoop.mapred.TaskTracker: 
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find 
taskTracker/jobcache/job_200903212204_0005/attempt_200903212204_0005_r_01_1/output/file.out 
in any of the configured local directories
2009-03-25 18:37:49,480 INFO org.apache.hadoop.mapred.TaskTracker: 
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find 
taskTracker/jobcache/job_200903212204_0005/attempt_200903212204_0005_r_01_1/output/file.out 
in any of the configured local directories
2009-03-25 18:37:51,481 INFO org.apache.hadoop.mapred.TaskTracker: 
attempt_200903212204_0005_r_01_1 0.3083758% reduce > copy (2360 of 2551 
at 0.87 MB/s) >
2009-03-25 18:37:54,372 WARN org.apache.hadoop.mapred.TaskRunner: 
attempt_200903212204_0005_r_01_1 Child Error
2009-03-25 18:37:54,497 INFO org.apache.hadoop.mapred.TaskTracker: 
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find 
taskTracker/jobcache/job_200903212204_0005/attempt_200903212204_0005_r_01_1/output/file.out 
in any of the configured local directories
2009-03-25 18:37:57,400 INFO org.apache.hadoop.mapred.TaskRunner: 
attempt_200903212204_0005_r_01_1 done; removing files.
2009-03-25 18:42:25,191 INFO org.apache.hadoop.mapred.TaskTracker: 
LaunchTaskAction (registerTask): attempt_200903212204_0005_r_01_1 task's 
state:FAILED_UNCLEAN
2009-03-25 18:42:25,192 INFO org.apache.hadoop.mapred.TaskTracker: Trying to 
launch : attempt_200903212204_0005_r_01_1
2009-03-25 18:42:25,192 INFO org.apache.hadoop.mapred.TaskTracker: In 
TaskLauncher, current free slots : 1 and trying to launch 
attempt_200903212204_0005_r_01_1
2009-03-25 18:42:30,134 INFO org.apache.hadoop.mapred.TaskTracker: JVM with 
ID: jvm_200903212204_0005_r_437314552 given task: 
attempt_200903212204_0005_r_01_1
2009-03-25 18:42:30,196 INFO org.apache.hadoop.mapred.TaskTracker: 
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find 
taskTracker/jobcache/job_200903212204_0005/attempt_200903212204_0005_r_01_1/output/file.out 
in any of the configured local directories
2009-03-25 18:42:32,530 INFO org.apache.hadoop.mapred.TaskTracker: 
attempt_200903212204_0005_r_01_1 0.0%
2009-03-25 18:42:32,555 INFO org.apache.hadoop.mapred.TaskTracker: 
attempt_200903212204_0005_r_01_1 0.0% cleanup
2009-03-25 18:42:32,567 INFO org.apache.hadoop.mapred.TaskTracker: Task 
attempt_200903212204_0005_r_01_1 is done.
2009-03-25 18:42:32,567 INFO org.apache.hadoop.mapred.TaskTracker: reported 
output size for attempt_200903212204_0005_r_01_1  was 0
2009-03-25 18:42:32,568 INFO org.apache.hadoop.mapred.TaskRunner: 
attempt_200903212204_0005_r_01_1 done; removing files.



grep the jobtracker for the same task

2009-03-25 18:37:54,500 INFO org.apache.hadoop.mapred.TaskInProgress: Error 
from attempt_200903212204_0005_r_01_1: java.io.IOException: Task process 
exit with nonzero status of 255.
2009-03-25 18:42:25,186 INFO org.apache.hadoop.mapred.JobTracker: Adding 
task (cleanup)'attempt_200903212204_0005_r_01_1' to tip 
task_200903212204_0005_r_01, for tracker 
'tracker_server-1:localhost.localdomain/127.0.0.1:38816'
2009-03-25 18:42:32,589 INFO org.apache.hadoop.mapred.JobTracker: Removed 
completed task 'attempt_200903212204_0005_r_01_1' from 
'tracker_server-1:localhost.localdomain/127.0.0.1:38816'







"Amar Kamat"  wrote in 
message news:49cafd8e.8010...@yahoo-inc.com...

Amareshwari Sriramadasu wrote:

Set mapred.jobtracker.retirejob.interval

This is used to retire completed jobs.

and mapred.userlog.retain.hours to higher value.

This is used to discard user logs.
By default, their values are 24 hours. These might be the reason for 
failure, though I'm not sure.


Tha

Re: HadoopConfig problem -Datanode not able to connect to the server

2009-03-26 Thread mingyang
check you iptable is off

2009/3/26 snehal nagmote 

> hello,
> We configured hadoop successfully, but after some days  its configuration
> file from datanode( hadoop-site.xml) went off , and datanode was not coming
> up ,so we again did the same configuration, its showing one datanode and
> its
> name as localhost rather than expected as either name of respected datanode
> m/c or ip address of   actual datanode in ui interfece of hadoop.
>
> But capacity as 80.0gb ,(we have  one namenode (40 gb) and datanode(40
> gb))means capacity is updated ,we can browse the filesystem , it is showing
> whatever directories we are creating in namenode .
>
> but when we try to access the same through the datanode  machine
> means doing ssh and executing series of commands its not able to connect to
> the server.
> saying retrying connect to the server
>
> 09/03/26 11:25:11 INFO ipc.Client: Retrying connect to server: /
> 172.16.6.102:21011. Already tried 0 time(s).
>
> 09/03/26 11:25:11 INFO ipc.Client: Retrying connect to server: /
> 172.16.6.102:21011. Already tried 1 time(s)
>
>
> moreover we added one datanode into it and formatted namenode ,but that
> datanode is not getting added. we are not understanding whats the problem.
>
> Can configuration files in case of datanode automatcally lost  after some
> days??
>
> I have again one doubt , according to my understanding namenode doesnt
> store
> any data , it stores metadata of all the data , so when i execute mkdir in
> namenode machine  and copying some files into it, it means that data is
> getting stored in datanode provided to it, please correct me if i am wrong
> ,
> i am very new to hadoop.
> So if i am able to view the data through inteface means its properly
> storing
> data into respected datanode, So
> why its showing localhost as datanode name rather than respected datanode
> name.
>
> can you please help.
>
>
> Regards,
> Snehal Nagmote
> IIIT hyderabad
>



-- 
致
礼!


王明阳