Re: how to upload files by web page

2009-03-09 Thread Yang Zhou
:-) I am afraid you have to solve both of your questions yourself.
1. submit the urls to your own servlet.
2. develop your own codes to read input bytes from those urls and save them
to HDFS.
There is no ready-made tool.

Good Luck.

2009/3/10 李睿 

> Thanks:)
>
>
>
> Could you tell more detail about your solution?
>
> I have some questions below:
>
> 1,where can  I submit the urls to ?
>
> 2,what is the backend service? Does it belong to HDFS?
>
>
>
>
>
>
> 2009/3/10 Yang Zhou 
>
> > Hi,
> >
> > I have done that before.
> >
> > My solution is :
> > 1. submit some FTP/SFTP/GridFTP urls of what you want to upload
> > 2. backend service will fetch those files/directories from FTP to HDFS
> > directly.
> >
> > Of course you can upload those files to the web server machine and then
> > move
> > them to HDFS. But since Hadoop is designed to process vast amounts of
> data,
> > I do think my solution is more efficient. :-)
> >
> > You can find how to make directory and save files to HDFS in the source
> > code
> > of "org.apache.hadoop.fs.FsShell".
> > 2009/3/9 
> >
> > >
> > >
> > > Hi, all,
> > >
> > >   I’m new to HDFS and want to upload files by JSP.
> > >
> > > Are there some APIs can use?  Are there some demo?
> > >
> > >   Thanks for your help:)
> > >
> > >
> >
>


Re: Support for zipped input files

2009-03-09 Thread jason hadoop
Hadoop has support for S3, the compression support is handled at another
level and should also work.


On Mon, Mar 9, 2009 at 9:05 PM, Ken Weiner  wrote:

> I have a lot of large zipped (not gzipped) files sitting in an Amazon S3
> bucket that I want to process.  What is the easiest way to process them
> with
> a Hadoop map-reduce job?  Do I need to write code to transfer them out of
> S3, unzip them, and then move them to HDFS before running my job, or does
> Hadoop have support for processing zipped input files directly from S3?
>



-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422


Support for zipped input files

2009-03-09 Thread Ken Weiner
I have a lot of large zipped (not gzipped) files sitting in an Amazon S3
bucket that I want to process.  What is the easiest way to process them with
a Hadoop map-reduce job?  Do I need to write code to transfer them out of
S3, unzip them, and then move them to HDFS before running my job, or does
Hadoop have support for processing zipped input files directly from S3?


Re: how to upload files by web page

2009-03-09 Thread 李睿
Thanks:)



Could you tell more detail about your solution?

I have some questions below:

1,where can  I submit the urls to ?

2,what is the backend service? Does it belong to HDFS?






2009/3/10 Yang Zhou 

> Hi,
>
> I have done that before.
>
> My solution is :
> 1. submit some FTP/SFTP/GridFTP urls of what you want to upload
> 2. backend service will fetch those files/directories from FTP to HDFS
> directly.
>
> Of course you can upload those files to the web server machine and then
> move
> them to HDFS. But since Hadoop is designed to process vast amounts of data,
> I do think my solution is more efficient. :-)
>
> You can find how to make directory and save files to HDFS in the source
> code
> of "org.apache.hadoop.fs.FsShell".
> 2009/3/9 
>
> >
> >
> > Hi, all,
> >
> >   I’m new to HDFS and want to upload files by JSP.
> >
> > Are there some APIs can use?  Are there some demo?
> >
> >   Thanks for your help:)
> >
> >
>


Re: how to upload files by web page

2009-03-09 Thread Yang Zhou
Hi,

I have done that before.

My solution is :
1. submit some FTP/SFTP/GridFTP urls of what you want to upload
2. backend service will fetch those files/directories from FTP to HDFS
directly.

Of course you can upload those files to the web server machine and then move
them to HDFS. But since Hadoop is designed to process vast amounts of data,
I do think my solution is more efficient. :-)

You can find how to make directory and save files to HDFS in the source code
of "org.apache.hadoop.fs.FsShell".
2009/3/9 

>
>
> Hi, all,
>
>   I’m new to HDFS and want to upload files by JSP.
>
> Are there some APIs can use?  Are there some demo?
>
>   Thanks for your help:)
>
>


Re: Reducer goes past 100% complete?

2009-03-09 Thread Devaraj Das
There is a jira for this https://issues.apache.org/jira/browse/HADOOP-5210. 
There was a jira to address this problem with intermediate compression on and 
that's fixed - https://issues.apache.org/jira/browse/HADOOP-3131.


On 3/9/09 9:15 PM, "Doug Cook"  wrote:



Hi folks,

I've recently upgraded to Hadoop 0.19.1 from a much, much older version of
Hadoop.

Most things in my application (a highly modified version of Nutch) are
working just fine, but one of them is bombing out with odd symptoms. The map
works just fine, but then reduce phase (a) runs extremely slowly and (b) the
"percentage complete" reporting for each reduce task doesn't stop at 100%,
it just keeps going on past that.

I figure I'll start by understanding the percentage-complete reporting
issue, since it's pretty concrete and may have some bearing on the
performance issue. It seems likely that my application is mis-configuring
the job, or otherwise not correctly using the Hadoop API. I don't think I'm
doing anything way out of the ordinary; my reducer simply creates an object,
wraps it in an ObjectWritable, and calls output.collect(), and I have a
local class that implements OutputFormat to take the object and put it in a
Lucene index. It does actually create correct output, at least for small
indices; on large indices, the performance problems are killing me.

I can and will start rummaging around in the Hadoop code to figure out how
it calculates percentage complete, and see what I'm not doing correctly, but
thought I'd ask here, too, to see if someone has good suggestions off the
top of their head.

Many thanks-

Doug Cook
--
View this message in context: 
http://www.nabble.com/Reducer-goes-past-100--complete--tp22413589p22413589.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.




Re: HDFS is corrupt, need to salvage the data.

2009-03-09 Thread lohit

How many Datanodes do you have.
>From the output it looks like at the point when you ran fsck, you had only one 
>datanode connected to your NameNode. Did you have others?
Also, I see that your default replication is set to 1. Can you check if  your 
datanodes are up and running.
Lohit



- Original Message 
From: Mayuran Yogarajah 
To: core-user@hadoop.apache.org
Sent: Monday, March 9, 2009 5:20:37 PM
Subject: HDFS is corrupt, need to salvage the data.

Hello, it seems the HDFS in my cluster is corrupt.  This is the output from 
hadoop fsck:
Total size:9196815693 B
Total dirs:17
Total files:   157
Total blocks:  157 (avg. block size 58578443 B)

CORRUPT FILES:157
MISSING BLOCKS:   157
MISSING SIZE: 9196815693 B

Minimally replicated blocks:   0 (0.0 %)
Over-replicated blocks:0 (0.0 %)
Under-replicated blocks:   0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor:1
Average block replication: 0.0
Missing replicas:  0
Number of data-nodes:  1
Number of racks:   1

It seems to say that there is 1 block missing from every file that was in the 
cluster..

I'm not sure how to proceed so any guidance would be much appreciated.  My 
primary
concern is recovering the data.

thanks



Re: Reducer goes past 100% complete?

2009-03-09 Thread jason hadoop
I noticed this getting much worse with block compression on the intermediate
map outputs, in the cloudera patched 18.3. I just assumed it was speculative
execution.
I wonder if one of the patches in the cloudera version has had an effect on
this.


On Mon, Mar 9, 2009 at 2:34 PM, Owen O'Malley  wrote:

>
> On Mar 9, 2009, at 1:00 PM, james warren wrote:
>
>  Speculative execution has existed far before 0.19.x, but AFAIK the > 100%
>> issue has appeared (at least with greater frequency) since 0.19.0 came
>> out.
>> Are you saying there are changes in how task progress is being tracked?
>>
>
> In the past, it has usually been when the input is compressed and the code
> is doing something like uncompressed / total compressed to figure out the
> done percent.
>
> -- Owen
>



-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422


HDFS is corrupt, need to salvage the data.

2009-03-09 Thread Mayuran Yogarajah
Hello, it seems the HDFS in my cluster is corrupt.  This is the output 
from hadoop fsck:

Total size:9196815693 B
Total dirs:17
Total files:   157
Total blocks:  157 (avg. block size 58578443 B)
 
 CORRUPT FILES:157
 MISSING BLOCKS:   157
 MISSING SIZE: 9196815693 B
 
Minimally replicated blocks:   0 (0.0 %)
Over-replicated blocks:0 (0.0 %)
Under-replicated blocks:   0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor:1
Average block replication: 0.0
Missing replicas:  0
Number of data-nodes:  1
Number of racks:   1

It seems to say that there is 1 block missing from every file that was 
in the cluster..


I'm not sure how to proceed so any guidance would be much appreciated.  
My primary

concern is recovering the data.

thanks


RE: DataNode stops cleaning disk?

2009-03-09 Thread Igor Bolotin
My mistake about 'current' directory - that's the one that consumes all
the disk space and 'du' on that directory matches exactly with namenode
web ui reported size.
I'm waiting for the next time this happens to collect more details, but
ever since I wrote the first email - everything works perfectly well
(another application of Murphy law). 

Thanks,
Igor

-Original Message-
From: Raghu Angadi [mailto:rang...@yahoo-inc.com] 
Sent: Thursday, March 05, 2009 12:06 PM
To: core-user@hadoop.apache.org
Subject: Re: DataNode stops cleaning disk?

Igor Bolotin wrote:
> That's what I saw just yesterday on one of the data nodes with this
> situation (will confirm also next time it happens):
> - Tmp and current were either empty or almost empty last time I
checked.
> - du on the entire data directory matched exactly with reported used
> space in NameNode web UI and it did report that it uses some most of
the
> available disk space. 
> - nothing else was using disk space (actually - it's dedicated DFS
> cluster).

If 'du' command (you can run in the shell) counts properly then you 
should be able to see which files are taking space.

If 'du' can't but 'df' reports very less space available, then it is 
possible (though never seen) that datanode is keeping a a lot these 
files open.. 'ls -l /proc/datanodepid/fd' lists these files. If it is 
not datanode, then check lsof to find who is holding these files.

hope this helps.
Raghu.

> Thank you for help!
> Igor
> 
> -Original Message-
> From: Raghu Angadi [mailto:rang...@yahoo-inc.com] 
> Sent: Thursday, March 05, 2009 11:05 AM
> To: core-user@hadoop.apache.org
> Subject: Re: DataNode stops cleaning disk?
> 
> 
> This is unexpected unless some other process is eating up space.
> 
> Couple of things to collect next time (along with log):
> 
>   - All the contents under datanode-directory/ (especially including 
> 'tmp' and 'current')
>   - Does 'du' of this directory match with what is reported to
NameNode 
> (shown on webui) by this DataNode.
>   - Is there anything else taking disk space on the machine?
> 
> Raghu.
> 
> Igor Bolotin wrote:
>> Normally I dislike writing about problems without being able to
> provide
>> some more information, but unfortunately in this case I just can't
> find
>> anything.
>>
>>  
>>
>> Here is the situation - DFS cluster running Hadoop version 0.19.0.
The
>> cluster is running on multiple servers with practically identical
>> hardware. Everything works perfectly well, except for one thing -
from
>> time to time one of the data nodes (every time it's a different node)
>> starts to consume more and more disk space. The node keeps going and
> if
>> we don't do anything - it runs out of space completely (ignoring 20GB
>> reserved space settings). Once restarted - it cleans disk rapidly and
>> goes back to approximately the same utilization as the rest of data
>> nodes in the cluster.
>>
>>  
>>
>> Scanning datanodes and namenode logs and comparing thread dumps
> (stacks)
>> from nodes experiencing problem and those that run normally didn't
>> produce any clues. Running balancer tool didn't help at all. FSCK
> shows
>> that everything is healthy and number of over-replicated blocks is
not
>> significant.
>>
>>  
>>
>> To me - it just looks like at some point the data node stops cleaning
>> invalidated/deleted blocks, but keeps reporting space consumed by
> these
>> blocks as "not used", but I'm not familiar enough with the internals
> and
>> just plain don't have enough free time to start digging deeper.
>>
>>  
>>
>> Anyone has an idea what is wrong or what else we can do to find out
>> what's wrong or maybe where to start looking in the code?
>>
>>  
>>
>> Thanks,
>>
>> Igor
>>
>>  
>>
>>
> 



Re: Reducer goes past 100% complete?

2009-03-09 Thread Owen O'Malley


On Mar 9, 2009, at 1:00 PM, james warren wrote:

Speculative execution has existed far before 0.19.x, but AFAIK the >  
100%
issue has appeared (at least with greater frequency) since 0.19.0  
came out.
Are you saying there are changes in how task progress is being  
tracked?


In the past, it has usually been when the input is compressed and the  
code is doing something like uncompressed / total compressed to figure  
out the done percent.


-- Owen


Re: DataNode gets 'stuck', ends up with two DataNode processes

2009-03-09 Thread jason hadoop
I think the bug probably still exists and the fix only covers one trigger.

On Mon, Mar 9, 2009 at 12:45 PM, Brian Bockelman wrote:

> Hey Jason,
>
> Looks like you're on the right track.  We are getting stuck after the JVM
> forks but before it exec's.  It appears to be this bug:
>
> http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6671051
>
> However, this is using Java 1.6.0_11, and that bug was marked as fixed in
> 1.6.0_6 :(
>
> Any other ideas?
>
> Brian
>
>
> On Mar 9, 2009, at 2:21 PM, jason hadoop wrote:
>
>  There were a couple of fork timing errors in the jdk 1.5 that occasionally
>> caused a sub process fork to go bad, this could by the du/df being forked
>> off by the datanode and dying.
>>
>> I can't find the references I had saved away at one point, from the java
>> forums, but perhaps this will get you started.
>>
>> http://forums.sun.com/thread.jspa?threadID=5297465&tstart=0
>>
>> On Mon, Mar 9, 2009 at 11:23 AM, Brian Bockelman > >wrote:
>>
>>  It's very strange.  It appears that the second process is the result of a
>>> fork call, yet has only one thread running whose gdb backtrace looks like
>>> this:
>>>
>>> (gdb) bt
>>> #0  0x003e10c0af8b in __lll_mutex_lock_wait () from
>>> /lib64/tls/libpthread.so.0
>>> #1  0x in ?? ()
>>>
>>> Not very helpful!  I'd normally suspect some strange memory issue, but
>>> I've
>>> checked - there was plenty of memory available on the host when the
>>> second
>>> process was spawned and we weren't close to the file descriptor limit.
>>>
>>>
>>> Looking at this issue,
>>> https://issues.apache.org/jira/browse/HADOOP-2231
>>>
>>> it seems that the "df" call is avoidable now that we're in Java 1.6.
>>> However, the issue was closed and marked as a duplicate, but without
>>> noting
>>> what it was a duplicate of (grrr).  Is there an updated version of that
>>> patch?
>>>
>>> Brian
>>>
>>>
>>> On Mar 9, 2009, at 12:48 PM, Steve Loughran wrote:
>>>
>>> Philip Zeyliger wrote:
>>>

  Very naively looking at the stack traces, a common theme is that
> there's
> a
> call out to "df" to find the system capacity.  If you see two data node
> processes, perhaps the fork/exec to call out to "df" is failing in some
> strange way.
>
>
 that's deep into Java code. OpenJDK gives you more of that source. One
 option here is to consider some kind of timeouts in the exec, but it's
 pretty tricky to tack that on round the Java runtime APIs, because the
 process APIs weren't really designed to be interrupted by other threads.

 -steve

 "DataNode: [/hadoop-data/dfs/data]" daemon prio=10

> tid=0x002ae2c0d400 nid=0x21cf in Object.wait()
> [0x42c54000..0x42c54b30]
> java.lang.Thread.State: WAITING (on object monitor)
>  at java.lang.Object.wait(Native Method)
>  at java.lang.Object.wait(Object.java:485)
>  at java.lang.UNIXProcess$Gate.waitForExit(UNIXProcess.java:64)
>  - locked <0x002a9fd84f98> (a java.lang.UNIXProcess$Gate)
>  at java.lang.UNIXProcess.(UNIXProcess.java:145)
>  at java.lang.ProcessImpl.start(ProcessImpl.java:65)
>  at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
>  at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
>  at org.apache.hadoop.util.Shell.run(Shell.java:134)
>  at org.apache.hadoop.fs.DF.getCapacity(DF.java:63)
>  at
>
> org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolume.getCapacity(FSDataset.java:341)
>  at
>
> org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolumeSet.getCapacity(FSDataset.java:501)
>  - locked <0x002a9ed97078> (a
> org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolumeSet)
>  at
>
> org.apache.hadoop.hdfs.server.datanode.FSDataset.getCapacity(FSDataset.java:697)
>  at
>
> org.apache.hadoop.hdfs.server.datanode.DataNode.offerService(DataNode.java:671)
>  at
> org.apache.hadoop.hdfs.server.datanode.DataNode.run(DataNode.java:1105)
>  at java.lang.Thread.run(Thread.java:619)
> On Mon, Mar 9, 2009 at 8:17 AM, Garhan Attebury 
>> wrote:
>>
>
>  On a ~100 node cluster running HDFS (we just use HDFS + fuse, no
>> job/task
>> trackers) I've noticed many datanodes get 'stuck'. The nodes
>> themselves
>> seem
>> fine with no network/memory problems, but in every instance I see two
>> DataNode processes running, and the NameNode logs indicate the
>> datanode
>> in
>> question simply stopped responding. This state persists until I come
>> along
>> and kill the DataNode processes and restart the DataNode on that
>> particular
>> machine.
>>
>> I'm at a loss as to why this is happening, so here's all the relevant
>> information I can think of sharing:
>>
>> hadoop version = 0.19.1-dev, r (we possibly have some custom 

Re: Reducer goes past 100% complete?

2009-03-09 Thread james warren
Speculative execution has existed far before 0.19.x, but AFAIK the > 100%
issue has appeared (at least with greater frequency) since 0.19.0 came out.
 Are you saying there are changes in how task progress is being tracked?
cheers,
-jw

On Mon, Mar 9, 2009 at 12:21 PM, jason hadoop wrote:

> speculative execution.
>
>
> On Mon, Mar 9, 2009 at 12:19 PM, Nathan Marz  wrote:
>
> > I have the same problem with reducers going past 100% on some jobs. I've
> > seen reducers go as high as 120%. Would love to know what the issue is.
> >
> >
> > On Mar 9, 2009, at 8:45 AM, Doug Cook wrote:
> >
> >
> >> Hi folks,
> >>
> >> I've recently upgraded to Hadoop 0.19.1 from a much, much older version
> of
> >> Hadoop.
> >>
> >> Most things in my application (a highly modified version of Nutch) are
> >> working just fine, but one of them is bombing out with odd symptoms. The
> >> map
> >> works just fine, but then reduce phase (a) runs extremely slowly and (b)
> >> the
> >> "percentage complete" reporting for each reduce task doesn't stop at
> 100%,
> >> it just keeps going on past that.
> >>
> >> I figure I'll start by understanding the percentage-complete reporting
> >> issue, since it's pretty concrete and may have some bearing on the
> >> performance issue. It seems likely that my application is
> mis-configuring
> >> the job, or otherwise not correctly using the Hadoop API. I don't think
> >> I'm
> >> doing anything way out of the ordinary; my reducer simply creates an
> >> object,
> >> wraps it in an ObjectWritable, and calls output.collect(), and I have a
> >> local class that implements OutputFormat to take the object and put it
> in
> >> a
> >> Lucene index. It does actually create correct output, at least for small
> >> indices; on large indices, the performance problems are killing me.
> >>
> >> I can and will start rummaging around in the Hadoop code to figure out
> how
> >> it calculates percentage complete, and see what I'm not doing correctly,
> >> but
> >> thought I'd ask here, too, to see if someone has good suggestions off
> the
> >> top of their head.
> >>
> >> Many thanks-
> >>
> >> Doug Cook
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/Reducer-goes-past-100--complete--tp22413589p22413589.html
> >> Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >>
> >>
> >
>
>
> --
> Alpha Chapters of my book on Hadoop are available
> http://www.apress.com/book/view/9781430219422
>


Re: DataNode gets 'stuck', ends up with two DataNode processes

2009-03-09 Thread Brian Bockelman

Hey Jason,

Looks like you're on the right track.  We are getting stuck after the  
JVM forks but before it exec's.  It appears to be this bug:


http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6671051

However, this is using Java 1.6.0_11, and that bug was marked as fixed  
in 1.6.0_6 :(


Any other ideas?

Brian

On Mar 9, 2009, at 2:21 PM, jason hadoop wrote:

There were a couple of fork timing errors in the jdk 1.5 that  
occasionally
caused a sub process fork to go bad, this could by the du/df being  
forked

off by the datanode and dying.

I can't find the references I had saved away at one point, from the  
java

forums, but perhaps this will get you started.

http://forums.sun.com/thread.jspa?threadID=5297465&tstart=0

On Mon, Mar 9, 2009 at 11:23 AM, Brian Bockelman  
wrote:


It's very strange.  It appears that the second process is the  
result of a
fork call, yet has only one thread running whose gdb backtrace  
looks like

this:

(gdb) bt
#0  0x003e10c0af8b in __lll_mutex_lock_wait () from
/lib64/tls/libpthread.so.0
#1  0x in ?? ()

Not very helpful!  I'd normally suspect some strange memory issue,  
but I've
checked - there was plenty of memory available on the host when the  
second
process was spawned and we weren't close to the file descriptor  
limit.



Looking at this issue,
https://issues.apache.org/jira/browse/HADOOP-2231

it seems that the "df" call is avoidable now that we're in Java 1.6.
However, the issue was closed and marked as a duplicate, but  
without noting
what it was a duplicate of (grrr).  Is there an updated version of  
that

patch?

Brian


On Mar 9, 2009, at 12:48 PM, Steve Loughran wrote:

Philip Zeyliger wrote:


Very naively looking at the stack traces, a common theme is that  
there's

a
call out to "df" to find the system capacity.  If you see two  
data node
processes, perhaps the fork/exec to call out to "df" is failing  
in some

strange way.



that's deep into Java code. OpenJDK gives you more of that source.  
One
option here is to consider some kind of timeouts in the exec, but  
it's
pretty tricky to tack that on round the Java runtime APIs, because  
the
process APIs weren't really designed to be interrupted by other  
threads.


-steve

"DataNode: [/hadoop-data/dfs/data]" daemon prio=10

tid=0x002ae2c0d400 nid=0x21cf in Object.wait()
[0x42c54000..0x42c54b30]
java.lang.Thread.State: WAITING (on object monitor)
  at java.lang.Object.wait(Native Method)
  at java.lang.Object.wait(Object.java:485)
  at java.lang.UNIXProcess$Gate.waitForExit(UNIXProcess.java: 
64)

  - locked <0x002a9fd84f98> (a java.lang.UNIXProcess$Gate)
  at java.lang.UNIXProcess.(UNIXProcess.java:145)
  at java.lang.ProcessImpl.start(ProcessImpl.java:65)
  at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
  at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
  at org.apache.hadoop.util.Shell.run(Shell.java:134)
  at org.apache.hadoop.fs.DF.getCapacity(DF.java:63)
  at
org.apache.hadoop.hdfs.server.datanode.FSDataset 
$FSVolume.getCapacity(FSDataset.java:341)

  at
org.apache.hadoop.hdfs.server.datanode.FSDataset 
$FSVolumeSet.getCapacity(FSDataset.java:501)

  - locked <0x002a9ed97078> (a
org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolumeSet)
  at
org 
.apache 
.hadoop.hdfs.server.datanode.FSDataset.getCapacity(FSDataset.java: 
697)

  at
org 
.apache 
.hadoop.hdfs.server.datanode.DataNode.offerService(DataNode.java: 
671)

  at
org.apache.hadoop.hdfs.server.datanode.DataNode.run(DataNode.java: 
1105)

  at java.lang.Thread.run(Thread.java:619)
On Mon, Mar 9, 2009 at 8:17 AM, Garhan Attebury 
wrote:



On a ~100 node cluster running HDFS (we just use HDFS + fuse, no
job/task
trackers) I've noticed many datanodes get 'stuck'. The nodes  
themselves

seem
fine with no network/memory problems, but in every instance I  
see two
DataNode processes running, and the NameNode logs indicate the  
datanode

in
question simply stopped responding. This state persists until I  
come

along
and kill the DataNode processes and restart the DataNode on that
particular
machine.

I'm at a loss as to why this is happening, so here's all the  
relevant

information I can think of sharing:

hadoop version = 0.19.1-dev, r (we possibly have some custom  
patches

running, but nothing which would affect HDFS that I'm aware of)
number of nodes = ~100
HDFS size = ~230TB
Java version =
OS = CentOS 4.7 x86_64, 4/8 core Opterons with 4GB/16GB of memory
respectively

I managed to grab a stack dump via "kill -3" from two of these  
problem

instances and threw up the logs at
http://cse.unl.edu/~attebury/datanode_problem/

.
The .log files honestly show nothing out of the ordinary, and  
having

very
little Java developing experience the .out files mean nothing to  
me.

It's
als

Re: Does "hadoop-default.xml" + "hadoop-site.xml" matter for whole cluster or each node?

2009-03-09 Thread Doug Cutting

Owen O'Malley wrote:
It depends on the property whether they come from the job's 
configuration or the system's. Some  like io.sort.mb and 
mapred.map.tasks come from the job, while others like 
mapred.tasktracker.map.tasks.maximum come from the system.


There is some method to the madness.

Things that are only set programmatically, like most job parameters, 
e.g, the mapper, reducer, etc, are not listed in hadoop-default.xml, 
since they don't make sense to configure cluster-wide.


Defaults are overidden by hadoop-site.xml, but a job can then override 
hadoop-site.xml unless hadoop-site.xml declares it to be final, in which 
case any value specified in a job is ignored.


There are a few odd cases of things that jobs might want to override but 
they cannot.  For example, a job might wish to override 
mapred.tasktracker.map.tasks.maximum, but, if you think a bit more, this 
is read by the tasktracker at startup and cannot be reasonably changed 
per job, since a tasktracker can run tasks from different jobs 
simultaneously.


So things that make sense per-job and are not declared final in your 
hadoop-site.xml can generally be overridden by the job.


Doug


Re: Reducer goes past 100% complete?

2009-03-09 Thread jason hadoop
speculative execution.


On Mon, Mar 9, 2009 at 12:19 PM, Nathan Marz  wrote:

> I have the same problem with reducers going past 100% on some jobs. I've
> seen reducers go as high as 120%. Would love to know what the issue is.
>
>
> On Mar 9, 2009, at 8:45 AM, Doug Cook wrote:
>
>
>> Hi folks,
>>
>> I've recently upgraded to Hadoop 0.19.1 from a much, much older version of
>> Hadoop.
>>
>> Most things in my application (a highly modified version of Nutch) are
>> working just fine, but one of them is bombing out with odd symptoms. The
>> map
>> works just fine, but then reduce phase (a) runs extremely slowly and (b)
>> the
>> "percentage complete" reporting for each reduce task doesn't stop at 100%,
>> it just keeps going on past that.
>>
>> I figure I'll start by understanding the percentage-complete reporting
>> issue, since it's pretty concrete and may have some bearing on the
>> performance issue. It seems likely that my application is mis-configuring
>> the job, or otherwise not correctly using the Hadoop API. I don't think
>> I'm
>> doing anything way out of the ordinary; my reducer simply creates an
>> object,
>> wraps it in an ObjectWritable, and calls output.collect(), and I have a
>> local class that implements OutputFormat to take the object and put it in
>> a
>> Lucene index. It does actually create correct output, at least for small
>> indices; on large indices, the performance problems are killing me.
>>
>> I can and will start rummaging around in the Hadoop code to figure out how
>> it calculates percentage complete, and see what I'm not doing correctly,
>> but
>> thought I'd ask here, too, to see if someone has good suggestions off the
>> top of their head.
>>
>> Many thanks-
>>
>> Doug Cook
>> --
>> View this message in context:
>> http://www.nabble.com/Reducer-goes-past-100--complete--tp22413589p22413589.html
>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>
>>
>


-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422


Re: DataNode gets 'stuck', ends up with two DataNode processes

2009-03-09 Thread jason hadoop
There were a couple of fork timing errors in the jdk 1.5 that occasionally
caused a sub process fork to go bad, this could by the du/df being forked
off by the datanode and dying.

I can't find the references I had saved away at one point, from the java
forums, but perhaps this will get you started.

http://forums.sun.com/thread.jspa?threadID=5297465&tstart=0

On Mon, Mar 9, 2009 at 11:23 AM, Brian Bockelman wrote:

> It's very strange.  It appears that the second process is the result of a
> fork call, yet has only one thread running whose gdb backtrace looks like
> this:
>
> (gdb) bt
> #0  0x003e10c0af8b in __lll_mutex_lock_wait () from
> /lib64/tls/libpthread.so.0
> #1  0x in ?? ()
>
> Not very helpful!  I'd normally suspect some strange memory issue, but I've
> checked - there was plenty of memory available on the host when the second
> process was spawned and we weren't close to the file descriptor limit.
>
>
> Looking at this issue,
> https://issues.apache.org/jira/browse/HADOOP-2231
>
> it seems that the "df" call is avoidable now that we're in Java 1.6.
>  However, the issue was closed and marked as a duplicate, but without noting
> what it was a duplicate of (grrr).  Is there an updated version of that
> patch?
>
> Brian
>
>
> On Mar 9, 2009, at 12:48 PM, Steve Loughran wrote:
>
>  Philip Zeyliger wrote:
>>
>>> Very naively looking at the stack traces, a common theme is that there's
>>> a
>>> call out to "df" to find the system capacity.  If you see two data node
>>> processes, perhaps the fork/exec to call out to "df" is failing in some
>>> strange way.
>>>
>>
>> that's deep into Java code. OpenJDK gives you more of that source. One
>> option here is to consider some kind of timeouts in the exec, but it's
>> pretty tricky to tack that on round the Java runtime APIs, because the
>> process APIs weren't really designed to be interrupted by other threads.
>>
>> -steve
>>
>>  "DataNode: [/hadoop-data/dfs/data]" daemon prio=10
>>> tid=0x002ae2c0d400 nid=0x21cf in Object.wait()
>>> [0x42c54000..0x42c54b30]
>>>  java.lang.Thread.State: WAITING (on object monitor)
>>>at java.lang.Object.wait(Native Method)
>>>at java.lang.Object.wait(Object.java:485)
>>>at java.lang.UNIXProcess$Gate.waitForExit(UNIXProcess.java:64)
>>>- locked <0x002a9fd84f98> (a java.lang.UNIXProcess$Gate)
>>>at java.lang.UNIXProcess.(UNIXProcess.java:145)
>>>at java.lang.ProcessImpl.start(ProcessImpl.java:65)
>>>at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
>>>at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
>>>at org.apache.hadoop.util.Shell.run(Shell.java:134)
>>>at org.apache.hadoop.fs.DF.getCapacity(DF.java:63)
>>>at
>>> org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolume.getCapacity(FSDataset.java:341)
>>>at
>>> org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolumeSet.getCapacity(FSDataset.java:501)
>>>- locked <0x002a9ed97078> (a
>>> org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolumeSet)
>>>at
>>> org.apache.hadoop.hdfs.server.datanode.FSDataset.getCapacity(FSDataset.java:697)
>>>at
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.offerService(DataNode.java:671)
>>>at
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.run(DataNode.java:1105)
>>>at java.lang.Thread.run(Thread.java:619)
>>> On Mon, Mar 9, 2009 at 8:17 AM, Garhan Attebury >> >wrote:
>>>
 On a ~100 node cluster running HDFS (we just use HDFS + fuse, no
 job/task
 trackers) I've noticed many datanodes get 'stuck'. The nodes themselves
 seem
 fine with no network/memory problems, but in every instance I see two
 DataNode processes running, and the NameNode logs indicate the datanode
 in
 question simply stopped responding. This state persists until I come
 along
 and kill the DataNode processes and restart the DataNode on that
 particular
 machine.

 I'm at a loss as to why this is happening, so here's all the relevant
 information I can think of sharing:

 hadoop version = 0.19.1-dev, r (we possibly have some custom patches
 running, but nothing which would affect HDFS that I'm aware of)
 number of nodes = ~100
 HDFS size = ~230TB
 Java version =
 OS = CentOS 4.7 x86_64, 4/8 core Opterons with 4GB/16GB of memory
 respectively

 I managed to grab a stack dump via "kill -3" from two of these problem
 instances and threw up the logs at
 http://cse.unl.edu/~attebury/datanode_problem/
 .
 The .log files honestly show nothing out of the ordinary, and having
 very
 little Java developing experience the .out files mean nothing to me.
 It's
 also worth mentioning that the NameNode logs at the time when these
 DataNodes go

Re: Reducer goes past 100% complete?

2009-03-09 Thread Nathan Marz
I have the same problem with reducers going past 100% on some jobs.  
I've seen reducers go as high as 120%. Would love to know what the  
issue is.


On Mar 9, 2009, at 8:45 AM, Doug Cook wrote:



Hi folks,

I've recently upgraded to Hadoop 0.19.1 from a much, much older  
version of

Hadoop.

Most things in my application (a highly modified version of Nutch) are
working just fine, but one of them is bombing out with odd symptoms.  
The map
works just fine, but then reduce phase (a) runs extremely slowly and  
(b) the
"percentage complete" reporting for each reduce task doesn't stop at  
100%,

it just keeps going on past that.

I figure I'll start by understanding the percentage-complete reporting
issue, since it's pretty concrete and may have some bearing on the
performance issue. It seems likely that my application is mis- 
configuring
the job, or otherwise not correctly using the Hadoop API. I don't  
think I'm
doing anything way out of the ordinary; my reducer simply creates an  
object,
wraps it in an ObjectWritable, and calls output.collect(), and I  
have a
local class that implements OutputFormat to take the object and put  
it in a
Lucene index. It does actually create correct output, at least for  
small

indices; on large indices, the performance problems are killing me.

I can and will start rummaging around in the Hadoop code to figure  
out how
it calculates percentage complete, and see what I'm not doing  
correctly, but
thought I'd ask here, too, to see if someone has good suggestions  
off the

top of their head.

Many thanks-

Doug Cook
--
View this message in context: 
http://www.nabble.com/Reducer-goes-past-100--complete--tp22413589p22413589.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.





Re: DataNode gets 'stuck', ends up with two DataNode processes

2009-03-09 Thread Brian Bockelman
It's very strange.  It appears that the second process is the result  
of a fork call, yet has only one thread running whose gdb backtrace  
looks like this:


(gdb) bt
#0  0x003e10c0af8b in __lll_mutex_lock_wait () from /lib64/tls/ 
libpthread.so.0

#1  0x in ?? ()

Not very helpful!  I'd normally suspect some strange memory issue, but  
I've checked - there was plenty of memory available on the host when  
the second process was spawned and we weren't close to the file  
descriptor limit.



Looking at this issue,
https://issues.apache.org/jira/browse/HADOOP-2231

it seems that the "df" call is avoidable now that we're in Java 1.6.   
However, the issue was closed and marked as a duplicate, but without  
noting what it was a duplicate of (grrr).  Is there an updated version  
of that patch?


Brian

On Mar 9, 2009, at 12:48 PM, Steve Loughran wrote:


Philip Zeyliger wrote:
Very naively looking at the stack traces, a common theme is that  
there's a
call out to "df" to find the system capacity.  If you see two data  
node
processes, perhaps the fork/exec to call out to "df" is failing in  
some

strange way.


that's deep into Java code. OpenJDK gives you more of that source.  
One option here is to consider some kind of timeouts in the exec,  
but it's pretty tricky to tack that on round the Java runtime APIs,  
because the process APIs weren't really designed to be interrupted  
by other threads.


-steve


"DataNode: [/hadoop-data/dfs/data]" daemon prio=10
tid=0x002ae2c0d400 nid=0x21cf in Object.wait()
[0x42c54000..0x42c54b30]
  java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:485)
at java.lang.UNIXProcess$Gate.waitForExit(UNIXProcess.java:64)
- locked <0x002a9fd84f98> (a java.lang.UNIXProcess$Gate)
at java.lang.UNIXProcess.(UNIXProcess.java:145)
at java.lang.ProcessImpl.start(ProcessImpl.java:65)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
at org.apache.hadoop.util.Shell.run(Shell.java:134)
at org.apache.hadoop.fs.DF.getCapacity(DF.java:63)
	at org.apache.hadoop.hdfs.server.datanode.FSDataset 
$FSVolume.getCapacity(FSDataset.java:341)
	at org.apache.hadoop.hdfs.server.datanode.FSDataset 
$FSVolumeSet.getCapacity(FSDataset.java:501)

- locked <0x002a9ed97078> (a
org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolumeSet)
	at  
org 
.apache 
.hadoop.hdfs.server.datanode.FSDataset.getCapacity(FSDataset.java: 
697)
	at  
org 
.apache 
.hadoop.hdfs.server.datanode.DataNode.offerService(DataNode.java:671)
	at  
org.apache.hadoop.hdfs.server.datanode.DataNode.run(DataNode.java: 
1105)

at java.lang.Thread.run(Thread.java:619)
On Mon, Mar 9, 2009 at 8:17 AM, Garhan Attebury  
wrote:
On a ~100 node cluster running HDFS (we just use HDFS + fuse, no  
job/task
trackers) I've noticed many datanodes get 'stuck'. The nodes  
themselves seem
fine with no network/memory problems, but in every instance I see  
two
DataNode processes running, and the NameNode logs indicate the  
datanode in
question simply stopped responding. This state persists until I  
come along
and kill the DataNode processes and restart the DataNode on that  
particular

machine.

I'm at a loss as to why this is happening, so here's all the  
relevant

information I can think of sharing:

hadoop version = 0.19.1-dev, r (we possibly have some custom patches
running, but nothing which would affect HDFS that I'm aware of)
number of nodes = ~100
HDFS size = ~230TB
Java version =
OS = CentOS 4.7 x86_64, 4/8 core Opterons with 4GB/16GB of memory
respectively

I managed to grab a stack dump via "kill -3" from two of these  
problem

instances and threw up the logs at
http://cse.unl.edu/~attebury/datanode_problem/.
The .log files honestly show nothing out of the ordinary, and  
having very
little Java developing experience the .out files mean nothing to  
me. It's

also worth mentioning that the NameNode logs at the time when these
DataNodes got stuck show nothing out of the ordinary either --  
just the
expected "lost heartbeat from node " message. The DataNode  
daemon (the
original process, not the second mysterious one) continues to  
respond to web

requests like browsing the log directory during this time.

Whenever this happens I've just manually done a "kill -9" to  
remove the two
stuck DataNode processes (I'm not even sure why there's two of  
them, as
under normal operation there's only one). After killing the stuck  
ones, I
simply do a "hadoop-daemon.sh start datanode" and all is normal  
again. I've

not seen any dataloss or corruption as a result of this problem.

Has anyone seen anything like this happen before? Out of our ~100  
node
cluster I see this problem around once a day, and it seems

Re: Profiling Map/Reduce Tasks

2009-03-09 Thread Chris Douglas

I use YourKit (http://yourkit.com/).

You'll also want to look at the following parameters:

  mapred.task.profile.params (e.g. - 
agentlib:yjpagent=sampling=onexit=snapshot,dir=%s)

  mapred.task.profile  (true)
  mapred.task.profile.reduces (0-2 (or whatever))
  mapred.task.profile.maps (0-2)

By default, YK filters org.apache.*, so be sure to disable that. It's  
simplest to profile individual tasks in psudo-distributed mode, though  
clearly that has its drawbacks. -C


On Mar 9, 2009, at 12:05 AM, Rasit OZDAS wrote:


I note System.currentTimeMillis() at the beginning of main function,
then at the end I use a while loop to wait for the job,

while (!runningJob.isComplete())
 Thread.sleep(1000);

Then again I note the system time. But this only gives the total  
amount of

time passed.

Rasit

2009/3/8 Richa Khandelwal 


Hi,
Does Map/Reduce profiles jobs down to milliseconds. From what I can  
see in
the logs, there is no time specified for the job. Although CPU TIME  
is an
information that should be present in the logs, it was not profiled  
and the

response time can only be noted in down to seconds from the runtime
progress
of the jobs.

Does someone know how to efficiently profile map reduce jobs?

Thanks,
Richa Khandelwal


University Of California,
Santa Cruz.
Ph:425-241-7763





--
M. Raşit ÖZDAŞ




Re: DataNode gets 'stuck', ends up with two DataNode processes

2009-03-09 Thread Steve Loughran

Philip Zeyliger wrote:

Very naively looking at the stack traces, a common theme is that there's a
call out to "df" to find the system capacity.  If you see two data node
processes, perhaps the fork/exec to call out to "df" is failing in some
strange way.


that's deep into Java code. OpenJDK gives you more of that source. One 
option here is to consider some kind of timeouts in the exec, but it's 
pretty tricky to tack that on round the Java runtime APIs, because the 
process APIs weren't really designed to be interrupted by other threads.


-steve



"DataNode: [/hadoop-data/dfs/data]" daemon prio=10
tid=0x002ae2c0d400 nid=0x21cf in Object.wait()
[0x42c54000..0x42c54b30]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:485)
at java.lang.UNIXProcess$Gate.waitForExit(UNIXProcess.java:64)
- locked <0x002a9fd84f98> (a java.lang.UNIXProcess$Gate)
at java.lang.UNIXProcess.(UNIXProcess.java:145)
at java.lang.ProcessImpl.start(ProcessImpl.java:65)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
at org.apache.hadoop.util.Shell.run(Shell.java:134)
at org.apache.hadoop.fs.DF.getCapacity(DF.java:63)
at 
org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolume.getCapacity(FSDataset.java:341)
at 
org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolumeSet.getCapacity(FSDataset.java:501)
- locked <0x002a9ed97078> (a
org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolumeSet)
at 
org.apache.hadoop.hdfs.server.datanode.FSDataset.getCapacity(FSDataset.java:697)
at 
org.apache.hadoop.hdfs.server.datanode.DataNode.offerService(DataNode.java:671)
at 
org.apache.hadoop.hdfs.server.datanode.DataNode.run(DataNode.java:1105)
at java.lang.Thread.run(Thread.java:619)



On Mon, Mar 9, 2009 at 8:17 AM, Garhan Attebury wrote:


On a ~100 node cluster running HDFS (we just use HDFS + fuse, no job/task
trackers) I've noticed many datanodes get 'stuck'. The nodes themselves seem
fine with no network/memory problems, but in every instance I see two
DataNode processes running, and the NameNode logs indicate the datanode in
question simply stopped responding. This state persists until I come along
and kill the DataNode processes and restart the DataNode on that particular
machine.

I'm at a loss as to why this is happening, so here's all the relevant
information I can think of sharing:

hadoop version = 0.19.1-dev, r (we possibly have some custom patches
running, but nothing which would affect HDFS that I'm aware of)
number of nodes = ~100
HDFS size = ~230TB
Java version =
OS = CentOS 4.7 x86_64, 4/8 core Opterons with 4GB/16GB of memory
respectively

I managed to grab a stack dump via "kill -3" from two of these problem
instances and threw up the logs at
http://cse.unl.edu/~attebury/datanode_problem/.
The .log files honestly show nothing out of the ordinary, and having very
little Java developing experience the .out files mean nothing to me. It's
also worth mentioning that the NameNode logs at the time when these
DataNodes got stuck show nothing out of the ordinary either -- just the
expected "lost heartbeat from node " message. The DataNode daemon (the
original process, not the second mysterious one) continues to respond to web
requests like browsing the log directory during this time.

Whenever this happens I've just manually done a "kill -9" to remove the two
stuck DataNode processes (I'm not even sure why there's two of them, as
under normal operation there's only one). After killing the stuck ones, I
simply do a "hadoop-daemon.sh start datanode" and all is normal again. I've
not seen any dataloss or corruption as a result of this problem.

Has anyone seen anything like this happen before? Out of our ~100 node
cluster I see this problem around once a day, and it seems to just strike
random nodes at random times. It happens often enough that I would be happy
to do additional debugging if anyone can tell me how. I'm not a developer at
all, so I'm at the end of my knowledge on how to solve this problem. Thanks
for any help!


===
Garhan Attebury
Systems Administrator
UNL Research Computing Facility
402-472-7761
===







--
Steve Loughran  http://www.1060.org/blogxter/publish/5
Author: Ant in Action   http://antbook.org/


Re: DataNode gets 'stuck', ends up with two DataNode processes

2009-03-09 Thread Philip Zeyliger
Very naively looking at the stack traces, a common theme is that there's a
call out to "df" to find the system capacity.  If you see two data node
processes, perhaps the fork/exec to call out to "df" is failing in some
strange way.

"DataNode: [/hadoop-data/dfs/data]" daemon prio=10
tid=0x002ae2c0d400 nid=0x21cf in Object.wait()
[0x42c54000..0x42c54b30]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:485)
at java.lang.UNIXProcess$Gate.waitForExit(UNIXProcess.java:64)
- locked <0x002a9fd84f98> (a java.lang.UNIXProcess$Gate)
at java.lang.UNIXProcess.(UNIXProcess.java:145)
at java.lang.ProcessImpl.start(ProcessImpl.java:65)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
at org.apache.hadoop.util.Shell.run(Shell.java:134)
at org.apache.hadoop.fs.DF.getCapacity(DF.java:63)
at 
org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolume.getCapacity(FSDataset.java:341)
at 
org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolumeSet.getCapacity(FSDataset.java:501)
- locked <0x002a9ed97078> (a
org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolumeSet)
at 
org.apache.hadoop.hdfs.server.datanode.FSDataset.getCapacity(FSDataset.java:697)
at 
org.apache.hadoop.hdfs.server.datanode.DataNode.offerService(DataNode.java:671)
at 
org.apache.hadoop.hdfs.server.datanode.DataNode.run(DataNode.java:1105)
at java.lang.Thread.run(Thread.java:619)



On Mon, Mar 9, 2009 at 8:17 AM, Garhan Attebury wrote:

> On a ~100 node cluster running HDFS (we just use HDFS + fuse, no job/task
> trackers) I've noticed many datanodes get 'stuck'. The nodes themselves seem
> fine with no network/memory problems, but in every instance I see two
> DataNode processes running, and the NameNode logs indicate the datanode in
> question simply stopped responding. This state persists until I come along
> and kill the DataNode processes and restart the DataNode on that particular
> machine.
>
> I'm at a loss as to why this is happening, so here's all the relevant
> information I can think of sharing:
>
> hadoop version = 0.19.1-dev, r (we possibly have some custom patches
> running, but nothing which would affect HDFS that I'm aware of)
> number of nodes = ~100
> HDFS size = ~230TB
> Java version =
> OS = CentOS 4.7 x86_64, 4/8 core Opterons with 4GB/16GB of memory
> respectively
>
> I managed to grab a stack dump via "kill -3" from two of these problem
> instances and threw up the logs at
> http://cse.unl.edu/~attebury/datanode_problem/.
> The .log files honestly show nothing out of the ordinary, and having very
> little Java developing experience the .out files mean nothing to me. It's
> also worth mentioning that the NameNode logs at the time when these
> DataNodes got stuck show nothing out of the ordinary either -- just the
> expected "lost heartbeat from node " message. The DataNode daemon (the
> original process, not the second mysterious one) continues to respond to web
> requests like browsing the log directory during this time.
>
> Whenever this happens I've just manually done a "kill -9" to remove the two
> stuck DataNode processes (I'm not even sure why there's two of them, as
> under normal operation there's only one). After killing the stuck ones, I
> simply do a "hadoop-daemon.sh start datanode" and all is normal again. I've
> not seen any dataloss or corruption as a result of this problem.
>
> Has anyone seen anything like this happen before? Out of our ~100 node
> cluster I see this problem around once a day, and it seems to just strike
> random nodes at random times. It happens often enough that I would be happy
> to do additional debugging if anyone can tell me how. I'm not a developer at
> all, so I'm at the end of my knowledge on how to solve this problem. Thanks
> for any help!
>
>
> ===
> Garhan Attebury
> Systems Administrator
> UNL Research Computing Facility
> 402-472-7761
> ===
>
>


Re: Does "hadoop-default.xml" + "hadoop-site.xml" matter for whole cluster or each node?

2009-03-09 Thread Owen O'Malley


On Mar 9, 2009, at 8:10 AM, Nick Cen wrote:

A clear naming convention will make it more easy to configure. But i  
think
besides the system and job level , i think there are also some  
parameters
take effect in node level like mapred.tasktracker.map.tasks.maximum,  
as far

as i can remember, we can set this differently for different node.


There are only a few that are actually pushed around by the system.  
The system directory and the heartbeat interval are the only ones that  
readily come to mind. For the most part, the other ones that act like  
that are only used by the job tracker and therefore only need to be  
present on the job tracker.


And probably a better structure would be:

mapred.job.* -- job specific
mapred.system.* -- used by both master and slaves
mapred.master.* -- used by the JobTracker
mapred.slave.*  -- used by the TaskTrackers

-- Owen


Re: question about released version id

2009-03-09 Thread Owen O'Malley

On Mar 2, 2009, at 11:46 PM, 鞠適存 wrote:

I wonder how to make the hadoop version number.


Each 0.18, 0.19 and 0.20 have their own branch. The first release on  
each branch is 0.X.0, and then 0.X.1 and so on. New features are only  
put into trunk and only important bug fixes are put into the branches.  
So there will be no new functionality going from 0.X.1 to 0.X.2, but  
there will be going from a release of 0.X to 0.X+1.


-- Owen

Reducer goes past 100% complete?

2009-03-09 Thread Doug Cook

Hi folks,

I've recently upgraded to Hadoop 0.19.1 from a much, much older version of
Hadoop. 

Most things in my application (a highly modified version of Nutch) are
working just fine, but one of them is bombing out with odd symptoms. The map
works just fine, but then reduce phase (a) runs extremely slowly and (b) the
"percentage complete" reporting for each reduce task doesn't stop at 100%,
it just keeps going on past that.

I figure I'll start by understanding the percentage-complete reporting
issue, since it's pretty concrete and may have some bearing on the
performance issue. It seems likely that my application is mis-configuring
the job, or otherwise not correctly using the Hadoop API. I don't think I'm
doing anything way out of the ordinary; my reducer simply creates an object,
wraps it in an ObjectWritable, and calls output.collect(), and I have a
local class that implements OutputFormat to take the object and put it in a
Lucene index. It does actually create correct output, at least for small
indices; on large indices, the performance problems are killing me.
 
I can and will start rummaging around in the Hadoop code to figure out how
it calculates percentage complete, and see what I'm not doing correctly, but
thought I'd ask here, too, to see if someone has good suggestions off the
top of their head.

Many thanks-

Doug Cook
-- 
View this message in context: 
http://www.nabble.com/Reducer-goes-past-100--complete--tp22413589p22413589.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: MultipleOutputFormat with sorting functionality

2009-03-09 Thread Nick Cen
When you are using the default sorting, it will use string comparation, i
think this may not be what you intend. You can call the JobConf's
setKeyFieldComparatorOptions("-n") to do a number comparation.


2009/3/9 Rasit OZDAS 

> Thanks, Nick!
>
> It seems that sorting takes place in map, not in reduce :)
> I've added double values in front of every map key, the problem is solved
> now.
> I know it's more like a workaround rather than a real solution,
> and I don't know if it has performance problems.. Have an idea? I'm not
> familiar with what hadoop does exactly when I do this.
>
> Rasit
>
> 2009/3/9 Nick Cen 
>
> > I think the sort is not relatived to the output format.
> >
> > I previously have try this class ,but has a little different compared to
> > your code. I extend the MultipleTextOutputFormat class and override
> > its generateFileNameForKeyValue()
> > method, and everything seems working fine.
> >
> > 2009/3/9 Rasit OZDAS 
> >
> > > Hi, all!
> > >
> > > I'm using multiple output format to write out 4 different files, each
> one
> > > has the same type.
> > > But it seems that outputs aren't being sorted.
> > >
> > > Should they be sorted? Or isn't it implemented for multiple output
> > format?
> > >
> > > Here is some code:
> > >
> > > // in main function
> > > MultipleOutputs.addMultiNamedOutput(conf, "text",
> TextOutputFormat.class,
> > > DoubleWritable.class, Text.class);
> > >
> > > // in Reducer.configure()
> > > mos = new MultipleOutputs(conf);
> > >
> > > // in Reducer.reduce()
> > > if (keystr.equalsIgnoreCase("BreachFace"))
> > >mos.getCollector("text", "BreachFace",
> > reporter).collect(new
> > > Text(key), dbl);
> > >else if (keystr.equalsIgnoreCase("Ejector"))
> > >mos.getCollector("text", "Ejector",
> reporter).collect(new
> > > Text(key), dbl);
> > >else if (keystr.equalsIgnoreCase("FiringPin"))
> > >mos.getCollector("text", "FiringPin",
> > reporter).collect(new
> > > Text(key), dbl);
> > >else if (keystr.equalsIgnoreCase("WeightedSum"))
> > >mos.getCollector("text", "WeightedSum",
> > > reporter).collect(new Text(key), dbl);
> > >else
> > >mos.getCollector("text", "Diger", reporter).collect(new
> > > Text(key), dbl);
> > >
> > >
> > > --
> > > M. Raşit ÖZDAŞ
> > >
> >
> >
> >
> > --
> > http://daily.appspot.com/food/
> >
>
>
>
> --
> M. Raşit ÖZDAŞ
>



-- 
http://daily.appspot.com/food/


Re: MultipleOutputFormat with sorting functionality

2009-03-09 Thread Rasit OZDAS
Thanks, Nick!

It seems that sorting takes place in map, not in reduce :)
I've added double values in front of every map key, the problem is solved
now.
I know it's more like a workaround rather than a real solution,
and I don't know if it has performance problems.. Have an idea? I'm not
familiar with what hadoop does exactly when I do this.

Rasit

2009/3/9 Nick Cen 

> I think the sort is not relatived to the output format.
>
> I previously have try this class ,but has a little different compared to
> your code. I extend the MultipleTextOutputFormat class and override
> its generateFileNameForKeyValue()
> method, and everything seems working fine.
>
> 2009/3/9 Rasit OZDAS 
>
> > Hi, all!
> >
> > I'm using multiple output format to write out 4 different files, each one
> > has the same type.
> > But it seems that outputs aren't being sorted.
> >
> > Should they be sorted? Or isn't it implemented for multiple output
> format?
> >
> > Here is some code:
> >
> > // in main function
> > MultipleOutputs.addMultiNamedOutput(conf, "text", TextOutputFormat.class,
> > DoubleWritable.class, Text.class);
> >
> > // in Reducer.configure()
> > mos = new MultipleOutputs(conf);
> >
> > // in Reducer.reduce()
> > if (keystr.equalsIgnoreCase("BreachFace"))
> >mos.getCollector("text", "BreachFace",
> reporter).collect(new
> > Text(key), dbl);
> >else if (keystr.equalsIgnoreCase("Ejector"))
> >mos.getCollector("text", "Ejector", reporter).collect(new
> > Text(key), dbl);
> >else if (keystr.equalsIgnoreCase("FiringPin"))
> >mos.getCollector("text", "FiringPin",
> reporter).collect(new
> > Text(key), dbl);
> >else if (keystr.equalsIgnoreCase("WeightedSum"))
> >mos.getCollector("text", "WeightedSum",
> > reporter).collect(new Text(key), dbl);
> >else
> >mos.getCollector("text", "Diger", reporter).collect(new
> > Text(key), dbl);
> >
> >
> > --
> > M. Raşit ÖZDAŞ
> >
>
>
>
> --
> http://daily.appspot.com/food/
>



-- 
M. Raşit ÖZDAŞ


DataNode gets 'stuck', ends up with two DataNode processes

2009-03-09 Thread Garhan Attebury
On a ~100 node cluster running HDFS (we just use HDFS + fuse, no job/ 
task trackers) I've noticed many datanodes get 'stuck'. The nodes  
themselves seem fine with no network/memory problems, but in every  
instance I see two DataNode processes running, and the NameNode logs  
indicate the datanode in question simply stopped responding. This  
state persists until I come along and kill the DataNode processes and  
restart the DataNode on that particular machine.


I'm at a loss as to why this is happening, so here's all the relevant  
information I can think of sharing:


hadoop version = 0.19.1-dev, r (we possibly have some custom patches  
running, but nothing which would affect HDFS that I'm aware of)

number of nodes = ~100
HDFS size = ~230TB
Java version =
OS = CentOS 4.7 x86_64, 4/8 core Opterons with 4GB/16GB of memory  
respectively


I managed to grab a stack dump via "kill -3" from two of these problem  
instances and threw up the logs at http://cse.unl.edu/~attebury/datanode_problem/ 
. The .log files honestly show nothing out of the ordinary, and having  
very little Java developing experience the .out files mean nothing to  
me. It's also worth mentioning that the NameNode logs at the time when  
these DataNodes got stuck show nothing out of the ordinary either --  
just the expected "lost heartbeat from node " message. The DataNode  
daemon (the original process, not the second mysterious one) continues  
to respond to web requests like browsing the log directory during this  
time.


Whenever this happens I've just manually done a "kill -9" to remove  
the two stuck DataNode processes (I'm not even sure why there's two of  
them, as under normal operation there's only one). After killing the  
stuck ones, I simply do a "hadoop-daemon.sh start datanode" and all is  
normal again. I've not seen any dataloss or corruption as a result of  
this problem.


Has anyone seen anything like this happen before? Out of our ~100 node  
cluster I see this problem around once a day, and it seems to just  
strike random nodes at random times. It happens often enough that I  
would be happy to do additional debugging if anyone can tell me how.  
I'm not a developer at all, so I'm at the end of my knowledge on how  
to solve this problem. Thanks for any help!



===
Garhan Attebury
Systems Administrator
UNL Research Computing Facility
402-472-7761
===



Re: Does "hadoop-default.xml" + "hadoop-site.xml" matter for whole cluster or each node?

2009-03-09 Thread Nick Cen
A clear naming convention will make it more easy to configure. But i think
besides the system and job level , i think there are also some parameters
take effect in node level like mapred.tasktracker.map.tasks.maximum, as far
as i can remember, we can set this differently for different node.

2009/3/9 Owen O'Malley 

> On Mar 7, 2009, at 10:56 PM, pavelkolo...@gmail.com wrote:
>
>
>> Does "hadoop-default.xml" + "hadoop-site.xml" of master host matter for
>> whole Job
>> or they matter for each node independently?
>>
>
> Please never modify hadoop-default. That is for the system defaults. Please
> use hadoop-site for your configuration.
>
> It depends on the property whether they come from the job's configuration
> or the system's. Some  like io.sort.mb and mapred.map.tasks come from the
> job, while others like mapred.tasktracker.map.tasks.maximum come from the
> system. The job parameters come from the submitting client, while the system
> parameters need to be distributed to each worker node.
>
> -- Owen
>
>  For example, if one of them (or both) contains:
>> 
>>  mapred.map.tasks
>>  6
>> 
>>
>> then is it means that six mappers will be executed on all nodes or 6 on
>> each node?
>>
>
> That means that your job will default to 6 maps.
> mapred.tasktracker.map.tasks.maximum specifies the number of maps running on
> each node.
>
> And yes, we really should do a cleanup of the property names to do
> something like:
>
> mapred.job.*
> mapred.system.*
>
> to separate the job from the system parameters.
>
> -- Owen
>



-- 
http://daily.appspot.com/food/


Re: Does "hadoop-default.xml" + "hadoop-site.xml" matter for whole cluster or each node?

2009-03-09 Thread Owen O'Malley

On Mar 7, 2009, at 10:56 PM, pavelkolo...@gmail.com wrote:



Does "hadoop-default.xml" + "hadoop-site.xml" of master host matter  
for whole Job

or they matter for each node independently?


Please never modify hadoop-default. That is for the system defaults.  
Please use hadoop-site for your configuration.


It depends on the property whether they come from the job's  
configuration or the system's. Some  like io.sort.mb and  
mapred.map.tasks come from the job, while others like  
mapred.tasktracker.map.tasks.maximum come from the system. The job  
parameters come from the submitting client, while the system  
parameters need to be distributed to each worker node.


-- Owen


For example, if one of them (or both) contains:

 mapred.map.tasks
 6


then is it means that six mappers will be executed on all nodes or 6  
on each node?


That means that your job will default to 6 maps.  
mapred.tasktracker.map.tasks.maximum specifies the number of maps  
running on each node.


And yes, we really should do a cleanup of the property names to do  
something like:


mapred.job.*
mapred.system.*

to separate the job from the system parameters.

-- Owen


Re: MultipleOutputFormat with sorting functionality

2009-03-09 Thread Nick Cen
I think the sort is not relatived to the output format.

I previously have try this class ,but has a little different compared to
your code. I extend the MultipleTextOutputFormat class and override
its generateFileNameForKeyValue()
method, and everything seems working fine.

2009/3/9 Rasit OZDAS 

> Hi, all!
>
> I'm using multiple output format to write out 4 different files, each one
> has the same type.
> But it seems that outputs aren't being sorted.
>
> Should they be sorted? Or isn't it implemented for multiple output format?
>
> Here is some code:
>
> // in main function
> MultipleOutputs.addMultiNamedOutput(conf, "text", TextOutputFormat.class,
> DoubleWritable.class, Text.class);
>
> // in Reducer.configure()
> mos = new MultipleOutputs(conf);
>
> // in Reducer.reduce()
> if (keystr.equalsIgnoreCase("BreachFace"))
>mos.getCollector("text", "BreachFace", reporter).collect(new
> Text(key), dbl);
>else if (keystr.equalsIgnoreCase("Ejector"))
>mos.getCollector("text", "Ejector", reporter).collect(new
> Text(key), dbl);
>else if (keystr.equalsIgnoreCase("FiringPin"))
>mos.getCollector("text", "FiringPin", reporter).collect(new
> Text(key), dbl);
>else if (keystr.equalsIgnoreCase("WeightedSum"))
>mos.getCollector("text", "WeightedSum",
> reporter).collect(new Text(key), dbl);
>else
>mos.getCollector("text", "Diger", reporter).collect(new
> Text(key), dbl);
>
>
> --
> M. Raşit ÖZDAŞ
>



-- 
http://daily.appspot.com/food/


MultipleOutputFormat with sorting functionality

2009-03-09 Thread Rasit OZDAS
Hi, all!

I'm using multiple output format to write out 4 different files, each one
has the same type.
But it seems that outputs aren't being sorted.

Should they be sorted? Or isn't it implemented for multiple output format?

Here is some code:

// in main function
MultipleOutputs.addMultiNamedOutput(conf, "text", TextOutputFormat.class,
DoubleWritable.class, Text.class);

// in Reducer.configure()
mos = new MultipleOutputs(conf);

// in Reducer.reduce()
if (keystr.equalsIgnoreCase("BreachFace"))
mos.getCollector("text", "BreachFace", reporter).collect(new
Text(key), dbl);
else if (keystr.equalsIgnoreCase("Ejector"))
mos.getCollector("text", "Ejector", reporter).collect(new
Text(key), dbl);
else if (keystr.equalsIgnoreCase("FiringPin"))
mos.getCollector("text", "FiringPin", reporter).collect(new
Text(key), dbl);
else if (keystr.equalsIgnoreCase("WeightedSum"))
mos.getCollector("text", "WeightedSum",
reporter).collect(new Text(key), dbl);
else
mos.getCollector("text", "Diger", reporter).collect(new
Text(key), dbl);


-- 
M. Raşit ÖZDAŞ


Re: master trying fetch data from slave using "localhost" hostname :)

2009-03-09 Thread pavelkolodin




what does /etc/host look like now?

I hit some problems with ubuntu and localhost last week; the hostname  
was set up in /etc/hosts not just to point to the loopback address, but  
to a different loopback address (127.0.1.1) from the normal value  
(127.0.0.1), so breaking everything.


http://www.1060.org/blogxter/entry?publicid=121ED68BB21DB8C060FE88607222EB52


"/etc/hosts" now on both machines:

192.168.0.28master1
192.168.0.199   slave1


Re: master trying fetch data from slave using "localhost" hostname :)

2009-03-09 Thread Steve Loughran

pavelkolo...@gmail.com wrote:
On Fri, 06 Mar 2009 14:41:57 -, jason hadoop 
 wrote:


I see that when the host name of the node is also on the localhost 
line in

/etc/hosts



I erased all records with "localhost" from all "/etc/hosts" files and 
all fine now :)

Thank you :)



what does /etc/host look like now?

I hit some problems with ubuntu and localhost last week; the hostname 
was set up in /etc/hosts not just to point to the loopback address, but 
to a different loopback address (127.0.1.1) from the normal value 
(127.0.0.1), so breaking everything.


http://www.1060.org/blogxter/entry?publicid=121ED68BB21DB8C060FE88607222EB52


Re: question about released version id

2009-03-09 Thread Rasit OZDAS
Hi, here is the versioning methodology of Apache Portable Runtime,
But I think hadoop's is much or less the same..

http://apr.apache.org/versioning.html

Rasit


2009/3/3 鞠適存 

> hi,
>
> I wonder how to make the hadoop version number.
> The HowToRelease page on the hadoop web site just describes
> the process about new release but not mentions the rules on
> assigning the version number. Are there any criteria for version number?
> For example,under what condition the next version of 0.18.0 would be call
> as
> 0.19.0, and
> under what condtion  the next version of 0.18.0 would be call as 0.18.1?
> In addition, did the other Apache projects (such as hbase) use the same
> criteria to decide the
> version number?
>
> Thank you in advance for any pointers.
>
> Chu, ShihTsun
>



-- 
M. Raşit ÖZDAŞ


Re: Profiling Map/Reduce Tasks

2009-03-09 Thread Rasit OZDAS
I note System.currentTimeMillis() at the beginning of main function,
then at the end I use a while loop to wait for the job,

while (!runningJob.isComplete())
  Thread.sleep(1000);

Then again I note the system time. But this only gives the total amount of
time passed.

Rasit

2009/3/8 Richa Khandelwal 

> Hi,
> Does Map/Reduce profiles jobs down to milliseconds. From what I can see in
> the logs, there is no time specified for the job. Although CPU TIME is an
> information that should be present in the logs, it was not profiled and the
> response time can only be noted in down to seconds from the runtime
> progress
> of the jobs.
>
> Does someone know how to efficiently profile map reduce jobs?
>
> Thanks,
> Richa Khandelwal
>
>
> University Of California,
> Santa Cruz.
> Ph:425-241-7763
>



-- 
M. Raşit ÖZDAŞ


Re: Does "hadoop-default.xml" + "hadoop-site.xml" matter for whole cluster or each node?

2009-03-09 Thread Rasit OZDAS
Some parameters are global (I can't give an example now),
they are cluster-wide even if they're defined in hadoop-site.xml

Rasit

2009/3/9 Nick Cen 

> for Q1: i think so , but i think it is a good practice to keep the
> hadoop-default.xml untouched.
> for Q2: i use this property for debugging in eclipse.
>
>
>
> 2009/3/9 
>
> >
> >
> >  The hadoop-site.xml will take effect only on that specified node. So
> each
> >> node can have its own configuration with hadoop-site.xml.
> >>
> >>
> > As i understand, parameters in "hadoop-site" overwrites these ones in
> > "hadoop-default".
> > So "hadoop-default" also individual for each node?
> >
> > Q2: what means "local" as value of "mapred.job.tracker"?
> >
> > thanks
> >
>
>
>
> --
> http://daily.appspot.com/food/
>



-- 
M. Raşit ÖZDAŞ