Re: does counters go the performance down seriously?

2011-03-29 Thread Steve Loughran

On 28/03/11 23:34, JunYoung Kim wrote:

hi,

this linke is about hadoop usage for the good practices.

http://developer.yahoo.com/blogs/hadoop/posts/2010/08/apache_hadoop_best_practices_a/
 by Arun C Murthy

if I want to use about 50,000 counters for a job, does it cause serious 
performance down?





Yes, you will use up lots of JT memory and so put limits on the overall 
size of your cluster.


If you have a small cluster and can crank up the memory settings on the 
JT to 48 GB this isn't going to be an issue, but as Y! are topping out 
at these numbers anyway, lots of counters just overload them.





Re: changing node's rack

2011-03-29 Thread Rita
I think I tried this. I have a data file which has the map, ip address:rack,
hostname:rack. I changed that and did a refreshNodes. Is this what you mean?
Or something else? I would be more than happy to test it.



On Mon, Mar 28, 2011 at 4:15 PM, Michael Segel michael_se...@hotmail.comwrote:

  This may be weird, but I could have sworn that the script is called
 repeatedly.
 One simple test would be to change the rack aware script and print a
 message out when the script is called.
 Then change the script and see if it catches the change without restarting
 the cluster.

 -Mike


  From: tdunn...@maprtech.com
  Date: Sat, 26 Mar 2011 15:50:58 -0700
  Subject: Re: changing node's rack
  To: common-user@hadoop.apache.org
  CC: rmorgan...@gmail.com

 
  I think that the namenode remembers the rack. Restarting the datanode
  doesn't make it forget.
 
  On Sat, Mar 26, 2011 at 7:34 AM, Rita rmorgan...@gmail.com wrote:
 
   What is the best way to change the rack of a node?
  
   I have tried the following: Killed the datanode process. Changed the
   rackmap
   file so the node  and ip address entry reflect the new rack and I do
 a
   '-refreshNodes'. Restarted the datanode. But it seems the datanode is
 keep
   getting register to the old rack.
  
   --
   --- Get your facts first, then you can distort them as you please.--
  




-- 
--- Get your facts first, then you can distort them as you please.--


live/dead node problem

2011-03-29 Thread Rita
Hello All,

Is there a parameter or procedure to check more aggressively for a live/dead
node? Despite me killing the hadoop process, I see the node active for more
than 10+ minutes in the Live Nodes page.  Fortunately, the last contact
increments.


Using, branch-0.21, 0985326

-- 
--- Get your facts first, then you can distort them as you please.--


Re: live/dead node problem

2011-03-29 Thread Harsh J
I'm not too sure about it, but I think dfs.client.socket-timeout and
dfs.datanode.socket.write.timeout keys control the timeout values
for reading/writing sockets (Defaults set by HdfsConstants.* values)
in 0.21.

On Tue, Mar 29, 2011 at 5:43 PM, Rita rmorgan...@gmail.com wrote:
 Hello All,

 Is there a parameter or procedure to check more aggressively for a live/dead
 node? Despite me killing the hadoop process, I see the node active for more
 than 10+ minutes in the Live Nodes page.  Fortunately, the last contact
 increments.


 Using, branch-0.21, 0985326

 --
 --- Get your facts first, then you can distort them as you please.--




-- 
Harsh J
http://harshj.com


Hadoop for Bioinformatics

2011-03-29 Thread Franco Nazareno
Good day everyone!

 

First, I want to congratulate the group for this wonderful project. It did
open up new ideas and solutions in computing and technology-wise. I'm
excited to learn more about it and discover possibilities using Hadoop and
its components. 

 

Well I just want to ask this with regards to my study. Currently I'm
studying my PhD course in Bioinformatics, and my question is that can you
give me a (rough) idea if it's possible to use Hadoop cluster in achieving a
DNA sequence alignment? My basic idea for this goes something like a string
search out of a huge data files stored in HDFS, and the application uses
MapReduce in searching and computing. As the Hadoop paradigm impies, it
doesn't serve well in interactive applications, and I think this kind of
searching is a write-once, read-many application.

 

I hope you don't mind my question. And it'll be great hearing your comments
or suggestions about this.

 

Thanks and more power!

Franco



Re: Hadoop for Bioinformatics

2011-03-29 Thread Bibek Paudel
On Mon, Mar 28, 2011 at 4:51 AM, Franco Nazareno
franco.nazar...@gmail.com wrote:
 Good day everyone!



 First, I want to congratulate the group for this wonderful project. It did
 open up new ideas and solutions in computing and technology-wise. I'm
 excited to learn more about it and discover possibilities using Hadoop and
 its components.



 Well I just want to ask this with regards to my study. Currently I'm
 studying my PhD course in Bioinformatics, and my question is that can you
 give me a (rough) idea if it's possible to use Hadoop cluster in achieving a
 DNA sequence alignment? My basic idea for this goes something like a string
 search out of a huge data files stored in HDFS, and the application uses
 MapReduce in searching and computing. As the Hadoop paradigm impies, it
 doesn't serve well in interactive applications, and I think this kind of
 searching is a write-once, read-many application.

Are you looking for something like a distributed grep? The hadoop
package comes with some examples, and 'grep' is one of them.

Please see: http://wiki.apache.org/hadoop/Grep and
http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html .

Let us know if you are looking for something else.

-b




 I hope you don't mind my question. And it'll be great hearing your comments
 or suggestions about this.



 Thanks and more power!

 Franco




Hadoop in Canada

2011-03-29 Thread James Seigel
Hello,

You might remember me from a couple of weeks back asking if there were any 
Calgary people interested in a “meetup” about #bigdata or using hadoop.  Well, 
I’ve expanded my search a little to see if any of my Canadian brothers and 
sisters are using the elephant for good or for evil.  It might be harder to 
grab coffee, but it would be fun to see where everyone is.

Shout out if you’d like or ping me, I think it’d be fun to chat!

Cheers
James Seigel
Captain Hammer at Tynt.com

Re: Hadoop for Bioinformatics

2011-03-29 Thread Kiss Tibor
Hi Franco,

We are using Hadoop for next-gen sequence alignment.
Earlier we had a classic programming model solution, but currently we are
upgrading our software services to M/R modell based on Hadoop.
We transferred most of our classic algorithms to Hadoop and I can say that
everything is getting more manageable.

We are going with Hadoop on the cloud and/or on datacenter. Another
challenge, especially with cloud, how you are transferring the data, because
in bioinformatics the amount of data are usually very high.
Currently i am working on an open-source version of Amazon multipart upload
which will be available in the next release of
JCloudshttp://code.google.com/p/jclouds/wiki/BlobStore,
here are the starting
ideashttp://www.slideshare.net/jclouds/big-data-in-real-life-a-study-on-s3-multipart-uploadsand
also a sample
client 
apphttps://github.com/jclouds/jclouds-examples/tree/master/blobstore-largeblob
.
If you want to follow new results on
twitterhttp://twitter.com/#%21/tiborkisstibor,
you are invited. I plan to release a paper with results of the data transfer
operations based on this open-source approach.

Also, soon we are releasing the version of our cloud based service stack
which is fully based on Hadoop.

Tibor

On Mon, Mar 28, 2011 at 4:51 AM, Franco Nazareno
franco.nazar...@gmail.comwrote:

 Good day everyone!



 First, I want to congratulate the group for this wonderful project. It did
 open up new ideas and solutions in computing and technology-wise. I'm
 excited to learn more about it and discover possibilities using Hadoop and
 its components.



 Well I just want to ask this with regards to my study. Currently I'm
 studying my PhD course in Bioinformatics, and my question is that can you
 give me a (rough) idea if it's possible to use Hadoop cluster in achieving
 a
 DNA sequence alignment? My basic idea for this goes something like a string
 search out of a huge data files stored in HDFS, and the application uses
 MapReduce in searching and computing. As the Hadoop paradigm impies, it
 doesn't serve well in interactive applications, and I think this kind of
 searching is a write-once, read-many application.



 I hope you don't mind my question. And it'll be great hearing your comments
 or suggestions about this.



 Thanks and more power!

 Franco




Re: Hadoop for Bioinformatics

2011-03-29 Thread Luca Pireddu
On March 28, 2011 04:51:14 Franco Nazareno wrote:
 Good day everyone!

And a good day to you Franco!

 First, I want to congratulate the group for this wonderful project. It did
 open up new ideas and solutions in computing and technology-wise. I'm
 excited to learn more about it and discover possibilities using Hadoop and
 its components.
 
 
 Well I just want to ask this with regards to my study. Currently I'm
 studying my PhD course in Bioinformatics, and my question is that can you
 give me a (rough) idea if it's possible to use Hadoop cluster in achieving
 a DNA sequence alignment? My basic idea for this goes something like a
 string search out of a huge data files stored in HDFS, and the application
 uses MapReduce in searching and computing. As the Hadoop paradigm impies,
 it doesn't serve well in interactive applications, and I think this kind
 of searching is a write-once, read-many application.
 
 
 
 I hope you don't mind my question. And it'll be great hearing your comments
 or suggestions about this.
 
 
 
 Thanks and more power!
 
 Franco

The short answer is yes!  At CRS4 we are working on this very problem.  

We have implemented a Hadoop-based workflow to perform short read alignment to 
support DNA sequencing activities in our lab.  Its alignment operation is 
based on (and therefore equivalent to) BWA.  We have written a paper about it 
which will appear in the coming months, and we are working on an open source 
release, but alas we haven't completed that task yet.  

We have also implemented a Hadoop-based distributed blast alignment program, 
in case you're working with long fragments.  It's currently being used by our 
collaborators to align viral DNA segments.


In either case, if you're interested we can let you have an advance release of 
either program so you can try them out.


-- 
Luca Pireddu
CRS4 - Distributed Computing Group
Loc. Pixina Manna Edificio 1
Pula 09010 (CA), Italy
Tel:  +39 0709250452


Re: Hadoop for Bioinformatics

2011-03-29 Thread Luca Pireddu
On March 28, 2011 04:51:14 Franco Nazareno wrote:
 
 Well I just want to ask this with regards to my study. Currently I'm
 studying my PhD course in Bioinformatics, and my question is that can you
 give me a (rough) idea if it's possible to use Hadoop cluster in achieving
 a DNA sequence alignment? My basic idea for this goes something like a
 string search out of a huge data files stored in HDFS, and the application
 uses MapReduce in searching and computing. As the Hadoop paradigm impies,
 it doesn't serve well in interactive applications, and I think this kind
 of searching is a write-once, read-many application.

I'll add some relevant citations:

An overview of the Hadoop/MapReduce/HBase framework and its current 
applications in bioinformatics
http://www.biomedcentral.com/1471-2105/11/S12/S1


Biodoop: Bioinformatics on Hadoop
http://www.computer.org/portal/web/csdl/doi/10.1109/ICPPW.2009.37


CloudBurst: highly sensitive read mapping with MapReduce
http://bioinformatics.oxfordjournals.org/content/25/11/1363.short


CloudBLAST: Combining MapReduce and Virtualization on Distributed Resources 
for Bioinformatics Applications
http://www.computer.org/portal/web/csdl/doi/10.1109/eScience.2008.62


-- 
Luca Pireddu
CRS4 - Distributed Computing Group
Loc. Pixina Manna Edificio 1
Pula 09010 (CA), Italy
Tel:  +39 0709250452


how to get each task's process?

2011-03-29 Thread 朱韬
hi,guys:
   I want to monitor each tasks' io usage, do can I get each task's process 
id?
   I used  jvmManager.getPid(tip.getTaskRunner()); but it didn't seem to be 
the task's process.
   Thanks

 zhutao



# of keys per reducer invocation (streaming api)

2011-03-29 Thread Dieter Plaetinck
Hi, I'm using the streaming API and I notice my reducer gets - in the same
invocation - a bunch of different keys, and I wonder why.
I would expect to get one key per reducer run, as with the normal
hadoop.

Is this to limit the amount of spawned processes, assuming creating and
destroying processes is usually expensive compared to the amount of
work they'll need to do (not much, if you have many keys with each a
handful of values)?

OTOH if you have a high number of values over a small number of keys, I
would rather stick to one-key-per-reducer-invocation, then I don't need
to worry about supporting (and allocating memory for) multiple input
keys.  Is there a config setting to enable such behavior?

Maybe I'm missing something, but this seems like a big difference in
comparison to the default way of working, and should maybe be added to
the FAQ at
http://hadoop.apache.org/common/docs/r0.20.2/streaming.html#Frequently+Asked+Questions

thanks,
Dieter


Re: how to get each task's process

2011-03-29 Thread Nan Zhu
TaskStatus.getprogress() is helpful

Best,
Nan

2011/3/29 朱韬 ryanzhu...@163.com

 hi,guys:
   I want to monitor each tasks' io usage, do can I get each task's
 process id?
   I used  jvmManager.getPid(tip.getTaskRunner()); but it didn't seem to
 be the task's process.
   Thanks

 zhutao




-- 
Nan Zhu
School of Software,5501
Shanghai Jiao Tong University
800,Dongchuan Road,Shanghai,China
E-Mail: zhunans...@gmail.com


Re:Re: how to get each task's process

2011-03-29 Thread 朱韬
Nan:
 Thanks, I mean process, not progress




在 2011-03-29 23:05:29,Nan Zhu zhunans...@gmail.com 写道:

TaskStatus.getprogress() is helpful

Best,
Nan

2011/3/29 朱韬 ryanzhu...@163.com

 hi,guys:
   I want to monitor each tasks' io usage, do can I get each task's
 process id?
   I used  jvmManager.getPid(tip.getTaskRunner()); but it didn't seem to
 be the task's process.
   Thanks

 zhutao




-- 
Nan Zhu
School of Software,5501
Shanghai Jiao Tong University
800,Dongchuan Road,Shanghai,China
E-Mail: zhunans...@gmail.com


RE: live/dead node problem

2011-03-29 Thread Michael Segel

Rita,

When the NameNode doesn't see a heartbeat for 10 minutes, it then recognizes 
that the node is down. 

Per the Hadoop online documentation:
Each DataNode sends a Heartbeat message to the NameNode periodically. A 
network partition can cause a 
subset of DataNodes to lose connectivity with the NameNode. The 
NameNode detects this condition by the 
absence of a Heartbeat message. The NameNode marks DataNodes without 
recent Heartbeats as dead and 
does not forward any new IO requests to them. Any data that was 
registered to a dead DataNode is not available to HDFS any more. 
DataNode death may cause the replication 
factor of some blocks to fall below their specified value. The NameNode 
constantly tracks which blocks need 
to be replicated and initiates replication whenever necessary. The 
necessity for re-replication may arise due 
to many reasons: a DataNode may become unavailable, a replica may 
become corrupted, a hard disk on a 
DataNode may fail, or the replication factor of a file may be 
increased. 


I was trying to find out if there's an hdfs-site parameter that could be set to 
decrease this time period, but wasn't successful.

HTH

-Mike



 Date: Tue, 29 Mar 2011 08:13:43 -0400
 Subject: live/dead node problem
 From: rmorgan...@gmail.com
 To: common-user@hadoop.apache.org

 Hello All,

 Is there a parameter or procedure to check more aggressively for a live/dead
 node? Despite me killing the hadoop process, I see the node active for more
 than 10+ minutes in the Live Nodes page. Fortunately, the last contact
 increments.


 Using, branch-0.21, 0985326

 --
 --- Get your facts first, then you can distort them as you please.--
  

too many map tasks, freezing jobTracker?

2011-03-29 Thread Brendan W.
Hi,

I have a 20-node hadoop cluster, processing large log files.  I've seen it
said that there's never any reason to make the inputSplitSize larger than a
single HDFS block (64M), because you give up data locality for no benefit if
you do.

But when I kick off a job against the whole dataset with that default
splitSize, I get about 180,000 map tasks, most lasting about 9-15 seconds
each.  Typically I can get through about half of them, then the jobTracker
freezes with OOM errors.

I do realize that I could just up the HADOOP_HEAP_SIZE on the jobTracker
node.  But it also seems like we ought to have fewer map tasks, lasting more
like 1 or 1.5 minutes each, to reduce the overhead to the jobTracker of
managing so many tasks...also the overhead to the cluster nodes of starting
and cleaning up after so many child JVMs.

Is that not a compelling reason for upping the inputSplitSize?  Or am I
missing something?

Thanks


Re: too many map tasks, freezing jobTracker?

2011-03-29 Thread Harsh J
Hello Brendan W.,

On Tue, Mar 29, 2011 at 9:01 PM, Brendan W. bw8...@gmail.com wrote:
 Hi,

 I have a 20-node hadoop cluster, processing large log files.  I've seen it
 said that there's never any reason to make the inputSplitSize larger than a
 single HDFS block (64M), because you give up data locality for no benefit if
 you do.

This is true. You wouldn't always want your InputSplits to have chunk
sizes bigger than the input file's block-size on the HDFS.

 But when I kick off a job against the whole dataset with that default
 splitSize, I get about 180,000 map tasks, most lasting about 9-15 seconds
 each.  Typically I can get through about half of them, then the jobTracker
 freezes with OOM errors.

That your input splits are 18 in number is a good indicator that you have:
a) Too many files (A small files problem? [1])
b) Too low block size of the input files [2])

[1] - http://www.cloudera.com/blog/2009/02/the-small-files-problem/
[2] - For file sizes in GBs, it does not make sense to have 64 MB
block sizes. Increasing block sizes for such files (it is a per-file
property after-all) directly reduces your number of tasks.

-- 
Harsh J
http://harshj.com


Re: DFSClient: Could not complete file

2011-03-29 Thread Chris Curtin
We are narrowing this down. The last few times it hung we found a 'du -sk'
process for each our HDFS disks as the top users of CPU. They are also
taking a really long time.

Searching around I find one example of someone reporting a similar issue
with du -sk, but they tied it to XFS. We are using Ext3.

Anyone have any other ideas since it appears to be related to the 'du' not
coming back? Note that running the command directly finishes in a few
seconds.

Thanks,

Chris

On Wed, Mar 16, 2011 at 9:41 AM, Chris Curtin curtin.ch...@gmail.comwrote:

 Caught something today I missed before:

 11/03/16 09:32:49 INFO hdfs.DFSClient: Exception in createBlockOutputStream
 java.io.IOException: Bad connect ack with firstBadLink 10.120.41.105:50010
 11/03/16 09:32:49 INFO hdfs.DFSClient: Abandoning block
 blk_-517003810449127046_10039793
 11/03/16 09:32:49 INFO hdfs.DFSClient: Waiting to find target node:
 10.120.41.103:50010
 11/03/16 09:34:04 INFO hdfs.DFSClient: Exception in createBlockOutputStream
 java.net.SocketTimeoutException: 69000 millis timeout while waiting for
 channel to be ready for read. ch : java.nio.channels.SocketChannel[connected
 local=/10.120.41.85:34323 remote=/10.120.41.105:50010]
 11/03/16 09:34:04 INFO hdfs.DFSClient: Abandoning block
 blk_2153189599588075377_10039793
 11/03/16 09:34:04 INFO hdfs.DFSClient: Waiting to find target node:
 10.120.41.105:50010
 11/03/16 09:34:55 INFO hdfs.DFSClient: Could not complete file
 /tmp/hadoop/mapred/system/job_201103160851_0014/job.jar retrying...



 On Wed, Mar 16, 2011 at 9:00 AM, Chris Curtin curtin.ch...@gmail.comwrote:

 Thanks. Spent a lot of time looking at logs and nothing on the reducers
 until they start complaining about 'could not complete'.

 Found this in the jobtracker log file:

 2011-03-16 02:38:47,881 WARN org.apache.hadoop.hdfs.DFSClient:
 DFSOutputStream ResponseProcessor exception  for block
 blk_3829493505250917008_9959810java.io.IOException: Bad response 1 for block
 blk_3829493505250917008_9959810 from datanode 10.120.41.103:50010
 at
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2454)
 2011-03-16 02:38:47,881 WARN org.apache.hadoop.hdfs.DFSClient: Error
 Recovery for block blk_3829493505250917008_9959810 bad datanode[2]
 10.120.41.103:50010
 2011-03-16 02:38:47,881 WARN org.apache.hadoop.hdfs.DFSClient: Error
 Recovery for block blk_3829493505250917008_9959810 in pipeline
 10.120.41.105:50010, 10.120.41.102:50010, 10.120.41.103:50010: bad
 datanode 10.120.41.103:50010
 2011-03-16 02:38:53,133 INFO org.apache.hadoop.hdfs.DFSClient: Could not
 complete file
 /var/hadoop/tmp/2_20110316_pmta_pipe_2_20_50351_2503122/_logs/history/hadnn01.atlis1_1299879680612_job_201103111641_0312_deliv_2_20110316_pmta_pipe*2_20110316_%5B%281%2F3%29+...QUEUED_T
 retrying...

 Looking at the logs from the various times this happens, the 'from
 datanode' in the first message is any of the data nodes (roughly equal in #
 of times it fails), so I don't think it is one specific node having
 problems.
 Any other ideas?

 Thanks,

 Chris
   On Sun, Mar 13, 2011 at 3:45 AM, icebergs hkm...@gmail.com wrote:

 You should check the bad reducers' logs carefully.There may be more
 information about it.

 2011/3/10 Chris Curtin curtin.ch...@gmail.com

  Hi,
 
  The last couple of days we have been seeing 10's of thousands of these
  errors in the logs:
 
   INFO org.apache.hadoop.hdfs.DFSClient: Could not complete file
 
 
 /offline/working/3/aat/_temporary/_attempt_201103100812_0024_r_03_0/4129371_172307245/part-3
  retrying...
  When this is going on the reducer in question is always the last
 reducer in
  a job.
 
  Sometimes the reducer recovers. Sometimes hadoop kills that reducer,
 runs
  another and it succeeds. Sometimes hadoop kills the reducer and the new
 one
  also fails, so it gets killed and the cluster goes into a loop of
  kill/launch/kill.
 
  At first we thought it was related to the size of the data being
 evaluated
  (4+GB), but we've seen it several times today on  100 MB
 
  Searching here or online doesn't show a lot about what this error means
 and
  how to fix it.
 
  We are running 0.20.2, r911707
 
  Any suggestions?
 
 
  Thanks,
 
  Chris
 






Re: how to get each task's process

2011-03-29 Thread Harsh J
I think jvmManager.getPid(...) is correct. It should give you the
launched JVM's PID properly. In fact, the same is even used to kill
the JVMs.

2011/3/29 朱韬 ryanzhu...@163.com:
 hi,guys:
       I want to monitor each tasks' io usage, do can I get each task's 
 process id?
       I used  jvmManager.getPid(tip.getTaskRunner()); but it didn't seem to 
 be the task's process.
       Thanks
                                                                               
                   zhutao



-- 
Harsh J
http://harshj.com


Re: Hadoop in Canada

2011-03-29 Thread Jean-Daniel Cryans
(moving to general@ since this is not a question regarding the usage
of the hadoop commons, which I BCC'd)

I moved from Montreal to SF a year and a half ago because I saw two
things 1) companies weren't interested (they are still trying to get
rid of COBOL or worse) or didn't have the data to use Hadoop (not
enough big companies) and 2) the universities were either uninterested
or just amused by this new comer. I know of one company that really
does cool stuff with Hadoop in Montreal and it's Hopper
(www.hopper.travel, they are still in closed alpha AFAIK) who also
organized hackreduce.org last weekend. This is what their CEO has to
say to the question Is there something you would do differently now
if you would start it over?:

Move to the Valley.

(see the rest here
http://nextmontreal.com/product-market-fit-hopper-travel-fred-lalonde/)

I'm sure there are a lot of other companies that are either
considering using or already using Hadoop to some extent in Canada
but, like anything else, only a portion of them are interested in
talking about it or even organizing an event.

I would actually love to see something getting organized and I'd be on
the first plane to Y**, but I'm afraid that to achieve any sort of
critical mass you'd have to fly in people from all the provinces. Air
Canada becomes a SPOF :P

Now that I think about it, there's probably enough Canucks around here
that use Hadoop that we could have our own little user group. If you
want to have a nice vacation and geek out with us, feel free to stop
by and say hi.

/rant

J-D

On Tue, Mar 29, 2011 at 6:21 AM, James Seigel ja...@tynt.com wrote:
 Hello,

 You might remember me from a couple of weeks back asking if there were any 
 Calgary people interested in a “meetup” about #bigdata or using hadoop.  
 Well, I’ve expanded my search a little to see if any of my Canadian brothers 
 and sisters are using the elephant for good or for evil.  It might be harder 
 to grab coffee, but it would be fun to see where everyone is.

 Shout out if you’d like or ping me, I think it’d be fun to chat!

 Cheers
 James Seigel
 Captain Hammer at Tynt.com


Re: Hadoop in Canada

2011-03-29 Thread James Seigel
I apologize posing it to user. 

Sorry. Forgot about General

Thanks J-D

James.
On 2011-03-29, at 11:33 AM, Jean-Daniel Cryans wrote:

 (moving to general@ since this is not a question regarding the usage
 of the hadoop commons, which I BCC'd)
 
 I moved from Montreal to SF a year and a half ago because I saw two
 things 1) companies weren't interested (they are still trying to get
 rid of COBOL or worse) or didn't have the data to use Hadoop (not
 enough big companies) and 2) the universities were either uninterested
 or just amused by this new comer. I know of one company that really
 does cool stuff with Hadoop in Montreal and it's Hopper
 (www.hopper.travel, they are still in closed alpha AFAIK) who also
 organized hackreduce.org last weekend. This is what their CEO has to
 say to the question Is there something you would do differently now
 if you would start it over?:
 
 Move to the Valley.
 
 (see the rest here
 http://nextmontreal.com/product-market-fit-hopper-travel-fred-lalonde/)
 
 I'm sure there are a lot of other companies that are either
 considering using or already using Hadoop to some extent in Canada
 but, like anything else, only a portion of them are interested in
 talking about it or even organizing an event.
 
 I would actually love to see something getting organized and I'd be on
 the first plane to Y**, but I'm afraid that to achieve any sort of
 critical mass you'd have to fly in people from all the provinces. Air
 Canada becomes a SPOF :P
 
 Now that I think about it, there's probably enough Canucks around here
 that use Hadoop that we could have our own little user group. If you
 want to have a nice vacation and geek out with us, feel free to stop
 by and say hi.
 
 /rant
 
 J-D
 
 On Tue, Mar 29, 2011 at 6:21 AM, James Seigel ja...@tynt.com wrote:
 Hello,
 
 You might remember me from a couple of weeks back asking if there were any 
 Calgary people interested in a “meetup” about #bigdata or using hadoop.  
 Well, I’ve expanded my search a little to see if any of my Canadian brothers 
 and sisters are using the elephant for good or for evil.  It might be harder 
 to grab coffee, but it would be fun to see where everyone is.
 
 Shout out if you’d like or ping me, I think it’d be fun to chat!
 
 Cheers
 James Seigel
 Captain Hammer at Tynt.com



Gzip compression thread safety

2011-03-29 Thread Jose Vinicius Pimenta Coletto
Can anyone tell me if gzip compression of Hadoop is thread safe?

Thank you.
-- 
Jose Vinicius Pimenta Coletto


Re: # of keys per reducer invocation (streaming api)

2011-03-29 Thread Harsh J
Hello,

On Tue, Mar 29, 2011 at 8:25 PM, Dieter Plaetinck
dieter.plaeti...@intec.ugent.be wrote:
 Hi, I'm using the streaming API and I notice my reducer gets - in the same
 invocation - a bunch of different keys, and I wonder why.
 I would expect to get one key per reducer run, as with the normal
 hadoop.

 Is this to limit the amount of spawned processes, assuming creating and
 destroying processes is usually expensive compared to the amount of
 work they'll need to do (not much, if you have many keys with each a
 handful of values)?

 OTOH if you have a high number of values over a small number of keys, I
 would rather stick to one-key-per-reducer-invocation, then I don't need
 to worry about supporting (and allocating memory for) multiple input
 keys.  Is there a config setting to enable such behavior?

 Maybe I'm missing something, but this seems like a big difference in
 comparison to the default way of working, and should maybe be added to
 the FAQ at
 http://hadoop.apache.org/common/docs/r0.20.2/streaming.html#Frequently+Asked+Questions

 thanks,
 Dieter


I think it would make more sense to think of streaming programs as
complete map/reduce 'tasks', instead of trying to apply the Map/Reduce
functional concept. Both of the programs need to be written from the
reading level onwards, which in map's case each line is record input
and in reduce's case it is one uniquely grouped key and all values
associated to it. One would need to handle the reading-loop
themselves.

Some non-Java libraries that provide abstractions atop the
streaming/etc. layer allow for more fluent representations of the
map() and reduce() functions, hiding away the other fine details (like
the Java API). Dumbo[1] is such a library for Python Hadoop Map/Reduce
programs, for example.

A FAQ entry on this should do good too! You can file a ticket for an
addition of this observation to the streaming docs' FAQ.

[1] - https://github.com/klbostee/dumbo/wiki/Short-tutorial

-- 
Harsh J
http://harshj.com


Re: Hadoop for Bioinformatics

2011-03-29 Thread Tsz Wo (Nicholas), Sze
Hi Franco,

I recall that there are some Hadoop-Blast researches/projects.  For examples, 
see


- http://www.cs.umd.edu/Grad/scholarlypapers/papers/MichaelSchatz.pdf
- http://salsahpc.indiana.edu/tutorial/hadoopblast.html

Nicholas




From: Franco Nazareno franco.nazar...@gmail.com
To: common-user@hadoop.apache.org
Sent: Sun, March 27, 2011 7:51:14 PM
Subject: Hadoop for Bioinformatics

Good day everyone!



First, I want to congratulate the group for this wonderful project. It did
open up new ideas and solutions in computing and technology-wise. I'm
excited to learn more about it and discover possibilities using Hadoop and
its components. 



Well I just want to ask this with regards to my study. Currently I'm
studying my PhD course in Bioinformatics, and my question is that can you
give me a (rough) idea if it's possible to use Hadoop cluster in achieving a
DNA sequence alignment? My basic idea for this goes something like a string
search out of a huge data files stored in HDFS, and the application uses
MapReduce in searching and computing. As the Hadoop paradigm impies, it
doesn't serve well in interactive applications, and I think this kind of
searching is a write-once, read-many application.



I hope you don't mind my question. And it'll be great hearing your comments
or suggestions about this.



Thanks and more power!

Franco

How do I increase mapper granularity?

2011-03-29 Thread W.P. McNeill
I'm running a job whose mappers take a long time, which causes problems like
starving out other jobs that want to run on the same cluster.  Rewriting the
mapper algorithm is not currently an option, but I still need a way to
increase the number of mappers so that I will have greater granularity.
 What is the best way to do this?

Looking through the O'Reilly book and starting from
thishttp://wiki.apache.org/hadoop/HowManyMapsAndReducesWiki page
I've come up with a couple of ideas:

   1. Set mapred.map.tasks to the value I want.
   2. Decrease the block size of my input files.

What are the gotchas with these approaches?  I know that (1) may not work
because this parameter is just a suggestion.  Is there a command line option
that accomplishes (2), or do I have to do a distcp with a non-default block
size.  (I think the answer is that I have to do a distcp, but I'm making
sure.)

Are there other approaches?  Are there other gotchas that come with trying
to increase mapper granularity.  I know this can be more of an art than a
science.

Thanks.


Re: How do I increase mapper granularity?

2011-03-29 Thread Harsh J
Hello,

On Tue, Mar 29, 2011 at 11:48 PM, W.P. McNeill bill...@gmail.com wrote:
   2. Decrease the block size of my input files.
 do I have to do a distcp with a non-default block
 size.  (I think the answer is that I have to do a distcp, but I'm making
 sure.)

A distcp or even a plain -cp with a proper -Ddfs.blocksize=size
parameter passed along should do the trick.

 Are there other approaches?

You can have a look at schedulers that guarantee resources to a
submitted job, perhaps?

 Are there other gotchas that come with trying
 to increase mapper granularity.

One thing that comes to my mind is that the more your splits are
(a.k.a. # of tasks), the more meta info the JobTracker has to hold and
maintain upon in its memory. Second, your NameNode also needs to hold
higher amount of bytes in memory for every such granulared set of
files (since lowering block sizes would lead to a LOT more block info
and replica locations to keep track of).

-- 
Harsh J
http://harshj.com


map JVMs do not cease to exist

2011-03-29 Thread Shrinivas Joshi
Noticed this on a TeraSort run - map JVM processes do not exit/cease to
exist even after a long while from successful execution of all map tasks.
Resources consumed by these JVM processes do not seem to be relinquished
either and that causes poor performance in the rest of the reduce phase
which continues execution after map phase is done.

Do you see this as an issue? If so, is it a known issue?

Thanks,
-Shrinivas


Re: Comparison between Gzip and LZO

2011-03-29 Thread Jose Vinicius Pimenta Coletto
During this month I refactor the code used for the tests and kept doing
 them with the same base mentioned above (about 92 000 files with an average
size of 2kb), but procedure differently: I ran the compression and
decompression 50 times in eight different computers.
The results were not different from those previously reported, on average
gzip was two times faster than the lzo.
As a last resort did profiling with JProfiler, but found nothing to explain
to me why the gzip be faster than lzo.

In this address http://www.linux.ime.usp.br/~jvcoletto/compression/ I share the
table with the results obtained in the tests, the code used in the tests and
 the results obtained in JProfiler.
Anyone have any ideas to help me?

Thank you.
-- 
Jose Vinicius Pimenta Coletto


Em 2 de março de 2011 16:32, José Vinícius Pimenta Coletto 
jvcole...@gmail.com escreveu:

 Hi,

 I'm making a comparison between the following compression methods: gzip
 and lzo provided by Hadoop and gzip from package java.util.zip.
 The test consists of compression and decompression of approximately 92,000
 files with an average size of 2kb, however the decompression time of lzo is
 twice the decompression time of gzip provided by Hadoop, it does not seem
 right.
 The results obtained in the test are:

   Method |   Bytes   |   Compression
  |Decompression
  -   | - | Total Time(with i/o)  Time Speed
 | Total Time(with i/o)  Time  Speed
 Gzip (Haddop)| 200876304 | 121.454s  43.167s
  4,653,424.079 B/s | 332.305s  111.806s   1,796,635.326 B/s
 Lzo  | 200876304 | 120.564s  54.072s
  3,714,914.621 B/s | 509.371s  184.906s   1,086,368.904 B/s
 Gzip (java.util.zip) | 200876304 | 148.014s  63.414s
  3,167,647.371 B/s | 483.148s  4.528s44,360,682.244 B/s

 You can see the code I'm using to the test here:
 http://www.linux.ime.usp.br/~jvcoletto/compression/

 Can anyone explain me why am I getting these results?
 Thanks.



Re: too many map tasks, freezing jobTracker?

2011-03-29 Thread Brendan W.
Thanks Harsh...it's definitely (2) below, i.e., giant files.

But what would be the benefit of actually changing the DFS block size (to
say N*64 Mbytes), as opposed to just increasing the inputSplitSize to N
64-Mbyte blocks for my job?  Both will reduce my number of mappers by a
factor of N, right?  Any benefit to one over the other?

On Tue, Mar 29, 2011 at 12:39 PM, Harsh J qwertyman...@gmail.com wrote:

 Hello Brendan W.,

 On Tue, Mar 29, 2011 at 9:01 PM, Brendan W. bw8...@gmail.com wrote:
  Hi,
 
  I have a 20-node hadoop cluster, processing large log files.  I've seen
 it
  said that there's never any reason to make the inputSplitSize larger than
 a
  single HDFS block (64M), because you give up data locality for no benefit
 if
  you do.

 This is true. You wouldn't always want your InputSplits to have chunk
 sizes bigger than the input file's block-size on the HDFS.

  But when I kick off a job against the whole dataset with that default
  splitSize, I get about 180,000 map tasks, most lasting about 9-15 seconds
  each.  Typically I can get through about half of them, then the
 jobTracker
  freezes with OOM errors.

 That your input splits are 18 in number is a good indicator that you
 have:
 a) Too many files (A small files problem? [1])
 b) Too low block size of the input files [2])

 [1] - http://www.cloudera.com/blog/2009/02/the-small-files-problem/
 [2] - For file sizes in GBs, it does not make sense to have 64 MB
 block sizes. Increasing block sizes for such files (it is a per-file
 property after-all) directly reduces your number of tasks.

 --
 Harsh J
 http://harshj.com



Re: live/dead node problem

2011-03-29 Thread Ravi Prakash
I set these parameters for quickly discovering live / dead nodes.

For 0.20 : heartbeat.recheck.interval
For 0.22 : dfs.namenode.heartbeat.recheck-interval dfs.heartbeat.interval

Cheers,
Ravi

On 3/29/11 10:24 AM, Michael Segel michael_se...@hotmail.com wrote:



Rita,

When the NameNode doesn't see a heartbeat for 10 minutes, it then recognizes 
that the node is down.

Per the Hadoop online documentation:
Each DataNode sends a Heartbeat message to the NameNode periodically. A 
network partition can cause a
subset of DataNodes to lose connectivity with the NameNode. The 
NameNode detects this condition by the
absence of a Heartbeat message. The NameNode marks DataNodes without 
recent Heartbeats as dead and
does not forward any new IO requests to them. Any data that was
registered to a dead DataNode is not available to HDFS any more. 
DataNode death may cause the replication
factor of some blocks to fall below their specified value. The NameNode 
constantly tracks which blocks need
to be replicated and initiates replication whenever necessary. The 
necessity for re-replication may arise due
to many reasons: a DataNode may become unavailable, a replica may 
become corrupted, a hard disk on a
DataNode may fail, or the replication factor of a file may be increased.


I was trying to find out if there's an hdfs-site parameter that could be set to 
decrease this time period, but wasn't successful.

HTH

-Mike



 Date: Tue, 29 Mar 2011 08:13:43 -0400
 Subject: live/dead node problem
 From: rmorgan...@gmail.com
 To: common-user@hadoop.apache.org

 Hello All,

 Is there a parameter or procedure to check more aggressively for a live/dead
 node? Despite me killing the hadoop process, I see the node active for more
 than 10+ minutes in the Live Nodes page. Fortunately, the last contact
 increments.


 Using, branch-0.21, 0985326

 --
 --- Get your facts first, then you can distort them as you please.--




namenode wont start

2011-03-29 Thread Bill Brune

Hi,

I've been running hadoop 0.20.2 for a while now on 2 different clusters 
that I setup. Now on this new cluster I can't get the namenode to stay 
up.  It exits with a IOException incomplete hdfs uri and prints the uri:

hdfs://rmsi_combined.rmsi.com:54310  -which looks complete to me.

All the usual suspects are ok, DNS works both ways, passphraseless ssh 
works fine, 54310 is open,  so I don't know what to tell the IT folks to 
fix.   in core-site.xml,  fs.default.name is set to the ip address: 
hdfs://10.2.50.235:54310  (same problem if it is set to the hostname)


The only way to get it to work is to set fs.default.name to 
hdfs://localhost:54310, but that isn't going to let me run a multi-node 
cluster.


The IOException boils up from the uri.getHost() call in 
DistributedFileSystem.initialize() as shown in the attached log snippit.


The exception is thrown if the host returned by getHost() is null.

This has got to be some kind of permission problem somewhere.

Anyone have any ideas where to look?

Thanks!

-Bill


STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = rmsi_combined/127.0.0.1
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 0.20.2
STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010
/
2011-03-30 07:32:37,748 INFO org.apache.hadoop.ipc.metrics.RpcMetrics: Initializing RPC Metrics with hostName=NameNode, port=54310
2011-03-30 07:32:37,759 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Namenode up at: rmsi_combined.rmsi.com/10.2.50.235:54310
2011-03-30 07:32:37,763 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=NameNode, sessionId=null
2011-03-30 07:32:37,765 INFO org.apache.hadoop.hdfs.server.namenode.metrics.NameNodeMetrics: Initializing NameNodeMeterics using context object:org.apache.hadoop.metrics.spi.NullContext
2011-03-30 07:32:37,845 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=dmc2,dmc2
2011-03-30 07:32:37,845 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup
2011-03-30 07:32:37,845 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isPermissionEnabled=false
2011-03-30 07:32:37,858 INFO org.apache.hadoop.hdfs.server.namenode.metrics.FSNamesystemMetrics: Initializing FSNamesystemMetrics using context object:org.apache.hadoop.metrics.spi.NullContext
2011-03-30 07:32:37,861 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered FSNamesystemStatusMBean
2011-03-30 07:32:37,912 INFO org.apache.hadoop.hdfs.server.common.Storage: Number of files = 1
2011-03-30 07:32:37,919 INFO org.apache.hadoop.hdfs.server.common.Storage: Number of files under construction = 0
2011-03-30 07:32:37,919 INFO org.apache.hadoop.hdfs.server.common.Storage: Image file of size 94 loaded in 0 seconds.
2011-03-30 07:32:37,920 INFO org.apache.hadoop.hdfs.server.common.Storage: Edits file /hdfs/name/current/edits of size 4 edits # 0 loaded in 0 seconds.
2011-03-30 07:32:37,931 INFO org.apache.hadoop.hdfs.server.common.Storage: Image file of size 94 saved in 0 seconds.
2011-03-30 07:32:37,967 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Finished loading FSImage in 146 msecs
2011-03-30 07:32:37,969 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Total number of blocks = 0
2011-03-30 07:32:37,969 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of invalid blocks = 0
2011-03-30 07:32:37,969 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of under-replicated blocks = 0
2011-03-30 07:32:37,969 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of  over-replicated blocks = 0
2011-03-30 07:32:37,969 INFO org.apache.hadoop.hdfs.StateChange: STATE* Leaving safe mode after 0 secs.
2011-03-30 07:32:37,970 INFO org.apache.hadoop.hdfs.StateChange: STATE* Network topology has 0 racks and 0 datanodes
2011-03-30 07:32:37,970 INFO org.apache.hadoop.hdfs.StateChange: STATE* UnderReplicatedBlocks has 0 blocks
2011-03-30 07:32:38,151 INFO org.mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog
2011-03-30 07:32:38,245 INFO org.apache.hadoop.http.HttpServer: Port returned by webServer.getConnectors()[0].getLocalPort() before open() is -1. Opening the listener on 50070
2011-03-30 07:32:38,247 INFO org.apache.hadoop.http.HttpServer: listener.getLocalPort() returned 50070 webServer.getConnectors()[0].getLocalPort() returned 50070
2011-03-30 07:32:38,247 INFO org.apache.hadoop.http.HttpServer: Jetty bound to port 50070
2011-03-30 07:32:38,247 INFO org.mortbay.log: jetty-6.1.14
2011-03-30 07:32:38,663 INFO org.mortbay.log: Started SelectChannelConnector@0.0.0.0:50070
2011-03-30 07:32:38,663 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Web-server up at: 0.0.0.0:50070
2011-03-30 07:32:38,663 INFO org.apache.hadoop.ipc.Server: IPC 

Re: Comparison between Gzip and LZO

2011-03-29 Thread Greg Roelofs
 During this month I refactor the code used for the tests and kept doing
  them with the same base mentioned above (about 92 000 files with an average
 size of 2kb),

Those are _tiny_.  It seems likely to me that you're spending most of your
time on I/O related to metadata (disk seeks, directory traversal, file open/
close, codec setup/teardown, buffer-cache churn) and very little on real
compression or even real file I/O.  Is any of this happening on HDFS?  If
so, add network I/O and namenode overhead, too.

For Hadoop, your file sizes should start at megabytes or tens of megabytes,
and it will really hit its stride above that.

Also, are you compressing text or binaries?

 In this address http://www.linux.ime.usp.br/~jvcoletto/compression/ I share 
 the
 table with the results obtained in the tests, the code used in the tests and
  the results obtained in JProfiler.

In my own tests with (C) command-line tools on Linux (and I've now forgotten
whether the system used fast SCSI disks or regular SATA), lzop's decompression
speed averaged 18-21 compressed MB/sec for binaries and 5-8 cMB/sec for text.
gzip on the same corpus averaged 9-10 cMB/sec for binaries and 3.5-4.5 cMB/sec
for text.  (Text compresses better, so the same input size means more output
size = slower due to I/O.)

For compression, gzip ranged from 2.5-10 uncompressed MB/sec, depending on
data type and compression level.  lzop is basically two compressors; for
levels 1-6, it averaged 15-16.5 ucMB/sec regardless of input or level, while
levels 7-9 dropped from 3 to 1 ucMB/sec.  (IOW, don't use LZO levels above 6.)

Java interfaces will add some overhead, but since all of the codecs in question
are ultimately native C code, this should give you some idea of which numbers
are most suspect. But don't bother benchmarking anything much below a megabyte;
it's a waste of time.

Greg


Re: DFSClient: Could not complete file

2011-03-29 Thread Brian Bockelman
Hi Chris,

One thing we've found helping in ext3 is examining your I/O scheduler.  Make 
sure it's set to deadline, not CFQ.  This will help prevent nodes from 
being overloaded; when du -sk is performed and the node is already 
overloaded, things quickly roll downhill.

Brian

On Mar 29, 2011, at 11:44 AM, Chris Curtin wrote:

 We are narrowing this down. The last few times it hung we found a 'du -sk'
 process for each our HDFS disks as the top users of CPU. They are also
 taking a really long time.
 
 Searching around I find one example of someone reporting a similar issue
 with du -sk, but they tied it to XFS. We are using Ext3.
 
 Anyone have any other ideas since it appears to be related to the 'du' not
 coming back? Note that running the command directly finishes in a few
 seconds.
 
 Thanks,
 
 Chris
 
 On Wed, Mar 16, 2011 at 9:41 AM, Chris Curtin curtin.ch...@gmail.comwrote:
 
 Caught something today I missed before:
 
 11/03/16 09:32:49 INFO hdfs.DFSClient: Exception in createBlockOutputStream
 java.io.IOException: Bad connect ack with firstBadLink 10.120.41.105:50010
 11/03/16 09:32:49 INFO hdfs.DFSClient: Abandoning block
 blk_-517003810449127046_10039793
 11/03/16 09:32:49 INFO hdfs.DFSClient: Waiting to find target node:
 10.120.41.103:50010
 11/03/16 09:34:04 INFO hdfs.DFSClient: Exception in createBlockOutputStream
 java.net.SocketTimeoutException: 69000 millis timeout while waiting for
 channel to be ready for read. ch : java.nio.channels.SocketChannel[connected
 local=/10.120.41.85:34323 remote=/10.120.41.105:50010]
 11/03/16 09:34:04 INFO hdfs.DFSClient: Abandoning block
 blk_2153189599588075377_10039793
 11/03/16 09:34:04 INFO hdfs.DFSClient: Waiting to find target node:
 10.120.41.105:50010
 11/03/16 09:34:55 INFO hdfs.DFSClient: Could not complete file
 /tmp/hadoop/mapred/system/job_201103160851_0014/job.jar retrying...
 
 
 
 On Wed, Mar 16, 2011 at 9:00 AM, Chris Curtin curtin.ch...@gmail.comwrote:
 
 Thanks. Spent a lot of time looking at logs and nothing on the reducers
 until they start complaining about 'could not complete'.
 
 Found this in the jobtracker log file:
 
 2011-03-16 02:38:47,881 WARN org.apache.hadoop.hdfs.DFSClient:
 DFSOutputStream ResponseProcessor exception  for block
 blk_3829493505250917008_9959810java.io.IOException: Bad response 1 for block
 blk_3829493505250917008_9959810 from datanode 10.120.41.103:50010
at
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2454)
 2011-03-16 02:38:47,881 WARN org.apache.hadoop.hdfs.DFSClient: Error
 Recovery for block blk_3829493505250917008_9959810 bad datanode[2]
 10.120.41.103:50010
 2011-03-16 02:38:47,881 WARN org.apache.hadoop.hdfs.DFSClient: Error
 Recovery for block blk_3829493505250917008_9959810 in pipeline
 10.120.41.105:50010, 10.120.41.102:50010, 10.120.41.103:50010: bad
 datanode 10.120.41.103:50010
 2011-03-16 02:38:53,133 INFO org.apache.hadoop.hdfs.DFSClient: Could not
 complete file
 /var/hadoop/tmp/2_20110316_pmta_pipe_2_20_50351_2503122/_logs/history/hadnn01.atlis1_1299879680612_job_201103111641_0312_deliv_2_20110316_pmta_pipe*2_20110316_%5B%281%2F3%29+...QUEUED_T
 retrying...
 
 Looking at the logs from the various times this happens, the 'from
 datanode' in the first message is any of the data nodes (roughly equal in #
 of times it fails), so I don't think it is one specific node having
 problems.
 Any other ideas?
 
 Thanks,
 
 Chris
  On Sun, Mar 13, 2011 at 3:45 AM, icebergs hkm...@gmail.com wrote:
 
 You should check the bad reducers' logs carefully.There may be more
 information about it.
 
 2011/3/10 Chris Curtin curtin.ch...@gmail.com
 
 Hi,
 
 The last couple of days we have been seeing 10's of thousands of these
 errors in the logs:
 
 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete file
 
 
 /offline/working/3/aat/_temporary/_attempt_201103100812_0024_r_03_0/4129371_172307245/part-3
 retrying...
 When this is going on the reducer in question is always the last
 reducer in
 a job.
 
 Sometimes the reducer recovers. Sometimes hadoop kills that reducer,
 runs
 another and it succeeds. Sometimes hadoop kills the reducer and the new
 one
 also fails, so it gets killed and the cluster goes into a loop of
 kill/launch/kill.
 
 At first we thought it was related to the size of the data being
 evaluated
 (4+GB), but we've seen it several times today on  100 MB
 
 Searching here or online doesn't show a lot about what this error means
 and
 how to fix it.
 
 We are running 0.20.2, r911707
 
 Any suggestions?
 
 
 Thanks,
 
 Chris
 
 
 
 
 



smime.p7s
Description: S/MIME cryptographic signature


Re: live/dead node problem

2011-03-29 Thread Rita
what about for 0.21 ?

Also, where do you set this? in the data node configuration or namenode?
It seems the default is set to 3 seconds.

On Tue, Mar 29, 2011 at 5:37 PM, Ravi Prakash ravip...@yahoo-inc.comwrote:

  I set these parameters for quickly discovering live / dead nodes.

 For 0.20 : heartbeat.recheck.interval
 For 0.22 : dfs.namenode.heartbeat.recheck-interval dfs.heartbeat.interval

 Cheers,
 Ravi


 On 3/29/11 10:24 AM, Michael Segel michael_se...@hotmail.com wrote:



 Rita,

 When the NameNode doesn't see a heartbeat for 10 minutes, it then
 recognizes that the node is down.

 Per the Hadoop online documentation:
 Each DataNode sends a Heartbeat message to the NameNode periodically. A
 network partition can cause a
 subset of DataNodes to lose connectivity with the NameNode. The
 NameNode detects this condition by the
 absence of a Heartbeat message. The NameNode marks DataNodes
 without recent Heartbeats as dead and
 does not forward any new IO requests to them. Any data that was
 registered to a dead DataNode is not available to HDFS any more.
 DataNode death may cause the replication
 factor of some blocks to fall below their specified value. The
 NameNode constantly tracks which blocks need
 to be replicated and initiates replication whenever necessary. The
 necessity for re-replication may arise due
 to many reasons: a DataNode may become unavailable, a replica may
 become corrupted, a hard disk on a
 DataNode may fail, or the replication factor of a file may be
 increased.
 

 I was trying to find out if there's an hdfs-site parameter that could be
 set to decrease this time period, but wasn't successful.

 HTH

 -Mike


 
  Date: Tue, 29 Mar 2011 08:13:43 -0400
  Subject: live/dead node problem
  From: rmorgan...@gmail.com
  To: common-user@hadoop.apache.org
 
  Hello All,
 
  Is there a parameter or procedure to check more aggressively for a
 live/dead
  node? Despite me killing the hadoop process, I see the node active for
 more
  than 10+ minutes in the Live Nodes page. Fortunately, the last contact
  increments.
 
 
  Using, branch-0.21, 0985326
 
  --
  --- Get your facts first, then you can distort them as you please.--





-- 
--- Get your facts first, then you can distort them as you please.--


Re: too many map tasks, freezing jobTracker?

2011-03-29 Thread Harsh J
Hello again,

On Wed, Mar 30, 2011 at 2:42 AM, Brendan W. bw8...@gmail.com wrote:
 But what would be the benefit of actually changing the DFS block size (to
 say N*64 Mbytes), as opposed to just increasing the inputSplitSize to N
 64-Mbyte blocks for my job?  Both will reduce my number of mappers by a
 factor of N, right?  Any benefit to one over the other?

You'll have a data locality benefit if you choose to change the block
size of the files itself, instead of going for 'logical' splitting by
the framework. This should save you a great deal of network transfer
costs in your job.

-- 
Harsh J
http://harshj.com