Re: Distributed Clusters

2010-04-08 Thread Ravi Phulari
Hello James,

I am new to this group, and relatively new to hadoop.
Welcome to the group!!

I am looking at building a large cluster.  I was wondering if anyone has any 
best practices for a cluster in the hundreds of nodes?  As well, has anyone had 
experience with a cluster spanning multiple data centers.  Is this a bad 
practice? moderately bad practice?  insane?

You can find answers to most of the questions here - 
http://wiki.apache.org/hadoop/
I am not sure if there are clusters spanning in multiple data centers. Even if 
there are such cluster I am very confident that Hadoop will work on such 
cluster spanning multiple data center.

Is it better to build the 1000 node cluster in a single data center?  Do you 
back one of these things up to a second data center or a different 1000 node 
cluster?

If you are completely new to Hadoop then it's better to start with 100-200 
nodes cluster and learn how it works. Obviously later you can scale to 1000 or 
more nodes.

Regards,
Ravi
--
Hadoop @ Yahoo!


Berlin Buzzwords - early registration extended

2010-04-08 Thread Isabel Drost

Hello,

we would like to invite everyone interested in data storage, analysis and 
search 
to join us for two days on June 7/8th in Berlin for an in-depth, technical, 
developer-focused conference located in the heart of Europe. Presentations will 
range from beginner friendly introductions on the hot data analysis topics up 
to 
in-depth technical presentations of scalable architectures.

Our intention is to bring together users and developers of data storage, 
analysis and search projects. Meet members of the development team working on 
projects you use. Get in touch with other developers you may know only from 
mailing list discussions. Exchange ideas with those using your software and get 
their feedback while having a drink in one of Berlin's many bars.

Early bird registration has been extended until April 17th - so don't wait too 
long. Tickets are available at: http://berlinbuzzwords.de/content/tickets

If you would like to submit a talk yourself: Conference submission is open for 
little more than one week. More details are available online in the call for 
presentations:

http://berlinbuzzwords.de/content/call-presentations-open

Looking forward to meeting you in the beautiful, vibrant city of Berlin this 
summer for a conference packed with high profile speakers, awesome talks and 
lots of interesting discussions.

Isabel


signature.asc
Description: This is a digitally signed message part.


Hadoop overhead

2010-04-08 Thread Aleksandar Stupar
Hi all,

As I realize hadoop is mainly used for tasks that take long
time to execute. I'm considering to use hadoop for task
whose lower bound in distributed execution is like 5 to 10 
seconds. Am wondering what would the overhead be with
using hadoop.

Does anyone have an idea? Any link where I can find this out?

Thanks,
Aleksandar. 


  

Re: Hadoop overhead

2010-04-08 Thread Jeff Zhang
By default, for each task hadoop will create a new jvm process which will be
the major cost in my opinion. You can customize configuration to let
tasktracker reuse the jvm to eliminate the overhead to some extend.

On Thu, Apr 8, 2010 at 8:55 PM, Aleksandar Stupar 
stupar.aleksan...@yahoo.com wrote:

 Hi all,

 As I realize hadoop is mainly used for tasks that take long
 time to execute. I'm considering to use hadoop for task
 whose lower bound in distributed execution is like 5 to 10
 seconds. Am wondering what would the overhead be with
 using hadoop.

 Does anyone have an idea? Any link where I can find this out?

 Thanks,
 Aleksandar.







-- 
Best Regards

Jeff Zhang


Re: Hadoop overhead

2010-04-08 Thread Rajesh Balamohan
If its too many short duration jobs, you might want to keep an eye on
jobtracker and tweak number of heartbeats processed per second 
outofbandheartbeat option. JobTracker might be bombarded with events
otherwise.



On Thu, Apr 8, 2010 at 8:07 PM, Jeff Zhang zjf...@gmail.com wrote:

 By default, for each task hadoop will create a new jvm process which will
 be
 the major cost in my opinion. You can customize configuration to let
 tasktracker reuse the jvm to eliminate the overhead to some extend.

 On Thu, Apr 8, 2010 at 8:55 PM, Aleksandar Stupar 
 stupar.aleksan...@yahoo.com wrote:

  Hi all,
 
  As I realize hadoop is mainly used for tasks that take long
  time to execute. I'm considering to use hadoop for task
  whose lower bound in distributed execution is like 5 to 10
  seconds. Am wondering what would the overhead be with
  using hadoop.
 
  Does anyone have an idea? Any link where I can find this out?
 
  Thanks,
  Aleksandar.
 
 
 




 --
 Best Regards

 Jeff Zhang




-- 
~Rajesh.B


Re: Hadoop overhead

2010-04-08 Thread Patrick Angeles
Packaging the job and config and sending it to the JobTracker and various
nodes also adds a few seconds overhead.

On Thu, Apr 8, 2010 at 10:37 AM, Jeff Zhang zjf...@gmail.com wrote:

 By default, for each task hadoop will create a new jvm process which will
 be
 the major cost in my opinion. You can customize configuration to let
 tasktracker reuse the jvm to eliminate the overhead to some extend.

 On Thu, Apr 8, 2010 at 8:55 PM, Aleksandar Stupar 
 stupar.aleksan...@yahoo.com wrote:

  Hi all,
 
  As I realize hadoop is mainly used for tasks that take long
  time to execute. I'm considering to use hadoop for task
  whose lower bound in distributed execution is like 5 to 10
  seconds. Am wondering what would the overhead be with
  using hadoop.
 
  Does anyone have an idea? Any link where I can find this out?
 
  Thanks,
  Aleksandar.
 
 
 




 --
 Best Regards

 Jeff Zhang



Re: Hadoop overhead

2010-04-08 Thread Edward Capriolo
On Thu, Apr 8, 2010 at 10:51 AM, Patrick Angeles patr...@cloudera.comwrote:

 Packaging the job and config and sending it to the JobTracker and various
 nodes also adds a few seconds overhead.

 On Thu, Apr 8, 2010 at 10:37 AM, Jeff Zhang zjf...@gmail.com wrote:

  By default, for each task hadoop will create a new jvm process which will
  be
  the major cost in my opinion. You can customize configuration to let
  tasktracker reuse the jvm to eliminate the overhead to some extend.
 
  On Thu, Apr 8, 2010 at 8:55 PM, Aleksandar Stupar 
  stupar.aleksan...@yahoo.com wrote:
 
   Hi all,
  
   As I realize hadoop is mainly used for tasks that take long
   time to execute. I'm considering to use hadoop for task
   whose lower bound in distributed execution is like 5 to 10
   seconds. Am wondering what would the overhead be with
   using hadoop.
  
   Does anyone have an idea? Any link where I can find this out?
  
   Thanks,
   Aleksandar.
  
  
  
 
 
 
 
  --
  Best Regards
 
  Jeff Zhang
 


All jobs make entries in a jobhistory directory on the task tracker. As of
now the jobhistory directory has some limitations with ext3 you hit max
files in a directory at 32k, if you use xfs or ext4 you can have no
theoretical limit but hadoop itself will bog down if the directory gets too
large.

If you want to do this enable JVM re-use as mentioned above to shorten job
start times. Also be prepared to make some shell scripts to handle some
cleanup tasks.

Edward


Network problems Hadoop 0.20.2 and Terasort on Debian 2.6.32 kernel

2010-04-08 Thread stephen mulcahy

Hi,

I'm commissioning a new Hadoop cluster with the following spec.

45 x data nodes:
- 2 x Quad-Core AMD Opteron(tm) Processor 2378
- 16GB ram
- 4 x WDC WD1002FBYS 1TB SATA drives (configured as separate ext4 
filesystems)


3 x name nodes:
- 2 x Quad-Core AMD Opteron(tm) Processor 2378
- 32GB ram
- 2 x WDC WD1002FBYS 1TB SATA drives (in software RAID1 config and ext4 
filesystem)


All nodes are running Debian testing/squeeze.

I'm doing my benchmarking with TeraSort running as follows

hadoop jar hadoop-0.20.2-examples.jar teragen -Dmapred.map.tasks=8000 
100 /terasort/in


hadoop jar hadoop-0.20.2-examples.jar terasort -Dmapred.reduce.tasks=530 
/terasort/in /terasort/out


When I run this on the Debian 2.6.30 kernel - it runs to completion in 
about 23 minutes (occasionally running into the cpu soft lockups 
problems described in [1]). I assume that is a reasonable time for this 
benchmark to complete in?


When I run this on the Debian 2.6.32 kernel - over the course of the 
run, 1 or 2 datanodes of the cluster enter a state whereby they are no 
longer responsive to network traffic.


Logging into these nodes via the console reveals no messages in the 
log-files. Running ifdown eth0 followed by ifup eth0 brings these 
systems back online. The systems that become unresponsive vary from run 
to run suggesting this is not a h/w problem specific to certain nodes.


I have raised this issue with the Debian kernel team[2] and have tested
various system and switch changes in an attempt to identify the cause -
but without success.

Has anyone run into similar problems with their environments? I noticed 
that the when the nodes become unresponsive, it often happens when the 
TeraSort is at


map 100%, reduce 78%

Is there any significance to that?

Any feedback welcome (including comments on what distro/kernel 
combinations others are using).


Thanks,

-stephen

[1] http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=556030
[2] http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=572201

--
Stephen Mulcahy, DI2, Digital Enterprise Research Institute,
NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland
http://di2.deri.iehttp://webstar.deri.iehttp://sindice.com


Re: Errors reading lzo-compressed files from Hadoop

2010-04-08 Thread Todd Lipcon
Doh, a couple more silly bugs in there. Don't use that version quite yet -
I'll put up a better patch later today. (Thanks to Kevin and Ted Yu for
pointing out the additional problems)

-Todd

On Wed, Apr 7, 2010 at 5:24 PM, Todd Lipcon t...@cloudera.com wrote:

 For Dmitriy and anyone else who has seen this error, I just committed a fix
 to my github repository:


 http://github.com/toddlipcon/hadoop-lzo/commit/f3bc3f8d003bb8e24f254b25bca2053f731cdd58

 The problem turned out to be an assumption that InputStream.read() would
 return all the bytes that were asked for. This turns out to almost always be
 true on local filesystems, but on HDFS it's not true if the read crosses a
 block boundary. So, every couple of TB of lzo compressed data one might see
 this error.

 Big thanks to Alex Roetter who was able to provide a file that exhibited
 the bug!

 Thanks
 -Todd


 On Tue, Apr 6, 2010 at 10:35 AM, Todd Lipcon t...@cloudera.com wrote:

 Hi Alex,
 Unfortunately I wasn't able to reproduce, and the data Dmitriy is
 working with is sensitive.
 Do you have some data you could upload (or send me off list) that
 exhibits the issue?
 -Todd

 On Tue, Apr 6, 2010 at 9:50 AM, Alex Roetter aroet...@imageshack.net
 wrote:
 
  Todd Lipcon t...@... writes:
 
  
   Hey Dmitriy,
  
   This is very interesting (and worrisome in a way!) I'll try to take a
 look
   this afternoon.
  
   -Todd
  
 
  Hi Todd,
 
  I wanted to see if you made any progress on this front. I'm seeing a
 very
  similar error, trying to run a MR (Hadoop 0.20.1) over a bunch of
  LZOP compressed / indexed files (using Kevin Weil's package), and I have
 one
  map task that always fails in what looks like the same place as
 described in
  the previous post. I haven't yet done the experimentation mentioned
 above
  (isolating the input file corresponding to the failed map task,
 decompressing
  it / recompressing it, testing it out operating directly on local disk
  instead of HDFS, etc).
 
  However, since I am crashing in exactly the same place it seems likely
 this
  is related, and thought I'd check on your work in the meantime.
 
  FYI, my stack track is below:
 
  2010-04-05 18:15:16,895 FATAL org.apache.hadoop.mapred.TaskTracker:
 Error
  running child : java.lang.InternalError: lzo1x_decompress_safe returned:
 at
 com.hadoop.compression.lzo.LzoDecompressor.decompressBytesDirect
  (Native Method)
 at com.hadoop.compression.lzo.LzoDecompressor.decompress
  (LzoDecompressor.java:303)
 at
  com.hadoop.compression.lzo.LzopDecompressor.decompress
  (LzopDecompressor.java:104)
 at com.hadoop.compression.lzo.LzopInputStream.decompress
  (LzopInputStream.java:223)
 at
  org.apache.hadoop.io.compress.DecompressorStream.read
  (DecompressorStream.java:74)
 at java.io.InputStream.read(InputStream.java:85)
 at
 org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
 at
 org.apache.hadoop.util.LineReader.readLine(LineReader.java:187)
 at
  com.hadoop.mapreduce.LzoLineRecordReader.nextKeyValue
  (LzoLineRecordReader.java:126)
 at
  org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue
  (MapTask.java:423)
 at
 org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
 at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
 at
 org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
 at org.apache.hadoop.mapred.Child.main(Child.java:170)
 
 
  Any update much appreciated,
  Alex
 
 
 
 
 



 --
 Todd Lipcon
 Software Engineer, Cloudera




 --
 Todd Lipcon
 Software Engineer, Cloudera




-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Errors reading lzo-compressed files from Hadoop

2010-04-08 Thread Todd Lipcon
OK, fixed, unit tests passing again. If anyone sees any more problems let
one of us know!

Thanks
-Todd

On Thu, Apr 8, 2010 at 10:39 AM, Todd Lipcon t...@cloudera.com wrote:

 Doh, a couple more silly bugs in there. Don't use that version quite yet -
 I'll put up a better patch later today. (Thanks to Kevin and Ted Yu for
 pointing out the additional problems)

 -Todd


 On Wed, Apr 7, 2010 at 5:24 PM, Todd Lipcon t...@cloudera.com wrote:

 For Dmitriy and anyone else who has seen this error, I just committed a
 fix to my github repository:


 http://github.com/toddlipcon/hadoop-lzo/commit/f3bc3f8d003bb8e24f254b25bca2053f731cdd58

 The problem turned out to be an assumption that InputStream.read() would
 return all the bytes that were asked for. This turns out to almost always be
 true on local filesystems, but on HDFS it's not true if the read crosses a
 block boundary. So, every couple of TB of lzo compressed data one might see
 this error.

 Big thanks to Alex Roetter who was able to provide a file that exhibited
 the bug!

 Thanks
 -Todd


 On Tue, Apr 6, 2010 at 10:35 AM, Todd Lipcon t...@cloudera.com wrote:

 Hi Alex,
 Unfortunately I wasn't able to reproduce, and the data Dmitriy is
 working with is sensitive.
 Do you have some data you could upload (or send me off list) that
 exhibits the issue?
 -Todd

 On Tue, Apr 6, 2010 at 9:50 AM, Alex Roetter aroet...@imageshack.net
 wrote:
 
  Todd Lipcon t...@... writes:
 
  
   Hey Dmitriy,
  
   This is very interesting (and worrisome in a way!) I'll try to take a
 look
   this afternoon.
  
   -Todd
  
 
  Hi Todd,
 
  I wanted to see if you made any progress on this front. I'm seeing a
 very
  similar error, trying to run a MR (Hadoop 0.20.1) over a bunch of
  LZOP compressed / indexed files (using Kevin Weil's package), and I
 have one
  map task that always fails in what looks like the same place as
 described in
  the previous post. I haven't yet done the experimentation mentioned
 above
  (isolating the input file corresponding to the failed map task,
 decompressing
  it / recompressing it, testing it out operating directly on local disk
  instead of HDFS, etc).
 
  However, since I am crashing in exactly the same place it seems likely
 this
  is related, and thought I'd check on your work in the meantime.
 
  FYI, my stack track is below:
 
  2010-04-05 18:15:16,895 FATAL org.apache.hadoop.mapred.TaskTracker:
 Error
  running child : java.lang.InternalError: lzo1x_decompress_safe
 returned:
 at
 com.hadoop.compression.lzo.LzoDecompressor.decompressBytesDirect
  (Native Method)
 at com.hadoop.compression.lzo.LzoDecompressor.decompress
  (LzoDecompressor.java:303)
 at
  com.hadoop.compression.lzo.LzopDecompressor.decompress
  (LzopDecompressor.java:104)
 at com.hadoop.compression.lzo.LzopInputStream.decompress
  (LzopInputStream.java:223)
 at
  org.apache.hadoop.io.compress.DecompressorStream.read
  (DecompressorStream.java:74)
 at java.io.InputStream.read(InputStream.java:85)
 at
 org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
 at
 org.apache.hadoop.util.LineReader.readLine(LineReader.java:187)
 at
  com.hadoop.mapreduce.LzoLineRecordReader.nextKeyValue
  (LzoLineRecordReader.java:126)
 at
  org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue
  (MapTask.java:423)
 at
 org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
 at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
 at
 org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
 at org.apache.hadoop.mapred.Child.main(Child.java:170)
 
 
  Any update much appreciated,
  Alex
 
 
 
 
 



 --
 Todd Lipcon
 Software Engineer, Cloudera




 --
 Todd Lipcon
 Software Engineer, Cloudera




 --
 Todd Lipcon
 Software Engineer, Cloudera




-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Distributed Clusters

2010-04-08 Thread Allen Wittenauer

On Apr 7, 2010, at 10:50 PM, James Seigel wrote:

 I am new to this group, and relatively new to hadoop. 

Welcome to the community, James. :)

 I am looking at building a large cluster.  I was wondering if anyone has any 
 best practices for a cluster in the hundreds of nodes?

Take a look at the 'Hadoop 24/7' presentation (on the hadoop wiki preso page) I 
did for ApacheCon EU last year.  It covers a lot of the now that I have a 
grid, what do I do? situations.

  As well, has anyone had experience with a cluster spanning multiple data 
 centers.  Is this a bad practice? moderately bad practice?  insane?

Right now, it generally falls into the insane category unless you have REALLY 
REALLY REALLY low latency and high bandwidth.  The heartbeats between nodes, 
issues with block placement, etc, make it highly likely to saturate the link 
and/or split the cluster in multiple pieces.

 Is it better to build the 1000 node cluster in a single data center?  Do you 
 back one of these things up to a second data center or a different 1000 node 
 cluster?

We're currently going with a 'multiple grids in one data center' strategy.  Our 
'Source of Truth' data is from another source, meaning we could (theoretically) 
rebuild the grid from that source if we were to get decimated by dinosaurs.  
[That source of truth has a much better backup/dr strategy.]

 Sorry, I am asking crazy questions...I am just wanting to learn the meta 
 issues and opportunities with making clusters.

These are pretty normal questions.  We should probably create a faq or 
something on the wiki.



HOD: JobTracker failed to initialise

2010-04-08 Thread Boyu Zhang
Dear All,

I am trying to install HOD on a cluster. When I tried to allocate a new
Hadoop cluster, I got the following error:

[2010-04-08 13:47:25,304] CRITICAL/50 hadoop:303 - Cluster could not be
allocated because of the following errors.
Hodring at n0 failed with following errors:
JobTracker failed to initialise

*The log file ringmaster.log has the following message:*

[2010-04-08 13:46:22,297] DEBUG/10 ringMaster:479 - getServiceAddr name:
hdfs
[2010-04-08 13:46:22,299] DEBUG/10 ringMaster:487 - getServiceAddr service:
hodlib.GridServices.hdfs.Hdfs instance at 0x2057b758
[2010-04-08 13:46:22,300] DEBUG/10 ringMaster:504 - getServiceAddr addr
hdfs: not found

*The log file hodring.log has the following message:*

[2010-04-08 13:46:31,749] DEBUG/10 hodRing:416 - hadoopThread still == None
...
[2010-04-08 13:46:31,750] DEBUG/10 hodRing:419 - hadoop input: None
[2010-04-08 13:46:31,752] DEBUG/10 hodRing:428 - isForground: False
[2010-04-08 13:46:31,753] DEBUG/10 hodRing:440 - hadoop run status: True
[2010-04-08 13:46:31,754] DEBUG/10 hodRing:657 - Waiting for jobtracker to
initialise
[2010-04-08 13:46:31,755] DEBUG/10 hodRing:659 - jobtracker version : 20
[2010-04-08 13:46:31,756] DEBUG/10 hodRing:664 - jobtracker rpc server :
n2:59664
[2010-04-08 13:46:31,757] DEBUG/10 hodRing:670 - Jobtracker jetty : n2:57775
[2010-04-08 13:46:32,042] DEBUG/10 hodRing:713 - Jetty gave a socket error.
Sleeping for 0.5
[2010-04-08 13:46:33,544] DEBUG/10 hodRing:713 - Jetty gave a socket error.
Sleeping for 1.0
[2010-04-08 13:46:35,545] DEBUG/10 hodRing:713 - Jetty gave a socket error.
Sleeping for 2.0
[2010-04-08 13:46:38,546] DEBUG/10 hodRing:713 - Jetty gave a socket error.
Sleeping for 4.0
[2010-04-08 13:46:43,547] DEBUG/10 hodRing:713 - Jetty gave a socket error.
Sleeping for 8.0
[2010-04-08 13:46:52,548] DEBUG/10 hodRing:713 - Jetty gave a socket error.
Sleeping for 16.0
4864033937778270/hdfs-nn/dfs-name']
[2010-04-08 13:47:08,552] CRITICAL/50 hodRing:723 - Jobtracker failed to
initialise.

*The log file hadoop.log in the actual compute node n0 has: *

2010-04-08 17:47:24,424 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer
Exception: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File
/scratch/hod/mapredsys/zhang/mapredsystem/
85.geronimo.gcl.cis.udel.edu/jobtracker.info could only be replicated to 0
nodes, instead of 1
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1271)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:422)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

--
It looks like that hdfs daemon failed to start, so JT has no one to
communicate with, then Jetty gave a error.

I used hadoop0.20.2, Scyld OS, the cluster uses 0-5 (n0-n5) to refer to the
back end compute node. Did anyone have this problem before? Any help will be
appreciated.
P.S. I have tmp files Jetty*** generated under /tmp on the compute nodes,
but I set all the tmp dir to /home or /scratch, any idea?


Here is my hod conf file:

[hod]
stream  = True
java-home =/usr
cluster = geronimo
cluster-factor  = 1.8
xrs-port-range  = 32768-65536
debug   = 4
allocate-wait-time  = 3600
temp-dir= /home/zhang/hodtmp.$PBS_JOBID

[ringmaster]
register= True
stream  = False
temp-dir= /scratch/hod/ringmastertmp.$PBS_JOBID
http-port-range = 8000-9000
work-dirs   = /scratch/hod/tmp/1,/scratch/hod/tmp/2
xrs-port-range  = 32768-65536
debug   = 4

[hodring]
stream  = False
temp-dir= /scratch/hod/hodringtmp.$PBS_JOBID
register= True
java-home   = /usr
http-port-range = 8000-9000
xrs-port-range  = 32768-65536
debug   = 4
mapred-system-dir-root  = /scratch/hod/mapredsys

[resource_manager]
queue   = batch
batch-home  = /usr
id  = torque
env-vars=
HOD_PYTHON_HOME=/opt/python/2.5.1/bin/python

[gridservice-mapred]
external= False
pkgs= /home/zhang/hadoop-0.20.2
tracker_port= 8030
info_port   = 50080

[gridservice-hdfs]
external= False
pkgs= /home/zhang/hadoop-0.20.2
fs_port = 8020
info_port   = 50070




Thanks a lot!!

Boyu


Re: hadoop on demand setup: Failed to retrieve 'hdfs' service address

2010-04-08 Thread Boyu Zhang
Hi Kevin,

I am having the same error, but my critical error is:

[2010-04-08 13:47:25,304] CRITICAL/50 hadoop:303 - Cluster could not be
allocated because of the following errors.
Hodring at n0 failed with following errors:
JobTracker failed to initialise

Have you solved this? Thanks!

Boyu

On Tue, Apr 6, 2010 at 11:32 AM, Kevin Van Workum v...@sabalcore.comwrote:

 [sorry for the double posting (to general), but I think this list is
 the appropriate place for this message]

 Hello,

 I'm trying to setup hadoop on demand (HOD) on my cluster. I'm
 currently unable to allocate cluster. I'm starting hod with the
 following command:

 /usr/local/hadoop-0.20.2/hod/bin/hod -c
 /usr/local/hadoop-0.20.2/hod/conf/hodrc -t
 /b/01/vanw/hod/hadoop-0.20.2.tar.gz -o allocate ~/hod 3
 --ringmaster.log-dir=/tmp -b 4

 The job starts on the nodes and I see the ringmaster running on the
 MotherSuperior. The ringmaster-main.log file is created and contains:

 [2010-04-06 11:18:29,036] DEBUG/10 ringMaster:487 - getServiceAddr
 service: hodlib.GridServices.mapred.MapReduce instance at 0x12b42518
 [2010-04-06 11:18:29,038] DEBUG/10 ringMaster:504 - getServiceAddr
 addr mapred: not found
 [2010-04-06 10:47:43,183] DEBUG/10 ringMaster:479 - getServiceAddr name:
 hdfs
 [2010-04-06 10:47:43,184] DEBUG/10 ringMaster:487 - getServiceAddr
 service: hodlib.GridServices.hdfs.Hdfs instance at 0x122d24d0
 [2010-04-06 10:47:43,186] DEBUG/10 ringMaster:504 - getServiceAddr
 addr hdfs: not found

 I don't see any associated processes running on the other 2 nodes in
 the job.

 The critical errors are as follows:

 [2010-04-06 10:34:13,630] CRITICAL/50 hadoop:298 - Failed to retrieve
 'hdfs' service address.
 [2010-04-06 10:34:13,631] DEBUG/10 hadoop:631 - Cleaning up cluster id
 238366.jman, as cluster could not be allocated.
 [2010-04-06 10:34:13,632] DEBUG/10 hadoop:635 - Calling rm.stop()
 [2010-04-06 10:34:13,639] DEBUG/10 hadoop:637 - Returning from rm.stop()
 [2010-04-06 10:34:13,639] CRITICAL/50 hod:401 - Cannot allocate
 cluster /b/01/vanw/hod
 [2010-04-06 10:34:14,149] DEBUG/10 hod:597 - return code: 7

 The contents of the hodrc file is:

 [hod]
 stream  = True
 java-home   = /usr/local/jdk1.6.0_02
 cluster = orange
 cluster-factor  = 1.8
 xrs-port-range  = 32768-65536
 debug   = 4
 allocate-wait-time  = 3600
 temp-dir= /tmp/hod

 [ringmaster]
 register= True
 stream  = False
 temp-dir= /tmp/hod
 http-port-range = 8000-9000
 work-dirs   = /tmp/hod/1,/tmp/hod/2
 xrs-port-range  = 32768-65536
 debug   = 4

 [hodring]
 stream  = False
 temp-dir= /tmp/hod
 register= True
 java-home   = /usr/local/jdk1.6.0_02
 http-port-range = 8000-9000
 xrs-port-range  = 32768-65536
 debug   = 4

 [resource_manager]
 queue   = dque
 batch-home  = /usr/local/torque-2.3.7
 id  = torque
 env-vars   =
 HOD_PYTHON_HOME=/usr/local/python-2.5.5/bin/python

 [gridservice-mapred]
 external= False
 tracker_port= 8030
 info_port   = 50080

 [gridservice-hdfs]
 external= False
 fs_port = 8020
 info_port   = 50070


 Some other useful information:
 Linux 2.6.18-128.7.1.el5
 Python 2.5.5
 Twisted 10.0.0
 zope 3.3.0
 java version 1.6.0_02
 hadoop version 0.20.2



 --
 Kevin Van Workum, PhD
 Sabalcore Computing Inc.
 Run your code on 500 processors.
 Sign up for a free trial account.
 www.sabalcore.com
 877-492-8027 ext. 11



Re: Errors reading lzo-compressed files from Hadoop

2010-04-08 Thread Dmitriy Ryaboy
Both Kevin's and Todd's branches now pass my tests. Thanks again Todd.

-D

On Thu, Apr 8, 2010 at 10:46 AM, Todd Lipcon t...@cloudera.com wrote:
 OK, fixed, unit tests passing again. If anyone sees any more problems let
 one of us know!

 Thanks
 -Todd

 On Thu, Apr 8, 2010 at 10:39 AM, Todd Lipcon t...@cloudera.com wrote:

 Doh, a couple more silly bugs in there. Don't use that version quite yet -
 I'll put up a better patch later today. (Thanks to Kevin and Ted Yu for
 pointing out the additional problems)

 -Todd


 On Wed, Apr 7, 2010 at 5:24 PM, Todd Lipcon t...@cloudera.com wrote:

 For Dmitriy and anyone else who has seen this error, I just committed a
 fix to my github repository:


 http://github.com/toddlipcon/hadoop-lzo/commit/f3bc3f8d003bb8e24f254b25bca2053f731cdd58

 The problem turned out to be an assumption that InputStream.read() would
 return all the bytes that were asked for. This turns out to almost always be
 true on local filesystems, but on HDFS it's not true if the read crosses a
 block boundary. So, every couple of TB of lzo compressed data one might see
 this error.

 Big thanks to Alex Roetter who was able to provide a file that exhibited
 the bug!

 Thanks
 -Todd


 On Tue, Apr 6, 2010 at 10:35 AM, Todd Lipcon t...@cloudera.com wrote:

 Hi Alex,
 Unfortunately I wasn't able to reproduce, and the data Dmitriy is
 working with is sensitive.
 Do you have some data you could upload (or send me off list) that
 exhibits the issue?
 -Todd

 On Tue, Apr 6, 2010 at 9:50 AM, Alex Roetter aroet...@imageshack.net
 wrote:
 
  Todd Lipcon t...@... writes:
 
  
   Hey Dmitriy,
  
   This is very interesting (and worrisome in a way!) I'll try to take a
 look
   this afternoon.
  
   -Todd
  
 
  Hi Todd,
 
  I wanted to see if you made any progress on this front. I'm seeing a
 very
  similar error, trying to run a MR (Hadoop 0.20.1) over a bunch of
  LZOP compressed / indexed files (using Kevin Weil's package), and I
 have one
  map task that always fails in what looks like the same place as
 described in
  the previous post. I haven't yet done the experimentation mentioned
 above
  (isolating the input file corresponding to the failed map task,
 decompressing
  it / recompressing it, testing it out operating directly on local disk
  instead of HDFS, etc).
 
  However, since I am crashing in exactly the same place it seems likely
 this
  is related, and thought I'd check on your work in the meantime.
 
  FYI, my stack track is below:
 
  2010-04-05 18:15:16,895 FATAL org.apache.hadoop.mapred.TaskTracker:
 Error
  running child : java.lang.InternalError: lzo1x_decompress_safe
 returned:
         at
 com.hadoop.compression.lzo.LzoDecompressor.decompressBytesDirect
  (Native Method)
         at com.hadoop.compression.lzo.LzoDecompressor.decompress
  (LzoDecompressor.java:303)
         at
  com.hadoop.compression.lzo.LzopDecompressor.decompress
  (LzopDecompressor.java:104)
         at com.hadoop.compression.lzo.LzopInputStream.decompress
  (LzopInputStream.java:223)
         at
  org.apache.hadoop.io.compress.DecompressorStream.read
  (DecompressorStream.java:74)
         at java.io.InputStream.read(InputStream.java:85)
         at
 org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
         at
 org.apache.hadoop.util.LineReader.readLine(LineReader.java:187)
         at
  com.hadoop.mapreduce.LzoLineRecordReader.nextKeyValue
  (LzoLineRecordReader.java:126)
         at
  org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue
  (MapTask.java:423)
         at
 org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
         at
 org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583)
         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
         at org.apache.hadoop.mapred.Child.main(Child.java:170)
 
 
  Any update much appreciated,
  Alex
 
 
 
 
 



 --
 Todd Lipcon
 Software Engineer, Cloudera




 --
 Todd Lipcon
 Software Engineer, Cloudera




 --
 Todd Lipcon
 Software Engineer, Cloudera




 --
 Todd Lipcon
 Software Engineer, Cloudera



Re: Reduce gets struck at 99%

2010-04-08 Thread Eric Arenas
Yes Raghava,

I have experience that issue before, and the solution that you mentioned also 
solved my issue (adding a context.progress or setcontext to tell the JT that my 
jobs are still running)

regards
 Eric Arenas





From: Raghava Mutharaju m.vijayaragh...@gmail.com
To: common-user@hadoop.apache.org; mapreduce-u...@hadoop.apache.org
Sent: Thu, April 8, 2010 10:30:49 AM
Subject: Reduce gets struck at 99%

Hello all,

 I got the time out error as mentioned below -- after 600 seconds, that 
attempt was killed and the attempt would be deemed a failure. I searched around 
about this error, and one of the suggestions to include progress statements 
in the reducer -- it might be taking longer than 600 seconds and so is timing 
out. I added calls to context.progress() and context.setStatus(str) in the 
reducer. Now, it works fine -- there are no timeout errors.

 But, for a few jobs, it takes awfully long time to move from Map 
100%, Reduce 99% to Reduce 100%. For some jobs its 15mins and for some it was 
more than an hour. The reduce code is not complex -- 2 level loop and couple of 
if-else blocks. The input size is also not huge, for the job that gets struck 
for an hour at reduce 99%, it would take in 130. Some of them are 1-3 MB in 
size and couple of them are 16MB in size. 

 Has anyone encountered this problem before? Any pointers? I use Hadoop 
0.20.2 on a linux cluster of 16 nodes.

Thank you.

Regards,
Raghava.


On Thu, Apr 1, 2010 at 2:24 AM, Raghava Mutharaju m.vijayaragh...@gmail.com 
wrote:

Hi all,

   I am running a series of jobs one after another. While executing the 
 4th job, the job fails. It fails in the reducer --- the progress percentage 
 would be map 100%, reduce 99%. It gives out the following message

10/04/01 01:04:15 INFO mapred.JobClient: Task Id : 
attempt_201003240138_0110_r_18_1, Status : FAILED 
Task attempt_201003240138_0110_r_18_1 failed to report status for 602 
seconds. Killing!

It makes several attempts again to execute it but fails with similar message. 
I couldn't get anything from this error message and wanted to look at logs 
(located in the default dir of ${HADOOP_HOME/logs}). But I don't find any 
files which match the timestamp of the job. Also I did not find history and 
userlogs in the logs folder. Should I look at some other place for the logs? 
What could be the possible causes for the above error?

   I am using Hadoop 0.20.2 and I am running it on a cluster with 16 nodes.

Thank you.

Regards,
Raghava.



Re: Reduce gets struck at 99%

2010-04-08 Thread prashant ullegaddi
Dear Raghava,

I also faced this problem. It mostly happens if the computation for the data
that reduce received is taking more time
and is not able to finish within the default time-out 600s. You can also
increase the time-out to ensure that all reduces
complete by setting the property mapred.task.timeout.


On Thu, Apr 8, 2010 at 11:57 PM, Eric Arenas eare...@rocketmail.com wrote:

 Yes Raghava,

 I have experience that issue before, and the solution that you mentioned
 also solved my issue (adding a context.progress or setcontext to tell the JT
 that my jobs are still running)

 regards
  Eric Arenas




 
 From: Raghava Mutharaju m.vijayaragh...@gmail.com
 To: common-user@hadoop.apache.org; mapreduce-u...@hadoop.apache.org
 Sent: Thu, April 8, 2010 10:30:49 AM
 Subject: Reduce gets struck at 99%

 Hello all,

 I got the time out error as mentioned below -- after 600 seconds,
 that attempt was killed and the attempt would be deemed a failure. I
 searched around about this error, and one of the suggestions to include
 progress statements in the reducer -- it might be taking longer than 600
 seconds and so is timing out. I added calls to context.progress() and
 context.setStatus(str) in the reducer. Now, it works fine -- there are no
 timeout errors.

 But, for a few jobs, it takes awfully long time to move from Map
 100%, Reduce 99% to Reduce 100%. For some jobs its 15mins and for some it
 was more than an hour. The reduce code is not complex -- 2 level loop and
 couple of if-else blocks. The input size is also not huge, for the job that
 gets struck for an hour at reduce 99%, it would take in 130. Some of them
 are 1-3 MB in size and couple of them are 16MB in size.

 Has anyone encountered this problem before? Any pointers? I use
 Hadoop 0.20.2 on a linux cluster of 16 nodes.

 Thank you.

 Regards,
 Raghava.


 On Thu, Apr 1, 2010 at 2:24 AM, Raghava Mutharaju 
 m.vijayaragh...@gmail.com wrote:

 Hi all,
 
I am running a series of jobs one after another. While executing
 the 4th job, the job fails. It fails in the reducer --- the progress
 percentage would be map 100%, reduce 99%. It gives out the following message
 
 10/04/01 01:04:15 INFO mapred.JobClient: Task Id :
 attempt_201003240138_0110_r_18_1, Status : FAILED
 Task attempt_201003240138_0110_r_18_1 failed to report status for 602
 seconds. Killing!
 
 It makes several attempts again to execute it but fails with similar
 message. I couldn't get anything from this error message and wanted to look
 at logs (located in the default dir of ${HADOOP_HOME/logs}). But I don't
 find any files which match the timestamp of the job. Also I did not find
 history and userlogs in the logs folder. Should I look at some other place
 for the logs? What could be the possible causes for the above error?
 
I am using Hadoop 0.20.2 and I am running it on a cluster with 16
 nodes.
 
 Thank you.
 
 Regards,
 Raghava.
 




-- 
Thanks and Regards,
Prashant Ullegaddi,
Search and Information Extraction Lab,
IIIT-Hyderabad, India.


Job report on JobTracker

2010-04-08 Thread Sanel Zukan
Hi all,

I'm working on larger application that utilizes Hadoop for some
crunching tasks and utilization is
done via new job API (Job/Configuration). I've noticed how
running/completed jobs are not visible on
JobTracker web view nor are displayed via 'hadoop job -list all' when
they are started this way
(MR jobs are started from application).

On other hand, jobs run via hadoop command (e.g. 'hadoop jar
myjob.jar') will be correctly
visible with their current state.

Am I missing something? Or, is it possible to notify JobTracker
somehow about started job?

Thanks :)

Best,
Sanel


Re: Reduce gets struck at 99%

2010-04-08 Thread Raghava Mutharaju
Hi,

 Thank you Eric, Prashant and Greg. Although the timeout problem was
resolved, reduce is getting stuck at 99%. As of now, it has been stuck there
for about 3 hrs. That is too high a wait time for my task. Do you guys see
any reason for this?

  Speculative execution is on by default right? Or should I enable it?

Regards,
Raghava.

On Thu, Apr 8, 2010 at 3:15 PM, Gregory Lawrence gr...@yahoo-inc.comwrote:

  Hi,

 I have also experienced this problem. Have you tried speculative execution?
 Also, I have had jobs that took a long time for one mapper / reducer because
 of a record that was significantly larger than those contained in the other
 filesplits. Do you know if it always slows down for the same filesplit?

 Regards,
 Greg Lawrence


 On 4/8/10 10:30 AM, Raghava Mutharaju m.vijayaragh...@gmail.com wrote:

 Hello all,

  I got the time out error as mentioned below -- after 600 seconds,
 that attempt was killed and the attempt would be deemed a failure. I
 searched around about this error, and one of the suggestions to include
 progress statements in the reducer -- it might be taking longer than 600
 seconds and so is timing out. I added calls to context.progress() and
 context.setStatus(str) in the reducer. Now, it works fine -- there are no
 timeout errors.

  But, for a few jobs, it takes awfully long time to move from Map
 100%, Reduce 99% to Reduce 100%. For some jobs its 15mins and for some it
 was more than an hour. The reduce code is not complex -- 2 level loop and
 couple of if-else blocks. The input size is also not huge, for the job that
 gets struck for an hour at reduce 99%, it would take in 130. Some of them
 are 1-3 MB in size and couple of them are 16MB in size.

  Has anyone encountered this problem before? Any pointers? I use
 Hadoop 0.20.2 on a linux cluster of 16 nodes.

 Thank you.

 Regards,
 Raghava.

 On Thu, Apr 1, 2010 at 2:24 AM, Raghava Mutharaju 
 m.vijayaragh...@gmail.com wrote:

 Hi all,

I am running a series of jobs one after another. While executing the
 4th job, the job fails. It fails in the reducer --- the progress percentage
 would be map 100%, reduce 99%. It gives out the following message

 10/04/01 01:04:15 INFO mapred.JobClient: Task Id :
 attempt_201003240138_0110_r_18_1, Status : FAILED
 Task attempt_201003240138_0110_r_18_1 failed to report status for 602
 seconds. Killing!

 It makes several attempts again to execute it but fails with similar
 message. I couldn't get anything from this error message and wanted to look
 at logs (located in the default dir of ${HADOOP_HOME/logs}). But I don't
 find any files which match the timestamp of the job. Also I did not find
 history and userlogs in the logs folder. Should I look at some other place
 for the logs? What could be the possible causes for the above error?

I am using Hadoop 0.20.2 and I am running it on a cluster with 16
 nodes.

 Thank you.

 Regards,
 Raghava.






Re: Reduce gets struck at 99%

2010-04-08 Thread Gregory Lawrence
Hi,

I have also experienced this problem. Have you tried speculative execution? 
Also, I have had jobs that took a long time for one mapper / reducer because of 
a record that was significantly larger than those contained in the other 
filesplits. Do you know if it always slows down for the same filesplit?

Regards,
Greg Lawrence

On 4/8/10 10:30 AM, Raghava Mutharaju m.vijayaragh...@gmail.com wrote:

Hello all,

 I got the time out error as mentioned below -- after 600 seconds, that 
attempt was killed and the attempt would be deemed a failure. I searched around 
about this error, and one of the suggestions to include progress statements 
in the reducer -- it might be taking longer than 600 seconds and so is timing 
out. I added calls to context.progress() and context.setStatus(str) in the 
reducer. Now, it works fine -- there are no timeout errors.

 But, for a few jobs, it takes awfully long time to move from Map 
100%, Reduce 99% to Reduce 100%. For some jobs its 15mins and for some it was 
more than an hour. The reduce code is not complex -- 2 level loop and couple of 
if-else blocks. The input size is also not huge, for the job that gets struck 
for an hour at reduce 99%, it would take in 130. Some of them are 1-3 MB in 
size and couple of them are 16MB in size.

 Has anyone encountered this problem before? Any pointers? I use Hadoop 
0.20.2 on a linux cluster of 16 nodes.

Thank you.

Regards,
Raghava.

On Thu, Apr 1, 2010 at 2:24 AM, Raghava Mutharaju m.vijayaragh...@gmail.com 
wrote:
Hi all,

   I am running a series of jobs one after another. While executing the 4th 
job, the job fails. It fails in the reducer --- the progress percentage would 
be map 100%, reduce 99%. It gives out the following message

10/04/01 01:04:15 INFO mapred.JobClient: Task Id : 
attempt_201003240138_0110_r_18_1, Status : FAILED
Task attempt_201003240138_0110_r_18_1 failed to report status for 602 
seconds. Killing!

It makes several attempts again to execute it but fails with similar message. I 
couldn't get anything from this error message and wanted to look at logs 
(located in the default dir of ${HADOOP_HOME/logs}). But I don't find any files 
which match the timestamp of the job. Also I did not find history and userlogs 
in the logs folder. Should I look at some other place for the logs? What could 
be the possible causes for the above error?

   I am using Hadoop 0.20.2 and I am running it on a cluster with 16 nodes.

Thank you.

Regards,
Raghava.




RE: Distributed Clusters

2010-04-08 Thread Michael Segel


  
  Is it better to build the 1000 node cluster in a single data center?  
 
 yes.
 
 Do you back one of these things up to a second data center or a different 
 1000 node cluster?
 

If you're building your cluster on the West Coast, yes, you had best concern 
yourself with Earthquakes, Rolling Blackouts and of course the ever present 
volcanic activity. ;-) In the Midwest? Not so much. Just some potential 
revolutionary, right wing conspiracy nut cases in Michigan and Northern 
Indiana. ;-)  (Ok, we do have tornadoes, floods, and in Chicago plagues of 
tourists. :-)  So do what Google and Microsoft are doing. Building out data 
centers at 'undisclosed' locations around Chicago. :-)

Ok... on a more serious note...

I think the question building out two clusters in two data centers only makes 
sense if you are worried about disaster recovery. Then yes, two clusters in 
different locations make sense.

Then your two clusters are independent and you have to workout how to keep them 
in sync.

If you were thinking of having one cloud span 2 data centers? Not really a good 
idea.

HTH

-Mike



  
_
Hotmail has tools for the New Busy. Search, chat and e-mail from your inbox.
http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_1

Re: hadoop on demand setup: Failed to retrieve 'hdfs' service address

2010-04-08 Thread Kevin Van Workum
On Thu, Apr 8, 2010 at 2:23 PM, Boyu Zhang boyuzhan...@gmail.com wrote:
 Hi Kevin,

 I am having the same error, but my critical error is:

 [2010-04-08 13:47:25,304] CRITICAL/50 hadoop:303 - Cluster could not be
 allocated because of the following errors.
 Hodring at n0 failed with following errors:
 JobTracker failed to initialise

 Have you solved this? Thanks!

Yes, I was about to post my solution. In my case the issue was that
the default log-dir is to use the log directory under the HOD
installation. Since I didn't have permissions to write to this
directory, the hdfs couldn't initailize. Setting log-dir = logs for
[hod], [ringmaster], [hodring], [gridservice-mapred], and
[gridservice-hdfs] in hodrc fixed the problem by writing the logs to
the logs directory under the CWD.

Also, I have managed to get HOD to use the hod.cluster setting from
hodrc to set the node properties for the qsub command. I'm going to
clean up my modifications and post it in the next day or two.

Kevin


 Boyu

 On Tue, Apr 6, 2010 at 11:32 AM, Kevin Van Workum v...@sabalcore.comwrote:

 [sorry for the double posting (to general), but I think this list is
 the appropriate place for this message]

 Hello,

 I'm trying to setup hadoop on demand (HOD) on my cluster. I'm
 currently unable to allocate cluster. I'm starting hod with the
 following command:

 /usr/local/hadoop-0.20.2/hod/bin/hod -c
 /usr/local/hadoop-0.20.2/hod/conf/hodrc -t
 /b/01/vanw/hod/hadoop-0.20.2.tar.gz -o allocate ~/hod 3
 --ringmaster.log-dir=/tmp -b 4

 The job starts on the nodes and I see the ringmaster running on the
 MotherSuperior. The ringmaster-main.log file is created and contains:

 [2010-04-06 11:18:29,036] DEBUG/10 ringMaster:487 - getServiceAddr
 service: hodlib.GridServices.mapred.MapReduce instance at 0x12b42518
 [2010-04-06 11:18:29,038] DEBUG/10 ringMaster:504 - getServiceAddr
 addr mapred: not found
 [2010-04-06 10:47:43,183] DEBUG/10 ringMaster:479 - getServiceAddr name:
 hdfs
 [2010-04-06 10:47:43,184] DEBUG/10 ringMaster:487 - getServiceAddr
 service: hodlib.GridServices.hdfs.Hdfs instance at 0x122d24d0
 [2010-04-06 10:47:43,186] DEBUG/10 ringMaster:504 - getServiceAddr
 addr hdfs: not found

 I don't see any associated processes running on the other 2 nodes in
 the job.

 The critical errors are as follows:

 [2010-04-06 10:34:13,630] CRITICAL/50 hadoop:298 - Failed to retrieve
 'hdfs' service address.
 [2010-04-06 10:34:13,631] DEBUG/10 hadoop:631 - Cleaning up cluster id
 238366.jman, as cluster could not be allocated.
 [2010-04-06 10:34:13,632] DEBUG/10 hadoop:635 - Calling rm.stop()
 [2010-04-06 10:34:13,639] DEBUG/10 hadoop:637 - Returning from rm.stop()
 [2010-04-06 10:34:13,639] CRITICAL/50 hod:401 - Cannot allocate
 cluster /b/01/vanw/hod
 [2010-04-06 10:34:14,149] DEBUG/10 hod:597 - return code: 7

 The contents of the hodrc file is:

 [hod]
 stream                          = True
 java-home                       = /usr/local/jdk1.6.0_02
 cluster                         = orange
 cluster-factor                  = 1.8
 xrs-port-range                  = 32768-65536
 debug                           = 4
 allocate-wait-time              = 3600
 temp-dir                        = /tmp/hod

 [ringmaster]
 register                        = True
 stream                          = False
 temp-dir                        = /tmp/hod
 http-port-range                 = 8000-9000
 work-dirs                       = /tmp/hod/1,/tmp/hod/2
 xrs-port-range                  = 32768-65536
 debug                           = 4

 [hodring]
 stream                          = False
 temp-dir                        = /tmp/hod
 register                        = True
 java-home                       = /usr/local/jdk1.6.0_02
 http-port-range                 = 8000-9000
 xrs-port-range                  = 32768-65536
 debug                           = 4

 [resource_manager]
 queue                           = dque
 batch-home                      = /usr/local/torque-2.3.7
 id                              = torque
 env-vars                       =
 HOD_PYTHON_HOME=/usr/local/python-2.5.5/bin/python

 [gridservice-mapred]
 external                        = False
 tracker_port                    = 8030
 info_port                       = 50080

 [gridservice-hdfs]
 external                        = False
 fs_port                         = 8020
 info_port                       = 50070


 Some other useful information:
 Linux 2.6.18-128.7.1.el5
 Python 2.5.5
 Twisted 10.0.0
 zope 3.3.0
 java version 1.6.0_02
 hadoop version 0.20.2



 --
 Kevin Van Workum, PhD
 Sabalcore Computing Inc.
 Run your code on 500 processors.
 Sign up for a free trial account.
 www.sabalcore.com
 877-492-8027 ext. 11





-- 
Kevin Van Workum, PhD
Sabalcore Computing Inc.
Run your code on 500 processors.
Sign up for a free trial account.
www.sabalcore.com
877-492-8027 ext. 11


Re: hadoop on demand setup: Failed to retrieve 'hdfs' service address

2010-04-08 Thread Boyu Zhang
Thanks for the reply. I checked out my logs more and found out that
sometimes the hdfs addr is the correct one.

But in the jobtracker log, there is an error:

file /data/mapredsys/zhang~~~/.info can only be replicated on 0 nodes
instead of 1
...
DFS is not ready...


And when I check the file, the who dir is not there. And do you know how to
check the namenode/datanode logs? I can't find them anywhere. Thanks a lot!

Boyu

On Thu, Apr 8, 2010 at 4:58 PM, Kevin Van Workum v...@sabalcore.com wrote:

 On Thu, Apr 8, 2010 at 2:23 PM, Boyu Zhang boyuzhan...@gmail.com wrote:
  Hi Kevin,
 
  I am having the same error, but my critical error is:
 
  [2010-04-08 13:47:25,304] CRITICAL/50 hadoop:303 - Cluster could not be
  allocated because of the following errors.
  Hodring at n0 failed with following errors:
  JobTracker failed to initialise
 
  Have you solved this? Thanks!

 Yes, I was about to post my solution. In my case the issue was that
 the default log-dir is to use the log directory under the HOD
 installation. Since I didn't have permissions to write to this
 directory, the hdfs couldn't initailize. Setting log-dir = logs for
 [hod], [ringmaster], [hodring], [gridservice-mapred], and
 [gridservice-hdfs] in hodrc fixed the problem by writing the logs to
 the logs directory under the CWD.

 Also, I have managed to get HOD to use the hod.cluster setting from
 hodrc to set the node properties for the qsub command. I'm going to
 clean up my modifications and post it in the next day or two.

 Kevin




Re: Reduce gets struck at 99%

2010-04-08 Thread Raghava Mutharaju
Hi Ted,

Thank you for all the suggestions. I went through the job tracker
logs and I have attached the exceptions found in the logs. I found two
exceptions

1) org.apache.hadoop.ipc.RemoteException: java.io.IOException: Could not
complete write to file(DFS Client)

2) org.apache.hadoop.ipc.RemoteException:
org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on
/user/raghava/MR_EL/output/_temporary/_attempt_201004060646_0057_r_14_0/part-r-00014
File does not exist. Holder DFSClient_attempt_201004060646_0057_r_14_0
does not have any open files.


The exception occurs at the point of writing out K,V pairs in the reducer
and it occurs only in certain task attempts. I am not using any custom
output format or record writers but I do use custom input reader.

What could have gone wrong here?

Thank you.

Regards,
Raghava.


On Thu, Apr 8, 2010 at 5:51 PM, Ted Yu yuzhih...@gmail.com wrote:

 Raghava:
 Are you able to share the last segment of reducer log ?
 You can get them from web UI:

 http://snv-it-lin-012.pr.com:50060/tasklog?taskid=attempt_201003221148_1211_r_03_0start=-8193

 Adding more log in your reducer task would help pinpoint where the issue
 is.
 Also look in job tracker log.

 Cheers

 On Thu, Apr 8, 2010 at 2:46 PM, Raghava Mutharaju 
 m.vijayaragh...@gmail.com
  wrote:

  Hi Ted,
 
   Thank you for the suggestion. I enabled it using the Configuration
  class because I cannot change hadoop-site.xml file (I am not an admin).
 The
  situation is still the same --- it gets stuck at reduce 99% and does not
  move further.
 
  Regards,
  Raghava.
 
  On Thu, Apr 8, 2010 at 4:40 PM, Ted Yu yuzhih...@gmail.com wrote:
 
   You need to turn on yourself (hadoop-site.xml):
   property
namemapred.reduce.tasks.speculative.execution/name
valuetrue/value
   /property
  
   property
namemapred.map.tasks.speculative.execution/name
valuetrue/value
   /property
  
  
   On Thu, Apr 8, 2010 at 1:14 PM, Raghava Mutharaju 
   m.vijayaragh...@gmail.com
wrote:
  
Hi,
   
Thank you Eric, Prashant and Greg. Although the timeout problem
 was
resolved, reduce is getting stuck at 99%. As of now, it has been
 stuck
there
for about 3 hrs. That is too high a wait time for my task. Do you
 guys
   see
any reason for this?
   
 Speculative execution is on by default right? Or should I
 enable
   it?
   
Regards,
Raghava.
   
On Thu, Apr 8, 2010 at 3:15 PM, Gregory Lawrence 
 gr...@yahoo-inc.com
wrote:
   
  Hi,

 I have also experienced this problem. Have you tried speculative
execution?
 Also, I have had jobs that took a long time for one mapper /
 reducer
because
 of a record that was significantly larger than those contained in
 the
other
 filesplits. Do you know if it always slows down for the same
  filesplit?

 Regards,
 Greg Lawrence


 On 4/8/10 10:30 AM, Raghava Mutharaju m.vijayaragh...@gmail.com
 
wrote:

 Hello all,

  I got the time out error as mentioned below -- after 600
seconds,
 that attempt was killed and the attempt would be deemed a failure.
 I
 searched around about this error, and one of the suggestions to
  include
 progress statements in the reducer -- it might be taking longer
  than
600
 seconds and so is timing out. I added calls to context.progress()
 and
 context.setStatus(str) in the reducer. Now, it works fine -- there
  are
   no
 timeout errors.

  But, for a few jobs, it takes awfully long time to move
 from
Map
 100%, Reduce 99% to Reduce 100%. For some jobs its 15mins and for
  some
it
 was more than an hour. The reduce code is not complex -- 2 level
 loop
   and
 couple of if-else blocks. The input size is also not huge, for the
  job
that
 gets struck for an hour at reduce 99%, it would take in 130. Some
 of
   them
 are 1-3 MB in size and couple of them are 16MB in size.

  Has anyone encountered this problem before? Any pointers?
 I
   use
 Hadoop 0.20.2 on a linux cluster of 16 nodes.

 Thank you.

 Regards,
 Raghava.

 On Thu, Apr 1, 2010 at 2:24 AM, Raghava Mutharaju 
 m.vijayaragh...@gmail.com wrote:

 Hi all,

I am running a series of jobs one after another. While
  executing
the
 4th job, the job fails. It fails in the reducer --- the progress
percentage
 would be map 100%, reduce 99%. It gives out the following message

 10/04/01 01:04:15 INFO mapred.JobClient: Task Id :
 attempt_201003240138_0110_r_18_1, Status : FAILED
 Task attempt_201003240138_0110_r_18_1 failed to report status
 for
   602
 seconds. Killing!

 It makes several attempts again to execute it but fails with
 similar
 message. I couldn't get anything from this error message and wanted
  to
look