Re: Hadoop overhead

2010-04-09 Thread Aleksandar Stupar
Thank you very much for all the answers. 

I will definitely try using hadoop. Hope that
results will be good. 

 Kind regards,
Aleksandar Stupar.






From: Edward Capriolo edlinuxg...@gmail.com
To: common-user@hadoop.apache.org
Sent: Thu, April 8, 2010 5:28:00 PM
Subject: Re: Hadoop overhead

On Thu, Apr 8, 2010 at 10:51 AM, Patrick Angeles patr...@cloudera.comwrote:

 Packaging the job and config and sending it to the JobTracker and various
 nodes also adds a few seconds overhead.

 On Thu, Apr 8, 2010 at 10:37 AM, Jeff Zhang zjf...@gmail.com wrote:

  By default, for each task hadoop will create a new jvm process which will
  be
  the major cost in my opinion. You can customize configuration to let
  tasktracker reuse the jvm to eliminate the overhead to some extend.
 
  On Thu, Apr 8, 2010 at 8:55 PM, Aleksandar Stupar 
  stupar.aleksan...@yahoo.com wrote:
 
   Hi all,
  
   As I realize hadoop is mainly used for tasks that take long
   time to execute. I'm considering to use hadoop for task
   whose lower bound in distributed execution is like 5 to 10
   seconds. Am wondering what would the overhead be with
   using hadoop.
  
   Does anyone have an idea? Any link where I can find this out?
  
   Thanks,
   Aleksandar.
  
  
  
 
 
 
 
  --
  Best Regards
 
  Jeff Zhang
 


All jobs make entries in a jobhistory directory on the task tracker. As of
now the jobhistory directory has some limitations with ext3 you hit max
files in a directory at 32k, if you use xfs or ext4 you can have no
theoretical limit but hadoop itself will bog down if the directory gets too
large.

If you want to do this enable JVM re-use as mentioned above to shorten job
start times. Also be prepared to make some shell scripts to handle some
cleanup tasks.

Edward



  

RE: Hadoop and BDB Java edition

2010-04-09 Thread Sagar Shukla
Hi Lamchith,
 There are couple of direct solutions available for Voldemort and Hadoop 
integration e.g. 
http://project-voldemort.com/blog/2009/06/voldemort-and-hadoop/ . This does not 
require BDB Java edition.

Does this help for your project ?

Thanks,
Sagar

-Original Message-
From: lamchith.chathuku...@wipro.com [mailto:lamchith.chathuku...@wipro.com]
Sent: Friday, April 09, 2010 10:40 AM
To: common-user@hadoop.apache.org
Subject: Hadoop and BDB Java edition

Is it advisable to create the BDB file of BDB Java edition using Hadoop?
I know that read only store data and index files for Voldemort can be
generated using Hadoop. As we are using BDB rather than read only store
for  voldemort I have this requirement.



Regards,

Lamchith




Please do not print this email unless it is absolutely necessary.

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments.

WARNING: Computer viruses can be transmitted via email. The recipient should 
check this email and any attachments for the presence of viruses. The company 
accepts no liability for any damage caused by any virus transmitted by this 
email.

www.wipro.com

DISCLAIMER
==
This e-mail may contain privileged and confidential information which is the 
property of Persistent Systems Ltd. It is intended only for the use of the 
individual or entity to which it is addressed. If you are not the intended 
recipient, you are not authorized to read, retain, copy, print, distribute or 
use this message. If you have received this communication in error, please 
notify the sender and delete all copies of this message. Persistent Systems 
Ltd. does not accept any liability for virus infected mails.


RE: Hadoop and BDB Java edition

2010-04-09 Thread lamchith.chathukutty
Hi Sagar,

Thank you for your reply.

I have seen that and it is not suited for me. It is talking about creating read 
only store using hadoop. What i want to know whether the same can be done for 
BDB JE, ie creating the .BDB file using Hadoop.

I am aware that the following issues are there if you are going to use BDB JE 
API.

com.sleepycat.je.Environment which needs File to be given as an input.
So that the location of the database can be given.
But in the HDFS org.apache.hadoop.fs.Path is used to give location of the .index
and .data files for readonly store as given the below mentioned link.

Regards,

Lamchith



From: Sagar Shukla [mailto:sagar_shu...@persistent.co.in]
Sent: Fri 4/9/2010 12:14 PM
To: common-user@hadoop.apache.org
Subject: RE: Hadoop and BDB Java edition



Hi Lamchith,
 There are couple of direct solutions available for Voldemort and Hadoop 
integration e.g. 
http://project-voldemort.com/blog/2009/06/voldemort-and-hadoop/ . This does not 
require BDB Java edition.

Does this help for your project ?

Thanks,
Sagar

-Original Message-
From: lamchith.chathuku...@wipro.com [mailto:lamchith.chathuku...@wipro.com]
Sent: Friday, April 09, 2010 10:40 AM
To: common-user@hadoop.apache.org
Subject: Hadoop and BDB Java edition

Is it advisable to create the BDB file of BDB Java edition using Hadoop?
I know that read only store data and index files for Voldemort can be
generated using Hadoop. As we are using BDB rather than read only store
for  voldemort I have this requirement.



Regards,

Lamchith




Please do not print this email unless it is absolutely necessary.

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments.

WARNING: Computer viruses can be transmitted via email. The recipient should 
check this email and any attachments for the presence of viruses. The company 
accepts no liability for any damage caused by any virus transmitted by this 
email.

www.wipro.com

DISCLAIMER
==
This e-mail may contain privileged and confidential information which is the 
property of Persistent Systems Ltd. It is intended only for the use of the 
individual or entity to which it is addressed. If you are not the intended 
recipient, you are not authorized to read, retain, copy, print, distribute or 
use this message. If you have received this communication in error, please 
notify the sender and delete all copies of this message. Persistent Systems 
Ltd. does not accept any liability for virus infected mails.



Please do not print this email unless it is absolutely necessary. 

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. 

WARNING: Computer viruses can be transmitted via email. The recipient should 
check this email and any attachments for the presence of viruses. The company 
accepts no liability for any damage caused by any virus transmitted by this 
email. 

www.wipro.com


What means PacketResponder ...terminating ?

2010-04-09 Thread Al Lias
While searching for a HBase Problem I came across this log messages:

...
box00:
/var/log/hadoop/hadoop-hadoop-datanode-box00.log.2010-04-08:2010-04-08
16:39:29,200 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
PacketResponder 0 for block blk_991235084167234271_101356 terminating
box05:
/var/log/hadoop/hadoop-hadoop-datanode-box05.log.2010-04-08:2010-04-08
16:39:29,200 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
PacketResponder 1 for block blk_991235084167234271_101356 terminating
box13:
/var/log/hadoop/hadoop-hadoop-datanode-box13.log.2010-04-08:2010-04-08
16:39:29,200 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
PacketResponder 2 for block blk_991235084167234271_101356 terminating

As they seem to preceed some HBase Problem, I would like to understand
what it means.

Thx for any help,

Al


distributed cache

2010-04-09 Thread janani venkat
Hi,
Am quite new to using this hadoop. I set the single node and tried running
the sample map-reduce programs in that. They worked fine.
1)I want to run the distributed cache code(single node or 2-node cluster)
and view the output. But i dont understand how to specify the input files,
setting up the path in  JobConf.java and where to add the functions
specified in the instruction.
2)I also want to view the output files(logs).
3)They are talking about speculative execution and it is set to true by
default in JobConf. But where exactly the actual logic of speculative
execution could be found in the hadoop installation? I mean the specific
code which gets executed when it is called.


Waiting for guidance..

regards
KulliKarot


Re: Network problems Hadoop 0.20.2 and Terasort on Debian 2.6.32 kernel

2010-04-09 Thread stephen mulcahy

Allen Wittenauer wrote:

On Apr 8, 2010, at 9:37 AM, stephen mulcahy wrote:

When I run this on the Debian 2.6.32 kernel - over the course of the run, 1 or 
2 datanodes of the cluster enter a state whereby they are no longer responsive 
to network traffic.


How much free memory do you have?


Lots, a few GB



How many tasks per node do you have?


I left this at the default.



What are the service times, etc, on your IO system?  


Can you clarify this query?




Has anyone run into similar problems with their environments? I noticed that 
the when the nodes become unresponsive, it often happens when the TeraSort is at


I've always seen Linux nodes go unresponsive when they get memory starved to 
the point that the OOM can't function because it can't allocate enough mem.


Sure, but I can login to the unresponsive nodes via the console - it's 
just the network that has become responsive. To be clear here, I don't 
suspect Hadoop is the root cause of the problem - I suspect either a 
kernel bug or some other operating system level bug. I was wondering if 
others had run into similar problems.


I was also wondering in general what kernel versions and distros people 
are using, especially for larger production clusters.


Thanks,

-stephen

--
Stephen Mulcahy, DI2, Digital Enterprise Research Institute,
NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland
http://di2.deri.iehttp://webstar.deri.iehttp://sindice.com


Re: What means PacketResponder ...terminating ?

2010-04-09 Thread Todd Lipcon
Hi Al,

It just means that the write pipeline is tearing itself down. Please see my
response on the hbase list for further explanation of your particular issue.

-Todd

On Fri, Apr 9, 2010 at 12:15 AM, Al Lias al.l...@gmx.de wrote:

 While searching for a HBase Problem I came across this log messages:

 ...
 box00:
 /var/log/hadoop/hadoop-hadoop-datanode-box00.log.2010-04-08:2010-04-08
 16:39:29,200 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
 PacketResponder 0 for block blk_991235084167234271_101356 terminating
 box05:
 /var/log/hadoop/hadoop-hadoop-datanode-box05.log.2010-04-08:2010-04-08
 16:39:29,200 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
 PacketResponder 1 for block blk_991235084167234271_101356 terminating
 box13:
 /var/log/hadoop/hadoop-hadoop-datanode-box13.log.2010-04-08:2010-04-08
 16:39:29,200 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
 PacketResponder 2 for block blk_991235084167234271_101356 terminating

 As they seem to preceed some HBase Problem, I would like to understand
 what it means.

 Thx for any help,

Al




-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Network problems Hadoop 0.20.2 and Terasort on Debian 2.6.32 kernel

2010-04-09 Thread Todd Lipcon
On Fri, Apr 9, 2010 at 8:18 AM, stephen mulcahy stephen.mulc...@deri.orgwrote:

 Allen Wittenauer wrote:

 On Apr 8, 2010, at 9:37 AM, stephen mulcahy wrote:

 When I run this on the Debian 2.6.32 kernel - over the course of the run,
 1 or 2 datanodes of the cluster enter a state whereby they are no longer
 responsive to network traffic.


 How much free memory do you have?


 Lots, a few GB



 How many tasks per node do you have?


 I left this at the default.



 What are the service times, etc, on your IO system?


 Can you clarify this query?



  Has anyone run into similar problems with their environments? I noticed
 that the when the nodes become unresponsive, it often happens when the
 TeraSort is at


 I've always seen Linux nodes go unresponsive when they get memory starved
 to the point that the OOM can't function because it can't allocate enough
 mem.


 Sure, but I can login to the unresponsive nodes via the console - it's just
 the network that has become responsive. To be clear here, I don't suspect
 Hadoop is the root cause of the problem - I suspect either a kernel bug or
 some other operating system level bug. I was wondering if others had run
 into similar problems.


Most likely a kernel bug. In previous versions of Debian there was a buggy
forcedeth driver, for example, that caused it to drop off the network in
high load. Who knows what new bug is in 2.6.32 which is brand spanking new.



 I was also wondering in general what kernel versions and distros people are
 using, especially for larger production clusters.


The overwhelming majority of production clusters run on RHEL 5.3 or RHEL 5.4
in my experience (I'm lumping CentOS 5.3/5.4 in with RHEL here). I know one
or two production clusters running Debian Lenny, but none running something
as new as what you're talking about. Hadoop doesn't exercise the new
features in very recent kernels, so there's no sense accepting instability -
just go with something old that works!

-Todd

-- 
Todd Lipcon
Software Engineer, Cloudera


Re: distributed cache

2010-04-09 Thread Raghava Mutharaju
Hi,

I can answer the 2nd question.

 2)I also want to view the output files(logs).
 Check the following link. It contains URLs to view the logs on the Web
UI.

http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Single-Node_Cluster%29#Hadoop_Web_Interfaces
.

If that is not possible (Web UI is the preferred way, atleast for me), then
the logs would be in ${HADOOP_LOG_DIR}. The default location is
${HADOOP_HOME}/logs. The relevant logs would be in userlogs folder. These
2 environment variables are generally set in the file hadoop-env.sh, so, you
can check out the values there.

For the 3rd question, are you planning to change the code related to
speculative execution or do you just want to have a look at it?


Regards,
Raghava.

On Fri, Apr 9, 2010 at 6:18 AM, janani venkat jan...@gmail.com wrote:

 Hi,
 Am quite new to using this hadoop. I set the single node and tried running
 the sample map-reduce programs in that. They worked fine.
 1)I want to run the distributed cache code(single node or 2-node cluster)
 and view the output. But i dont understand how to specify the input files,
 setting up the path in  JobConf.java and where to add the functions
 specified in the instruction.
 2)I also want to view the output files(logs).
 3)They are talking about speculative execution and it is set to true by
 default in JobConf. But where exactly the actual logic of speculative
 execution could be found in the hadoop installation? I mean the specific
 code which gets executed when it is called.


 Waiting for guidance..

 regards
 KulliKarot



Install shared library?

2010-04-09 Thread Keith Wiley
My C++ pipes program needs to use a shared library.  What are my options?  Can 
I installed this on the cluster in a way that permits HDFS to access it from 
each node as needed?  Can I put it in the distributed cache such that attempts 
to link to the library find it in the cache?  Other options?

Thanks.


Keith Wiley   kwi...@keithwiley.com   www.keithwiley.com

It's a fine line between meticulous and obsessive-compulsive and a slippery
rope between obsessive compulsive and debilitatingly slow.
  -- Keith Wiley






Re: Install shared library?

2010-04-09 Thread Allen Wittenauer

On Apr 9, 2010, at 1:22 PM, Keith Wiley wrote:

 My C++ pipes program needs to use a shared library.  What are my options?  
 Can I installed this on the cluster in a way that permits HDFS to access it 
 from each node as needed?  Can I put it in the distributed cache such that 
 attempts to link to the library find it in the cache?  Other options?

Distributed Cache is the way to go.




Re: Install shared library?

2010-04-09 Thread Keith Wiley

On Apr 9, 2010, at 13:43 , Allen Wittenauer wrote:

 
 On Apr 9, 2010, at 1:22 PM, Keith Wiley wrote:
 
 My C++ pipes program needs to use a shared library.  What are my options?  
 Can I installed this on the cluster in a way that permits HDFS to access it 
 from each node as needed?  Can I put it in the distributed cache such that 
 attempts to link to the library find it in the cache?  Other options?
 
 Distributed Cache is the way to go.

Okay, I saw some docs on that but I thought they were kinda Javaish.  I wasn't 
sure if it would jive for pipes.  I'll follow up on that.

Thanks.


Keith Wiley   kwi...@keithwiley.com   www.keithwiley.com

Luminous beings are we, not this crude matter.
  -- Yoda






Re: Install shared library?

2010-04-09 Thread Keith Wiley

On Apr 9, 2010, at 13:43 , Allen Wittenauer wrote:

 
 On Apr 9, 2010, at 1:22 PM, Keith Wiley wrote:
 
 My C++ pipes program needs to use a shared library.  What are my options?  
 Can I installed this on the cluster in a way that permits HDFS to access it 
 from each node as needed?  Can I put it in the distributed cache such that 
 attempts to link to the library find it in the cache?  Other options?
 
 Distributed Cache is the way to go.

Suppose the share library is quite large (or there are numerous required shared 
libraries) and it is therefore costly and tedious to send it (them) to the 
distributed cache for every job.  Is there any way to install them on HDFS 
permanently such that they are found when executing C++ pipes programs?


Keith Wiley   kwi...@keithwiley.com   www.keithwiley.com

And what if we picked the wrong religion?  Every week, we're just making God
madder and madder!
  -- Homer Simpson






Re: Install shared library?

2010-04-09 Thread Keith Wiley

On Apr 9, 2010, at 13:43 , Allen Wittenauer wrote:

 
 On Apr 9, 2010, at 1:22 PM, Keith Wiley wrote:
 
 My C++ pipes program needs to use a shared library.  What are my options?  
 Can I installed this on the cluster in a way that permits HDFS to access it 
 from each node as needed?  Can I put it in the distributed cache such that 
 attempts to link to the library find it in the cache?  Other options?
 
 Distributed Cache is the way to go.

Is there anyway to simply install all the necessary shared libraries on every 
node of the cluster so they're already there, ready, waiting...and properly 
linkable from an HDFS pipes job, so they don't have to be copied to the 
distributed cache and sent node-to-node around the cluster on every run?


Keith Wiley   kwi...@keithwiley.com   www.keithwiley.com

What I primarily learned in grad school is how much I *don't* know.
Consequently, I left grad school with a higher ignorance to knowledge ratio than
when I entered.
  -- Keith Wiley







-files flag question

2010-04-09 Thread Keith Wiley
I'm a little confused how the -files flag works.  My understanding is that it 
takes two arguments: a file URI (could be local or on HDFS, assumed local if no 
URI scheme is provided) and a short tag representing the file on the 
distributed cache, usually just the name of the file without the long path that 
precedes it in the URI.

But, -files can also pass multiple files to the distributed cache, so, how does 
this all go together.  Are odd arguments all URIs and even arguments all 
cache-tags?  Is it that simple?  I'm not really sure how to fit it all together 
if I need to send several files to the distributed cache (several shared 
libraries for example).

Thanks.


Keith Wiley   kwi...@keithwiley.com   www.keithwiley.com

It's a fine line between meticulous and obsessive-compulsive and a slippery
rope between obsessive compulsive and debilitatingly slow.
  -- Keith Wiley