Re: Hadoop overhead
Thank you very much for all the answers. I will definitely try using hadoop. Hope that results will be good. Kind regards, Aleksandar Stupar. From: Edward Capriolo edlinuxg...@gmail.com To: common-user@hadoop.apache.org Sent: Thu, April 8, 2010 5:28:00 PM Subject: Re: Hadoop overhead On Thu, Apr 8, 2010 at 10:51 AM, Patrick Angeles patr...@cloudera.comwrote: Packaging the job and config and sending it to the JobTracker and various nodes also adds a few seconds overhead. On Thu, Apr 8, 2010 at 10:37 AM, Jeff Zhang zjf...@gmail.com wrote: By default, for each task hadoop will create a new jvm process which will be the major cost in my opinion. You can customize configuration to let tasktracker reuse the jvm to eliminate the overhead to some extend. On Thu, Apr 8, 2010 at 8:55 PM, Aleksandar Stupar stupar.aleksan...@yahoo.com wrote: Hi all, As I realize hadoop is mainly used for tasks that take long time to execute. I'm considering to use hadoop for task whose lower bound in distributed execution is like 5 to 10 seconds. Am wondering what would the overhead be with using hadoop. Does anyone have an idea? Any link where I can find this out? Thanks, Aleksandar. -- Best Regards Jeff Zhang All jobs make entries in a jobhistory directory on the task tracker. As of now the jobhistory directory has some limitations with ext3 you hit max files in a directory at 32k, if you use xfs or ext4 you can have no theoretical limit but hadoop itself will bog down if the directory gets too large. If you want to do this enable JVM re-use as mentioned above to shorten job start times. Also be prepared to make some shell scripts to handle some cleanup tasks. Edward
RE: Hadoop and BDB Java edition
Hi Lamchith, There are couple of direct solutions available for Voldemort and Hadoop integration e.g. http://project-voldemort.com/blog/2009/06/voldemort-and-hadoop/ . This does not require BDB Java edition. Does this help for your project ? Thanks, Sagar -Original Message- From: lamchith.chathuku...@wipro.com [mailto:lamchith.chathuku...@wipro.com] Sent: Friday, April 09, 2010 10:40 AM To: common-user@hadoop.apache.org Subject: Hadoop and BDB Java edition Is it advisable to create the BDB file of BDB Java edition using Hadoop? I know that read only store data and index files for Voldemort can be generated using Hadoop. As we are using BDB rather than read only store for voldemort I have this requirement. Regards, Lamchith Please do not print this email unless it is absolutely necessary. The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. www.wipro.com DISCLAIMER == This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.
RE: Hadoop and BDB Java edition
Hi Sagar, Thank you for your reply. I have seen that and it is not suited for me. It is talking about creating read only store using hadoop. What i want to know whether the same can be done for BDB JE, ie creating the .BDB file using Hadoop. I am aware that the following issues are there if you are going to use BDB JE API. com.sleepycat.je.Environment which needs File to be given as an input. So that the location of the database can be given. But in the HDFS org.apache.hadoop.fs.Path is used to give location of the .index and .data files for readonly store as given the below mentioned link. Regards, Lamchith From: Sagar Shukla [mailto:sagar_shu...@persistent.co.in] Sent: Fri 4/9/2010 12:14 PM To: common-user@hadoop.apache.org Subject: RE: Hadoop and BDB Java edition Hi Lamchith, There are couple of direct solutions available for Voldemort and Hadoop integration e.g. http://project-voldemort.com/blog/2009/06/voldemort-and-hadoop/ . This does not require BDB Java edition. Does this help for your project ? Thanks, Sagar -Original Message- From: lamchith.chathuku...@wipro.com [mailto:lamchith.chathuku...@wipro.com] Sent: Friday, April 09, 2010 10:40 AM To: common-user@hadoop.apache.org Subject: Hadoop and BDB Java edition Is it advisable to create the BDB file of BDB Java edition using Hadoop? I know that read only store data and index files for Voldemort can be generated using Hadoop. As we are using BDB rather than read only store for voldemort I have this requirement. Regards, Lamchith Please do not print this email unless it is absolutely necessary. The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. www.wipro.com DISCLAIMER == This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails. Please do not print this email unless it is absolutely necessary. The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. www.wipro.com
What means PacketResponder ...terminating ?
While searching for a HBase Problem I came across this log messages: ... box00: /var/log/hadoop/hadoop-hadoop-datanode-box00.log.2010-04-08:2010-04-08 16:39:29,200 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for block blk_991235084167234271_101356 terminating box05: /var/log/hadoop/hadoop-hadoop-datanode-box05.log.2010-04-08:2010-04-08 16:39:29,200 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 1 for block blk_991235084167234271_101356 terminating box13: /var/log/hadoop/hadoop-hadoop-datanode-box13.log.2010-04-08:2010-04-08 16:39:29,200 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for block blk_991235084167234271_101356 terminating As they seem to preceed some HBase Problem, I would like to understand what it means. Thx for any help, Al
distributed cache
Hi, Am quite new to using this hadoop. I set the single node and tried running the sample map-reduce programs in that. They worked fine. 1)I want to run the distributed cache code(single node or 2-node cluster) and view the output. But i dont understand how to specify the input files, setting up the path in JobConf.java and where to add the functions specified in the instruction. 2)I also want to view the output files(logs). 3)They are talking about speculative execution and it is set to true by default in JobConf. But where exactly the actual logic of speculative execution could be found in the hadoop installation? I mean the specific code which gets executed when it is called. Waiting for guidance.. regards KulliKarot
Re: Network problems Hadoop 0.20.2 and Terasort on Debian 2.6.32 kernel
Allen Wittenauer wrote: On Apr 8, 2010, at 9:37 AM, stephen mulcahy wrote: When I run this on the Debian 2.6.32 kernel - over the course of the run, 1 or 2 datanodes of the cluster enter a state whereby they are no longer responsive to network traffic. How much free memory do you have? Lots, a few GB How many tasks per node do you have? I left this at the default. What are the service times, etc, on your IO system? Can you clarify this query? Has anyone run into similar problems with their environments? I noticed that the when the nodes become unresponsive, it often happens when the TeraSort is at I've always seen Linux nodes go unresponsive when they get memory starved to the point that the OOM can't function because it can't allocate enough mem. Sure, but I can login to the unresponsive nodes via the console - it's just the network that has become responsive. To be clear here, I don't suspect Hadoop is the root cause of the problem - I suspect either a kernel bug or some other operating system level bug. I was wondering if others had run into similar problems. I was also wondering in general what kernel versions and distros people are using, especially for larger production clusters. Thanks, -stephen -- Stephen Mulcahy, DI2, Digital Enterprise Research Institute, NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland http://di2.deri.iehttp://webstar.deri.iehttp://sindice.com
Re: What means PacketResponder ...terminating ?
Hi Al, It just means that the write pipeline is tearing itself down. Please see my response on the hbase list for further explanation of your particular issue. -Todd On Fri, Apr 9, 2010 at 12:15 AM, Al Lias al.l...@gmx.de wrote: While searching for a HBase Problem I came across this log messages: ... box00: /var/log/hadoop/hadoop-hadoop-datanode-box00.log.2010-04-08:2010-04-08 16:39:29,200 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for block blk_991235084167234271_101356 terminating box05: /var/log/hadoop/hadoop-hadoop-datanode-box05.log.2010-04-08:2010-04-08 16:39:29,200 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 1 for block blk_991235084167234271_101356 terminating box13: /var/log/hadoop/hadoop-hadoop-datanode-box13.log.2010-04-08:2010-04-08 16:39:29,200 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for block blk_991235084167234271_101356 terminating As they seem to preceed some HBase Problem, I would like to understand what it means. Thx for any help, Al -- Todd Lipcon Software Engineer, Cloudera
Re: Network problems Hadoop 0.20.2 and Terasort on Debian 2.6.32 kernel
On Fri, Apr 9, 2010 at 8:18 AM, stephen mulcahy stephen.mulc...@deri.orgwrote: Allen Wittenauer wrote: On Apr 8, 2010, at 9:37 AM, stephen mulcahy wrote: When I run this on the Debian 2.6.32 kernel - over the course of the run, 1 or 2 datanodes of the cluster enter a state whereby they are no longer responsive to network traffic. How much free memory do you have? Lots, a few GB How many tasks per node do you have? I left this at the default. What are the service times, etc, on your IO system? Can you clarify this query? Has anyone run into similar problems with their environments? I noticed that the when the nodes become unresponsive, it often happens when the TeraSort is at I've always seen Linux nodes go unresponsive when they get memory starved to the point that the OOM can't function because it can't allocate enough mem. Sure, but I can login to the unresponsive nodes via the console - it's just the network that has become responsive. To be clear here, I don't suspect Hadoop is the root cause of the problem - I suspect either a kernel bug or some other operating system level bug. I was wondering if others had run into similar problems. Most likely a kernel bug. In previous versions of Debian there was a buggy forcedeth driver, for example, that caused it to drop off the network in high load. Who knows what new bug is in 2.6.32 which is brand spanking new. I was also wondering in general what kernel versions and distros people are using, especially for larger production clusters. The overwhelming majority of production clusters run on RHEL 5.3 or RHEL 5.4 in my experience (I'm lumping CentOS 5.3/5.4 in with RHEL here). I know one or two production clusters running Debian Lenny, but none running something as new as what you're talking about. Hadoop doesn't exercise the new features in very recent kernels, so there's no sense accepting instability - just go with something old that works! -Todd -- Todd Lipcon Software Engineer, Cloudera
Re: distributed cache
Hi, I can answer the 2nd question. 2)I also want to view the output files(logs). Check the following link. It contains URLs to view the logs on the Web UI. http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Single-Node_Cluster%29#Hadoop_Web_Interfaces . If that is not possible (Web UI is the preferred way, atleast for me), then the logs would be in ${HADOOP_LOG_DIR}. The default location is ${HADOOP_HOME}/logs. The relevant logs would be in userlogs folder. These 2 environment variables are generally set in the file hadoop-env.sh, so, you can check out the values there. For the 3rd question, are you planning to change the code related to speculative execution or do you just want to have a look at it? Regards, Raghava. On Fri, Apr 9, 2010 at 6:18 AM, janani venkat jan...@gmail.com wrote: Hi, Am quite new to using this hadoop. I set the single node and tried running the sample map-reduce programs in that. They worked fine. 1)I want to run the distributed cache code(single node or 2-node cluster) and view the output. But i dont understand how to specify the input files, setting up the path in JobConf.java and where to add the functions specified in the instruction. 2)I also want to view the output files(logs). 3)They are talking about speculative execution and it is set to true by default in JobConf. But where exactly the actual logic of speculative execution could be found in the hadoop installation? I mean the specific code which gets executed when it is called. Waiting for guidance.. regards KulliKarot
Install shared library?
My C++ pipes program needs to use a shared library. What are my options? Can I installed this on the cluster in a way that permits HDFS to access it from each node as needed? Can I put it in the distributed cache such that attempts to link to the library find it in the cache? Other options? Thanks. Keith Wiley kwi...@keithwiley.com www.keithwiley.com It's a fine line between meticulous and obsessive-compulsive and a slippery rope between obsessive compulsive and debilitatingly slow. -- Keith Wiley
Re: Install shared library?
On Apr 9, 2010, at 1:22 PM, Keith Wiley wrote: My C++ pipes program needs to use a shared library. What are my options? Can I installed this on the cluster in a way that permits HDFS to access it from each node as needed? Can I put it in the distributed cache such that attempts to link to the library find it in the cache? Other options? Distributed Cache is the way to go.
Re: Install shared library?
On Apr 9, 2010, at 13:43 , Allen Wittenauer wrote: On Apr 9, 2010, at 1:22 PM, Keith Wiley wrote: My C++ pipes program needs to use a shared library. What are my options? Can I installed this on the cluster in a way that permits HDFS to access it from each node as needed? Can I put it in the distributed cache such that attempts to link to the library find it in the cache? Other options? Distributed Cache is the way to go. Okay, I saw some docs on that but I thought they were kinda Javaish. I wasn't sure if it would jive for pipes. I'll follow up on that. Thanks. Keith Wiley kwi...@keithwiley.com www.keithwiley.com Luminous beings are we, not this crude matter. -- Yoda
Re: Install shared library?
On Apr 9, 2010, at 13:43 , Allen Wittenauer wrote: On Apr 9, 2010, at 1:22 PM, Keith Wiley wrote: My C++ pipes program needs to use a shared library. What are my options? Can I installed this on the cluster in a way that permits HDFS to access it from each node as needed? Can I put it in the distributed cache such that attempts to link to the library find it in the cache? Other options? Distributed Cache is the way to go. Suppose the share library is quite large (or there are numerous required shared libraries) and it is therefore costly and tedious to send it (them) to the distributed cache for every job. Is there any way to install them on HDFS permanently such that they are found when executing C++ pipes programs? Keith Wiley kwi...@keithwiley.com www.keithwiley.com And what if we picked the wrong religion? Every week, we're just making God madder and madder! -- Homer Simpson
Re: Install shared library?
On Apr 9, 2010, at 13:43 , Allen Wittenauer wrote: On Apr 9, 2010, at 1:22 PM, Keith Wiley wrote: My C++ pipes program needs to use a shared library. What are my options? Can I installed this on the cluster in a way that permits HDFS to access it from each node as needed? Can I put it in the distributed cache such that attempts to link to the library find it in the cache? Other options? Distributed Cache is the way to go. Is there anyway to simply install all the necessary shared libraries on every node of the cluster so they're already there, ready, waiting...and properly linkable from an HDFS pipes job, so they don't have to be copied to the distributed cache and sent node-to-node around the cluster on every run? Keith Wiley kwi...@keithwiley.com www.keithwiley.com What I primarily learned in grad school is how much I *don't* know. Consequently, I left grad school with a higher ignorance to knowledge ratio than when I entered. -- Keith Wiley
-files flag question
I'm a little confused how the -files flag works. My understanding is that it takes two arguments: a file URI (could be local or on HDFS, assumed local if no URI scheme is provided) and a short tag representing the file on the distributed cache, usually just the name of the file without the long path that precedes it in the URI. But, -files can also pass multiple files to the distributed cache, so, how does this all go together. Are odd arguments all URIs and even arguments all cache-tags? Is it that simple? I'm not really sure how to fit it all together if I need to send several files to the distributed cache (several shared libraries for example). Thanks. Keith Wiley kwi...@keithwiley.com www.keithwiley.com It's a fine line between meticulous and obsessive-compulsive and a slippery rope between obsessive compulsive and debilitatingly slow. -- Keith Wiley