Re: When applying a patch, which attachment should I use?
Dear Adarsh, My situation is somewhat different from yours as I am only running Hadoop and Hbase (as opposed to Hadoop/Hive/Hbase). But I hope my experience could be of help to you somehow. I applied the hdfs-630-0.20-append.patch to every single Hadoop node. (including master and slaves) Then I followed exactly what they told me to do on http://hbase.apache.org/docs/current/api/overview-summary.html#overview_description . I didn't get a single error message and successfully started HBase in a fully distributed mode. I am not using Hive so I can't tell what caused the MasterNotRunningException, but the patch above is meant to allow DFSClients pass NameNode lists of known dead Datanodes. I doubt that the patch has anything to do with MasterNotRunningException. Hope this helps. Regards, Ed 2011/1/13 Adarsh Sharma adarsh.sha...@orkash.com I am also facing some issues and i think applying hdfs-630-0.20-append.patch https://issues.apache.org/jira/secure/attachment/12446812/hdfs-630-0.20-append.patch would solve my problem. I try to run Hadoop/Hive/Hbase integration in fully Distributed mode. But I am facing master Not Running Exception mentioned in http://wiki.apache.org/hadoop/Hive/HBaseIntegration. My Hadoop Version= 0.20.2, Hive =0.6.0 , Hbase=0.20.6. What you think Edward. Thanks Adarsh edward choi wrote: I am not familiar with this whole svn and patch stuff, so please understand my asking. I was going to apply hdfs-630-0.20-append.patch https://issues.apache.org/jira/secure/attachment/12446812/hdfs-630-0.20-append.patch only because I wanted to install HBase and the installation guide told me to. The append branch you mentioned, does that include hdfs-630-0.20-append.patch https://issues.apache.org/jira/secure/attachment/12446812/hdfs-630-0.20-append.patch as well? Is it like the latest patch with all the good stuff packed in one? Regards, Ed 2011/1/12 Ted Dunning tdunn...@maprtech.com You may also be interested in the append branch: http://svn.apache.org/viewvc/hadoop/common/branches/branch-0.20-append/ On Tue, Jan 11, 2011 at 3:12 AM, edward choi mp2...@gmail.com wrote: Thanks for the info. I am currently using Hadoop 0.20.2, so I guess I only need apply hdfs-630-0.20-append.patch https://issues.apache.org/jira/secure/attachment/12446812/hdfs-630-0.20-append.patch . I wasn't familiar with the term trunk. I guess it means the latest development. Thanks again. Best Regards, Ed 2011/1/11 Konstantin Boudnik c...@apache.org Yeah, that's pretty crazy all right. In your case looks like that 3 patches on the top are the latest for 0.20-append branch, 0.21 branch and trunk (which perhaps 0.22 branch at the moment). It doesn't look like you need to apply all of them - just try the latest for your particular branch. The mess is caused by the fact the ppl are using different names for consequent patches (as in file.1.patch, file.2.patch etc) This is _very_ confusing indeed, especially when different contributors work on the same fix/feature. -- Take care, Konstantin (Cos) Boudnik On Mon, Jan 10, 2011 at 01:10, edward choi mp2...@gmail.com wrote: Hi, For the first time I am about to apply a patch to HDFS. https://issues.apache.org/jira/browse/HDFS-630 Above is the one that I am trying to do. But there are like 15 patches and I don't know which one to use. Could anyone tell me if I need to apply them all or just the one at the top? The whole patching process is just so confusing :-( Ed
Why Hadoop uses HTTP for file transmission between Map and Reduce?
Hi, all I have a question about the file transmission between Map and Reduce stage, in current implementation, the Reducers get the results generated by Mappers through HTTP Get, I don't understand why HTTP is selected, why not FTP, or a self-developed protocal? Just for HTTP's simple? thanks Nan
Re: Why Hadoop uses HTTP for file transmission between Map and Reduce?
That is also my concerns. Is it efficient for data transmission. On Thu, Jan 13, 2011 at 4:27 PM, Nan Zhu zhunans...@gmail.com wrote: Hi, all I have a question about the file transmission between Map and Reduce stage, in current implementation, the Reducers get the results generated by Mappers through HTTP Get, I don't understand why HTTP is selected, why not FTP, or a self-developed protocal? Just for HTTP's simple? thanks Nan -- -李平
Re: TeraSort question.
On 11/01/11 16:40, Raj V wrote: Ted Thanks. I have all the graphs I need that include, map reduce timeline, system activity for all the nodes when the sort was running. I will publish them once I have them in some presentable format., For legal reasons, I really don't want to send the complete job histiory files. My question is still this. When running terasort, would the CPU, disk and network utilization of all the nodes be more or less similar or completely different. They can be different. The JT pushes out work to machines when they report in, some may get more work than others, so generate more local data. This will have follow-on consequences. In a live system things are different as the work tends to follow the data, so machines with (or near) the data you need get the work. It's a really hard thing to say is the cluster working right, when bringing it up, everyone is really guessing about expected performance. -Steve
Re: Why Hadoop uses HTTP for file transmission between Map and Reduce?
On 13/01/11 08:34, li ping wrote: That is also my concerns. Is it efficient for data transmission. It's long lived TCP connections, reasonably efficient for bulk data xfer, has all the throttling of TCP built in, and comes with some excellently debugged client and server code in the form of jetty and httpclient. In maintenance costs alone, those libraries justify HTTP unless you have a vastly superior option *and are willing to maintain it forever* FTPs limits are well known (security), NFS limits well known (security, UDP version doesn't throttle), self developed protocols will have whatever problems you want. There are better protocols for long-haul data transfer over fat pipes, such as GridFTP , PhedEX ( http://www.gridpp.ac.uk/papers/ah05_phedex.pdf ), which use multiple TCP channels in parallel to reduce the impact of a single lost packet, but within a datacentre, you shouldn't have to worry about this. If you do find lots of packets get lost, raise the issue with the networking team. -Steve On Thu, Jan 13, 2011 at 4:27 PM, Nan Zhuzhunans...@gmail.com wrote: Hi, all I have a question about the file transmission between Map and Reduce stage, in current implementation, the Reducers get the results generated by Mappers through HTTP Get, I don't understand why HTTP is selected, why not FTP, or a self-developed protocal? Just for HTTP's simple? thanks Nan
Re: Why Hadoop uses HTTP for file transmission between Map and Reduce?
Actually, PhedEx is using GridFTP for its data transferring. On Thu, Jan 13, 2011 at 5:34 AM, Steve Loughran ste...@apache.org wrote: On 13/01/11 08:34, li ping wrote: That is also my concerns. Is it efficient for data transmission. It's long lived TCP connections, reasonably efficient for bulk data xfer, has all the throttling of TCP built in, and comes with some excellently debugged client and server code in the form of jetty and httpclient. In maintenance costs alone, those libraries justify HTTP unless you have a vastly superior option *and are willing to maintain it forever* FTPs limits are well known (security), NFS limits well known (security, UDP version doesn't throttle), self developed protocols will have whatever problems you want. There are better protocols for long-haul data transfer over fat pipes, such as GridFTP , PhedEX ( http://www.gridpp.ac.uk/papers/ah05_phedex.pdf ), which use multiple TCP channels in parallel to reduce the impact of a single lost packet, but within a datacentre, you shouldn't have to worry about this. If you do find lots of packets get lost, raise the issue with the networking team. -Steve On Thu, Jan 13, 2011 at 4:27 PM, Nan Zhuzhunans...@gmail.com wrote: Hi, all I have a question about the file transmission between Map and Reduce stage, in current implementation, the Reducers get the results generated by Mappers through HTTP Get, I don't understand why HTTP is selected, why not FTP, or a self-developed protocal? Just for HTTP's simple? thanks Nan
About hadoop-..-examples.jar
Hi, guys: Do anyone know where I can get package hadoop-..-examples.jar? I want ti use TeraSort in it. It seems this package is not included in hadoop source code. And I also fail to find download links on its Homepage. -- Best Regards! Sincerely Bo Sang
Re: About hadoop-..-examples.jar
The examples package is in the MapReduce trunk. Note that it is under a different src directory src/examples but not src/java. See also http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/examples/org/apache/hadoop/examples/terasort/ Nicholas From: Bo Sang sampl...@gmail.com To: Hadoop user mail list common-user@hadoop.apache.org Sent: Thu, January 13, 2011 11:23:44 AM Subject: About hadoop-..-examples.jar Hi, guys: Do anyone know where I can get package hadoop-..-examples.jar? I want ti use TeraSort in it. It seems this package is not included in hadoop source code. And I also fail to find download links on its Homepage. -- Best Regards! Sincerely Bo Sang
Re: Why Hadoop uses HTTP for file transmission between Map and Reduce?
At some point, we'll replace Jetty in the shuffle, because it imposes too much overhead and go to Netty or some other lower level library. I don't think that using HTTP adds that much overhead although it would be interesting to measure that. -- Owen
Re: MultipleOutputs Performance?
On 12/10/2010 02:16 PM, Harsh J wrote: Hi, On Thu, Dec 2, 2010 at 10:40 PM, Matt Tanquarymatt.tanqu...@gmail.com wrote: I am using MultipleOutputs to split a mapper input into about 20 different files. Adding this split has had an extremely adverse effect on performance. Is MultipleOutputs known for performing slowly? There was a bug in MultipleOutputs which could've lead to this. It has been fixed in MAPREDUCE-1853. Should be in the next 0.21 maintenance release as well as 0.22. (And in next CDH3, if you are using that). Is there any workaround to this issue for those of us who are still running 0.20? I have a job that very much lends itself to using the MultipleOutputs functionality, but this bug is absolutely crushing the job's performance. Are there any ways to fix/workaround this issue without having to a) upgrade our cluster to 0.21, or b) completely re-write my job? Thanks, DR
Re: When applying a patch, which attachment should I use?
Dear Adarsh, I have a single machine running Namenode/JobTracker/Hbase Master. There are 17 machines running Datanode/TaskTracker Among those 17 machines, 14 are running Hbase Regionservers. The other 3 machines are running Zookeeper. And about the Zookeeper, Hbase comes with its own Zookeeper so you don't need to install a new Zookeeper. (except for the special occasion, which I'll explain later) I assigned 14 machines as regionservers using $HBASE_HOME/conf/regionservers. I assigned 3 machines as Zookeeperss using hbase.zookeeper.quorum property in $HBASE_HOME/conf/hbase-site.xml. Don't forget to set export HBASE_MANAGES_ZK=true in $HBASE_HOME/conf/hbase-env.sh. (This is where you announce that you will be using Zookeeper that comes with HBase) This way, when you execute $HBASE_HOME/bin/start-hbase.sh, HBase will automatically start Zookeeper first, then start HBase daemons. Also, you can install your own Zookeeper and tell HBase to use it instead of its own. I read it on the internet that Zookeeper that comes with HBase does not work properly on Windows 7 64bit. ( http://alans.se/blog/2010/hadoop-hbase-cygwin-windows-7-x64/) So in that case you need to install your own Zookeeper, set it up properly, and tell HBase to use it instead of its own. All you need to do is configure zoo.cfg and add it to the HBase CLASSPATH. And don't forget to set export HBASE_MANAGES_ZK=false in $HBASE_HOME/conf/hbase-env.sh. This way, HBase will not start Zookeeper automatically. About the separation of Zookeepers from regionservers, Yes, it is recommended to separate Zookeepers from regionservers. But that won't be necessary unless your clusters are very heavily loaded. They also suggest that you give Zookeeper its own hard disk. But I haven't done that myself yet. (Hard disks cost money you know) So I'd say your cluster seems fine. But when you want to expand your cluster, you'd need some changes. I suggest you take a look at Hadoop: The Definitive Guide. Regards, Edward 2011/1/13 Adarsh Sharma adarsh.sha...@orkash.com Thanks Edward, Can you describe me the architecture used in your configuration. Fore.g I have a cluster of 10 servers and 1 node act as ( Namenode, Jobtracker, Hmaster ). Remainning 9 nodes act as ( Slaves, datanodes, Tasktracker, Hregionservers ). Among these 9 nodes I also set 3 nodes in zookeeper.quorum.property. I want to know that is it necessary to configure zookeeper separately with the zookeeper-3.2.2 package or just have some IP's listed in zookeeper.quorum.property and Hbase take care of it. Can we specify IP's of Hregionservers used before as zookeeper servers ( HQuorumPeer ) or we must need separate servers for it. My problem arises in running zookeeper. My Hbase is up and running in fully distributed mode too. With Best Regards Adarsh Sharma edward choi wrote: Dear Adarsh, My situation is somewhat different from yours as I am only running Hadoop and Hbase (as opposed to Hadoop/Hive/Hbase). But I hope my experience could be of help to you somehow. I applied the hdfs-630-0.20-append.patch to every single Hadoop node. (including master and slaves) Then I followed exactly what they told me to do on http://hbase.apache.org/docs/current/api/overview-summary.html#overview_description . I didn't get a single error message and successfully started HBase in a fully distributed mode. I am not using Hive so I can't tell what caused the MasterNotRunningException, but the patch above is meant to allow DFSClients pass NameNode lists of known dead Datanodes. I doubt that the patch has anything to do with MasterNotRunningException. Hope this helps. Regards, Ed 2011/1/13 Adarsh Sharma adarsh.sha...@orkash.com I am also facing some issues and i think applying hdfs-630-0.20-append.patch https://issues.apache.org/jira/secure/attachment/12446812/hdfs-630-0.20-append.patch would solve my problem. I try to run Hadoop/Hive/Hbase integration in fully Distributed mode. But I am facing master Not Running Exception mentioned in http://wiki.apache.org/hadoop/Hive/HBaseIntegration. My Hadoop Version= 0.20.2, Hive =0.6.0 , Hbase=0.20.6. What you think Edward. Thanks Adarsh edward choi wrote: I am not familiar with this whole svn and patch stuff, so please understand my asking. I was going to apply hdfs-630-0.20-append.patch https://issues.apache.org/jira/secure/attachment/12446812/hdfs-630-0.20-append.patch only because I wanted to install HBase and the installation guide told me to. The append branch you mentioned, does that include hdfs-630-0.20-append.patch https://issues.apache.org/jira/secure/attachment/12446812/hdfs-630-0.20-append.patch as well? Is it like the latest patch with all the good stuff packed in one? Regards, Ed 2011/1/12 Ted Dunning tdunn...@maprtech.com You may also be interested in the append branch:
cannot connect from slaves
Hi, my list file command hadoop fs -ls hdfs://master-url/ works locally on the master, but cannot connect from any of the slaves. What should I check for? Thank you, Mark
found an inconsistent entry in 0.21 API
I searched the MultipleOutputs class in google and found a 0.21 API documentation page that describes the class in the new version of hadoop. But the downloaded jar file doesn't support this class. There are also a few errors in the example on MultipleOutputs API document page.
Re: cannot connect from slaves
Can you connect the other slaves by ssh or ping cmd? On Fri, Jan 14, 2011 at 9:02 AM, Mark Kerzner markkerz...@gmail.com wrote: Hi, my list file command hadoop fs -ls hdfs://master-url/ works locally on the master, but cannot connect from any of the slaves. What should I check for? Thank you, Mark
Re: cannot connect from slaves
I did this, and tried both ways: hadoop fs -ls / 11/01/14 02:45:25 INFO ipc.Client: Retrying connect to server: / 10.113.118.244:8020. Already tried 0 time(s). 11/01/14 02:45:29 INFO ipc.Client: Retrying connect to server: / 10.113.118.244:8020. Already tried 1 time(s). and hadoop fs -ls hdfs://10.113.118.244/ 11/01/14 02:48:56 INFO ipc.Client: Retrying connect to server: / 10.113.118.244:8020. Already tried 0 time(s). I am suspecting the port 8020 - how do I test it outside of hadoop? Thank you, Mark On Thu, Jan 13, 2011 at 8:29 PM, George Datskos george.dats...@jp.fujitsu.com wrote: Hello On 2011/01/14 10:02, Mark Kerzner wrote: hadoop fs -ls hdfs://master-url/ works locally on the master, but cannot connect from any of the slaves. Make sure to replicate conf/core-site.xml to each of the slaves. The fs.default.name property should point to the master node. That way the slaves know how to reach the NameNode. namefs.default.name/name valuehdfs://master:8020/value (adjust the host name and port as necessary) George
Re: cannot connect from slaves
Hello, It can be a firewall issue. Try telnet from master and from slaves and see: (master)# telnet master 8020 and (slave)# telnet master 8020 On Fri, Jan 14, 2011 at 8:19 AM, Mark Kerzner markkerz...@gmail.com wrote: I did this, and tried both ways: hadoop fs -ls / 11/01/14 02:45:25 INFO ipc.Client: Retrying connect to server: / 10.113.118.244:8020. Already tried 0 time(s). 11/01/14 02:45:29 INFO ipc.Client: Retrying connect to server: / 10.113.118.244:8020. Already tried 1 time(s). and hadoop fs -ls hdfs://10.113.118.244/ 11/01/14 02:48:56 INFO ipc.Client: Retrying connect to server: / 10.113.118.244:8020. Already tried 0 time(s). I am suspecting the port 8020 - how do I test it outside of hadoop? Thank you, Mark On Thu, Jan 13, 2011 at 8:29 PM, George Datskos george.dats...@jp.fujitsu.com wrote: Hello On 2011/01/14 10:02, Mark Kerzner wrote: hadoop fs -ls hdfs://master-url/ works locally on the master, but cannot connect from any of the slaves. Make sure to replicate conf/core-site.xml to each of the slaves. The fs.default.name property should point to the master node. That way the slaves know how to reach the NameNode. namefs.default.name/name valuehdfs://master:8020/value (adjust the host name and port as necessary) George -- Regards-- Rishi Pathak National PARAM Supercomputing Facility Center for Development of Advanced Computing(C-DAC) Pune University Campus,Ganesh Khind Road Pune-Maharastra
Re: cannot connect from slaves
probably that's what it was - since with the network change it finally connects On Thu, Jan 13, 2011 at 10:44 PM, rishi pathak mailmaverick...@gmail.comwrote: Hello, It can be a firewall issue. Try telnet from master and from slaves and see: (master)# telnet master 8020 and (slave)# telnet master 8020 On Fri, Jan 14, 2011 at 8:19 AM, Mark Kerzner markkerz...@gmail.com wrote: I did this, and tried both ways: hadoop fs -ls / 11/01/14 02:45:25 INFO ipc.Client: Retrying connect to server: / 10.113.118.244:8020. Already tried 0 time(s). 11/01/14 02:45:29 INFO ipc.Client: Retrying connect to server: / 10.113.118.244:8020. Already tried 1 time(s). and hadoop fs -ls hdfs://10.113.118.244/ 11/01/14 02:48:56 INFO ipc.Client: Retrying connect to server: / 10.113.118.244:8020. Already tried 0 time(s). I am suspecting the port 8020 - how do I test it outside of hadoop? Thank you, Mark On Thu, Jan 13, 2011 at 8:29 PM, George Datskos george.dats...@jp.fujitsu.com wrote: Hello On 2011/01/14 10:02, Mark Kerzner wrote: hadoop fs -ls hdfs://master-url/ works locally on the master, but cannot connect from any of the slaves. Make sure to replicate conf/core-site.xml to each of the slaves. The fs.default.name property should point to the master node. That way the slaves know how to reach the NameNode. namefs.default.name/name valuehdfs://master:8020/value (adjust the host name and port as necessary) George -- Regards-- Rishi Pathak National PARAM Supercomputing Facility Center for Development of Advanced Computing(C-DAC) Pune University Campus,Ganesh Khind Road Pune-Maharastra