doubt about reduce tasks and block writes
Hey there, I have a doubt about reduce tasks and block writes. Do a reduce task always first write to hdfs in the node where they it is placed? (and then these blocks would be replicated to other nodes) In case yes, if I have a cluster of 5 nodes, 4 of them run DN and TT and one (node A) just run DN, when running MR jobs, map tasks would never read from node A? This would be because maps have data locality and if the reduce tasks write first to the node where they live, one replica of the block would always be in a node that has a TT. Node A would just contain blocks created from replication by the framework as no reduce task would write there directly. Is this correct? Thanks in advance -- View this message in context: http://lucene.472066.n3.nabble.com/doubt-about-reduce-tasks-and-block-writes-tp4003185.html Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
Re: doubt about reduce tasks and block writes
Marc, see my inline comments. On Fri, Aug 24, 2012 at 4:09 PM, Marc Sturlese marc.sturl...@gmail.comwrote: Hey there, I have a doubt about reduce tasks and block writes. Do a reduce task always first write to hdfs in the node where they it is placed? (and then these blocks would be replicated to other nodes) Yes, if there is a DN running on that server (it's possible to be running TT without a DN). In case yes, if I have a cluster of 5 nodes, 4 of them run DN and TT and one (node A) just run DN, when running MR jobs, map tasks would never read from node A? This would be because maps have data locality and if the reduce tasks write first to the node where they live, one replica of the block would always be in a node that has a TT. Node A would just contain blocks created from replication by the framework as no reduce task would write there directly. Is this correct? I believe that it's possible that a map task would read from node A's DN. Yes, the JobTracker tries to schedule map tasks on nodes where the data would be local, but it can't always do so. If there's a node with a free map slot, but that node doesn't have the data blocks locally, the JobTracker will assign the map task to that free map slot. Some work done (albeit slower than the ideal case because of the increased network IO) is better than no work done. Thanks in advance -- View this message in context: http://lucene.472066.n3.nabble.com/doubt-about-reduce-tasks-and-block-writes-tp4003185.html Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
Re: doubt about reduce tasks and block writes
Assuming that node A only contains replica, there is no garante that its data would never be read. First, you might lose a replica. The copy inside the node A could be used to create the missing replica again. Second, data locality is on best effort. If all the map slots are occupied except one on one node without a replica of the data then your node A is as likely as any other to be chosen as a source. Regards Bertrand On Fri, Aug 24, 2012 at 10:09 PM, Marc Sturlese marc.sturl...@gmail.comwrote: Hey there, I have a doubt about reduce tasks and block writes. Do a reduce task always first write to hdfs in the node where they it is placed? (and then these blocks would be replicated to other nodes) In case yes, if I have a cluster of 5 nodes, 4 of them run DN and TT and one (node A) just run DN, when running MR jobs, map tasks would never read from node A? This would be because maps have data locality and if the reduce tasks write first to the node where they live, one replica of the block would always be in a node that has a TT. Node A would just contain blocks created from replication by the framework as no reduce task would write there directly. Is this correct? Thanks in advance -- View this message in context: http://lucene.472066.n3.nabble.com/doubt-about-reduce-tasks-and-block-writes-tp4003185.html Sent from the Hadoop lucene-users mailing list archive at Nabble.com. -- Bertrand Dechoux
Re: doubt about reduce tasks and block writes
But since node A has no TT running, it will not run map or reduce tasks. When the reducer node writes the output file, the fist block will be written on the local node and never on node A. So, to answer the question, Node A will contain copies of blocks of all output files. It wont contain the copy 0 of any output file. I am reasonably sure about this , but there could be corner cases in case of node failure and such like! I need to look into the code. Raj From: Marc Sturlese marc.sturl...@gmail.com To: hadoop-u...@lucene.apache.org Sent: Friday, August 24, 2012 1:09 PM Subject: doubt about reduce tasks and block writes Hey there, I have a doubt about reduce tasks and block writes. Do a reduce task always first write to hdfs in the node where they it is placed? (and then these blocks would be replicated to other nodes) In case yes, if I have a cluster of 5 nodes, 4 of them run DN and TT and one (node A) just run DN, when running MR jobs, map tasks would never read from node A? This would be because maps have data locality and if the reduce tasks write first to the node where they live, one replica of the block would always be in a node that has a TT. Node A would just contain blocks created from replication by the framework as no reduce task would write there directly. Is this correct? Thanks in advance -- View this message in context: http://lucene.472066.n3.nabble.com/doubt-about-reduce-tasks-and-block-writes-tp4003185.html Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
Re: Reading multiple lines from a microsoft doc in hadoop
And that would help you with performance too. Were you originally planning to have one file per word document? What is the average size of you word documents? It shouldn't be much. I am afraid your map startup time won't be negligible in that case. Regards Bertrand On Fri, Aug 24, 2012 at 8:07 AM, Håvard Wahl Kongsgård haavard.kongsga...@gmail.com wrote: It's much easier if you convert the documents to text first use http://tika.apache.org/ or some other doc parser -Håvard On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari siddharth.tiw...@live.com wrote: hi, I have doc files in msword doc and docx format. These have entries which are seperated by an empty line. Is it possible for me to read these lines separated from empty lines at a time. Also which inpurformat shall I use to read doc docx. Please help ** Cheers !!! Siddharth Tiwari Have a refreshing day !!! Every duty is holy, and devotion to duty is the highest form of worship of God.” Maybe other people will try to limit me but I don't limit myself -- Håvard Wahl Kongsgård Faculty of Medicine Department of Mathematical Sciences NTNU http://havard.security-review.net/ -- Bertrand Dechoux
Re: namenode not starting
did you run the command bin/hadoop namenode -format before starting the namenode ? On Fri, Aug 24, 2012 at 12:58 PM, Abhay Ratnaparkhi abhay.ratnapar...@gmail.com wrote: Hello, I had a running hadoop cluster. I restarted it and after that namenode is unable to start. I am getting error saying that it's not formatted. :( Is it possible to recover the data on HDFS? 2012-08-24 03:17:55,378 ERROR org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem initialization failed. java.io.IOException: NameNode is not formatted. at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:434) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:110) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:291) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:270) at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:271) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:303) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:433) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:421) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1359) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1368) 2012-08-24 03:17:55,380 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.IOException: NameNode is not formatted. at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:434) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:110) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:291) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:270) at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:271) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:303) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:433) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:421) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1359) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1368) Regards, Abhay -- Nitin Pawar
RE: Reading multiple lines from a microsoft doc in hadoop
Hi, Thank you for the suggestion. Actually I was using poi to extract text, but since now I have so many documents I thought I will use hadoop directly to parse as well. Average size of each document is around 120 kb. Also I want to read multiple lines from the text until I find a blank line. I do not have any idea ankit how to design custom input format and record reader. Pleaser help with some tutorial tutorial, code or resource around it. I am struggling with the issue. I will be highly grateful. Thank you so much once again Date: Fri, 24 Aug 2012 08:07:39 +0200 Subject: Re: Reading multiple lines from a microsoft doc in hadoop From: haavard.kongsga...@gmail.com To: user@hadoop.apache.org It's much easier if you convert the documents to text first use http://tika.apache.org/ or some other doc parser -Håvard On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari siddharth.tiw...@live.com wrote: hi, I have doc files in msword doc and docx format. These have entries which are seperated by an empty line. Is it possible for me to read these lines separated from empty lines at a time. Also which inpurformat shall I use to read doc docx. Please help ** Cheers !!! Siddharth Tiwari Have a refreshing day !!! Every duty is holy, and devotion to duty is the highest form of worship of God.” Maybe other people will try to limit me but I don't limit myself -- Håvard Wahl Kongsgård Faculty of Medicine Department of Mathematical Sciences NTNU http://havard.security-review.net/
Re: namenode not starting
hi, Have u rubn the command namenode -format??? Thanks regards , Vivek On Fri, Aug 24, 2012 at 12:58 PM, Abhay Ratnaparkhi abhay.ratnapar...@gmail.com wrote: Hello, I had a running hadoop cluster. I restarted it and after that namenode is unable to start. I am getting error saying that it's not formatted. :( Is it possible to recover the data on HDFS? 2012-08-24 03:17:55,378 ERROR org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem initialization failed. java.io.IOException: NameNode is not formatted. at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:434) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:110) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:291) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:270) at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:271) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:303) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:433) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:421) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1359) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1368) 2012-08-24 03:17:55,380 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.IOException: NameNode is not formatted. at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:434) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:110) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:291) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:270) at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:271) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:303) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:433) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:421) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1359) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1368) Regards, Abhay -- Thanks and Regards, VIVEK KOUL
Re: namenode not starting
Hi Abhay What is the value for hadoop.tmp.dir or dfs.name.dir . If it was set to /tmp the contents would be deleted on a OS restart. You need to change this location before you start your NN. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: Abhay Ratnaparkhi abhay.ratnapar...@gmail.com Date: Fri, 24 Aug 2012 12:58:41 To: user@hadoop.apache.org Reply-To: user@hadoop.apache.org Subject: namenode not starting Hello, I had a running hadoop cluster. I restarted it and after that namenode is unable to start. I am getting error saying that it's not formatted. :( Is it possible to recover the data on HDFS? 2012-08-24 03:17:55,378 ERROR org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem initialization failed. java.io.IOException: NameNode is not formatted. at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:434) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:110) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:291) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:270) at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:271) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:303) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:433) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:421) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1359) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1368) 2012-08-24 03:17:55,380 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.IOException: NameNode is not formatted. at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:434) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:110) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:291) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:270) at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:271) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:303) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:433) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:421) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1359) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1368) Regards, Abhay
Re: namenode not starting
Hello, I was using cluster for long time and not formatted the namenode. I ran bin/stop-all.sh and bin/start-all.sh scripts only. I am using NFS for dfs.name.dir. hadoop.tmp.dir is a /tmp directory. I've not restarted the OS. Any way to recover the data? Thanks, Abhay On Fri, Aug 24, 2012 at 1:01 PM, Bejoy KS bejoy.had...@gmail.com wrote: ** Hi Abhay What is the value for hadoop.tmp.dir or dfs.name.dir . If it was set to /tmp the contents would be deleted on a OS restart. You need to change this location before you start your NN. Regards Bejoy KS Sent from handheld, please excuse typos. -- *From: * Abhay Ratnaparkhi abhay.ratnapar...@gmail.com *Date: *Fri, 24 Aug 2012 12:58:41 +0530 *To: *user@hadoop.apache.org *ReplyTo: * user@hadoop.apache.org *Subject: *namenode not starting Hello, I had a running hadoop cluster. I restarted it and after that namenode is unable to start. I am getting error saying that it's not formatted. :( Is it possible to recover the data on HDFS? 2012-08-24 03:17:55,378 ERROR org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem initialization failed. java.io.IOException: NameNode is not formatted. at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:434) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:110) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:291) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:270) at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:271) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:303) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:433) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:421) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1359) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1368) 2012-08-24 03:17:55,380 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.IOException: NameNode is not formatted. at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:434) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:110) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:291) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:270) at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:271) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:303) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:433) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:421) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1359) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1368) Regards, Abhay
Re: Reading multiple lines from a microsoft doc in hadoop
Hello Siddharth, You can tweak the NLineInputFormat as per your requirement and use it. It allows us to read a specified no of lines unlike TextInputFormat. Here is a good post by Boris and Michael on custom record reader. Also I would suggest you to combine similar files together into one bigger file if feasible, as you files are very small. Regards, Mohammad Tariq On Fri, Aug 24, 2012 at 1:00 PM, Siddharth Tiwari siddharth.tiw...@live.com wrote: Hi, Thank you for the suggestion. Actually I was using poi to extract text, but since now I have so many documents I thought I will use hadoop directly to parse as well. Average size of each document is around 120 kb. Also I want to read multiple lines from the text until I find a blank line. I do not have any idea ankit how to design custom input format and record reader. Pleaser help with some tutorial tutorial, code or resource around it. I am struggling with the issue. I will be highly grateful. Thank you so much once again Date: Fri, 24 Aug 2012 08:07:39 +0200 Subject: Re: Reading multiple lines from a microsoft doc in hadoop From: haavard.kongsga...@gmail.com To: user@hadoop.apache.org It's much easier if you convert the documents to text first use http://tika.apache.org/ or some other doc parser -Håvard On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari siddharth.tiw...@live.com wrote: hi, I have doc files in msword doc and docx format. These have entries which are seperated by an empty line. Is it possible for me to read these lines separated from empty lines at a time. Also which inpurformat shall I use to read doc docx. Please help ** Cheers !!! Siddharth Tiwari Have a refreshing day !!! Every duty is holy, and devotion to duty is the highest form of worship of God.” Maybe other people will try to limit me but I don't limit myself -- Håvard Wahl Kongsgård Faculty of Medicine Department of Mathematical Sciences NTNU http://havard.security-review.net/
Re: About many user accounts in hadoop platform
Hi, Sonal Thanks for your information. Because some users want to modify the source codes of hadoop, these users want to install their own hadoop version in the same clusters. After they modify their hadoop version, they may compile and install the modified hadoop version and don't want to make any influence to other users. Does your method make effect on this? Thanks a lot, May --- Hi, Do your users want different versions of Hadoop? Or can they share the same hadoop cluster and schedule their jobs? If the latter, Hadoop can be configured to run for multiple users, and each user can submit their data and jobs to the same cluster. Hence you can maintain a single cluster and utilize your resources more efficiently. You can read more here: http://www.ibm.com/developerworks/linux/library/os-hadoop-scheduling/index.h tml http://www.cloudera.com/blog/2008/11/job-scheduling-in-hadoop/ Best Regards, Sonal Crux: Reporting for HBase https://github.com/sonalgoyal/crux Nube Technologies http://www.nubetech.co On Fri, Aug 24, 2012 at 9:13 AM, Li Shengmei lisheng...@ict.ac.cn wrote: Hi, all There are many users in hadoop platform. Can they install their own hadoop version on the same clusters platform? I tried to do this but failed. There exsited a user account and the user install his hadoop. I create another account and install his hadoop. The logs display ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.net.BindException: Problem binding to hadoop01/10.3.1.91:9000 : Address already in use. So I change the port no. to 8000, but still failed. When I start-all.sh, the namenode can't start, the logs display ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:lismhadoop cause:java.net.BindException: Address already in use Can anyone give some suggestions? Thanks, May
Re: About many user accounts in hadoop platform
Hi Li, The approach of everyone having their own version of hadoop on same cluster is way too much complicated. Good aproach would be all of them test the patches they are compiling for hadoop in pseudo mode on their person laptops/desktops and then merge the features or patches with a single version to cluster. It is never recommended that you run multiple version of hadoops on single hardware cluster. Thanks, nitin On Fri, Aug 24, 2012 at 1:42 PM, Li Shengmei lisheng...@ict.ac.cn wrote: Hi, Sonal Thanks for your information. Because some users want to modify the source codes of hadoop, these users want to install their own hadoop version in the same clusters. After they modify their hadoop version, they may compile and install the modified hadoop version and don’t want to make any influence to other users. Does your method make effect on this? Thanks a lot, May --- Hi, Do your users want different versions of Hadoop? Or can they share the same hadoop cluster and schedule their jobs? If the latter, Hadoop can be configured to run for multiple users, and each user can submit their data and jobs to the same cluster. Hence you can maintain a single cluster and utilize your resources more efficiently. You can read more here: http://www.ibm.com/developerworks/linux/library/os-hadoop-scheduling/index.html http://www.cloudera.com/blog/2008/11/job-scheduling-in-hadoop/ Best Regards, Sonal Crux: Reporting for HBase Nube Technologies On Fri, Aug 24, 2012 at 9:13 AM, Li Shengmei lisheng...@ict.ac.cn wrote: Hi, all There are many users in hadoop platform. Can they install their own hadoop version on the same clusters platform? I tried to do this but failed. There exsited a user account and the user install his hadoop. I create another account and install his hadoop. The logs display “ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.net.BindException: Problem binding to hadoop01/10.3.1.91:9000 : Address already in use”. So I change the port no. to 8000, but still failed. When I “start-all.sh”, the namenode can’t start, the logs display “ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:lismhadoop cause:java.net.BindException: Address already in use” Can anyone give some suggestions? Thanks, May -- Nitin Pawar
Re: Hadoop on EC2 Managing Internal/External IPs
Hi, a vpn or simply first uploading the files to an ec2 node is the best option but an alternative is to use the external interface/IP instead of the internal in the hadoop config¸ I assume this will be slower and more costly... -Håvard On Fri, Aug 24, 2012 at 4:54 AM, igor Finkelshteyn iefin...@gmail.com wrote: I've seen a bunch of people with this exact same question all over Google with no answers. I know people have successful non-temporary clusters in EC2. Is there really no one that's needed to deal with having EC2 expose external addresses instead of internal addresses before? This seems like it should be a common thing. On Aug 23, 2012, at 12:34 PM, igor Finkelshteyn wrote: Hi, I'm currently setting up a Hadoop cluster on EC2, and everything works just fine when accessing the cluster from inside EC2, but as soon as I try to do something like upload a file from an external client, I get timeout errors like: 12/08/23 12:06:16 ERROR hdfs.DFSClient: Failed to close file /user/some_file._COPYING_ java.net.SocketTimeoutException: 65000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=/10.123.x.x:50010] What's clearly happening is my NameNode is resolving my DataNode's IPs to their internal EC2 values instead of their external values, and then sending along the internal IP to my external client, which is obviously unable to reach those. I'm thinking this must be a common problem. How do other people deal with it? Is there a way to just force my name node to send along my DataNode's hostname instead of IP, so that the hostname can be resolved properly from whatever box will be sending files? Eli -- Håvard Wahl Kongsgård Faculty of Medicine Department of Mathematical Sciences NTNU http://havard.security-review.net/
Re: namenode not starting
You should start with a reboot of the system. A lesson to everyone, this is exactly why you should have a secondary name node (http://wiki.apache.org/hadoop/FAQ#What_is_the_purpose_of_the_secondary_name-node.3F) and run the namenode a mirrored RAID-5/10 disk. -Håvard On Fri, Aug 24, 2012 at 9:40 AM, Abhay Ratnaparkhi abhay.ratnapar...@gmail.com wrote: Hello, I was using cluster for long time and not formatted the namenode. I ran bin/stop-all.sh and bin/start-all.sh scripts only. I am using NFS for dfs.name.dir. hadoop.tmp.dir is a /tmp directory. I've not restarted the OS. Any way to recover the data? Thanks, Abhay On Fri, Aug 24, 2012 at 1:01 PM, Bejoy KS bejoy.had...@gmail.com wrote: Hi Abhay What is the value for hadoop.tmp.dir or dfs.name.dir . If it was set to /tmp the contents would be deleted on a OS restart. You need to change this location before you start your NN. Regards Bejoy KS Sent from handheld, please excuse typos. From: Abhay Ratnaparkhi abhay.ratnapar...@gmail.com Date: Fri, 24 Aug 2012 12:58:41 +0530 To: user@hadoop.apache.org ReplyTo: user@hadoop.apache.org Subject: namenode not starting Hello, I had a running hadoop cluster. I restarted it and after that namenode is unable to start. I am getting error saying that it's not formatted. :( Is it possible to recover the data on HDFS? 2012-08-24 03:17:55,378 ERROR org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem initialization failed. java.io.IOException: NameNode is not formatted. at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:434) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:110) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:291) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:270) at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:271) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:303) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:433) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:421) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1359) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1368) 2012-08-24 03:17:55,380 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.IOException: NameNode is not formatted. at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:434) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:110) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:291) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:270) at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:271) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:303) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:433) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:421) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1359) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1368) Regards, Abhay -- Håvard Wahl Kongsgård Faculty of Medicine Department of Mathematical Sciences NTNU http://havard.security-review.net/
hadoop download path missing
All the links at: http://www.apache.org/dyn/closer.cgi/hadoop/common/ are returning 404s, even the backup site at: http://www.us.apache.org/dist/hadoop/common/. However, the eu site: http://www.eu.apache.org/dist/hadoop/common/ does work. -Steven Willis
Re: hadoop download path missing
I just tried and could go to http://apache.techartifact.com/mirror/hadoop/common/hadoop-2.0.1-alpha/ Is this still happening for you? Best Regards, Sonal Crux: Reporting for HBase https://github.com/sonalgoyal/crux Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Fri, Aug 24, 2012 at 8:59 PM, Steven Willis swil...@compete.com wrote: All the links at: http://www.apache.org/dyn/closer.cgi/hadoop/common/ are returning 404s, even the backup site at: http://www.us.apache.org/dist/hadoop/common/. However, the eu site: http://www.eu.apache.org/dist/hadoop/common/ does work. -Steven Willis
RE: namenode not starting
Hi Abhay, I totaly conform with Bejoy. Can you paste your mapred-site.xml and hdfs-site.xml content here ? ** Cheers !!! Siddharth Tiwari Have a refreshing day !!! Every duty is holy, and devotion to duty is the highest form of worship of God.” Maybe other people will try to limit me but I don't limit myself From: lle...@ddn.com To: user@hadoop.apache.org Subject: RE: namenode not starting Date: Fri, 24 Aug 2012 16:38:01 + Abhay, Sounds like your namenode cannot find the metadata information it needs to start (the path/current | image | *checppints etc) Basically, if you cannot locate that data locally or on your NFS Server, your cluster is busted. But, let's us be optimistic about this. There is a chance that your NFS Server is down or the path mounted is lost. If it is NFS mounted (as you suggested) check that your host still have that path mounted. (from the proper NFS Server) ( [shell] mount ) can tell. * obviously if you originally mounted from foo:/mydata and now do bar:/mydata /you'll need to do some digging to find which NFS server it was writing to before. Failing to locate your namenode metadata (locally or on any of your NFS Server) either because the NFS Server decided to become a blackhole, or someone|thing removed it. And you don't have a backup of your namenode (tape or Secondary Namenode), I think you are in a world of hurt there. In theory you can read the blocks on the DN and try to recover some of your data (assume not in CODEC / compressed) . Humm.. anyone knows about recovery services? (^^) -Original Message- From: Håvard Wahl Kongsgård [mailto:haavard.kongsga...@gmail.com] Sent: Friday, August 24, 2012 5:38 AM To: user@hadoop.apache.org Subject: Re: namenode not starting You should start with a reboot of the system. A lesson to everyone, this is exactly why you should have a secondary name node (http://wiki.apache.org/hadoop/FAQ#What_is_the_purpose_of_the_secondary_name-node.3F) and run the namenode a mirrored RAID-5/10 disk. -Håvard On Fri, Aug 24, 2012 at 9:40 AM, Abhay Ratnaparkhi abhay.ratnapar...@gmail.com wrote: Hello, I was using cluster for long time and not formatted the namenode. I ran bin/stop-all.sh and bin/start-all.sh scripts only. I am using NFS for dfs.name.dir. hadoop.tmp.dir is a /tmp directory. I've not restarted the OS. Any way to recover the data? Thanks, Abhay On Fri, Aug 24, 2012 at 1:01 PM, Bejoy KS bejoy.had...@gmail.com wrote: Hi Abhay What is the value for hadoop.tmp.dir or dfs.name.dir . If it was set to /tmp the contents would be deleted on a OS restart. You need to change this location before you start your NN. Regards Bejoy KS Sent from handheld, please excuse typos. From: Abhay Ratnaparkhi abhay.ratnapar...@gmail.com Date: Fri, 24 Aug 2012 12:58:41 +0530 To: user@hadoop.apache.org ReplyTo: user@hadoop.apache.org Subject: namenode not starting Hello, I had a running hadoop cluster. I restarted it and after that namenode is unable to start. I am getting error saying that it's not formatted. :( Is it possible to recover the data on HDFS? 2012-08-24 03:17:55,378 ERROR org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem initialization failed. java.io.IOException: NameNode is not formatted. at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:434) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:110) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:291) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:270) at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:271) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:303) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:433) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:421) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1359) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:13 68) 2012-08-24 03:17:55,380 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.IOException: NameNode is not formatted. at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:434) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:110) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:291) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:270) at
RE: How do we view the blocks of a file in HDFS
Hi Abhishek, You can use fsck for this purpose hadoop fsck HDFS directory -files -blocks -locations --- Displays what you want ** Cheers !!! Siddharth Tiwari Have a refreshing day !!! Every duty is holy, and devotion to duty is the highest form of worship of God.” Maybe other people will try to limit me but I don't limit myself From: abhisheksgum...@gmail.com Date: Fri, 24 Aug 2012 22:10:37 +0530 Subject: How do we view the blocks of a file in HDFS To: user@hadoop.apache.org hi, If I push a file into HDFS that runs on a 4 node cluster with 1 namenode and 3 datanodes, how can I view where on the datanodes are the blocks of this file?I would like to view the blocks and their replicas individually. How can I do this? The answer is very critical for my current task which is halted :) A detailed answer will be highly appreciated.Thank you! With Regards, Abhishek S
unsubscribe
easy mv or heavy mv
Hi, there I'm just want to know that for hadoop dfs -mv. Does 'mv' just change the meta info, or really copy data around on the hdfs? Thank you very much! Thanks
Re: questions about CDH Version 4.0.1
Pls email CDH lists. On Aug 24, 2012, at 2:34 AM, jing wang wrote: Hi, I'm curious of what the release notes said,http://archive.cloudera.com/cdh4/cdh/4/hadoop-2.0.0+91.releasenotes.html Is cdh4 based on ‘Release 2.0.0-alpha’ ?http://hadoop.apache.org/common/releases.html#23+May%2C+2012%3A+Release+2.0.0-alpha+available Are there limitations on cdh packages? Any advice will be appreciated! Thanks Best Regards Jing Wang -- Arun C. Murthy Hortonworks Inc. http://hortonworks.com/
RE: Reading multiple lines from a microsoft doc in hadoop
Any help on below would be really appreciated. i am stuck with it ** Cheers !!! Siddharth Tiwari Have a refreshing day !!! Every duty is holy, and devotion to duty is the highest form of worship of God.” Maybe other people will try to limit me but I don't limit myself From: siddharth.tiw...@live.com To: user@hadoop.apache.org; bejoy.had...@gmail.com; bejoy...@yahoo.com Subject: RE: Reading multiple lines from a microsoft doc in hadoop Date: Fri, 24 Aug 2012 20:23:45 + Hi , Can anyone please help ? Thank you in advance ** Cheers !!! Siddharth Tiwari Have a refreshing day !!! Every duty is holy, and devotion to duty is the highest form of worship of God.” Maybe other people will try to limit me but I don't limit myself From: siddharth.tiw...@live.com To: user@hadoop.apache.org; bejoy.had...@gmail.com; bejoy...@yahoo.com Subject: RE: Reading multiple lines from a microsoft doc in hadoop Date: Fri, 24 Aug 2012 16:22:57 + Hi Team, Thanks a lot for so many good suggestions. I wrote a custom input format for reading one paragraph at a time. But when I use it I get lines read. Can you please suggest what changes I must make to read one para at a time seperated by null lines ? below is the code I wrote:- import java.io.IOException; import java.util.ArrayList; import java.util.regex.Matcher; import java.util.regex.Pattern; import java.io.IOException; import java.util.ArrayList; import java.util.List; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FSDataInputStream; import org.apache.hadoop.fs.FileStatus; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.JobConf; import org.apache.hadoop.mapreduce.InputSplit; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.JobContext; import org.apache.hadoop.mapreduce.RecordReader; import org.apache.hadoop.mapreduce.TaskAttemptContext; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.FileSplit; import org.apache.hadoop.mapreduce.lib.input.LineRecordReader; import org.apache.hadoop.util.LineReader; /** * */ /** * @author 460615 * */ //FileInputFormat is the base class for all file-based InputFormats public class ParaInputFormat extends FileInputFormatLongWritable,Text { private String nullRegex = ^\\s*$ ; public String StrLine = null; /*public RecordReaderLongWritable, Text getRecordReader (InputSplit genericSplit, JobConf job, Reporter reporter) throws IOException { reporter.setStatus(genericSplit.toString()); return new ParaInputFormat(job, (FileSplit)genericSplit); }*/ public RecordReaderLongWritable, Text createRecordReader(InputSplit genericSplit, TaskAttemptContext context)throws IOException { context.setStatus(genericSplit.toString()); return new LineRecordReader(); } public InputSplit[] getSplits(JobContext job, Configuration conf) throws IOException { ArrayListFileSplit splits = new ArrayListFileSplit(); for (FileStatus status : listStatus(job)) { Path fileName = status.getPath(); if (status.isDir()) { throw new IOException(Not a file: + fileName); } FileSystem fs = fileName.getFileSystem(conf); LineReader lr = null; try { FSDataInputStream in = fs.open(fileName); lr = new LineReader(in, conf); // String regexMatch =in.readLine(); Text line = new Text(); long begin = 0; long length = 0; int num = -1; String boolTest = null; boolean match = false; Pattern p = Pattern.compile(nullRegex); // Matcher matcher = new p.matcher(); while ((boolTest = in.readLine()) != null (num = lr.readLine(line)) 0 ! ( in.readLine().isEmpty())){ // numLines++; length += num; splits.add(new FileSplit(fileName, begin, length, new String[]{}));} begin=length; }finally { if (lr != null) { lr.close(); } } } return splits.toArray(new FileSplit[splits.size()]); } } ** Cheers !!! Siddharth Tiwari Have a refreshing day !!! Every duty is holy, and devotion to duty is the highest form of worship of God.” Maybe other people will try to limit me but I don't limit myself Date: Fri, 24 Aug 2012 09:54:10 +0200 Subject: Re: Reading multiple lines from a microsoft doc in hadoop From: haavard.kongsga...@gmail.com To: user@hadoop.apache.org Hi, maybe you should check out the old nutch project http://nutch.apache.org/ (hadoop was developed for nutch). It's a web crawler and indexer, but the malinglists hold much info doc/pdf parsing which also relates to hadoop. Have never parsed many docx or doc files, but it should be strait-forward. But generally for text analysis preprocessing is the KEY! For example replace dual lines \r\n\r\n or (\n\n) with is a simple trick) -Håvard On Fri, Aug 24, 2012 at 9:30 AM, Siddharth Tiwari siddharth.tiw...@live.com wrote: Hi, Thank you for