Re: hadoop knowledge gaining
On 07/10/11 15:25, Jignesh Patel wrote: Guys, I am able to deploy the first program word count using hadoop. I am interesting exploring more about hadoop and Hbase and don't know which is the best way to grasp both of them. I have hadoop in action but it has older api. Actually the API covered in the 2nd edition is pretty much the one in widest use. The newer API is better, but is only as complete in hadoop 0.21 and later, which aren't yet in wide use I do also have Hbase definitive guide which I have not started exploring. Think of a problem, get some data, go through the books. Learning more about statistics and datamining is what you really need to learn, more than just the hadoop APIs -steve
Re: ways to expand hadoop.tmp.dir capacity?
2011/10/9 Harsh J ha...@cloudera.com Hello Meng, On Wed, Oct 5, 2011 at 11:02 AM, Meng Mao meng...@gmail.com wrote: Currently, we've got defined: property namehadoop.tmp.dir/name value/hadoop/hadoop-metadata/cache//value /property In our experiments with SOLR, the intermediate files are so large that they tend to blow out disk space and fail (and annoyingly leave behind their huge failed attempts). We've had issues with it in the past, but we're having real problems with SOLR if we can't comfortably get more space out of hadoop.tmp.dir somehow. 1) It seems we never set *mapred.system.dir* to anything special, so it's defaulting to ${hadoop.tmp.dir}/mapred/system. Is this a problem? The docs seem to recommend against it when hadoop.tmp.dir had ${user.name} in it, which ours doesn't. The {mapred.system.dir} is a HDFS location, and you shouldn't really be worried about it as much. 1b) The doc says mapred.system.dir is the in-HDFS path to shared MapReduce system files. To me, that means there's must be 1 single path for mapred.system.dir, which sort of forces hadoop.tmp.dir to be 1 path. Otherwise, one might imagine that you could specify multiple paths to store hadoop.tmp.dir, like you can for dfs.data.dir. Is this a correct interpretation? -- hadoop.tmp.dir could live on multiple paths/disks if there were more mapping/lookup between mapred.system.dir and hadoop.tmp.dir? {hadoop.tmp.dir} is indeed reused for {mapred.system.dir}, although it is on HDFS, and hence is confusing, but there should just be one mapred.system.dir, yes. Also, the config {hadoop.tmp.dir} doesn't support 1 path. What you need here is a proper {mapred.local.dir} configuration. 2) IIRC, there's a -D switch for supplying config name/value pairs into indivdiual jobs. Does such a switch exist? Googling for single letters is fruitless. If we had a path on our workers with more space (in our case, another hard disk), could we simply pass that path in as hadoop.tmp.dir for our SOLR jobs? Without incurring any consistency issues on future jobs that might use the SOLR output on HDFS? Only a few parameters of a job are user-configurable. Stuff like hadoop.tmp.dir and mapred.local.dir are not override-able by user set parameters as they are server side configurations (static). Given that the default value is ${hadoop.tmp.dir}/mapred/local, would the expanded capacity we're looking for be as easily accomplished as by defining mapred.local.dir to span multiple disks? Setting aside the issue of temp files so big that they could still fill a whole disk. 1. You can set mapred.local.dir independent of hadoop.tmp.dir 2. mapred.local.dir can have comma separated values in it, spanning multiple disks 3. Intermediate outputs may spread across these disks but shall not consume 1 disk at a time. So if your largest configured disk is 500 GB while the total set of them may be 2 TB, then your intermediate output size can't really exceed 500 GB, cause only one disk is consumed by one task -- the multiple disks are for better I/O parallelism between tasks. Know that hadoop.tmp.dir is a convenience property, for quickly starting up dev clusters and such. For a proper configuration, you need to remove dependency on it (almost nothing uses hadoop.tmp.dir on the server side, once the right properties are configured - ex: dfs.data.dir, dfs.name.dir, fs.checkpoint.dir, mapred.local.dir, etc.) -- Harsh J Here it's a excellent explanation how to install Apache Hadoop manually, and Lars explains this very good. http://blog.lars-francke.de/2011/01/26/setting-up-a-hadoop-cluster-part-1-manual-installation/ Regards -- Marcos Luis Ortíz Valmaseda Linux Infrastructure Engineer Linux User # 418229 http://marcosluis2186.posterous.com http://www.linkedin.com/in/marcosluis2186 Twitter: @marcosluis2186
Developing MapReduce
I use eclipse. Is this http://wiki.apache.org/hadoop/EclipsePlugIn still the best way to develop mapreduce programs in hadoop? Just want to make sure before I go down this path. Or should I just add hadoop jars in my classpath of eclipse and create my own MapReduce programs. Thanks
Re: Developing MapReduce
When you download the hadoop in its dist(i don't remember exact name) there is a related plugin. Go and get it from there. On Oct 10, 2011, at 10:34 AM, Mohit Anchlia wrote: I use eclipse. Is this http://wiki.apache.org/hadoop/EclipsePlugIn still the best way to develop mapreduce programs in hadoop? Just want to make sure before I go down this path. Or should I just add hadoop jars in my classpath of eclipse and create my own MapReduce programs. Thanks
How to iterate over a hdfs folder with hadoop
Hi, I'm wondering how can I browse an hdfs folder using the classes in org.apache.hadoop.fs package. The operation that I'm looking for is 'hadoop dfs -ls' The standard file system equivalent would be: File f = new File(outputPath); if(f.isDirectory()){ String files[] = f.list(); for(String file : files){ //Do your logic } } Thanks in advance, Raimon Bosch.
Re: How to iterate over a hdfs folder with hadoop
FileStatus[] files = fs.listStatus(new Path(path)); for (FileStatus fileStatus : files) { //...do stuff ehre } On Mon, Oct 10, 2011 at 8:03 AM, Raimon Bosch raimon.bo...@gmail.comwrote: Hi, I'm wondering how can I browse an hdfs folder using the classes in org.apache.hadoop.fs package. The operation that I'm looking for is 'hadoop dfs -ls' The standard file system equivalent would be: File f = new File(outputPath); if(f.isDirectory()){ String files[] = f.list(); for(String file : files){ //Do your logic } } Thanks in advance, Raimon Bosch. -- Thanks, John C
Re: hadoop input buffer size
I think below can give you more info about it. http://developer.yahoo.com/blogs/hadoop/posts/2009/08/the_anatomy_of_hadoop_io_pipel/ Nice explanation by Owen here. Regards, Uma - Original Message - From: Yang Xiaoliang yangxiaoliang2...@gmail.com Date: Wednesday, October 5, 2011 4:27 pm Subject: Re: hadoop input buffer size To: common-user@hadoop.apache.org Hi, Hadoop neither read one line each time, nor fetching dfs.block.size of lines into a buffer, Actually, for the TextInputFormat, it read io.file.buffer.size bytes of text into a buffer each time, this can be seen from the hadoop source file LineReader.java 2011/10/5 Mark question markq2...@gmail.com Hello, Correct me if I'm wrong, but when a program opens n-files at the same time to read from, and start reading from each file at a time 1 line at a time. Isn't hadoop actually fetching dfs.block.size of lines into a buffer? and not actually one line. If this is correct, I set up my dfs.block.size = 3MB and each line takes about 650 bytes only, then I would assume the performance for reading 1-4000 lines would be the same, but it isn't ! Do you know a way to find #n of lines to be read at once? Thank you, Mark
Custom InputFormat for Multiline Input File Hive/Hadoop
Hi all, Sending this to core-u...@hadoop.apache.org and d...@hive.apache.org. Trying to process Omniture's data log files with Hadoop/Hive. The file format is tab delimited and while being pretty simple for the most part, they do allow you to have multiple new lines and tabs within a field that are escaped by a backslash (\\n and \\t). As a result I've opted to create my own InputFormat to handle the multiple newlines and convert those tabs to spaces when Hive is going to try to do a split on the tabs. I've found a fairly good reference for doing this using the newer InputFormat API at http://blog.rguha.net/?p=293 but unfortunately my version of Hive (0.7.0) still uses the old InputFormat API. I haven't been able to find many tutorials on writing a custom InputFile using the older API so I'm looking to see if I can get some guidance as to what may be wrong with the following two classes: https://gist.github.com/3141e9d27d4e07f5f9ed https://gist.github.com/79fdab227950a0776616 The SELECT statements within hive currently return nothing and my other variations returned nothing but NULL values. This issue is also available on StackOverflow at http://stackoverflow.com/questions/7692994/custom-inputformat-with-hive. If there's a resource someone can point me to that'd also be great. Many thanks in advance, Mike
Re: How to iterate over a hdfs folder with hadoop
Yes, FileStatus class would be trhe equavalent for list. FileStstus has the API's isDir and getPath. This both api's can satify for your futher usage.:-) I think small difference would be, FileStatus will ensure the sorted order. Regards, Uma - Original Message - From: John Conwell j...@iamjohn.me Date: Monday, October 10, 2011 8:40 pm Subject: Re: How to iterate over a hdfs folder with hadoop To: common-user@hadoop.apache.org FileStatus[] files = fs.listStatus(new Path(path)); for (FileStatus fileStatus : files) { //...do stuff ehre } On Mon, Oct 10, 2011 at 8:03 AM, Raimon Bosch raimon.bo...@gmail.comwrote: Hi, I'm wondering how can I browse an hdfs folder using the classes in org.apache.hadoop.fs package. The operation that I'm looking for is 'hadoop dfs -ls' The standard file system equivalent would be: File f = new File(outputPath); if(f.isDirectory()){ String files[] = f.list(); for(String file : files){ //Do your logic } } Thanks in advance, Raimon Bosch. -- Thanks, John C
Re: How to iterate over a hdfs folder with hadoop
Thanks John! There is the complete solution: Configuration jc = new Configuration(); Object files[] = null; List files_in_hdfs = new ArrayList(); FileSystem fs = FileSystem.get(jc); FileStatus[] file_status = fs.listStatus(new Path(outputPath)); for (FileStatus fileStatus : file_status) { files_in_hdfs.add(fileStatus.getPath().getName()); } files = files_in_hdfs.toArray(); 2011/10/10 John Conwell j...@iamjohn.me FileStatus[] files = fs.listStatus(new Path(path)); for (FileStatus fileStatus : files) { //...do stuff ehre } On Mon, Oct 10, 2011 at 8:03 AM, Raimon Bosch raimon.bo...@gmail.com wrote: Hi, I'm wondering how can I browse an hdfs folder using the classes in org.apache.hadoop.fs package. The operation that I'm looking for is 'hadoop dfs -ls' The standard file system equivalent would be: File f = new File(outputPath); if(f.isDirectory()){ String files[] = f.list(); for(String file : files){ //Do your logic } } Thanks in advance, Raimon Bosch. -- Thanks, John C
Re: hdfs directory location
Jignesh You are creating a dir in hdfs by that command. The dir won't be in your local file system but it hdfs. Issue a command like hadoop fs -ls /user/hadoop-user/citation/ You can see the dir you created in hdfs If you want to create a die on local unix use a simple linux command mkdir /user/hadoop-user/citation/input --Original Message-- From: Jignesh Patel To: common-user@hadoop.apache.org ReplyTo: common-user@hadoop.apache.org Subject: hdfs directory location Sent: Oct 10, 2011 23:45 I am using following command to create a file in Unix(i.e. mac) system. bin/hadoop fs -mkdir /user/hadoop-user/citation/input While it creates the directory I need, I am struggling to figure out exact location of the folder in my local box. Regards Bejoy K S
Re: hdfs directory location
Bejoy, If I create a directory in unix box then how I can link it with HDFS directory structure? -Jignesh On Oct 10, 2011, at 2:59 PM, bejoy.had...@gmail.com wrote: Jignesh You are creating a dir in hdfs by that command. The dir won't be in your local file system but it hdfs. Issue a command like hadoop fs -ls /user/hadoop-user/citation/ You can see the dir you created in hdfs If you want to create a die on local unix use a simple linux command mkdir /user/hadoop-user/citation/input --Original Message-- From: Jignesh Patel To: common-user@hadoop.apache.org ReplyTo: common-user@hadoop.apache.org Subject: hdfs directory location Sent: Oct 10, 2011 23:45 I am using following command to create a file in Unix(i.e. mac) system. bin/hadoop fs -mkdir /user/hadoop-user/citation/input While it creates the directory I need, I am struggling to figure out exact location of the folder in my local box. Regards Bejoy K S
Re: Developing MapReduce
Hi Mohit I'm really not sure how many of the map reduce developers use the map reduce eclipse plugin. AFAIK majority don't. As Jignesh mentioned you can get it from the hadoop distribution folder as soon as you unzip the same. My suggested approach would be,If you are on Windows OS, you can test run your map reduce code in two ways. -set up cygwin in Windows, atop you can set up hadoop and related tools. It is a little messy. -Use a linux VM image. I'd recommend Cloudera test VM,as it comes pre configured with the whole hadoop technology stack. It really segregates the developer from the hassles of installing the hadoop tools and making them up and running. In Linux or Mac you can just add the hadoop jars to your class path and run the driver class as just how you run a java class within eclipse.(Here hadoop would be on standalone mode). Hope it helps!... --Original Message-- From: Jignesh Patel To: common-user@hadoop.apache.org ReplyTo: common-user@hadoop.apache.org Subject: Re: Developing MapReduce Sent: Oct 10, 2011 20:31 When you download the hadoop in its dist(i don't remember exact name) there is a related plugin. Go and get it from there. On Oct 10, 2011, at 10:34 AM, Mohit Anchlia wrote: I use eclipse. Is this http://wiki.apache.org/hadoop/EclipsePlugIn still the best way to develop mapreduce programs in hadoop? Just want to make sure before I go down this path. Or should I just add hadoop jars in my classpath of eclipse and create my own MapReduce programs. Thanks Regards Bejoy K S
Re: hdfs directory location
Jignesh Sorry I didn't get your query, 'how I can link it with HDFS directory structure? ' You mean putting your unix dir contents into hdfs? If so use hadoop fs -copyFromLocal src destn --Original Message-- From: Jignesh Patel To: common-user@hadoop.apache.org To: bejoy.had...@gmail.com Subject: Re: hdfs directory location Sent: Oct 11, 2011 01:18 Bejoy, If I create a directory in unix box then how I can link it with HDFS directory structure? -Jignesh On Oct 10, 2011, at 2:59 PM, bejoy.had...@gmail.com wrote: Jignesh You are creating a dir in hdfs by that command. The dir won't be in your local file system but it hdfs. Issue a command like hadoop fs -ls /user/hadoop-user/citation/ You can see the dir you created in hdfs If you want to create a die on local unix use a simple linux command mkdir /user/hadoop-user/citation/input --Original Message-- From: Jignesh Patel To: common-user@hadoop.apache.org ReplyTo: common-user@hadoop.apache.org Subject: hdfs directory location Sent: Oct 10, 2011 23:45 I am using following command to create a file in Unix(i.e. mac) system. bin/hadoop fs -mkdir /user/hadoop-user/citation/input While it creates the directory I need, I am struggling to figure out exact location of the folder in my local box. Regards Bejoy K S Regards Bejoy K S
Re: hdfs directory location
Bejoy, copyToLocal makes sense, it worked. But I am still wondering if HDFS has a directory created on local box, somewhere it exist physically but couldn't able to locate it. Is HDFS directory structure is a virtual structure, doesn't exist physically? -Jignesh On Oct 10, 2011, at 3:53 PM, bejoy.had...@gmail.com wrote: Jignesh Sorry I didn't get your query, 'how I can link it with HDFS directory structure? ' You mean putting your unix dir contents into hdfs? If so use hadoop fs -copyFromLocal src destn --Original Message-- From: Jignesh Patel To: common-user@hadoop.apache.org To: bejoy.had...@gmail.com Subject: Re: hdfs directory location Sent: Oct 11, 2011 01:18 Bejoy, If I create a directory in unix box then how I can link it with HDFS directory structure? -Jignesh On Oct 10, 2011, at 2:59 PM, bejoy.had...@gmail.com wrote: Jignesh You are creating a dir in hdfs by that command. The dir won't be in your local file system but it hdfs. Issue a command like hadoop fs -ls /user/hadoop-user/citation/ You can see the dir you created in hdfs If you want to create a die on local unix use a simple linux command mkdir /user/hadoop-user/citation/input --Original Message-- From: Jignesh Patel To: common-user@hadoop.apache.org ReplyTo: common-user@hadoop.apache.org Subject: hdfs directory location Sent: Oct 10, 2011 23:45 I am using following command to create a file in Unix(i.e. mac) system. bin/hadoop fs -mkdir /user/hadoop-user/citation/input While it creates the directory I need, I am struggling to figure out exact location of the folder in my local box. Regards Bejoy K S Regards Bejoy K S
Re: hdfs directory location
Jignesh You are absolutely right. In hdfs directory doesn't exist physically. It is just meta data on name node. I don't think such a dir structure would be there in name node lfs as well as it just meta data and hence no physical dir structure is created. Regards Bejoy K S -Original Message- From: Jignesh Patel jign...@websoft.com Date: Mon, 10 Oct 2011 16:02:53 To: bejoy.had...@gmail.com Cc: common-user@hadoop.apache.org Subject: Re: hdfs directory location Bejoy, copyToLocal makes sense, it worked. But I am still wondering if HDFS has a directory created on local box, somewhere it exist physically but couldn't able to locate it. Is HDFS directory structure is a virtual structure, doesn't exist physically? -Jignesh On Oct 10, 2011, at 3:53 PM, bejoy.had...@gmail.com wrote: Jignesh Sorry I didn't get your query, 'how I can link it with HDFS directory structure? ' You mean putting your unix dir contents into hdfs? If so use hadoop fs -copyFromLocal src destn --Original Message-- From: Jignesh Patel To: common-user@hadoop.apache.org To: bejoy.had...@gmail.com Subject: Re: hdfs directory location Sent: Oct 11, 2011 01:18 Bejoy, If I create a directory in unix box then how I can link it with HDFS directory structure? -Jignesh On Oct 10, 2011, at 2:59 PM, bejoy.had...@gmail.com wrote: Jignesh You are creating a dir in hdfs by that command. The dir won't be in your local file system but it hdfs. Issue a command like hadoop fs -ls /user/hadoop-user/citation/ You can see the dir you created in hdfs If you want to create a die on local unix use a simple linux command mkdir /user/hadoop-user/citation/input --Original Message-- From: Jignesh Patel To: common-user@hadoop.apache.org ReplyTo: common-user@hadoop.apache.org Subject: hdfs directory location Sent: Oct 10, 2011 23:45 I am using following command to create a file in Unix(i.e. mac) system. bin/hadoop fs -mkdir /user/hadoop-user/citation/input While it creates the directory I need, I am struggling to figure out exact location of the folder in my local box. Regards Bejoy K S Regards Bejoy K S
Re: hdfs directory location
Hi, I guess what you are wanting is to see your HDFS directory through normal File System commands like ls etc or by browsing your directory structure. This is not possible as none of your commands or Finder (in Mac) have ability to read / write HDFS. So they don't have the capability to show HDFS directories. Hence, the HDFS directory structure must be viewed using the HDFS tools and not the Operating System FS commands. Hope this helps! Warm regards Arko On Mon, Oct 10, 2011 at 3:08 PM, bejoy.had...@gmail.com wrote: Jignesh You are absolutely right. In hdfs directory doesn't exist physically. It is just meta data on name node. I don't think such a dir structure would be there in name node lfs as well as it just meta data and hence no physical dir structure is created. Regards Bejoy K S -Original Message- From: Jignesh Patel jign...@websoft.com Date: Mon, 10 Oct 2011 16:02:53 To: bejoy.had...@gmail.com Cc: common-user@hadoop.apache.org Subject: Re: hdfs directory location Bejoy, copyToLocal makes sense, it worked. But I am still wondering if HDFS has a directory created on local box, somewhere it exist physically but couldn't able to locate it. Is HDFS directory structure is a virtual structure, doesn't exist physically? -Jignesh On Oct 10, 2011, at 3:53 PM, bejoy.had...@gmail.com wrote: Jignesh Sorry I didn't get your query, 'how I can link it with HDFS directory structure? ' You mean putting your unix dir contents into hdfs? If so use hadoop fs -copyFromLocal src destn --Original Message-- From: Jignesh Patel To: common-user@hadoop.apache.org To: bejoy.had...@gmail.com Subject: Re: hdfs directory location Sent: Oct 11, 2011 01:18 Bejoy, If I create a directory in unix box then how I can link it with HDFS directory structure? -Jignesh On Oct 10, 2011, at 2:59 PM, bejoy.had...@gmail.com wrote: Jignesh You are creating a dir in hdfs by that command. The dir won't be in your local file system but it hdfs. Issue a command like hadoop fs -ls /user/hadoop-user/citation/ You can see the dir you created in hdfs If you want to create a die on local unix use a simple linux command mkdir /user/hadoop-user/citation/input --Original Message-- From: Jignesh Patel To: common-user@hadoop.apache.org ReplyTo: common-user@hadoop.apache.org Subject: hdfs directory location Sent: Oct 10, 2011 23:45 I am using following command to create a file in Unix(i.e. mac) system. bin/hadoop fs -mkdir /user/hadoop-user/citation/input While it creates the directory I need, I am struggling to figure out exact location of the folder in my local box. Regards Bejoy K S Regards Bejoy K S
ssh setup stop working
I have created private key setup on local box and till this week end everything was working great. But when today I tried JPS I found none of the service works as well as when I tried to do ssh localhost it started asking for password. when I tried ssh-keygen -t rsa the message appeared /Users/hadoop-user/.ssh/id_rsa already exists What went wrong? Do I need to recreate the key? -Jignesh
Re: ssh setup stop working
nope they works. I have a mac system On Oct 10, 2011, at 4:40 PM, Ilker Ozkaymak wrote: Has your user account's password been expired?? Best regards, IO On Mon, Oct 10, 2011 at 3:35 PM, Jignesh Patel jign...@websoft.com wrote: I have created private key setup on local box and till this week end everything was working great. But when today I tried JPS I found none of the service works as well as when I tried to do ssh localhost it started asking for password. when I tried ssh-keygen -t rsa the message appeared /Users/hadoop-user/.ssh/id_rsa already exists What went wrong? Do I need to recreate the key? -Jignesh
Re: Secondary namenode fsimage concept
hey parick i wanted to configure my cluster to write namenode metadata to multiple directories as well: property namedfs.name.dir/name value/hadoop/var/name,/mnt/hadoop/var/name/value /property in my case, /hadoop/var/name is local directory, /mnt/hadoop/var/name is NFS volume. i took down the cluster first, then copied over files from /hadoop/var/name to /mnt/hadoop/var/name, and then tried to start up the cluster. but the cluster won't start up properly... here's the namenode log: http://pastebin.com/gmu0B7yd any ideas why it wouldn't start up? thx On Thu, Oct 6, 2011 at 6:58 PM, patrick sang silvianhad...@gmail.comwrote: I would say your namenode write metadata in local fs (where your secondary namenode will pull files), and NFS mount. property namedfs.name.dir/name value/hadoop/name,/hadoop/nfs_server_name/value /property my 0.02$ P On Thu, Oct 6, 2011 at 12:04 AM, shanmuganathan.r shanmuganatha...@zohocorp.com wrote: Hi Kai, There is no datas stored in the secondarynamenode related to the Hadoop cluster . Am I correct? If it correct means If we run the secondaryname node in separate machine then fetching , merging and transferring time is increased if the cluster has large data in the namenode fsimage file . At the time if fail over occurs , then how can we recover the nearly one hour changes in the HDFS file ? (default check point time is one hour) Thanks R.Shanmuganathan On Thu, 06 Oct 2011 12:20:28 +0530 Kai Voigtlt;k...@123.orggt; wrote Hi, the secondary namenode only fetches the two files when a checkpointing is needed. Kai Am 06.10.2011 um 08:45 schrieb shanmuganathan.r: gt; Hi Kai, gt; gt; In the Second part I meant gt; gt; gt; Is the secondary namenode also contain the FSImage file or the two files(FSImage and EdiltLog) are transferred from the namenode at the checkpoint time. gt; gt; gt; Thanks gt; Shanmuganathan gt; gt; gt; gt; gt; gt; On Thu, 06 Oct 2011 11:37:50 +0530 Kai Voigtamp;lt;k...@123.org amp;gt; wrote gt; gt; gt; Hi, gt; gt; you're correct when saying the namenode hosts the fsimage file and the edits log file. gt; gt; The fsimage file contains a snapshot of the HDFS metadata (a filename to blocks list mapping). Whenever there is a change to HDFS, it will be appended to the edits file. Think of it as a database transaction log, where changes will not be applied to the datafile, but appended to a log. gt; gt; To prevent the edits file growing infinitely, the secondary namenode periodically pulls these two files, and the namenode starts writing changes to a new edits file. Then, the secondary namenode merges the changes from the edits file with the old snapshot from the fsimage file and creates an updated fsimage file. This updated fsimage file is then copied to the namenode. gt; gt; Then, the entire cycle starts again. To answer your question: The namenode has both files, even if the secondary namenode is running on a different machine. gt; gt; Kai gt; gt; Am 06.10.2011 um 07:57 schrieb shanmuganathan.r: gt; gt; amp;gt; gt; amp;gt; Hi All, gt; amp;gt; gt; amp;gt; I have a doubt in hadoop secondary namenode concept . Please correct if the following statements are wrong . gt; amp;gt; gt; amp;gt; gt; amp;gt; The namenode hosts the fsimage and edit log files. The secondary namenode hosts the fsimage file only. At the time of checkpoint the edit log file is transferred to the secondary namenode and the both files are merged, Then the updated fsimage file is transferred to the namenode . Is it correct? gt; amp;gt; gt; amp;gt; gt; amp;gt; If we run the secondary namenode in separate machine , then both machines contain the fsimage file . Namenode only contains the editlog file. Is it true? gt; amp;gt; gt; amp;gt; gt; amp;gt; gt; amp;gt; Thanks R.Shanmuganathan gt; amp;gt; gt; amp;gt; gt; amp;gt; gt; amp;gt; gt; amp;gt; gt; amp;gt; gt; gt; -- gt; Kai Voigt gt; k...@123.org gt; gt; gt; gt; gt; gt; gt; -- Kai Voigt k...@123.org
Re: ssh setup stop working
Infect I have created passphraseless key again and still it asks me for password. On Oct 10, 2011, at 4:51 PM, Jignesh Patel wrote: nope they works. I have a mac system On Oct 10, 2011, at 4:40 PM, Ilker Ozkaymak wrote: Has your user account's password been expired?? Best regards, IO On Mon, Oct 10, 2011 at 3:35 PM, Jignesh Patel jign...@websoft.com wrote: I have created private key setup on local box and till this week end everything was working great. But when today I tried JPS I found none of the service works as well as when I tried to do ssh localhost it started asking for password. when I tried ssh-keygen -t rsa the message appeared /Users/hadoop-user/.ssh/id_rsa already exists What went wrong? Do I need to recreate the key? -Jignesh
Re: ssh setup stop working
Key requires a specific permissions for .ssh directory 700 and authorized_keys file 600 anything more it won't work. However you said it worked before, I usually experience problem when password ages the key also doesn't work until the password is reset. Anyhow it might be little different. Best regards, On Mon, Oct 10, 2011 at 4:10 PM, Jignesh Patel jign...@websoft.com wrote: Infect I have created passphraseless key again and still it asks me for password. On Oct 10, 2011, at 4:51 PM, Jignesh Patel wrote: nope they works. I have a mac system On Oct 10, 2011, at 4:40 PM, Ilker Ozkaymak wrote: Has your user account's password been expired?? Best regards, IO On Mon, Oct 10, 2011 at 3:35 PM, Jignesh Patel jign...@websoft.com wrote: I have created private key setup on local box and till this week end everything was working great. But when today I tried JPS I found none of the service works as well as when I tried to do ssh localhost it started asking for password. when I tried ssh-keygen -t rsa the message appeared /Users/hadoop-user/.ssh/id_rsa already exists What went wrong? Do I need to recreate the key? -Jignesh
Subscribe to list
Hi, I want to know your improvement subscribing to this list. Many thanks :)
problem in running program
I m trying to run attached program. My input directory structure is /user/hadoop-user/input/cite65_77.txt file. But it doesn't do anything. It doesn't read the file and not creates output directory.
Re: ssh setup stop working
You are right I have a problem with the access rights. Now it works. On Oct 10, 2011, at 5:36 PM, Ilker Ozkaymak wrote: Key requires a specific permissions for .ssh directory 700 and authorized_keys file 600 anything more it won't work. However you said it worked before, I usually experience problem when password ages the key also doesn't work until the password is reset. Anyhow it might be little different. Best regards, On Mon, Oct 10, 2011 at 4:10 PM, Jignesh Patel jign...@websoft.com wrote: Infect I have created passphraseless key again and still it asks me for password. On Oct 10, 2011, at 4:51 PM, Jignesh Patel wrote: nope they works. I have a mac system On Oct 10, 2011, at 4:40 PM, Ilker Ozkaymak wrote: Has your user account's password been expired?? Best regards, IO On Mon, Oct 10, 2011 at 3:35 PM, Jignesh Patel jign...@websoft.com wrote: I have created private key setup on local box and till this week end everything was working great. But when today I tried JPS I found none of the service works as well as when I tried to do ssh localhost it started asking for password. when I tried ssh-keygen -t rsa the message appeared /Users/hadoop-user/.ssh/id_rsa already exists What went wrong? Do I need to recreate the key? -Jignesh
Re: ways to expand hadoop.tmp.dir capacity?
So the only way we can expand to multiple mapred.local.dir paths is to config our site.xml and to restart the DataNode? On Mon, Oct 10, 2011 at 9:36 AM, Marcos Luis Ortiz Valmaseda marcosluis2...@googlemail.com wrote: 2011/10/9 Harsh J ha...@cloudera.com Hello Meng, On Wed, Oct 5, 2011 at 11:02 AM, Meng Mao meng...@gmail.com wrote: Currently, we've got defined: property namehadoop.tmp.dir/name value/hadoop/hadoop-metadata/cache//value /property In our experiments with SOLR, the intermediate files are so large that they tend to blow out disk space and fail (and annoyingly leave behind their huge failed attempts). We've had issues with it in the past, but we're having real problems with SOLR if we can't comfortably get more space out of hadoop.tmp.dir somehow. 1) It seems we never set *mapred.system.dir* to anything special, so it's defaulting to ${hadoop.tmp.dir}/mapred/system. Is this a problem? The docs seem to recommend against it when hadoop.tmp.dir had ${user.name} in it, which ours doesn't. The {mapred.system.dir} is a HDFS location, and you shouldn't really be worried about it as much. 1b) The doc says mapred.system.dir is the in-HDFS path to shared MapReduce system files. To me, that means there's must be 1 single path for mapred.system.dir, which sort of forces hadoop.tmp.dir to be 1 path. Otherwise, one might imagine that you could specify multiple paths to store hadoop.tmp.dir, like you can for dfs.data.dir. Is this a correct interpretation? -- hadoop.tmp.dir could live on multiple paths/disks if there were more mapping/lookup between mapred.system.dir and hadoop.tmp.dir? {hadoop.tmp.dir} is indeed reused for {mapred.system.dir}, although it is on HDFS, and hence is confusing, but there should just be one mapred.system.dir, yes. Also, the config {hadoop.tmp.dir} doesn't support 1 path. What you need here is a proper {mapred.local.dir} configuration. 2) IIRC, there's a -D switch for supplying config name/value pairs into indivdiual jobs. Does such a switch exist? Googling for single letters is fruitless. If we had a path on our workers with more space (in our case, another hard disk), could we simply pass that path in as hadoop.tmp.dir for our SOLR jobs? Without incurring any consistency issues on future jobs that might use the SOLR output on HDFS? Only a few parameters of a job are user-configurable. Stuff like hadoop.tmp.dir and mapred.local.dir are not override-able by user set parameters as they are server side configurations (static). Given that the default value is ${hadoop.tmp.dir}/mapred/local, would the expanded capacity we're looking for be as easily accomplished as by defining mapred.local.dir to span multiple disks? Setting aside the issue of temp files so big that they could still fill a whole disk. 1. You can set mapred.local.dir independent of hadoop.tmp.dir 2. mapred.local.dir can have comma separated values in it, spanning multiple disks 3. Intermediate outputs may spread across these disks but shall not consume 1 disk at a time. So if your largest configured disk is 500 GB while the total set of them may be 2 TB, then your intermediate output size can't really exceed 500 GB, cause only one disk is consumed by one task -- the multiple disks are for better I/O parallelism between tasks. Know that hadoop.tmp.dir is a convenience property, for quickly starting up dev clusters and such. For a proper configuration, you need to remove dependency on it (almost nothing uses hadoop.tmp.dir on the server side, once the right properties are configured - ex: dfs.data.dir, dfs.name.dir, fs.checkpoint.dir, mapred.local.dir, etc.) -- Harsh J Here it's a excellent explanation how to install Apache Hadoop manually, and Lars explains this very good. http://blog.lars-francke.de/2011/01/26/setting-up-a-hadoop-cluster-part-1-manual-installation/ Regards -- Marcos Luis Ortíz Valmaseda Linux Infrastructure Engineer Linux User # 418229 http://marcosluis2186.posterous.com http://www.linkedin.com/in/marcosluis2186 Twitter: @marcosluis2186
Re: hdfs directory location
Jignesh, Can be done. Use the fuse-dfs feature of HDFS to have your DFS as a 'physical' mount point on Linux. Instructions may be found here: http://wiki.apache.org/hadoop/MountableHDFS and on other resources across the web (search around for fuse hdfs). On Tue, Oct 11, 2011 at 1:32 AM, Jignesh Patel jign...@websoft.com wrote: Bejoy, copyToLocal makes sense, it worked. But I am still wondering if HDFS has a directory created on local box, somewhere it exist physically but couldn't able to locate it. Is HDFS directory structure is a virtual structure, doesn't exist physically? -Jignesh On Oct 10, 2011, at 3:53 PM, bejoy.had...@gmail.com wrote: Jignesh Sorry I didn't get your query, 'how I can link it with HDFS directory structure? ' You mean putting your unix dir contents into hdfs? If so use hadoop fs -copyFromLocal src destn --Original Message-- From: Jignesh Patel To: common-user@hadoop.apache.org To: bejoy.had...@gmail.com Subject: Re: hdfs directory location Sent: Oct 11, 2011 01:18 Bejoy, If I create a directory in unix box then how I can link it with HDFS directory structure? -Jignesh On Oct 10, 2011, at 2:59 PM, bejoy.had...@gmail.com wrote: Jignesh You are creating a dir in hdfs by that command. The dir won't be in your local file system but it hdfs. Issue a command like hadoop fs -ls /user/hadoop-user/citation/ You can see the dir you created in hdfs If you want to create a die on local unix use a simple linux command mkdir /user/hadoop-user/citation/input --Original Message-- From: Jignesh Patel To: common-user@hadoop.apache.org ReplyTo: common-user@hadoop.apache.org Subject: hdfs directory location Sent: Oct 10, 2011 23:45 I am using following command to create a file in Unix(i.e. mac) system. bin/hadoop fs -mkdir /user/hadoop-user/citation/input While it creates the directory I need, I am struggling to figure out exact location of the folder in my local box. Regards Bejoy K S Regards Bejoy K S -- Harsh J
Re: problem in running program
Jignesh, Please do not attach files to the mailing list. They are stripped away and the community will never receive them. Instead, if its small enough, paste it along in the mail, or paste it at services like pastebin.com and pass along the public links. On Tue, Oct 11, 2011 at 3:35 AM, Jignesh Patel jign...@websoft.com wrote: I m trying to run attached program. My input directory structure is /user/hadoop-user/input/cite65_77.txt file. But it doesn't do anything. It doesn't read the file and not creates output directory. -- Harsh J
Re: ways to expand hadoop.tmp.dir capacity?
Meng, Yes, configure the mapred-site.xml (mapred.local.dir) to add the property and roll-restart your TaskTrackers. If you'd like to expand your DataNode to multiple disks as well (helps HDFS I/O greatly), do the same with hdfs-site.xml (dfs.data.dir) and perform the same rolling restart of DataNodes. Ensure that for each service, the directories you create are owned by the same user as the one running the process. This will help avoid permission nightmares. On Tue, Oct 11, 2011 at 3:58 AM, Meng Mao meng...@gmail.com wrote: So the only way we can expand to multiple mapred.local.dir paths is to config our site.xml and to restart the DataNode? On Mon, Oct 10, 2011 at 9:36 AM, Marcos Luis Ortiz Valmaseda marcosluis2...@googlemail.com wrote: 2011/10/9 Harsh J ha...@cloudera.com Hello Meng, On Wed, Oct 5, 2011 at 11:02 AM, Meng Mao meng...@gmail.com wrote: Currently, we've got defined: property namehadoop.tmp.dir/name value/hadoop/hadoop-metadata/cache//value /property In our experiments with SOLR, the intermediate files are so large that they tend to blow out disk space and fail (and annoyingly leave behind their huge failed attempts). We've had issues with it in the past, but we're having real problems with SOLR if we can't comfortably get more space out of hadoop.tmp.dir somehow. 1) It seems we never set *mapred.system.dir* to anything special, so it's defaulting to ${hadoop.tmp.dir}/mapred/system. Is this a problem? The docs seem to recommend against it when hadoop.tmp.dir had ${user.name} in it, which ours doesn't. The {mapred.system.dir} is a HDFS location, and you shouldn't really be worried about it as much. 1b) The doc says mapred.system.dir is the in-HDFS path to shared MapReduce system files. To me, that means there's must be 1 single path for mapred.system.dir, which sort of forces hadoop.tmp.dir to be 1 path. Otherwise, one might imagine that you could specify multiple paths to store hadoop.tmp.dir, like you can for dfs.data.dir. Is this a correct interpretation? -- hadoop.tmp.dir could live on multiple paths/disks if there were more mapping/lookup between mapred.system.dir and hadoop.tmp.dir? {hadoop.tmp.dir} is indeed reused for {mapred.system.dir}, although it is on HDFS, and hence is confusing, but there should just be one mapred.system.dir, yes. Also, the config {hadoop.tmp.dir} doesn't support 1 path. What you need here is a proper {mapred.local.dir} configuration. 2) IIRC, there's a -D switch for supplying config name/value pairs into indivdiual jobs. Does such a switch exist? Googling for single letters is fruitless. If we had a path on our workers with more space (in our case, another hard disk), could we simply pass that path in as hadoop.tmp.dir for our SOLR jobs? Without incurring any consistency issues on future jobs that might use the SOLR output on HDFS? Only a few parameters of a job are user-configurable. Stuff like hadoop.tmp.dir and mapred.local.dir are not override-able by user set parameters as they are server side configurations (static). Given that the default value is ${hadoop.tmp.dir}/mapred/local, would the expanded capacity we're looking for be as easily accomplished as by defining mapred.local.dir to span multiple disks? Setting aside the issue of temp files so big that they could still fill a whole disk. 1. You can set mapred.local.dir independent of hadoop.tmp.dir 2. mapred.local.dir can have comma separated values in it, spanning multiple disks 3. Intermediate outputs may spread across these disks but shall not consume 1 disk at a time. So if your largest configured disk is 500 GB while the total set of them may be 2 TB, then your intermediate output size can't really exceed 500 GB, cause only one disk is consumed by one task -- the multiple disks are for better I/O parallelism between tasks. Know that hadoop.tmp.dir is a convenience property, for quickly starting up dev clusters and such. For a proper configuration, you need to remove dependency on it (almost nothing uses hadoop.tmp.dir on the server side, once the right properties are configured - ex: dfs.data.dir, dfs.name.dir, fs.checkpoint.dir, mapred.local.dir, etc.) -- Harsh J Here it's a excellent explanation how to install Apache Hadoop manually, and Lars explains this very good. http://blog.lars-francke.de/2011/01/26/setting-up-a-hadoop-cluster-part-1-manual-installation/ Regards -- Marcos Luis Ortíz Valmaseda Linux Infrastructure Engineer Linux User # 418229 http://marcosluis2186.posterous.com http://www.linkedin.com/in/marcosluis2186 Twitter: @marcosluis2186 -- Harsh J
Re: Secondary namenode fsimage concept
Hi, It looks to me that, problem with your NFS. It is not supporting locks. Which version of NFS are you using? Please check your NFS locking support by writing simple program for file locking. I think NFS4 supports locking ( i did not tried). http://nfs.sourceforge.net/ A6. What are the main new features in version 4 of the NFS protocol? *NFS Versions 2 and 3 are stateless protocols, but NFS Version 4 introduces state. An NFS Version 4 client uses state to notify an NFS Version 4 server of its intentions on a file: locking, reading, writing, and so on. An NFS Version 4 server can return information to a client about what other clients have intentions on a file to allow a client to cache file data more aggressively via delegation. To help keep state consistent, more sophisticated client and server reboot recovery mechanisms are built in to the NFS Version 4 protocol. *NFS Version 4 introduces support for byte-range locking and share reservation. Locking in NFS Version 4 is lease-based, so an NFS Version 4 client must maintain contact with an NFS Version 4 server to continue extending its open and lock leases. Regards, Uma - Original Message - From: Shouguo Li the1plum...@gmail.com Date: Tuesday, October 11, 2011 2:31 am Subject: Re: Secondary namenode fsimage concept To: common-user@hadoop.apache.org hey parick i wanted to configure my cluster to write namenode metadata to multipledirectories as well: property namedfs.name.dir/name value/hadoop/var/name,/mnt/hadoop/var/name/value /property in my case, /hadoop/var/name is local directory, /mnt/hadoop/var/name is NFS volume. i took down the cluster first, then copied over files from /hadoop/var/name to /mnt/hadoop/var/name, and then tried to start up the cluster. but the cluster won't start up properly... here's the namenode log: http://pastebin.com/gmu0B7yd any ideas why it wouldn't start up? thx On Thu, Oct 6, 2011 at 6:58 PM, patrick sang silvianhad...@gmail.comwrote: I would say your namenode write metadata in local fs (where your secondary namenode will pull files), and NFS mount. property namedfs.name.dir/name value/hadoop/name,/hadoop/nfs_server_name/value /property my 0.02$ P On Thu, Oct 6, 2011 at 12:04 AM, shanmuganathan.r shanmuganatha...@zohocorp.com wrote: Hi Kai, There is no datas stored in the secondarynamenode related to the Hadoop cluster . Am I correct? If it correct means If we run the secondaryname node in separate machine then fetching , merging and transferring time is increased if the cluster has large data in the namenode fsimage file . At the time if fail over occurs , then how can we recover the nearly one hour changes in the HDFS file ? (default check point time is one hour) Thanks R.Shanmuganathan On Thu, 06 Oct 2011 12:20:28 +0530 Kai Voigtk...@123.orggt; wrote Hi, the secondary namenode only fetches the two files when a checkpointing is needed. Kai Am 06.10.2011 um 08:45 schrieb shanmuganathan.r: gt; Hi Kai, gt; gt; In the Second part I meant gt; gt; gt; Is the secondary namenode also contain the FSImage file or the two files(FSImage and EdiltLog) are transferred from the namenode at the checkpoint time. gt; gt; gt; Thanks gt; Shanmuganathan gt; gt; gt; gt; gt; gt; On Thu, 06 Oct 2011 11:37:50 +0530 Kai Voigtamp;lt;k...@123.org amp;gt; wrote gt; gt; gt; Hi, gt; gt; you're correct when saying the namenode hosts the fsimage file and the edits log file. gt; gt; The fsimage file contains a snapshot of the HDFS metadata (a filename to blocks list mapping). Whenever there is a change to HDFS, it will be appended to the edits file. Think of it as a database transaction log, where changes will not be applied to the datafile, but appended to a log. gt; gt; To prevent the edits file growing infinitely, the secondary namenode periodically pulls these two files, and the namenode starts writing changes to a new edits file. Then, the secondary namenode merges the changes from the edits file with the old snapshot from the fsimage file and creates an updated fsimage file. This updated fsimage file is then copied to the namenode. gt; gt; Then, the entire cycle starts again. To answer your question: The namenode has both files, even if the secondary namenode is running on a different machine. gt; gt; Kai gt; gt; Am 06.10.2011 um 07:57 schrieb shanmuganathan.r: gt; gt; amp;gt; gt; amp;gt; Hi All, gt; amp;gt; gt; amp;gt; I have a doubt in hadoop secondary namenode concept . Please correct if the following statements are wrong . gt; amp;gt; gt; amp;gt; gt; amp;gt; The namenode hosts the fsimage and edit log files.
Re: Secondary namenode fsimage concept
Generally you just gotta ensure that your rpc.lockd service is up and running on both ends, to allow for locking over NFS. On Tue, Oct 11, 2011 at 8:16 AM, Uma Maheswara Rao G 72686 mahesw...@huawei.com wrote: Hi, It looks to me that, problem with your NFS. It is not supporting locks. Which version of NFS are you using? Please check your NFS locking support by writing simple program for file locking. I think NFS4 supports locking ( i did not tried). http://nfs.sourceforge.net/ A6. What are the main new features in version 4 of the NFS protocol? *NFS Versions 2 and 3 are stateless protocols, but NFS Version 4 introduces state. An NFS Version 4 client uses state to notify an NFS Version 4 server of its intentions on a file: locking, reading, writing, and so on. An NFS Version 4 server can return information to a client about what other clients have intentions on a file to allow a client to cache file data more aggressively via delegation. To help keep state consistent, more sophisticated client and server reboot recovery mechanisms are built in to the NFS Version 4 protocol. *NFS Version 4 introduces support for byte-range locking and share reservation. Locking in NFS Version 4 is lease-based, so an NFS Version 4 client must maintain contact with an NFS Version 4 server to continue extending its open and lock leases. Regards, Uma - Original Message - From: Shouguo Li the1plum...@gmail.com Date: Tuesday, October 11, 2011 2:31 am Subject: Re: Secondary namenode fsimage concept To: common-user@hadoop.apache.org hey parick i wanted to configure my cluster to write namenode metadata to multipledirectories as well: property namedfs.name.dir/name value/hadoop/var/name,/mnt/hadoop/var/name/value /property in my case, /hadoop/var/name is local directory, /mnt/hadoop/var/name is NFS volume. i took down the cluster first, then copied over files from /hadoop/var/name to /mnt/hadoop/var/name, and then tried to start up the cluster. but the cluster won't start up properly... here's the namenode log: http://pastebin.com/gmu0B7yd any ideas why it wouldn't start up? thx On Thu, Oct 6, 2011 at 6:58 PM, patrick sang silvianhad...@gmail.comwrote: I would say your namenode write metadata in local fs (where your secondary namenode will pull files), and NFS mount. property namedfs.name.dir/name value/hadoop/name,/hadoop/nfs_server_name/value /property my 0.02$ P On Thu, Oct 6, 2011 at 12:04 AM, shanmuganathan.r shanmuganatha...@zohocorp.com wrote: Hi Kai, There is no datas stored in the secondarynamenode related to the Hadoop cluster . Am I correct? If it correct means If we run the secondaryname node in separate machine then fetching , merging and transferring time is increased if the cluster has large data in the namenode fsimage file . At the time if fail over occurs , then how can we recover the nearly one hour changes in the HDFS file ? (default check point time is one hour) Thanks R.Shanmuganathan On Thu, 06 Oct 2011 12:20:28 +0530 Kai Voigtk...@123.orggt; wrote Hi, the secondary namenode only fetches the two files when a checkpointing is needed. Kai Am 06.10.2011 um 08:45 schrieb shanmuganathan.r: gt; Hi Kai, gt; gt; In the Second part I meant gt; gt; gt; Is the secondary namenode also contain the FSImage file or the two files(FSImage and EdiltLog) are transferred from the namenode at the checkpoint time. gt; gt; gt; Thanks gt; Shanmuganathan gt; gt; gt; gt; gt; gt; On Thu, 06 Oct 2011 11:37:50 +0530 Kai Voigtamp;lt;k...@123.org amp;gt; wrote gt; gt; gt; Hi, gt; gt; you're correct when saying the namenode hosts the fsimage file and the edits log file. gt; gt; The fsimage file contains a snapshot of the HDFS metadata (a filename to blocks list mapping). Whenever there is a change to HDFS, it will be appended to the edits file. Think of it as a database transaction log, where changes will not be applied to the datafile, but appended to a log. gt; gt; To prevent the edits file growing infinitely, the secondary namenode periodically pulls these two files, and the namenode starts writing changes to a new edits file. Then, the secondary namenode merges the changes from the edits file with the old snapshot from the fsimage file and creates an updated fsimage file. This updated fsimage file is then copied to the namenode. gt; gt; Then, the entire cycle starts again. To answer your question: The namenode has both files, even if the secondary namenode is running on a different machine. gt; gt; Kai gt; gt; Am 06.10.2011 um 07:57 schrieb shanmuganathan.r: gt; gt; amp;gt; gt; amp;gt; Hi All, gt; amp;gt; gt; amp;gt; I have a
Re: Is it possible to run multiple MapReduce against the same HDFS?
Thanks, Robert. I will look into hod. When MapReduce framework accesses data stored in HDFS, which account is used, the account which MapReduce daemons (e.g. job tracker) run as or the account of the user who submits the job? If HDFS and MapReduce clusters are run with different accounts, can MapReduce cluster be able to access HDFS directories and files (if authentication in HDFS is enabled)? Thanks! Gerald On Mon, Oct 10, 2011 at 12:36 PM, Robert Evans ev...@yahoo-inc.com wrote: It should be possible to use multiple map/reduce clusters sharing the same HDFS, you can look at hod where it launches a JT on demand. The only change of collision that I can think of would be if by some odd chance both Job Trackers were started at exactly the same millisecond. The JT uses the time it was started as part of the job id for all jobs. Those job ids are assumed to be unique and used to create files/directories in HDFS to store data for that job. --Bobby Evans On 10/7/11 12:09 PM, Zhenhua (Gerald) Guo jen...@gmail.com wrote: I plan to deploy a HDFS cluster which will be shared by multiple MapReduce clusters. I wonder whether this is possible. Will it incur any conflicts among MapReduce (e.g. different MapReduce clusters try to use the same temp directory in HDFS)? If it is possible, how should the security parameters be set up (e.g. user identity, file permission)? Thanks, Gerald
Re: hadoop input buffer size
Thanks for the clarifications guys :) Mark On Mon, Oct 10, 2011 at 8:27 AM, Uma Maheswara Rao G 72686 mahesw...@huawei.com wrote: I think below can give you more info about it. http://developer.yahoo.com/blogs/hadoop/posts/2009/08/the_anatomy_of_hadoop_io_pipel/ Nice explanation by Owen here. Regards, Uma - Original Message - From: Yang Xiaoliang yangxiaoliang2...@gmail.com Date: Wednesday, October 5, 2011 4:27 pm Subject: Re: hadoop input buffer size To: common-user@hadoop.apache.org Hi, Hadoop neither read one line each time, nor fetching dfs.block.size of lines into a buffer, Actually, for the TextInputFormat, it read io.file.buffer.size bytes of text into a buffer each time, this can be seen from the hadoop source file LineReader.java 2011/10/5 Mark question markq2...@gmail.com Hello, Correct me if I'm wrong, but when a program opens n-files at the same time to read from, and start reading from each file at a time 1 line at a time. Isn't hadoop actually fetching dfs.block.size of lines into a buffer? and not actually one line. If this is correct, I set up my dfs.block.size = 3MB and each line takes about 650 bytes only, then I would assume the performance for reading 1-4000 lines would be the same, but it isn't ! Do you know a way to find #n of lines to be read at once? Thank you, Mark
Re: Error using hadoop distcp
Distcp will run as mapreduce job. Here tasktrackers required the hostname mappings to contact to other nodes. Please configure the mapping correctly in both the machines and try. egards, Uma - Original Message - From: trang van anh anh...@vtc.vn Date: Wednesday, October 5, 2011 1:41 pm Subject: Re: Error using hadoop distcp To: common-user@hadoop.apache.org which host run the task that throws the exception ? ensure that each data node know another data nodes in hadoop cluster- add ub16 entry in /etc/hosts on where the task running. On 10/5/2011 12:15 PM, praveenesh kumar wrote: I am trying to use distcp to copy a file from one HDFS to another. But while copying I am getting the following exception : hadoop distcp hdfs://ub13:54310/user/hadoop/weblog hdfs://ub16:54310/user/hadoop/weblog 11/10/05 10:41:01 INFO mapred.JobClient: Task Id : attempt_201110031447_0005_m_07_0, Status : FAILED java.net.UnknownHostException: unknown host: ub16 at org.apache.hadoop.ipc.Client$Connection.init(Client.java:195) at org.apache.hadoop.ipc.Client.getConnection(Client.java:850) at org.apache.hadoop.ipc.Client.call(Client.java:720) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) at $Proxy1.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359) at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:113) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:215) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:177) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:82) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175) at org.apache.hadoop.mapred.FileOutputCommitter.setupJob(FileOutputCommitter.java:48) at org.apache.hadoop.mapred.OutputCommitter.setupJob(OutputCommitter.java:124) at org.apache.hadoop.mapred.Task.runJobSetupTask(Task.java:835) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:296) at org.apache.hadoop.mapred.Child.main(Child.java:170) Its saying its not finding ub16. But the entry is there in /etc/hosts files. I am able to ssh both the machines. Do I need password less ssh between these two NNs ? What can be the issue ? Any thing I am missing before using distcp ? Thanks, Praveenesh