Re: CUDA on Hadoop
If you want to use Python, one of the Py+CUDA projects generates CUDA C from the Python byte-codes. You don't have to write any C. I don't remember which project it is. This lets you debug the CUDA code in isolation, then run it from the Hadoop streaming mode. On 2/9/11, Adarsh Sharma adarsh.sha...@orkash.com wrote: He Chen wrote: Hi sharma I shared our slides about CUDA performance on Hadoop clusters. Feel free to modified it, please mention the copyright! Chen On Wed, Feb 9, 2011 at 11:13 AM, He Chen airb...@gmail.com mailto:airb...@gmail.com wrote: Hi Sharma I have some experiences on working Hybrid Hadoop with GPU. Our group has tested CUDA performance on Hadoop clusters. We obtain 20 times speedup and save up to 95% power consumption in some computation-intensive test case. You can parallel your Java code by using JCUDA which is a kind of API to help you call CUDA in your Java code. Chen On Wed, Feb 9, 2011 at 8:45 AM, Steve Loughran ste...@apache.org mailto:ste...@apache.org wrote: On 09/02/11 13:58, Harsh J wrote: You can check-out this project which did some work for Hama+CUDA: http://code.google.com/p/mrcl/ Amazon let you bring up a Hadoop cluster on machines with GPUs you can code against, but I haven't heard of anyone using it. The big issue is bandwidth; it just doesn't make sense for a classic scan through the logs kind of problem as the disk:GPU bandwidth ratio is even worse than disk:CPU. That said, if you were doing something that involved a lot of compute on a block of data (e.g. rendering tiles in a map), this could work. Thanks Chen , I am looking for some White-Papers on the mentioned topic or concerning. I think no one has write any white paper on this topic Or I'm wrong. However U'r Ppt is very nice. Thanx Once again . Adarsh -- Lance Norskog goks...@gmail.com
Is there any smart ways to give arguments to mappers reducers from a main job?
Hi, all in my job, I wanna pass some arguments to mappers and reducers from a main job. I googled some references to do that by using Configuration. but, it's not working. code) job) Configuration conf = new Configuration(); conf.set(test, value); mapper) doMap() extends Mapper... { System.out.println(context.getConfiguration.get(test)); /// -- this printed out null } How could I do that to make it working?-- Junyoung Kim (juneng...@gmail.com)
Re: why is it invalid to have non-alphabet characters as a result of MultipleOutputs?
OK. thanks for your replies. I decided to use '00' as a delimiter. :( Junyoung Kim (juneng...@gmail.com) On 02/09/2011 01:46 AM, David Rosenstrauch wrote: On 02/08/2011 05:01 AM, Jun Young Kim wrote: Hi, Multipleoutputs supports to have named outputs as a result of a hadoop. but, it has inconvenient restrictions to have it. only, alphabet characters are valid as a named output. A ~ Z a ~ z 0 ~ 9 are only characters we can take. I believe if I can use other chars like '.', '_', it could be more convenient for me. There's already a bug report open for this. https://issues.apache.org/jira/browse/MAPREDUCE-2293 DR
Re: Is there any smart ways to give arguments to mappers reducers from a main job?
Your 'Job' must reference this Configuration object for it to take those values. If it does not know about it, it would not work, logically :-) For example, create your Configuration and set things into it, and only then do new Job(ConfigurationObj) to make it use your configured object for this job. On Thu, Feb 10, 2011 at 3:19 PM, Jun Young Kim juneng...@gmail.com wrote: Hi, all in my job, I wanna pass some arguments to mappers and reducers from a main job. I googled some references to do that by using Configuration. but, it's not working. code) job) Configuration conf = new Configuration(); conf.set(test, value); mapper) doMap() extends Mapper... { System.out.println(context.getConfiguration.get(test)); /// -- this printed out null } How could I do that to make it working?-- Junyoung Kim (juneng...@gmail.com) -- Harsh J www.harshj.com
Re: Is there any smart ways to give arguments to mappers reducers from a main job?
correct. Just like this: Configuration conf = new Configuration(); conf.setStrings(test, test); Job job = new Job(conf, job name); On Thu, Feb 10, 2011 at 6:42 PM, Harsh J qwertyman...@gmail.com wrote: Your 'Job' must reference this Configuration object for it to take those values. If it does not know about it, it would not work, logically :-) For example, create your Configuration and set things into it, and only then do new Job(ConfigurationObj) to make it use your configured object for this job. On Thu, Feb 10, 2011 at 3:19 PM, Jun Young Kim juneng...@gmail.com wrote: Hi, all in my job, I wanna pass some arguments to mappers and reducers from a main job. I googled some references to do that by using Configuration. but, it's not working. code) job) Configuration conf = new Configuration(); conf.set(test, value); mapper) doMap() extends Mapper... { System.out.println(context.getConfiguration.get(test)); /// -- this printed out null } How could I do that to make it working?-- Junyoung Kim (juneng...@gmail.com) -- Harsh J www.harshj.com -- -李平
Re: CUDA on Hadoop
On 09/02/11 17:31, He Chen wrote: Hi sharma I shared our slides about CUDA performance on Hadoop clusters. Feel free to modified it, please mention the copyright! This is nice. If you stick it up online you should link to it from the Hadoop wiki pages -maybe start a hadoop+cuda page and refer to it
Re: CUDA on Hadoop
Steve Loughran wrote: On 09/02/11 17:31, He Chen wrote: Hi sharma I shared our slides about CUDA performance on Hadoop clusters. Feel free to modified it, please mention the copyright! This is nice. If you stick it up online you should link to it from the Hadoop wiki pages -maybe start a hadoop+cuda page and refer to it Yes, This will be very helpful for others too. But This much information is not sufficient , need more. Best Regards Adarsh Sharma
some doubts Hadoop MR
Hi all, I had some doubts regarding the functioning of Hadoop MapReduce : 1) I understand that every MapReduce job is parameterized using an XML file (with all the job configurations). So whenever I set certain parameters using my MR code (say I set splitsize to be 32kb) it does get reflected in the job (number of mappers). How exactly does that happen ? Does the parameters coded in the MR module override the default parameters set in the configuration XML ? And how does the JobTracker ensure that the configuration is followed by all the TaskTrackers ? What is the mechanism followed ? 2) Assume I am running cascading (chained) MR modules. In this case I feel there is a huge overhead when output of MR1 is written back to HDFS and then read from there as input of MR2.Can this be avoided ? (maybe store it in some memory without hitting the HDFS and NameNode ) Please let me know if there s some means of exercising this because it will increase the efficiency of chained MR to a great extent. Matthew
Re: some doubts Hadoop MR
Hello, On Thu, Feb 10, 2011 at 5:16 PM, Matthew John tmatthewjohn1...@gmail.com wrote: Hi all, I had some doubts regarding the functioning of Hadoop MapReduce : 1) I understand that every MapReduce job is parameterized using an XML file (with all the job configurations). So whenever I set certain parameters using my MR code (say I set splitsize to be 32kb) it does get reflected in the job (number of mappers). How exactly does that happen ? Does the parameters coded in the MR module override the default parameters set in the configuration XML ? And how does the JobTracker ensure that the configuration is followed by all the TaskTrackers ? What is the mechanism followed ? Yes, your configurations are applied over the defaults that are loaded from Hadoop's core/etc jars. A job is represented by its job file + jars/files, where the job file is the 'job.xml' produced by the configuration saving mechanism, performed upon submission of a Job. This file is distributed to all workers to read and utilize, by the JobTracker as part of its submission and localization process. I suggest reading Hadoop's source code from the submit call upwards. 2) Assume I am running cascading (chained) MR modules. In this case I feel there is a huge overhead when output of MR1 is written back to HDFS and then read from there as input of MR2.Can this be avoided ? (maybe store it in some memory without hitting the HDFS and NameNode ) Please let me know if there s some means of exercising this because it will increase the efficiency of chained MR to a great extent. Not possible to pipeline in Apache Hadoop. Have a look at HOP (Hadoop On-line project), which has some of what you seek. -- Harsh J www.harshj.com
Re: Hadoop Multi user - Cluster Setup
Hey Amit, please try HOD or hadoop on demand tool. This will suffice to your need for creating multiple users on ur cluster. -Piyush On Thu, Feb 10, 2011 at 12:42 AM, Kumar, Amit H. ahku...@odu.edu wrote: Dear All, I am trying to setup Hadoop for multiple users in a class, on our cluster. For some reason I don't seem to get it right. If only one user is running it works great. I would want to have all of the users submit a Hadoop job to the existing DataNode and on the cluster, not sure if this is right. Do I need to start a DataNode for every user, if so I was not able to do because I ran into issues of port already being used. Please advise. Below are few of the config files. Also I have tired searching for other documents, that tell us to create a user Hadoop and a group Hadoop and then start the daemons as Hadoop user. This didn't work for me as well. I am sure I am doing something wrong. Could anyone please thrown in some more ideas. =List of env changed in Hadoop-env.sh: export HADOOP_LOG_DIR=/scratch/$USER/hadoop-logs export HADOOP_PID_DIR=/scratch/$USER/.var/hadoop/pids #cat core-site.xml configuration property namefs.default.name/name valuehdfs://frontend:9000/value /property property namehadoop.tmp.dir/name value/scratch/${user.name}/hadoop-FS/value descriptionA base for other temporary directories./description /property /configuration # cat hdfs-site.xml configuration property namedfs.replication/name value1/value /property property namedfs.name.dir/name value/scratch/${user.name}/.hadoop/.transaction/.edits/value /property /configuration # cat mapred-site.xml configuration property namemapred.job.tracker/name valuefrontend:9001/value /property property namemapreduce.tasktracker.map.tasks.maximum/name value2/value /property property namemapreduce.tasktracker.reduce.tasks.maximum/name value2/value /property /configuration Thank you, Amit
Re: Could not add a new data node without rebooting Hadoop system
Dear Harsh, Your advice gave me insight, and I finally solved my problem. I'm not sure this is the correct way, but anyway it worked in my situation. I hope it would be helpful to someone else who has similar problem with me. hadoop/conf slaves update *.xml update hadoop/bin start-dfs.sh hadoop/bin start-maperd.sh -- Regards, Henny (ahneui...@gmail.com) 2011/2/7 Harsh J qwertyman...@gmail.com On Mon, Feb 7, 2011 at 5:16 PM, ahn ahneui...@gmail.com wrote: Hello everybody 1. configure conf/slaves and *.xml files on master machine 2. configure conf/master and *.xml files on slave machine 'slaves' and 'masters' file are generally only required in the master machine, and only if you are using the start-* scripts supplied with Hadoop for use with SSH (FAQ has an entry on this) from master. 3. run ${HADOOP}/bin/hadoop datanode But when I ran the commands on the master node, the master node was recognized as a data node. 3. wasn't a valid command in this case. start-dfs.sh When I ran the commands on the data node which I want to add, the data node was not properly added.(The number of total data node didn't show any change) What do the logs say for the DataNode on the slave? Does it start successfully? If fs.default.name is set properly in slave's core-site.xml it should be able to communicate properly if started (and if the version is not mismatched). -- Harsh J www.harshj.com
hadoop 0.20 append - some clarifications
Hi All, I have run the hadoop 0.20 append branch . Can someone please clarify the following behavior? A writer writing a file but he has not flushed the data and not closed the file. Could a parallel reader read this partial file? For example, 1. a writer is writing a 10MB file(block size 2 MB) 2. wrote the file upto 5MB (2 finalized blocks + 1 blockBeingWritten) . note that writer is not calling FsDataOutputStream sync( ) at all 3. now a reader tries to read the above partially written file I can be able to see that the reader can be able to see the partially written 5MB data but I feel the reader should be able to see the data only after the writer calls sync() api. Is this the correct behavior or my understanding is wrong? Thanks, Gokul
Re: hadoop 0.20 append - some clarifications
Correct is a strong word here. There is actually an HDFS unit test that checks to see if partially written and unflushed data is visible. The basic rule of thumb is that you need to synchronize readers and writers outside of HDFS. There is no guarantee that data is visible or invisible after writing, but there is a guarantee that it will become visible after sync or close. On Thu, Feb 10, 2011 at 7:11 AM, Gokulakannan M gok...@huawei.com wrote: Is this the correct behavior or my understanding is wrong?
Re: MRUnit and Herriot
Hi, I took a look around on the Internet, but I didn't find any docs about MiniDFS and MiniMRCluster. Is there docs about them? It remember me this phrase I got from the Herriot [1] page. As always your best source of information and knowledge about any software system is its source code :) Do you think is possible to have just one tool to cover all kinds of tests? Another question, do you know if is possible to evaluate a MR program, eg sort, with Herriot considering several test data? Thanks in Advance -- Edson Ramiro Lucas Filho {skype, twitter, gtalk}: erlfilho http://www.inf.ufpr.br/erlf07/ On Mon, Feb 7, 2011 at 10:29 PM, Konstantin Boudnik c...@apache.org wrote: On Mon, Feb 7, 2011 at 04:20, Edson Ramiro erlfi...@gmail.com wrote: Well, I'm studying the Hadoop test tools to evaluate some (if there are) deficiences, also trying to compare these tools to see what one cover that other doesn't and what is possible to do with each one. There's also a simulated test cluster infrastructure called MiniDFS and MiniMRCluster to allow you to develop functional tests without actual cluster deployment. As far as I know we have just Herriot and MRUnit for test, and them do different things as you said me :) I'm very interested in your initial version, is there a link? Not at the moment, but I will send it here as soon as a initial version is pushed out. Thanks in advance -- Edson Ramiro Lucas Filho {skype, twitter, gtalk}: erlfilho http://www.inf.ufpr.br/erlf07/ On Fri, Feb 4, 2011 at 3:40 AM, Konstantin Boudnik c...@apache.org wrote: Yes, Herriot can be used for integration tests of MR. Unit test is a very different thing and normally is done against a 'unit of compilation' e.g. a class, etc. Typically you won't expect to do unit tests against a deployed cluster. There is fault injection framework wich works at the level of functional tests (with mini-clusters). Shortly we'll be opening an initial version of smoke and integration test framework (maven and JUnit based). It'd be easier to provide you with a hint if you care to explain what you're trying to solve. Cos On Thu, Feb 03, 2011 at 10:25AM, Edson Ramiro wrote: Thank you a lot Konstantin, you cleared my mind. So, Herriot is a framework designed to test Hadoop as a whole, and (IMHO) is a tool for help Hadoop developers and not for who is developing MR programs, but can we use Herriot to do unit, integration or other tests on our MR jobs? Do you know another test tool or test framework for Hadoop? Thanks in Advance -- Edson Ramiro Lucas Filho {skype, twitter, gtalk}: erlfilho http://www.inf.ufpr.br/erlf07/ On Wed, Feb 2, 2011 at 4:58 PM, Konstantin Boudnik c...@apache.org wrote: (Moving to common-user where this belongs) Herriot is system test framework which runs against a real physical cluster deployed with a specially crafted build of Hadoop. That instrumented build of provides an extra APIs not available in Hadoop otherwise. These APIs are created to facilitate cluster software testability. Herriot isn't limited by MR but also covered (although in a somewhat lesser extend) HDFS side of Hadoop. MRunit is for MR job unit testing as in making sure that your MR job is ok and/or to allow you to debug it locally before scale deployment. So, long story short - they are very different ;) Herriot can do intricate fault injection and can work closely with a deployed cluster (say control Hadoop nodes and daemons); MRUnit is focused on MR jobs testing. Hope it helps. -- Take care, Konstantin (Cos) Boudnik On Wed, Feb 2, 2011 at 05:44, Edson Ramiro erlfi...@gmail.com wrote: Hi all, Plz, could you explain me the difference between MRUnit and Herriot? I've read the documentation of both and they seem very similar to me. Is Herriot an evolution of MRUnit? What can Herriot do that MRUnit can't? Thanks in Advance -- Edson Ramiro Lucas Filho {skype, twitter, gtalk}: erlfilho http://www.inf.ufpr.br/erlf07/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (GNU/Linux) iF4EAREIAAYFAk1LkUYACgkQenyFlstYjhIyYwD9HM7YvfdcvBuqdN24No5T4dLe lDLVlnEs8QIN4V7RqAYBAJ8liUG2YZ+c/wvWL3/lVAGY+Fqls0k4OYLG4rXJrwwD =h/66 -END PGP SIGNATURE-
Fwd: multiple namenode directories
-- Forwarded message -- From: mike anderson saidthero...@gmail.com Date: Thu, Feb 10, 2011 at 11:57 AM Subject: multiple namenode directories To: core-u...@hadoop.apache.org This should be a straightforward question, but better safe than sorry. I wanted to add a second name node directory (on an NFS as a backup), so now my hdfs-site.xml contains: property namedfs.name.dir/name value/mnt/hadoop/name/value /property property namedfs.name.dir/name value/public/hadoop/name/value /property When I go to start DFS i'm getting the exception: org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /public/hadoop/name is in an inconsistent state: storage directory does not exist or is not accessible. After googling a bit, it seems like I want to do bin/hadoop namenode -format Is this right? As long as I shut down DFS before issuing the command I shouldn't lose any data? Thanks in advance, Mike
multiple namenode directories
This should be a straightforward question, but better safe than sorry. I wanted to add a second name node directory (on an NFS as a backup), so now my hdfs-site.xml contains: property namedfs.name.dir/name value/mnt/hadoop/name/value /property property namedfs.name.dir/name value/public/hadoop/name/value /property When I go to start DFS i'm getting the exception: org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /public/hadoop/name is in an inconsistent state: storage directory does not exist or is not accessible. After googling a bit, it seems like I want to do bin/hadoop namenode -format Is this right? As long as I shut down DFS before issuing the command I shouldn't lose any data? Thanks in advance, Mike
Re: CUDA on Hadoop
Thank you Steve Loughran. I just created a new page on Hadoop wiki, however, how can I create a new document page on Hadoop Wiki? Best wishes Chen On Thu, Feb 10, 2011 at 5:38 AM, Steve Loughran ste...@apache.org wrote: On 09/02/11 17:31, He Chen wrote: Hi sharma I shared our slides about CUDA performance on Hadoop clusters. Feel free to modified it, please mention the copyright! This is nice. If you stick it up online you should link to it from the Hadoop wiki pages -maybe start a hadoop+cuda page and refer to it
Re: multiple namenode directories
DO NOT format your NameNode. Formatting a NameNode is equivalent to formatting a FS -- you're bound lose it all. And while messing with NameNode, after bringing it down safely, ALWAYS take a backup of the existing dfs.name.dir contents and preferably the SNN checkpoint directory contents too (if you're running it). The RIGHT way to add new directories to the NameNode's dfs.name.dir is by comma-separating them in the same value and NOT by adding two properties - that is not how Hadoop's configuration operates. In your case, bring NN down and edit conf as: property namedfs.name.dir/name value/mnt/hadoop/name,/public/hadoop/name/value /property Create the new directory by copying the existing one. Both must have the SAME file and structure in them, like mirror copies of one another. Ensure that this new location, apart from being symmetric in content, is also symmetric in permissions. NameNode will require WRITE permissions via its user on all locations configured. Having configured properly and ensured that both storage directories mirror one another, launch your NameNode back up again (feel a little paranoid and do check namenode logs for any issues -- in which case your backup would be very essential as a requirement for recovery!). P.s. Hold on for a bit for a possible comment from another user before getting into action. I've added extra directories this way, but I do not know if this is the genuine way to do so - although it feels right to me. On Thu, Feb 10, 2011 at 10:27 PM, mike anderson saidthero...@gmail.com wrote: This should be a straightforward question, but better safe than sorry. I wanted to add a second name node directory (on an NFS as a backup), so now my hdfs-site.xml contains: property namedfs.name.dir/name value/mnt/hadoop/name/value /property property namedfs.name.dir/name value/public/hadoop/name/value /property When I go to start DFS i'm getting the exception: org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /public/hadoop/name is in an inconsistent state: storage directory does not exist or is not accessible. After googling a bit, it seems like I want to do bin/hadoop namenode -format Is this right? As long as I shut down DFS before issuing the command I shouldn't lose any data? Thanks in advance, Mike -- Harsh J www.harshj.com
Map reduce streaming unable to partition
Hi, I'm trying to get partitioning working from a streaming map/reduce job. I'm using hadoop r0.20.2. Consider the following files, both in the same hdfs directory: f1: 01:01:01TABa,a,a,a,a,1 01:01:02TABa,a,a,a,a,2 01:02:01TABa,a,a,a,a,3 01:02:02TABa,a,a,a,a,4 02:01:01TABa,a,a,a,a,5 02:01:02TABa,a,a,a,a,6 02:02:01TABa,a,a,a,a,7 02:02:02TABa,a,a,a,a,8 f2: 01:01:01TABb,b,b,b,b,1 01:01:02TABb,b,b,b,b,2 01:02:01TABb,b,b,b,b,3 01:02:02TABb,b,b,b,b,4 02:01:01TABb,b,b,b,b,5 02:01:02TABb,b,b,b,b,6 02:02:01TABb,b,b,b,b,7 02:02:02TABb,b,b,b,b,8 I execute the following command: hadoop jar /opt/hadoop-0.20.2/contrib/streaming/hadoop-0.20.2-streaming.jar \ -D stream.map.output.field.separator=: \ -D stream.num.map.output.key.fields=3 \ -D map.output.key.field.separator=: \ -D mapred.text.key.partitioner.options=-k1,1 \ -input /tmp/krb/part \ -output /tmp/krb/mp \ -mapper /bin/cat \ -reducer /bin/cat \ -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner (actually I've executed about a zillion permutations of various -D arguments...) I end up with a single file sorted by the entire key, exactly what I expect if no partitioning at all is going on. What I'm hoping to end up with is two output files, each file has the first component of the key in common: 01:01:01TABa,a,a,a,a,1 01:01:01TABb,b,b,b,b,1 01:01:02TABa,a,a,a,a,2 01:01:02TABb,b,b,b,b,2 01:02:01TABa,a,a,a,a,3 01:02:01TABb,b,b,b,b,3 01:02:02TABa,a,a,a,a,4 01:02:02TABb,b,b,b,b,4 Can anyone suggest a command that may partition files as I describe? Also, it seems that the API has changed considerably from my version 0.20.x to the latest version r0.21. Is 0.20 expected to work? Or are there some fatal issues that forced major work resulting in release 0.21. Thanks, -Kelly
RE: Hadoop Multi user - Cluster Setup
Li Ping: Disabling dfs.permissions did the charm!. I have the following questions, if you can help me understand this better: 1. Not sure what are the consequences of disabling it or even doing chmod o+w on the entire filesyste(/). 2. Is there any need to have the permissions in place, other than securing users from each other's work. 3. Is it still possible to have the hdfs permissions enabled and yet be able to run multiple user submitting jobs to a common pool of resources. Thank you so much for your help! Amit -Original Message- From: li ping [mailto:li.j...@gmail.com] Sent: Wednesday, February 09, 2011 9:00 PM To: common-user@hadoop.apache.org Subject: Re: Hadoop Multi user - Cluster Setup If can check this property in hdfs-site.xml property namedfs.permissions/name valuetrue/value description If true, enable permission checking in HDFS. If false, permission checking is turned off, but all other behavior is unchanged. Switching from one parameter value to the other does not change the mode, owner or group of files or directories. /description /property You can disable this option. the second way is: running the command in hadoop. hadoop fs -chmod o+w / It has the same effect with first one On Thu, Feb 10, 2011 at 3:12 AM, Kumar, Amit H. ahku...@odu.edu wrote: Dear All, I am trying to setup Hadoop for multiple users in a class, on our cluster. For some reason I don't seem to get it right. If only one user is running it works great. I would want to have all of the users submit a Hadoop job to the existing DataNode and on the cluster, not sure if this is right. Do I need to start a DataNode for every user, if so I was not able to do because I ran into issues of port already being used. Please advise. Below are few of the config files. Also I have tired searching for other documents, that tell us to create a user Hadoop and a group Hadoop and then start the daemons as Hadoop user. This didn't work for me as well. I am sure I am doing something wrong. Could anyone please thrown in some more ideas. =List of env changed in Hadoop-env.sh: export HADOOP_LOG_DIR=/scratch/$USER/hadoop-logs export HADOOP_PID_DIR=/scratch/$USER/.var/hadoop/pids #cat core-site.xml configuration property namefs.default.name/name valuehdfs://frontend:9000/value /property property namehadoop.tmp.dir/name value/scratch/${user.name}/hadoop-FS/value descriptionA base for other temporary directories./description /property /configuration # cat hdfs-site.xml configuration property namedfs.replication/name value1/value /property property namedfs.name.dir/name value/scratch/${user.name}/.hadoop/.transaction/.edits/value /property /configuration # cat mapred-site.xml configuration property namemapred.job.tracker/name valuefrontend:9001/value /property property namemapreduce.tasktracker.map.tasks.maximum/name value2/value /property property namemapreduce.tasktracker.reduce.tasks.maximum/name value2/value /property /configuration Thank you, Amit -- -李平 -- BEGIN-ANTISPAM-VOTING-LINKS -- Teach CanIt if this mail (ID 444122709) is spam: Spam: https://www.spamtrap.odu.edu/b.php?i=444122709m=22ca10eb246at=2011020 9c=s Not spam: https://www.spamtrap.odu.edu/b.php?i=444122709m=22ca10eb246at=2011020 9c=n Forget vote: https://www.spamtrap.odu.edu/b.php?i=444122709m=22ca10eb246at=2011020 9c=f -- END-ANTISPAM-VOTING-LINKS
Re: multiple namenode directories
The links appeared outdated, I've updated those to reflect the current release 0.21's configurations. The configuration descriptions describe properly, the way to set them 'right'. For 0.20 releases, only the configuration name changes: dfs.name.dir instead of dfs.namenode.name.dir, and dfs.data.dir instead of dfs.datanode.data.dir The value formatting remains the same. On Thu, Feb 10, 2011 at 11:18 PM, mike anderson saidthero...@gmail.com wrote: Whew, glad I asked. It might be useful for someone to update the wiki: http://wiki.apache.org/hadoop/FAQ#How_do_I_set_up_a_hadoop_node_to_use_multiple_volumes.3F -Mike On Thu, Feb 10, 2011 at 12:43 PM, Harsh J qwertyman...@gmail.com wrote: DO NOT format your NameNode. Formatting a NameNode is equivalent to formatting a FS -- you're bound lose it all. And while messing with NameNode, after bringing it down safely, ALWAYS take a backup of the existing dfs.name.dir contents and preferably the SNN checkpoint directory contents too (if you're running it). The RIGHT way to add new directories to the NameNode's dfs.name.dir is by comma-separating them in the same value and NOT by adding two properties - that is not how Hadoop's configuration operates. In your case, bring NN down and edit conf as: property namedfs.name.dir/name value/mnt/hadoop/name,/public/hadoop/name/value /property Create the new directory by copying the existing one. Both must have the SAME file and structure in them, like mirror copies of one another. Ensure that this new location, apart from being symmetric in content, is also symmetric in permissions. NameNode will require WRITE permissions via its user on all locations configured. Having configured properly and ensured that both storage directories mirror one another, launch your NameNode back up again (feel a little paranoid and do check namenode logs for any issues -- in which case your backup would be very essential as a requirement for recovery!). P.s. Hold on for a bit for a possible comment from another user before getting into action. I've added extra directories this way, but I do not know if this is the genuine way to do so - although it feels right to me. On Thu, Feb 10, 2011 at 10:27 PM, mike anderson saidthero...@gmail.com wrote: This should be a straightforward question, but better safe than sorry. I wanted to add a second name node directory (on an NFS as a backup), so now my hdfs-site.xml contains: property namedfs.name.dir/name value/mnt/hadoop/name/value /property property namedfs.name.dir/name value/public/hadoop/name/value /property When I go to start DFS i'm getting the exception: org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /public/hadoop/name is in an inconsistent state: storage directory does not exist or is not accessible. After googling a bit, it seems like I want to do bin/hadoop namenode -format Is this right? As long as I shut down DFS before issuing the command I shouldn't lose any data? Thanks in advance, Mike -- Harsh J www.harshj.com -- Harsh J www.harshj.com
Re: Hadoop Multi user - Cluster Setup
Please read the HDFS Permissions guide which explains the understanding required to have a working permissions model on the DFS: http://hadoop.apache.org/hdfs/docs/current/hdfs_permissions_guide.html On Thu, Feb 10, 2011 at 11:15 PM, Kumar, Amit H. ahku...@odu.edu wrote: Li Ping: Disabling dfs.permissions did the charm!. I have the following questions, if you can help me understand this better: 1. Not sure what are the consequences of disabling it or even doing chmod o+w on the entire filesyste(/). 2. Is there any need to have the permissions in place, other than securing users from each other's work. 3. Is it still possible to have the hdfs permissions enabled and yet be able to run multiple user submitting jobs to a common pool of resources. Thank you so much for your help! Amit -Original Message- From: li ping [mailto:li.j...@gmail.com] Sent: Wednesday, February 09, 2011 9:00 PM To: common-user@hadoop.apache.org Subject: Re: Hadoop Multi user - Cluster Setup If can check this property in hdfs-site.xml property namedfs.permissions/name valuetrue/value description If true, enable permission checking in HDFS. If false, permission checking is turned off, but all other behavior is unchanged. Switching from one parameter value to the other does not change the mode, owner or group of files or directories. /description /property You can disable this option. the second way is: running the command in hadoop. hadoop fs -chmod o+w / It has the same effect with first one On Thu, Feb 10, 2011 at 3:12 AM, Kumar, Amit H. ahku...@odu.edu wrote: Dear All, I am trying to setup Hadoop for multiple users in a class, on our cluster. For some reason I don't seem to get it right. If only one user is running it works great. I would want to have all of the users submit a Hadoop job to the existing DataNode and on the cluster, not sure if this is right. Do I need to start a DataNode for every user, if so I was not able to do because I ran into issues of port already being used. Please advise. Below are few of the config files. Also I have tired searching for other documents, that tell us to create a user Hadoop and a group Hadoop and then start the daemons as Hadoop user. This didn't work for me as well. I am sure I am doing something wrong. Could anyone please thrown in some more ideas. =List of env changed in Hadoop-env.sh: export HADOOP_LOG_DIR=/scratch/$USER/hadoop-logs export HADOOP_PID_DIR=/scratch/$USER/.var/hadoop/pids #cat core-site.xml configuration property namefs.default.name/name valuehdfs://frontend:9000/value /property property namehadoop.tmp.dir/name value/scratch/${user.name}/hadoop-FS/value descriptionA base for other temporary directories./description /property /configuration # cat hdfs-site.xml configuration property namedfs.replication/name value1/value /property property namedfs.name.dir/name value/scratch/${user.name}/.hadoop/.transaction/.edits/value /property /configuration # cat mapred-site.xml configuration property namemapred.job.tracker/name valuefrontend:9001/value /property property namemapreduce.tasktracker.map.tasks.maximum/name value2/value /property property namemapreduce.tasktracker.reduce.tasks.maximum/name value2/value /property /configuration Thank you, Amit -- -李平 -- BEGIN-ANTISPAM-VOTING-LINKS -- Teach CanIt if this mail (ID 444122709) is spam: Spam: https://www.spamtrap.odu.edu/b.php?i=444122709m=22ca10eb246at=2011020 9c=s Not spam: https://www.spamtrap.odu.edu/b.php?i=444122709m=22ca10eb246at=2011020 9c=n Forget vote: https://www.spamtrap.odu.edu/b.php?i=444122709m=22ca10eb246at=2011020 9c=f -- END-ANTISPAM-VOTING-LINKS -- Harsh J www.harshj.com
recommendation on HDDs
What would be a good hard drive for a 7 node cluster which is targeted to run a mix of IO and CPU intensive Hadoop workloads? We are looking for around 1 TB of storage on each node distributed amongst 4 or 5 disks. So either 250GB * 4 disks or 160GB * 5 disks. Also it should be less than 100$ each ;) I looked at HDD benchmark comparisons on tomshardware, storagereview etc. Got overwhelmed with the # of benchmarks and different aspects of HDD performance. Appreciate your help on this. -Shrinivas
Re: recommendation on HDDs
Get bigger disks. Data only grows and having extra is always good. You can get 2TB drives for $100 and 1TB for $75. As far as transfer rates are concerned, any 3GB/s SATA drive is going to be about the same (ish). Seek times will vary a bit with rotation speed, but with Hadoop, you will be doing long reads and writes. Your controller and backplane will have a MUCH bigger vote in getting acceptable performance. With only 4 or 5 drives, you don't have to worry about super-duper backplane, but you can still kill performance with a lousy controller. On Thu, Feb 10, 2011 at 12:26 PM, Shrinivas Joshi jshrini...@gmail.comwrote: What would be a good hard drive for a 7 node cluster which is targeted to run a mix of IO and CPU intensive Hadoop workloads? We are looking for around 1 TB of storage on each node distributed amongst 4 or 5 disks. So either 250GB * 4 disks or 160GB * 5 disks. Also it should be less than 100$ each ;) I looked at HDD benchmark comparisons on tomshardware, storagereview etc. Got overwhelmed with the # of benchmarks and different aspects of HDD performance. Appreciate your help on this. -Shrinivas
Re: Map reduce streaming unable to partition
OK, I think I sumbled upon the correct incantation: time hadoop jar /opt/hadoop-0.20.2/contrib/streaming/hadoop-0.20.2-streaming.jar \ -D map.output.key.field.separator=: \ -D mapred.text.key.partitioner.options=-k1,1 \ -D mapred.reduce.tasks=16 \ -input /tmp/krb/part \ -output /tmp/krb/mp \ -mapper /bin/cat \ -reducer /bin/cat \ -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner This will partition and sort the files as I expect, leaving me with 16 output files, 14 of which are empty and 2 non-empty. If I increase the number of partitions in the data so they exceed the number of reduce tasks, multiple partitions will be written to some or all of the output files. I believe I can deal with that now that I understand it, but it would be nice if the number of output files was equal to the number of partitions in the data. -K On Thu, Feb 10, 2011 at 11:45 AM, Kelly Burkhart kelly.burkh...@gmail.com wrote: Hi, I'm trying to get partitioning working from a streaming map/reduce job. I'm using hadoop r0.20.2. Consider the following files, both in the same hdfs directory: f1: 01:01:01TABa,a,a,a,a,1 01:01:02TABa,a,a,a,a,2 01:02:01TABa,a,a,a,a,3 01:02:02TABa,a,a,a,a,4 02:01:01TABa,a,a,a,a,5 02:01:02TABa,a,a,a,a,6 02:02:01TABa,a,a,a,a,7 02:02:02TABa,a,a,a,a,8 f2: 01:01:01TABb,b,b,b,b,1 01:01:02TABb,b,b,b,b,2 01:02:01TABb,b,b,b,b,3 01:02:02TABb,b,b,b,b,4 02:01:01TABb,b,b,b,b,5 02:01:02TABb,b,b,b,b,6 02:02:01TABb,b,b,b,b,7 02:02:02TABb,b,b,b,b,8 I execute the following command: hadoop jar /opt/hadoop-0.20.2/contrib/streaming/hadoop-0.20.2-streaming.jar \ -D stream.map.output.field.separator=: \ -D stream.num.map.output.key.fields=3 \ -D map.output.key.field.separator=: \ -D mapred.text.key.partitioner.options=-k1,1 \ -input /tmp/krb/part \ -output /tmp/krb/mp \ -mapper /bin/cat \ -reducer /bin/cat \ -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner (actually I've executed about a zillion permutations of various -D arguments...) I end up with a single file sorted by the entire key, exactly what I expect if no partitioning at all is going on. What I'm hoping to end up with is two output files, each file has the first component of the key in common: 01:01:01TABa,a,a,a,a,1 01:01:01TABb,b,b,b,b,1 01:01:02TABa,a,a,a,a,2 01:01:02TABb,b,b,b,b,2 01:02:01TABa,a,a,a,a,3 01:02:01TABb,b,b,b,b,3 01:02:02TABa,a,a,a,a,4 01:02:02TABb,b,b,b,b,4 Can anyone suggest a command that may partition files as I describe? Also, it seems that the API has changed considerably from my version 0.20.x to the latest version r0.21. Is 0.20 expected to work? Or are there some fatal issues that forced major work resulting in release 0.21. Thanks, -Kelly
Re: MRUnit and Herriot
On Thu, Feb 10, 2011 at 08:39, Edson Ramiro erlfi...@gmail.com wrote: Hi, I took a look around on the Internet, but I didn't find any docs about MiniDFS and MiniMRCluster. Is there docs about them? It remember me this phrase I got from the Herriot [1] page. As always your best source of information and knowledge about any software system is its source code :) Yes, this still holds ;) Source code is your best friend for a number of reasons: - this is _the_ best documentation for the code and shows what an application does - it is always up-to-date - developers can focus on their development/testing rather then writing an end-user documents about some internals (which no-one but other developers will ever need) Do you think is possible to have just one tool to cover all kinds of tests? Sure, why not? I am also a big believer that a single OS would do just fine. Another question, do you know if is possible to evaluate a MR program, eg sort, with Herriot considering several test data? Absolutely... Herriot does run work-loads against a physical clusters. So, I don't see why it can be impossible. Would be most effective use of your time? Perhaps not, because Herriot requires a specially tailored (instrumented) cluster to be executed against. What you need, I think, is a simple way to get a jar file containing some tests, drop it to a cluster's gateway machine and run then. Looks like as what we are trying to achieve in iTest I have mentioned earlier. Cos Thanks in Advance -- Edson Ramiro Lucas Filho {skype, twitter, gtalk}: erlfilho http://www.inf.ufpr.br/erlf07/ On Mon, Feb 7, 2011 at 10:29 PM, Konstantin Boudnik c...@apache.org wrote: On Mon, Feb 7, 2011 at 04:20, Edson Ramiro erlfi...@gmail.com wrote: Well, I'm studying the Hadoop test tools to evaluate some (if there are) deficiences, also trying to compare these tools to see what one cover that other doesn't and what is possible to do with each one. There's also a simulated test cluster infrastructure called MiniDFS and MiniMRCluster to allow you to develop functional tests without actual cluster deployment. As far as I know we have just Herriot and MRUnit for test, and them do different things as you said me :) I'm very interested in your initial version, is there a link? Not at the moment, but I will send it here as soon as a initial version is pushed out. Thanks in advance -- Edson Ramiro Lucas Filho {skype, twitter, gtalk}: erlfilho http://www.inf.ufpr.br/erlf07/ On Fri, Feb 4, 2011 at 3:40 AM, Konstantin Boudnik c...@apache.org wrote: Yes, Herriot can be used for integration tests of MR. Unit test is a very different thing and normally is done against a 'unit of compilation' e.g. a class, etc. Typically you won't expect to do unit tests against a deployed cluster. There is fault injection framework wich works at the level of functional tests (with mini-clusters). Shortly we'll be opening an initial version of smoke and integration test framework (maven and JUnit based). It'd be easier to provide you with a hint if you care to explain what you're trying to solve. Cos On Thu, Feb 03, 2011 at 10:25AM, Edson Ramiro wrote: Thank you a lot Konstantin, you cleared my mind. So, Herriot is a framework designed to test Hadoop as a whole, and (IMHO) is a tool for help Hadoop developers and not for who is developing MR programs, but can we use Herriot to do unit, integration or other tests on our MR jobs? Do you know another test tool or test framework for Hadoop? Thanks in Advance -- Edson Ramiro Lucas Filho {skype, twitter, gtalk}: erlfilho http://www.inf.ufpr.br/erlf07/ On Wed, Feb 2, 2011 at 4:58 PM, Konstantin Boudnik c...@apache.org wrote: (Moving to common-user where this belongs) Herriot is system test framework which runs against a real physical cluster deployed with a specially crafted build of Hadoop. That instrumented build of provides an extra APIs not available in Hadoop otherwise. These APIs are created to facilitate cluster software testability. Herriot isn't limited by MR but also covered (although in a somewhat lesser extend) HDFS side of Hadoop. MRunit is for MR job unit testing as in making sure that your MR job is ok and/or to allow you to debug it locally before scale deployment. So, long story short - they are very different ;) Herriot can do intricate fault injection and can work closely with a deployed cluster (say control Hadoop nodes and daemons); MRUnit is focused on MR jobs testing. Hope it helps. -- Take care, Konstantin (Cos) Boudnik On Wed, Feb 2, 2011 at 05:44, Edson Ramiro erlfi...@gmail.com wrote: Hi all, Plz, could you explain me the difference between MRUnit and Herriot? I've read the documentation of both and
Re: hadoop 0.20 append - some clarifications
You might also want to check append design doc published at HDFS-265 -- Take care, Konstantin (Cos) Boudnik On Thu, Feb 10, 2011 at 07:11, Gokulakannan M gok...@huawei.com wrote: Hi All, I have run the hadoop 0.20 append branch . Can someone please clarify the following behavior? A writer writing a file but he has not flushed the data and not closed the file. Could a parallel reader read this partial file? For example, 1. a writer is writing a 10MB file(block size 2 MB) 2. wrote the file upto 5MB (2 finalized blocks + 1 blockBeingWritten) . note that writer is not calling FsDataOutputStream sync( ) at all 3. now a reader tries to read the above partially written file I can be able to see that the reader can be able to see the partially written 5MB data but I feel the reader should be able to see the data only after the writer calls sync() api. Is this the correct behavior or my understanding is wrong? Thanks, Gokul
Re: some doubts Hadoop MR
2) Assume I am running cascading (chained) MR modules. In this case I feel there is a huge overhead when output of MR1 is written back to HDFS and then read from there as input of MR2.Can this be avoided ? (maybe store it in some memory without hitting the HDFS and NameNode ) Please let me know if there s some means of exercising this because it will increase the efficiency of chained MR to a great extent. Not possible to pipeline in Apache Hadoop. Have a look at HOP (Hadoop On-line project), which has some of what you seek. It is under some circumstances. With ChainMapper and ChainReducer, if the key/value signatures of the inputs and outputs of all mappers and reducers are the same, then the only disk I/O is at the endpoints. Note that there is _no_ buffering at all, however (just a single-element queue between each pair), so all maps and reduces in each ChainMapper or ChainReducer chain have to reside in memory simultaneously. I haven't ever used them, btw, so I don't know how useful or efficient they are. I just came across them while working on another feature that turns out to be fundamentally incompatible with them... Greg
File name which includes defined keyword
File name which includes defined keyword Dear all I have an error when I copy localsrc in Hadoop fs commands. e.g. hadoop/bin hadoop fs -copyFromLocal abcdef:abcdef.exm /test I can't copy a localsrc which includes ':' to the dst. Does anybody know what could I do? Regards, Henny ahn(ahneui...@gmail.com)
Re: File name which includes defined keyword
There appears to be a bug filed about this, check it's JIRA out here: https://issues.apache.org/jira/browse/HDFS-13 On Fri, Feb 11, 2011 at 6:09 AM, 안의건 ahneui...@gmail.com wrote: File name which includes defined keyword Dear all I have an error when I copy localsrc in Hadoop fs commands. e.g. hadoop/bin hadoop fs -copyFromLocal abcdef:abcdef.exm /test I can't copy a localsrc which includes ':' to the dst. Does anybody know what could I do? Regards, Henny ahn(ahneui...@gmail.com) -- Harsh J www.harshj.com
How do I insert a new node while running a MapReduce hadoop?
Hi, i started using hadoop now and I'm doing some tests on a cluster of three machines. I wanted to insert a new node after the MapReduce started, is this possible? How do I?
Re: How do I insert a new node while running a MapReduce hadoop?
of course you can. What is the node type, datanode?job tracker?task tracker? Let's say you are trying to add a datanode. You can modify the xml file let the datanode pointed to the NameNode, JobTracker, TaskTracker. property namefs.default.name/name valuehdfs://:9000//value /property property namemapred.job.tracker/name valueip:port/value descriptionThe host and port that the MapReduce job tracker runs at. If local, then jobs are run in-process as a single map and reduce task. /description /property In most cases, the tasktracker and datanode are running on the same machine (to get the best performance). After doing this, you can start the hdfs by command start-dfs.sh On Fri, Feb 11, 2011 at 11:13 AM, Sandro Simas sandro.csi...@gmail.comwrote: Hi, i started using hadoop now and I'm doing some tests on a cluster of three machines. I wanted to insert a new node after the MapReduce started, is this possible? How do I? -- -李平
Re: hadoop 0.20 append - some clarifications
It is a bit confusing. SequenceFile.Writer#sync isn't really sync. There is SequenceFile.Writer#syncFs which is more what you might expect to be sync. Then there is HADOOP-6313 which specifies hflush and hsync. Generally, if you want portable code, you have to reflect a bit to figure out what can be done. On Thu, Feb 10, 2011 at 8:38 PM, Gokulakannan M gok...@huawei.com wrote: Thanks Ted for clarifying. So the *sync* is to just flush the current buffers to datanode and persist the block info in namenode once per block, isn't it? Regarding reader able to see the unflushed data, I faced an issue in the following scneario: 1. a writer is writing a *10MB* file(block size 2 MB) 2. wrote the file upto 4MB (2 finalized blocks in *current* and nothing in *blocksBeingWritten* directory in DN) . So 2 blocks are written 3. client calls addBlock for the 3rd block on namenode and not yet created outputstream to DN(or written anything to DN). At this point of time, the namenode knows about the 3rd block but the datanode doesn't. 4. at point 3, a reader is trying to read the file and he is getting exception and not able to read the file as the datanode's getBlockInfo returns null to the client(of course DN doesn't know about the 3rd block yet) In this situation the reader cannot see the file. But when the block writing is in progress , the read is successful. *Is this a bug that needs to be handled in append branch?* -Original Message- From: Konstantin Boudnik [mailto:c...@boudnik.org] Sent: Friday, February 11, 2011 4:09 AM To: common-user@hadoop.apache.org Subject: Re: hadoop 0.20 append - some clarifications You might also want to check append design doc published at HDFS-265 I was asking about the hadoop 0.20 append branch. I suppose HDFS-265's design doc won't apply to it. -- *From:* Ted Dunning [mailto:tdunn...@maprtech.com] *Sent:* Thursday, February 10, 2011 9:29 PM *To:* common-user@hadoop.apache.org; gok...@huawei.com *Cc:* hdfs-u...@hadoop.apache.org *Subject:* Re: hadoop 0.20 append - some clarifications Correct is a strong word here. There is actually an HDFS unit test that checks to see if partially written and unflushed data is visible. The basic rule of thumb is that you need to synchronize readers and writers outside of HDFS. There is no guarantee that data is visible or invisible after writing, but there is a guarantee that it will become visible after sync or close. On Thu, Feb 10, 2011 at 7:11 AM, Gokulakannan M gok...@huawei.com wrote: Is this the correct behavior or my understanding is wrong?