Re: MapReduce code location
Hi Y. Dong, Here is for your questions: 1. will hadoop instantiate multiple instances of this class then transmit them to every remote machine? ANSWER: Each of the TaskTracker in Hadoop cluster will create an instance of your Map class, and the transmission of the data is accomplished by other part of the framework in Hadoop cluster. since each TaskTracker starts a JVM, which will create an object of your Map class,and will feed key-value pairs of your input data to your map method. And the shuffle phase will pass the Map output data to Reduce method. 2. in a remote machine will the map(…) method be able to access List A and List B locally from its own memory? ANSWER: because each TaskTracker node have its only Map object, they have List A and List B in their local memory only. Hoping the above answer helps you. yours, Kun Ling On Tue, Aug 20, 2013 at 6:06 PM, Y. Dong tq00...@gmail.com wrote: Hi All, I'm a Mapreduce newbie, what I want to know is that, say I have a mapper class: public Class Map implements Mapper { public List A; public static List B; public Map(){ //class constructor System.out.println(Im initializing); } @Override protected void map(………){ System.out.println(Im inside a mapper); ……. } } when I run this mapper on a multi-machine hadoop configuration, will hadoop instantiate multiple instances of this class then transmit them to every remote machine? So in a remote machine will the map(…) method be able to access List A and List B locally from its own memory? If yes, in the map method, what if I run System.out.println, will printed message be only shown on the remote machine but not the machine I start the whole map reduce job? Thanks. Eason -- http://www.lingcc.com
Re: Is there any possible way to use hostname variable in mapred-site.xml file
Hi Binglin, Thanks for your kindly help. And your advice works well for me. yours, Kun Ling On Thu, Aug 15, 2013 at 5:51 PM, Binglin Chang decst...@gmail.com wrote: How about add -Dhost.name=`hostname` in HADOOP_OPTS and get this variable in config file ${host.name} ? I have not tried this, you can try this. On Thu, Aug 15, 2013 at 5:26 PM, Kun Ling lkun.e...@gmail.com wrote: Hi all, I have a Hadoop MapReduce Cluster. In which I want to adjust the mapred.local.dir, so that each TaskTracker can write to a mapred.local.dir with different name. And also make the conf file looks the same to make deployment easier. Currently, my plan is that each TaskTracker have hostname in its mapred.local.dir configuration. So the configuration in mapred-site.xml is just like this: property namemapred.local.dir/name value/var/mapred_local/*HOSTNAME*//value /property The problem is how to make TaskTracker automatically get the HOSTNAME? I have look through all the .xml files in conf/ and jar files, but only get an variable ${user.name} which could be used to indicate the current hadoop username. Thanks very much . yours, Kun Ling -- http://www.lingcc.com -- http://www.lingcc.com
Is there any possible way to use hostname variable in mapred-site.xml file
Hi all, I have a Hadoop MapReduce Cluster. In which I want to adjust the mapred.local.dir, so that each TaskTracker can write to a mapred.local.dir with different name. And also make the conf file looks the same to make deployment easier. Currently, my plan is that each TaskTracker have hostname in its mapred.local.dir configuration. So the configuration in mapred-site.xml is just like this: property namemapred.local.dir/name value/var/mapred_local/*HOSTNAME*//value /property The problem is how to make TaskTracker automatically get the HOSTNAME? I have look through all the .xml files in conf/ and jar files, but only get an variable ${user.name} which could be used to indicate the current hadoop username. Thanks very much . yours, Kun Ling -- http://www.lingcc.com
Re: MapReduce on Local FileSystem
Hi Agarwal, I once have similar questions, and have done some experiment. Here is my experience: 1. For some applications over MR, like HBase, Hive, which does not need to submit additional files to HDFS, file:/// could work well without any problem (According to my test). 2. For simple MR applications, like TeraSort, there is some problems by simply using file:///, since MR will maintain some MR-control files both in shared FileSystem, and local file sytem in one list, and will lookup the list for the file, and simply using file:/// will cause the shared FS looks the same as local filesystem, while in fact, they are two different kinds of filesystem, and have different path conversion-rules. For the 2nd issue, you can just create a new shared filesystem class by deriving the existing org.apache.hadoop.fs.FileSystem , I have create such a repository with an example filesystem class implementation( https://github.com/Lingcc/hadoop-lingccfs ), hoping it is helpful to you. yours, Ling Kun. On Fri, May 31, 2013 at 2:37 PM, Agarwal, Nikhil nikhil.agar...@netapp.comwrote: Hi, ** ** Is it possible to run MapReduce on *multiple nodes* using Local File system (file:///) ? I am able to run it in single node setup but in a multiple node setup the “slave” nodes are not able to access the “jobtoken” file which is present in the Hadoop.tmp.dir in “master” node. ** ** Please let me know if it is possible to do this. ** ** Thanks Regards, Nikhil -- http://www.lingcc.com
Re: How is sharing done in HDFS ?
Hi, Agarwal, Hadoop just put the jobtoken, _partitionlst, and some other files that needed to share in a directory located in hdfs://namenode:port/tmp//. And all the TaskTracker will access these files from the shared tmp directory, just like the way they share the input file in the HDFS. yours, Ling Kun On Wed, May 22, 2013 at 4:29 PM, Agarwal, Nikhil nikhil.agar...@netapp.comwrote: Hi, ** ** Can anyone guide me to some pointers or explain how HDFS shares the information put in the temporary directories (hadoop.tmp.dir, mapred.tmp.dir, etc.) to all other nodes? ** ** I suppose that during execution of a MapReduce job, the JobTracker prepares a file called jobtoken and puts it in the temporary directories. which needs to be read by all TaskTrackers. So, how does HDFS share the contents? Does it use nfs mount or ….? ** ** Thanks Regards, Nikhil ** ** -- http://www.lingcc.com
Re: Shuffle phase replication factor
Hi John, 1. for the number of simultaneous connection limitations. You can configure this using the mapred.reduce.parallel.copies flag. the default is 5. 2. For the aggressively disconnect implication, I am afraid it is only a little. Normally, each reducer will connect to each mapper task, and asking for the partions of the map output file. Because there are about 5 simultaneous connections to fetch the map output for each reducer. For a large MR cluster with 1000 node, and a Huge MR job with 1000 Mapper, and 1000 reducer, for each node, there are only about 5 connections. So the imply is only a little. 3. What happens to the pending/ failing coonection, the short answer is: just try to reconnect.There is a List, which maintain all the output of the Mapper that need to copied, and the element will be removed iff the map output is successfully copied. A forever loop will keep on look into the List, and fetch the corrsponding map output. All the above answer is based on the Hadoop 1.0.4 source code, especially the ReduceTask.java file. yours, Ling Kun On Wed, May 22, 2013 at 10:57 PM, John Lilley john.lil...@redpoint.netwrote: U, is that also the limit for the number of simultaneous connections? In general, one does not need a 1:1 map between threads and connections. If this is the connection limit, does it imply that the client or server side aggressively disconnects after a transfer? What happens to the pending/failing connection attempts that exceed the limit? Thanks! john ** ** *From:* Rahul Bhattacharjee [mailto:rahul.rec@gmail.com] *Sent:* Wednesday, May 22, 2013 8:52 AM *To:* user@hadoop.apache.org *Subject:* Re: Shuffle phase replication factor ** ** There are properties/configuration to control the no. of copying threads for copy. tasktracker.http.threads=40 Thanks, Rahul ** ** On Wed, May 22, 2013 at 8:16 PM, John Lilley john.lil...@redpoint.net wrote: This brings up another nagging question I’ve had for some time. Between HDFS and shuffle, there seems to be the potential for “every node connecting to every other node” via TCP. Are there explicit mechanisms in place to manage or limit simultaneous connections? Is the protocol simply robust enough to allow a server-side to disconnect at any time to free up slots and the client-side will retry the request? Thanks john *From:* Shahab Yunus [mailto:shahab.yu...@gmail.com] *Sent:* Wednesday, May 22, 2013 8:38 AM *To:* user@hadoop.apache.org *Subject:* Re: Shuffle phase replication factor As mentioned by Bertrand, Hadoop, The Definitive Guide, is well... really definitive :) place to start. It is pretty thorough for starts and once you are gone through it, the code will start making more sense too. Regards, Shahab On Wed, May 22, 2013 at 10:33 AM, John Lilley john.lil...@redpoint.net wrote: Oh I see. Does this mean there is another service and TCP listen port for this purpose? Thanks for your indulgence… I would really like to read more about this without bothering the group but not sure where to start to learn these internals other than the code. john *From:* Kai Voigt [mailto:k...@123.org] *Sent:* Tuesday, May 21, 2013 12:59 PM *To:* user@hadoop.apache.org *Subject:* Re: Shuffle phase replication factor The map output doesn't get written to HDFS. The map task writes its output to its local disk, the reduce tasks will pull the data through HTTP for further processing. Am 21.05.2013 um 19:57 schrieb John Lilley john.lil...@redpoint.net: When MapReduce enters “shuffle” to partition the tuples, I am assuming that it writes intermediate data to HDFS. What replication factor is used for those temporary files? john -- Kai Voigt k...@123.org ** ** -- http://www.lingcc.com
Re: Project ideas
Hi Anshuman, Since MR is like: split the input, map it to different node, run it in parallel, and combine the result. I would suggest you look into the application of the Divide-and-Conquer algorithms, and port it, or rewrite it in Hadoop MapReduce. yours, Ling Kun On Tue, May 21, 2013 at 9:35 PM, Anshuman Mathur ans...@gmail.com wrote: Hello fellow users, We are a group of students studying in National University of Singapore. As part of our course curriculum we need to develop an application using Hadoop and map-reduce. Can you please suggest some innovative ideas for our project? Thanks in advance. Anshuman -- http://www.lingcc.com
Re: cloudera4.2 source code ant
Hi dylan, I have not build CDH source code using ant, However I have met a similar dependencies resolve filed problem. Acccording to my experience, this is much like a package network download issue. You may try to remove the .ivy2 and .m2 directories in your home directory, and run ant clean; ant to try again. Hope it is helpful to you. yours, Kun Ling On Fri, May 17, 2013 at 4:42 PM, dylan dwld0...@gmail.com wrote: hello, there is a problem i can't resolved, i want to remote connect the hadoop ( cloudera cdh4.2.0 ) via eclipse plugin.There’s have no hadoop-eclipse-pluge.jar ,so i download the hadoop of cdh4.2.0 tarbal and when i complie, the error is below: ivy-resolve-common: [ivy:resolve] :: resolving dependencies :: org.apache.hadoop#eclipse-plugin;working@master [ivy:resolve]confs: [common] [ivy:resolve]found commons-logging#commons-logging;1.1.1 in maven2 [ivy:resolve] :: resolution report :: resolve 5475ms :: artifacts dl 2ms** ** - | |modules|| artifacts | | conf | number| search|dwnlded|evicted|| number|dwnlded| - | common | 2 | 0 | 0 | 0 || 1 | 0 | - [ivy:resolve] [ivy:resolve] :: problems summary :: [ivy:resolve] WARNINGS [ivy:resolve] :: [ivy:resolve] :: UNRESOLVED DEPENDENCIES :: [ivy:resolve] :: [ivy:resolve] :: log4j#log4j;1.2.16: several problems occurred while resolving dependency: log4j#log4j;1.2.16 {common=[master]}: [ivy:resolve]reactor-repo: unable to get resource for log4j#log4j;1.2.16: res=${reactor.repo}/log4j/log4j/1.2.16/log4j-1.2.16.pom: java.net.MalformedURLException: no protocol: ${reactor.repo}/log4j/log4j/1.2.16/log4j-1.2.16.pom [ivy:resolve]reactor-repo: unable to get resource for log4j#log4j;1.2.16: res=${reactor.repo}/log4j/log4j/1.2.16/log4j-1.2.16.jar: java.net.MalformedURLException: no protocol: ${reactor.repo}/log4j/log4j/1.2.16/log4j-1.2.16.jar [ivy:resolve] :: [ivy:resolve] [ivy:resolve] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS ** ** BUILD FAILED /home/paramiao/hadoop-2.0.0-mr1-cdh4.2.0/src/contrib/build-contrib.xml:440: impossible to resolve dependencies: resolve failed - see output for details ** ** so could someone tell me where i am wrong and how could make it success? * *** ** ** best regards! ** ** -- http://www.lingcc.com
Re: recursive list in java without block
Hi Ankit, Following Harsh's advice, I have found out that: although none of the FileSystem.java and DistributedFileSystem.java have support recursively liststatus(). However, the FsShell.java did have an ls() method which is used to support hadoop command like lsr ( that is the ls -R in Linux). yours, Kun Ling On Fri, May 17, 2013 at 6:59 AM, Harsh J ha...@cloudera.com wrote: The FileSystem API doesn't provide a utility to do recursive listing yet, so you'd have to build it on your own. MR and the Fs Shell, both do seem to have inbuilt support for such a utility though. On Fri, May 17, 2013 at 3:25 AM, Ankit Bhatnagar ankit_impress...@yahoo.com wrote: Hi folks, How can I get a recursive listing of file using java code from HDFS (hadoop 0.23.7*) i mean equivalent to ls -R? Ankit -- Harsh J -- http://www.lingcc.com