Re: MapReduce code location

2013-08-20 Thread Kun Ling
Hi Y. Dong,
Here is for your questions:

1. will hadoop instantiate multiple instances of this class then transmit
them to every remote machine?
ANSWER:  Each of the  TaskTracker in Hadoop  cluster will create an
instance of your Map class, and the transmission of the data is
accomplished by other part of the framework in Hadoop cluster.

since each TaskTracker  starts a JVM, which will create an object of
your Map class,and will feed  key-value pairs of your input data  to your
map method.  And the shuffle phase will pass the Map output data to Reduce
method.

2.  in a remote machine will the map(…) method be able to access List A and
List B locally from its own memory?

ANSWER: because each TaskTracker node have its only Map object,  they have
List A and List B in their local memory only.



Hoping the above answer helps you.


yours,

Kun Ling



On Tue, Aug 20, 2013 at 6:06 PM, Y. Dong tq00...@gmail.com wrote:

 Hi All,

 I'm a Mapreduce newbie, what I want to know is that,  say I have a mapper
 class:

 public Class Map implements Mapper {

 public List A;
 public static List B;

 public Map(){   //class constructor
 System.out.println(Im initializing);
 }

 @Override
 protected void map(………){
 System.out.println(Im inside a mapper);
 …….
 }

 }

 when I run this mapper on a multi-machine hadoop configuration, will
 hadoop instantiate
 multiple instances of this class then transmit them to every remote
 machine?  So in a remote
 machine will the map(…) method be able to access List A and List B locally
 from its own memory?
 If yes, in the map method, what if I run System.out.println, will printed
 message be only shown on
 the remote machine but not the machine I start the whole map reduce job?

 Thanks.

 Eason




-- 
http://www.lingcc.com


Re: Is there any possible way to use hostname variable in mapred-site.xml file

2013-08-19 Thread Kun Ling
Hi Binglin,
   Thanks for your kindly help.

   And your advice works well for me.


yours,
Kun Ling


On Thu, Aug 15, 2013 at 5:51 PM, Binglin Chang decst...@gmail.com wrote:

 How about add -Dhost.name=`hostname` in HADOOP_OPTS
 and get this variable in config file ${host.name} ?
 I have not tried this, you can try this.


 On Thu, Aug 15, 2013 at 5:26 PM, Kun Ling lkun.e...@gmail.com wrote:

 Hi all,
I have a Hadoop MapReduce Cluster. In which I want to adjust the
 mapred.local.dir, so that each TaskTracker can write to a mapred.local.dir
 with different name. And also make the conf file looks the same to make
 deployment easier.

 Currently, my plan is that each TaskTracker have hostname in its
 mapred.local.dir configuration. So the configuration in mapred-site.xml is
 just like this:

   property
 namemapred.local.dir/name
 value/var/mapred_local/*HOSTNAME*//value
   /property

  The problem is how to make TaskTracker automatically get the HOSTNAME?
 I  have look through all the .xml files in conf/ and jar files, but only
 get an variable ${user.name} which could be used to indicate the current
 hadoop username.


 Thanks very much .


 yours,
 Kun Ling


 --
 http://www.lingcc.com





-- 
http://www.lingcc.com


Is there any possible way to use hostname variable in mapred-site.xml file

2013-08-15 Thread Kun Ling
Hi all,
   I have a Hadoop MapReduce Cluster. In which I want to adjust the
mapred.local.dir, so that each TaskTracker can write to a mapred.local.dir
with different name. And also make the conf file looks the same to make
deployment easier.

Currently, my plan is that each TaskTracker have hostname in its
mapred.local.dir configuration. So the configuration in mapred-site.xml is
just like this:

  property
namemapred.local.dir/name
value/var/mapred_local/*HOSTNAME*//value
  /property

 The problem is how to make TaskTracker automatically get the HOSTNAME? I
have look through all the .xml files in conf/ and jar files, but only get
an variable ${user.name} which could be used to indicate the current hadoop
username.


Thanks very much .


yours,
Kun Ling


-- 
http://www.lingcc.com


Re: MapReduce on Local FileSystem

2013-06-04 Thread Kun Ling
Hi Agarwal,
   I once have similar questions, and have done some experiment. Here is my
experience:
1. For some applications over MR, like HBase, Hive, which does not need to
submit additional files to HDFS, file:///  could work well without any
problem (According to my test).

2. For simple MR applications, like TeraSort, there is some problems by
simply using file:///, since MR will maintain some MR-control files both in
shared FileSystem, and local file sytem in one list, and will lookup the
list for the file, and simply using file:/// will cause the shared FS looks
the same as local filesystem, while in fact, they are two different kinds
of filesystem, and have different path conversion-rules.

For the 2nd issue, you can just create a new shared filesystem class by
deriving the existing org.apache.hadoop.fs.FileSystem , I have create such
a  repository with an example filesystem class implementation(
https://github.com/Lingcc/hadoop-lingccfs ), hoping it is helpful to you.


yours,
Ling Kun.




On Fri, May 31, 2013 at 2:37 PM, Agarwal, Nikhil
nikhil.agar...@netapp.comwrote:

  Hi, 

 ** **

 Is it possible to run MapReduce on *multiple nodes* using Local File
 system (file:///)  ?

 I am able to run it in single node setup but in a multiple node setup the
 “slave” nodes are not able to access the “jobtoken” file which is present
 in the Hadoop.tmp.dir in “master” node. 

 ** **

 Please let me know if it is possible to do this.

 ** **

 Thanks  Regards,

 Nikhil




-- 
http://www.lingcc.com


Re: How is sharing done in HDFS ?

2013-05-22 Thread Kun Ling
Hi, Agarwal,
Hadoop just put the jobtoken, _partitionlst, and  some other files that
needed to share in a directory located in hdfs://namenode:port/tmp//.

   And all the TaskTracker will access these files from the shared tmp
directory, just like the way  they share the input file in the HDFS.



yours,
Ling Kun


On Wed, May 22, 2013 at 4:29 PM, Agarwal, Nikhil
nikhil.agar...@netapp.comwrote:

  Hi,

 ** **

 Can anyone guide me to some pointers or explain how HDFS shares the
 information put in the temporary directories (hadoop.tmp.dir,
 mapred.tmp.dir, etc.) to all other nodes? 

 ** **

 I suppose that during execution of a MapReduce job, the JobTracker
 prepares a file called jobtoken and puts it in the temporary directories.
 which needs to be read by all TaskTrackers. So, how does HDFS share the
 contents? Does it use nfs mount or ….?

 ** **

 Thanks  Regards,

 Nikhil

 ** **




-- 
http://www.lingcc.com


Re: Shuffle phase replication factor

2013-05-22 Thread Kun Ling
Hi John,


   1. for the number of  simultaneous connection limitations. You can
configure this using the mapred.reduce.parallel.copies flag. the default
 is 5.

   2. For the aggressively disconnect implication, I am afraid it is only a
little. Normally, each reducer will connect to each mapper task, and asking
for the partions of the map output file.   Because there are about 5
simultaneous connections to fetch the map output for each reducer. For a
large MR cluster with 1000 node, and a Huge MR job with 1000 Mapper, and
1000 reducer, for each node, there are only about 5 connections. So the
imply is only a little.


  3.  What happens to the pending/ failing coonection, the short answer is:
just try to reconnect.There is a List, which maintain all the output
of the Mapper that need to copied, and the element will be removed iff the
map output is successfully copied.  A forever loop will keep on look into
the List, and fetch the corrsponding map output.


  All the above answer is based on the Hadoop 1.0.4 source code, especially
the ReduceTask.java file.

yours,
Ling Kun


On Wed, May 22, 2013 at 10:57 PM, John Lilley john.lil...@redpoint.netwrote:

  U, is that also the limit for the number of simultaneous
 connections?  In general, one does not need a 1:1 map between threads and
 connections.

 If this is the connection limit, does it imply  that the client or server
 side aggressively disconnects after a transfer?  

 What happens to the pending/failing connection attempts that exceed the
 limit?

 Thanks!

 john

 ** **

 *From:* Rahul Bhattacharjee [mailto:rahul.rec@gmail.com]
 *Sent:* Wednesday, May 22, 2013 8:52 AM

 *To:* user@hadoop.apache.org
 *Subject:* Re: Shuffle phase replication factor

 ** **

 There are properties/configuration to control the no. of copying threads
 for copy.
 tasktracker.http.threads=40
 Thanks,
 Rahul

 ** **

 On Wed, May 22, 2013 at 8:16 PM, John Lilley john.lil...@redpoint.net
 wrote:

 This brings up another nagging question I’ve had for some time.  Between
 HDFS and shuffle, there seems to be the potential for “every node
 connecting to every other node” via TCP.  Are there explicit mechanisms in
 place to manage or limit simultaneous connections?  Is the protocol simply
 robust enough to allow a server-side to disconnect at any time to free up
 slots and the client-side will retry the request?

 Thanks

 john

  

 *From:* Shahab Yunus [mailto:shahab.yu...@gmail.com]
 *Sent:* Wednesday, May 22, 2013 8:38 AM


 *To:* user@hadoop.apache.org
 *Subject:* Re: Shuffle phase replication factor

  

 As mentioned by Bertrand, Hadoop, The Definitive Guide, is well... really
 definitive :) place to start. It is pretty thorough for starts and once you
 are gone through it, the code will start making more sense too.

  

 Regards,

 Shahab

  

 On Wed, May 22, 2013 at 10:33 AM, John Lilley john.lil...@redpoint.net
 wrote:

 Oh I see.  Does this mean there is another service and TCP listen port for
 this purpose?

 Thanks for your indulgence… I would really like to read more about this
 without bothering the group but not sure where to start to learn these
 internals other than the code.

 john

  

 *From:* Kai Voigt [mailto:k...@123.org]
 *Sent:* Tuesday, May 21, 2013 12:59 PM
 *To:* user@hadoop.apache.org
 *Subject:* Re: Shuffle phase replication factor

  

 The map output doesn't get written to HDFS. The map task writes its output
 to its local disk, the reduce tasks will pull the data through HTTP for
 further processing.

  

 Am 21.05.2013 um 19:57 schrieb John Lilley john.lil...@redpoint.net:

  

 When MapReduce enters “shuffle” to partition the tuples, I am assuming
 that it writes intermediate data to HDFS.  What replication factor is used
 for those temporary files?

 john

  

  

 -- 

 Kai Voigt

 k...@123.org

  

  

  

  

 ** **




-- 
http://www.lingcc.com


Re: Project ideas

2013-05-21 Thread Kun Ling
Hi  Anshuman,
   Since MR is like: split the input,  map it to different node, run it in
parallel, and combine the result.  I would suggest you look into the
application of the Divide-and-Conquer algorithms, and port it, or rewrite
it in Hadoop MapReduce.

yours,
Ling Kun


On Tue, May 21, 2013 at 9:35 PM, Anshuman Mathur ans...@gmail.com wrote:

 Hello fellow users,

 We are a group of students studying in National University of Singapore.
 As part of our course curriculum we need to develop an application using
 Hadoop and  map-reduce. Can you please suggest some innovative ideas for
 our project?

 Thanks in advance.

 Anshuman




-- 
http://www.lingcc.com


Re: cloudera4.2 source code ant

2013-05-17 Thread Kun Ling
Hi dylan,

 I have not build CDH source code using ant, However I have met
a similar dependencies resolve filed problem.

Acccording to my experience,   this is much like a package network
download issue.

You may try to remove the .ivy2  and .m2   directories in your home
directory, and  run ant clean; ant to try again.


   Hope it is helpful to you.


yours,
Kun Ling


On Fri, May 17, 2013 at 4:42 PM, dylan dwld0...@gmail.com wrote:

 hello, 

  there is a problem i can't resolved, i want to remote connect the
 hadoop ( cloudera cdh4.2.0 ) via eclipse plugin.There’s have no
 hadoop-eclipse-pluge.jar ,so i download the  hadoop of cdh4.2.0  tarbal and
 when i complie, the error is below:

  

 ivy-resolve-common:

 [ivy:resolve] :: resolving dependencies ::
 org.apache.hadoop#eclipse-plugin;working@master

 [ivy:resolve]confs: [common]

 [ivy:resolve]found commons-logging#commons-logging;1.1.1 in maven2

 [ivy:resolve] :: resolution report :: resolve 5475ms :: artifacts dl 2ms**
 **


 -

|  |modules||   artifacts
   |

|   conf   | number| search|dwnlded|evicted||
 number|dwnlded|


 -

|  common  |   2   |   0   |   0   |   0   ||   1   |   0
   |


 -

 [ivy:resolve] 

 [ivy:resolve] :: problems summary ::

 [ivy:resolve]  WARNINGS

 [ivy:resolve]   ::

 [ivy:resolve]   ::  UNRESOLVED DEPENDENCIES ::

 [ivy:resolve]   ::

 [ivy:resolve]   :: log4j#log4j;1.2.16: several problems occurred
 while resolving dependency: log4j#log4j;1.2.16 {common=[master]}:

 [ivy:resolve]reactor-repo: unable to get resource for
 log4j#log4j;1.2.16:
 res=${reactor.repo}/log4j/log4j/1.2.16/log4j-1.2.16.pom:
 java.net.MalformedURLException: no protocol:
 ${reactor.repo}/log4j/log4j/1.2.16/log4j-1.2.16.pom

 [ivy:resolve]reactor-repo: unable to get resource for
 log4j#log4j;1.2.16:
 res=${reactor.repo}/log4j/log4j/1.2.16/log4j-1.2.16.jar:
 java.net.MalformedURLException: no protocol:
 ${reactor.repo}/log4j/log4j/1.2.16/log4j-1.2.16.jar

 [ivy:resolve]   ::

 [ivy:resolve] 

 [ivy:resolve] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS

 ** **

 BUILD FAILED

 /home/paramiao/hadoop-2.0.0-mr1-cdh4.2.0/src/contrib/build-contrib.xml:440:
 impossible to resolve dependencies:

resolve failed - see output for details

 ** **

 so could someone tell me where i am wrong and how could make it success? *
 ***

 ** **

 best regards!

 ** **




-- 
http://www.lingcc.com


Re: recursive list in java without block

2013-05-17 Thread Kun Ling
Hi Ankit,

   Following Harsh's advice, I have found out that: although none of the
FileSystem.java and DistributedFileSystem.java have support recursively
liststatus().   However, the FsShell.java did have an ls() method which is
used to support hadoop command like lsr ( that is the ls -R in Linux).


yours,

Kun Ling


On Fri, May 17, 2013 at 6:59 AM, Harsh J ha...@cloudera.com wrote:

 The FileSystem API doesn't provide a utility to do recursive listing
 yet, so you'd have to build it on your own.

 MR and the Fs Shell, both do seem to have inbuilt support for such a
 utility though.

 On Fri, May 17, 2013 at 3:25 AM, Ankit Bhatnagar
 ankit_impress...@yahoo.com wrote:
  Hi folks,
 
  How can I get a recursive listing of file using java code from HDFS
 (hadoop
  0.23.7*)
  i mean equivalent to ls  -R?
 
  Ankit



 --
 Harsh J




-- 
http://www.lingcc.com