date:20130512

@Tariq can you point me to some resource which shows how distcp is used to
upload files from local to hdfs.

isn't distcp a MR job ? wouldn't it need the data to be already present in
the hadoop's fs?

Rahul


On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq donta...@gmail.com wrote:

 You'r welcome :)

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 Thanks Tariq!


 On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq donta...@gmail.comwrote:

 @Rahul : Yes. distcp can do that.

 And, bigger the files lesser the metadata hence lesser memory
 consumption.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 IMHO,I think the statement about NN with regard to block metadata is
 more like a general statement. Even if you put lots of small files of
 combined size 10 TB , you need to have a capable NN.

 can disct cp be used to copy local - to - hdfs ?

 Thanks,
 Rahul


 On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar 
 nitinpawar...@gmail.comwrote:

 absolutely rite Mohammad


 On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq donta...@gmail.comwrote:

 Sorry for barging in guys. I think Nitin is talking about this :

 Every file and block in HDFS is treated as an object and for each
 object around 200B of metadata get created. So the NN should be powerful
 enough to handle that much metadata, since it is going to be in-memory.
 Actually memory is the most important metric when it comes to NN.

 Am I correct @Nitin?

 @Thoihen : As Nitin has said, when you talk about that much data you
 don't actually just do a put. You could use something like distcp for
 parallel copying. A better approach would be to use a data aggregation 
 tool
 like Flume or Chukwa, as Nitin has already pointed. Facebook uses their 
 own
 data aggregation tool, called Scribe for this purpose.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar nitinpawar...@gmail.com
  wrote:

 NN would still be in picture because it will be writing a lot of
 meta data for each individual file. so you will need a NN capable enough
 which can store the metadata for your entire dataset. Data will never 
 go to
 NN but lot of metadata about data will be on NN so its always good idea 
 to
 have a strong NN.


 On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 @Nitin , parallel dfs to write to hdfs is great , but could not
 understand the meaning of capable NN. As I know , the NN would not be a
 part of the actual data write pipeline , means that the data would not
 travel through the NN , the dfs would contact the NN from time to time 
 to
 get locations of DN as where to store the data blocks.

 Thanks,
 Rahul



 On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar 
 nitinpawar...@gmail.com wrote:

 is it safe? .. there is no direct answer yes or no

 when you say , you have files worth 10TB files and you want to
 upload  to HDFS, several factors come into picture

 1) Is the machine in the same network as your hadoop cluster?
 2) If there guarantee that network will not go down?

 and Most importantly I assume that you have a capable hadoop
 cluster. By that I mean you have a capable namenode.

 I would definitely not write files sequentially in HDFS. I would
 prefer to write files in parallel to hdfs to utilize the DFS write 
 features
 to speed up the process.
 you can hdfs put command in parallel manner and in my experience
 it has not failed when we write a lot of data.


 On Sat, May 11, 2013 at 4:38 PM, maisnam ns 
 maisnam...@gmail.comwrote:

 @Nitin Pawar , thanks for clearing my doubts .

 But I have one more question , say I have 10 TB data in the
 pipeline .

 Is it perfectly OK to use hadopo fs put command to upload these
 files of size 10 TB and is there any limit to the file size  using 
 hadoop
 command line . Can hadoop put command line work with huge data.

 Thanks in advance


 On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar 
 nitinpawar...@gmail.com wrote:

 first of all .. most of the companies do not get 100 PB of data
 in one go. Its an accumulating process and most of the companies do 
 have a
 data pipeline in place where the data is written to hdfs on a 
 frequency
 basis and  then its retained on hdfs for some duration as per 
 needed and
 from there its sent to archivers or deleted.

 For data management products, you can look at falcon which is
 open sourced by inmobi along with hortonworks.

 In any case, if you want to write files to hdfs there are few
 options available to you
 1) Write your dfs client which writes to dfs
 2) use hdfs proxy
 3) there is webhdfs
 4) command line hdfs
 5) data collection tools come with support to write to hdfs like
 flume etc


 On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam 
 thoihen...@gmail.com wrote:

 Hi All,

 Can anyone help me know how does

Re: Hadoop noob question

2013-05-12 Thread Nitin Pawar

you can do that using file:///

example:

hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/



On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee 
rahul.rec@gmail.com wrote:

 @Tariq can you point me to some resource which shows how distcp is used to
 upload files from local to hdfs.

 isn't distcp a MR job ? wouldn't it need the data to be already present in
 the hadoop's fs?

 Rahul


 On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq donta...@gmail.comwrote:

 You'r welcome :)

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 Thanks Tariq!


 On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq donta...@gmail.comwrote:

 @Rahul : Yes. distcp can do that.

 And, bigger the files lesser the metadata hence lesser memory
 consumption.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 IMHO,I think the statement about NN with regard to block metadata is
 more like a general statement. Even if you put lots of small files of
 combined size 10 TB , you need to have a capable NN.

 can disct cp be used to copy local - to - hdfs ?

 Thanks,
 Rahul


 On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar 
 nitinpawar...@gmail.comwrote:

 absolutely rite Mohammad


 On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq 
 donta...@gmail.comwrote:

 Sorry for barging in guys. I think Nitin is talking about this :

 Every file and block in HDFS is treated as an object and for each
 object around 200B of metadata get created. So the NN should be powerful
 enough to handle that much metadata, since it is going to be in-memory.
 Actually memory is the most important metric when it comes to NN.

 Am I correct @Nitin?

 @Thoihen : As Nitin has said, when you talk about that much data you
 don't actually just do a put. You could use something like distcp 
 for
 parallel copying. A better approach would be to use a data aggregation 
 tool
 like Flume or Chukwa, as Nitin has already pointed. Facebook uses their 
 own
 data aggregation tool, called Scribe for this purpose.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar 
 nitinpawar...@gmail.com wrote:

 NN would still be in picture because it will be writing a lot of
 meta data for each individual file. so you will need a NN capable 
 enough
 which can store the metadata for your entire dataset. Data will never 
 go to
 NN but lot of metadata about data will be on NN so its always good 
 idea to
 have a strong NN.


 On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 @Nitin , parallel dfs to write to hdfs is great , but could not
 understand the meaning of capable NN. As I know , the NN would not be 
 a
 part of the actual data write pipeline , means that the data would not
 travel through the NN , the dfs would contact the NN from time to 
 time to
 get locations of DN as where to store the data blocks.

 Thanks,
 Rahul



 On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar 
 nitinpawar...@gmail.com wrote:

 is it safe? .. there is no direct answer yes or no

 when you say , you have files worth 10TB files and you want to
 upload  to HDFS, several factors come into picture

 1) Is the machine in the same network as your hadoop cluster?
 2) If there guarantee that network will not go down?

 and Most importantly I assume that you have a capable hadoop
 cluster. By that I mean you have a capable namenode.

 I would definitely not write files sequentially in HDFS. I would
 prefer to write files in parallel to hdfs to utilize the DFS write 
 features
 to speed up the process.
 you can hdfs put command in parallel manner and in my experience
 it has not failed when we write a lot of data.


 On Sat, May 11, 2013 at 4:38 PM, maisnam ns maisnam...@gmail.com
  wrote:

 @Nitin Pawar , thanks for clearing my doubts .

 But I have one more question , say I have 10 TB data in the
 pipeline .

 Is it perfectly OK to use hadopo fs put command to upload these
 files of size 10 TB and is there any limit to the file size  using 
 hadoop
 command line . Can hadoop put command line work with huge data.

 Thanks in advance


 On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar 
 nitinpawar...@gmail.com wrote:

 first of all .. most of the companies do not get 100 PB of data
 in one go. Its an accumulating process and most of the companies 
 do have a
 data pipeline in place where the data is written to hdfs on a 
 frequency
 basis and  then its retained on hdfs for some duration as per 
 needed and
 from there its sent to archivers or deleted.

 For data management products, you can look at falcon which is
 open sourced by inmobi along with hortonworks.

 In any case, if you want to write files to hdfs there are few
 options available to you
 1) Write your dfs client which writes to dfs
 2) use hdfs proxy
 3) there is webhdfs
 4) command

Re: Need help about task slots

@Rahul : I'm sorry as I am not aware of any such document. But you could
use distcp for local to HDFS copy :
*bin/hadoop  distcp  file:///home/tariq/in.txt  hdfs://localhost:9000/*
*
*
And yes. When you use distcp from local to HDFS, you can't take the
pleasure of parallelism as the data is stored in a non distributed fashion.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sat, May 11, 2013 at 11:07 PM, Mohammad Tariq donta...@gmail.com wrote:

 Hello guys,

  My 2 cents :

 Actually no. of mappers is primarily governed by the no. of InputSplits
 created by the InputFormat you are using and the no. of reducers by the no.
 of partitions you get after the map phase. Having said that, you should
 also keep the no of slots, available per slave, in mind, along with the
 available memory. But as a general rule you could use this approach :

 Take the no. of virtual CPUs*.75 and that's the no. of slots you can
 configure. For example, if you have 12 physical cores (or 24 virtual
 cores), you would have (24*.75)=18 slots. Now, based on your requirement
 you could choose how many mappers and reducers you want to use. With 18 MR
 slots, you could have 9 mappers and 9 reducers or 12 mappers and 9 reducers
 or whatever you think is OK with you.

 I don't know if it ,makes much sense, but it helps me pretty decently.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, May 11, 2013 at 8:57 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 Hi,

 I am also new to Hadoop world , here is my take on your question , if
 there is something missing then others would surely correct that.

 For per-YARN , the slots are fixed and computed based on the crunching
 capacity of the datanode hardware , once the slots per data node is
 ascertained , they are divided into Map and reducer slots and that goes
 into the config files and remain fixed , until changed.In YARN , its
 decided at runtime based on the kind of requirement of particular task.Its
 very much possible that a datanode at certain point of time running  10
 tasks and another similar datanode is only running 4 tasks.

 Coming to your question. Based of the data set size , block size of dfs
 and input formater , the number of map tasks are decided , generally for
 file based inputformats its one mapper per data block , however there are
 way to change this using configuration settings.Reduce tasks are set using
 job configuration.

 General rule as I have read from various documents is that Mappers should
 run atleast a minute , so you can run a sample to find out a good size of
 data block which would make you mapper run more than a minute. Now it again
 depends on your SLA , in case you are not looking for a very small SLA you
 can choose to run less mappers at the expense of higher runtime.

 But again its all theory , not sure how these things are handled in
 actual prod clusters.

 HTH,



 Thanks,
 Rahul


 On Sat, May 11, 2013 at 8:02 PM, Shashidhar Rao 
 raoshashidhar...@gmail.com wrote:

 Hi Users,

 I am new to Hadoop and confused about task slots in a cluster. How would
 I know how many task slots would be required for a job. Is there any
 empirical formula or on what basis should I set the number of task slots.

 Advanced Thanks

Re: Hadoop noob question

@Rahul : I'm sorry I answered this on a wrong thread by mistake. You could
do that as Nitin has shown.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar nitinpawar...@gmail.comwrote:

 you can do that using file:///

 example:

 hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/



 On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 @Tariq can you point me to some resource which shows how distcp is used
 to upload files from local to hdfs.

 isn't distcp a MR job ? wouldn't it need the data to be already present
 in the hadoop's fs?

  Rahul


 On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq donta...@gmail.comwrote:

 You'r welcome :)

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 Thanks Tariq!


 On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq donta...@gmail.comwrote:

 @Rahul : Yes. distcp can do that.

 And, bigger the files lesser the metadata hence lesser memory
 consumption.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 IMHO,I think the statement about NN with regard to block metadata is
 more like a general statement. Even if you put lots of small files of
 combined size 10 TB , you need to have a capable NN.

 can disct cp be used to copy local - to - hdfs ?

 Thanks,
 Rahul


 On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar nitinpawar...@gmail.com
  wrote:

 absolutely rite Mohammad


 On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq 
 donta...@gmail.comwrote:

 Sorry for barging in guys. I think Nitin is talking about this :

 Every file and block in HDFS is treated as an object and for each
 object around 200B of metadata get created. So the NN should be 
 powerful
 enough to handle that much metadata, since it is going to be in-memory.
 Actually memory is the most important metric when it comes to NN.

 Am I correct @Nitin?

 @Thoihen : As Nitin has said, when you talk about that much data
 you don't actually just do a put. You could use something like 
 distcp
 for parallel copying. A better approach would be to use a data 
 aggregation
 tool like Flume or Chukwa, as Nitin has already pointed. Facebook uses
 their own data aggregation tool, called Scribe for this purpose.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar 
 nitinpawar...@gmail.com wrote:

 NN would still be in picture because it will be writing a lot of
 meta data for each individual file. so you will need a NN capable 
 enough
 which can store the metadata for your entire dataset. Data will never 
 go to
 NN but lot of metadata about data will be on NN so its always good 
 idea to
 have a strong NN.


 On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 @Nitin , parallel dfs to write to hdfs is great , but could not
 understand the meaning of capable NN. As I know , the NN would not 
 be a
 part of the actual data write pipeline , means that the data would 
 not
 travel through the NN , the dfs would contact the NN from time to 
 time to
 get locations of DN as where to store the data blocks.

 Thanks,
 Rahul



 On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar 
 nitinpawar...@gmail.com wrote:

 is it safe? .. there is no direct answer yes or no

 when you say , you have files worth 10TB files and you want to
 upload  to HDFS, several factors come into picture

 1) Is the machine in the same network as your hadoop cluster?
 2) If there guarantee that network will not go down?

 and Most importantly I assume that you have a capable hadoop
 cluster. By that I mean you have a capable namenode.

 I would definitely not write files sequentially in HDFS. I would
 prefer to write files in parallel to hdfs to utilize the DFS write 
 features
 to speed up the process.
 you can hdfs put command in parallel manner and in my experience
 it has not failed when we write a lot of data.


 On Sat, May 11, 2013 at 4:38 PM, maisnam ns 
 maisnam...@gmail.com wrote:

 @Nitin Pawar , thanks for clearing my doubts .

 But I have one more question , say I have 10 TB data in the
 pipeline .

 Is it perfectly OK to use hadopo fs put command to upload these
 files of size 10 TB and is there any limit to the file size  using 
 hadoop
 command line . Can hadoop put command line work with huge data.

 Thanks in advance


 On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar 
 nitinpawar...@gmail.com wrote:

 first of all .. most of the companies do not get 100 PB of
 data in one go. Its an accumulating process and most of the 
 companies do
 have a data pipeline in place where the data is written to hdfs 
 on a
 frequency basis and  then its retained on hdfs for some duration 
 as per
 needed and from there its sent to archivers or deleted.

 For data management products, you can look at falcon which is

Re: Need help about task slots

Sorry for the blunder guys.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sun, May 12, 2013 at 5:39 PM, Mohammad Tariq donta...@gmail.com wrote:

 @Rahul : I'm sorry as I am not aware of any such document. But you could
 use distcp for local to HDFS copy :
 *bin/hadoop  distcp  file:///home/tariq/in.txt  hdfs://localhost:9000/*
 *
 *
 And yes. When you use distcp from local to HDFS, you can't take the
 pleasure of parallelism as the data is stored in a non distributed fashion.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, May 11, 2013 at 11:07 PM, Mohammad Tariq donta...@gmail.comwrote:

 Hello guys,

  My 2 cents :

 Actually no. of mappers is primarily governed by the no. of InputSplits
 created by the InputFormat you are using and the no. of reducers by the no.
 of partitions you get after the map phase. Having said that, you should
 also keep the no of slots, available per slave, in mind, along with the
 available memory. But as a general rule you could use this approach :

 Take the no. of virtual CPUs*.75 and that's the no. of slots you can
 configure. For example, if you have 12 physical cores (or 24 virtual
 cores), you would have (24*.75)=18 slots. Now, based on your requirement
 you could choose how many mappers and reducers you want to use. With 18 MR
 slots, you could have 9 mappers and 9 reducers or 12 mappers and 9 reducers
 or whatever you think is OK with you.

 I don't know if it ,makes much sense, but it helps me pretty decently.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, May 11, 2013 at 8:57 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 Hi,

 I am also new to Hadoop world , here is my take on your question , if
 there is something missing then others would surely correct that.

 For per-YARN , the slots are fixed and computed based on the crunching
 capacity of the datanode hardware , once the slots per data node is
 ascertained , they are divided into Map and reducer slots and that goes
 into the config files and remain fixed , until changed.In YARN , its
 decided at runtime based on the kind of requirement of particular task.Its
 very much possible that a datanode at certain point of time running  10
 tasks and another similar datanode is only running 4 tasks.

 Coming to your question. Based of the data set size , block size of dfs
 and input formater , the number of map tasks are decided , generally for
 file based inputformats its one mapper per data block , however there are
 way to change this using configuration settings.Reduce tasks are set using
 job configuration.

 General rule as I have read from various documents is that Mappers
 should run atleast a minute , so you can run a sample to find out a good
 size of data block which would make you mapper run more than a minute. Now
 it again depends on your SLA , in case you are not looking for a very small
 SLA you can choose to run less mappers at the expense of higher runtime.

 But again its all theory , not sure how these things are handled in
 actual prod clusters.

 HTH,



 Thanks,
 Rahul


 On Sat, May 11, 2013 at 8:02 PM, Shashidhar Rao 
 raoshashidhar...@gmail.com wrote:

 Hi Users,

 I am new to Hadoop and confused about task slots in a cluster. How
 would I know how many task slots would be required for a job. Is there any
 empirical formula or on what basis should I set the number of task slots.

 Advanced Thanks

Re: Hadoop noob question

Thanks to both of you!

Rahul


On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar nitinpawar...@gmail.comwrote:

 you can do that using file:///

 example:

 hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/




 On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 @Tariq can you point me to some resource which shows how distcp is used
 to upload files from local to hdfs.

 isn't distcp a MR job ? wouldn't it need the data to be already present
 in the hadoop's fs?

  Rahul


 On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq donta...@gmail.comwrote:

 You'r welcome :)

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 Thanks Tariq!


 On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq donta...@gmail.comwrote:

 @Rahul : Yes. distcp can do that.

 And, bigger the files lesser the metadata hence lesser memory
 consumption.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 IMHO,I think the statement about NN with regard to block metadata is
 more like a general statement. Even if you put lots of small files of
 combined size 10 TB , you need to have a capable NN.

 can disct cp be used to copy local - to - hdfs ?

 Thanks,
 Rahul


 On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar nitinpawar...@gmail.com
  wrote:

 absolutely rite Mohammad


 On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq 
 donta...@gmail.comwrote:

 Sorry for barging in guys. I think Nitin is talking about this :

 Every file and block in HDFS is treated as an object and for each
 object around 200B of metadata get created. So the NN should be 
 powerful
 enough to handle that much metadata, since it is going to be in-memory.
 Actually memory is the most important metric when it comes to NN.

 Am I correct @Nitin?

 @Thoihen : As Nitin has said, when you talk about that much data
 you don't actually just do a put. You could use something like 
 distcp
 for parallel copying. A better approach would be to use a data 
 aggregation
 tool like Flume or Chukwa, as Nitin has already pointed. Facebook uses
 their own data aggregation tool, called Scribe for this purpose.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar 
 nitinpawar...@gmail.com wrote:

 NN would still be in picture because it will be writing a lot of
 meta data for each individual file. so you will need a NN capable 
 enough
 which can store the metadata for your entire dataset. Data will never 
 go to
 NN but lot of metadata about data will be on NN so its always good 
 idea to
 have a strong NN.


 On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 @Nitin , parallel dfs to write to hdfs is great , but could not
 understand the meaning of capable NN. As I know , the NN would not 
 be a
 part of the actual data write pipeline , means that the data would 
 not
 travel through the NN , the dfs would contact the NN from time to 
 time to
 get locations of DN as where to store the data blocks.

 Thanks,
 Rahul



 On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar 
 nitinpawar...@gmail.com wrote:

 is it safe? .. there is no direct answer yes or no

 when you say , you have files worth 10TB files and you want to
 upload  to HDFS, several factors come into picture

 1) Is the machine in the same network as your hadoop cluster?
 2) If there guarantee that network will not go down?

 and Most importantly I assume that you have a capable hadoop
 cluster. By that I mean you have a capable namenode.

 I would definitely not write files sequentially in HDFS. I would
 prefer to write files in parallel to hdfs to utilize the DFS write 
 features
 to speed up the process.
 you can hdfs put command in parallel manner and in my experience
 it has not failed when we write a lot of data.


 On Sat, May 11, 2013 at 4:38 PM, maisnam ns 
 maisnam...@gmail.com wrote:

 @Nitin Pawar , thanks for clearing my doubts .

 But I have one more question , say I have 10 TB data in the
 pipeline .

 Is it perfectly OK to use hadopo fs put command to upload these
 files of size 10 TB and is there any limit to the file size  using 
 hadoop
 command line . Can hadoop put command line work with huge data.

 Thanks in advance


 On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar 
 nitinpawar...@gmail.com wrote:

 first of all .. most of the companies do not get 100 PB of
 data in one go. Its an accumulating process and most of the 
 companies do
 have a data pipeline in place where the data is written to hdfs 
 on a
 frequency basis and  then its retained on hdfs for some duration 
 as per
 needed and from there its sent to archivers or deleted.

 For data management products, you can look at falcon which is
 open sourced by inmobi along with hortonworks.

 In any case, if you want to write files to hdfs there are few

Re: Need help about task slots

2013-05-12 Thread yypvsxf19870706

Hi

The concept of task slots is used in MRv1.
 In the new version of Hadoop ,MRv2 uses yarn instead of slots.
  You can read it from Hadoop definitive 3rd.




发自我的 iPhone

在 2013-5-12，20:11，Mohammad Tariq donta...@gmail.com 写道：

 Sorry for the blunder guys.
 
 Warm Regards,
 Tariq
 cloudfront.blogspot.com
 
 
 On Sun, May 12, 2013 at 5:39 PM, Mohammad Tariq donta...@gmail.com wrote:
 @Rahul : I'm sorry as I am not aware of any such document. But you could use 
 distcp for local to HDFS copy :
 bin/hadoop  distcp  file:///home/tariq/in.txt  hdfs://localhost:9000/
 
 And yes. When you use distcp from local to HDFS, you can't take the pleasure 
 of parallelism as the data is stored in a non distributed fashion.
 
 Warm Regards,
 Tariq
 cloudfront.blogspot.com
 
 
 On Sat, May 11, 2013 at 11:07 PM, Mohammad Tariq donta...@gmail.com wrote:
 Hello guys, 
 
 My 2 cents : 
 
 Actually no. of mappers is primarily governed by the no. of InputSplits 
 created by the InputFormat you are using and the no. of reducers by the no. 
 of partitions you get after the map phase. Having said that, you should 
 also keep the no of slots, available per slave, in mind, along with the 
 available memory. But as a general rule you could use this approach :
 Take the no. of virtual CPUs*.75 and that's the no. of slots you can 
 configure. For example, if you have 12 physical cores (or 24 virtual 
 cores), you would have (24*.75)=18 slots. Now, based on your requirement 
 you could choose how many mappers and reducers you want to use. With 18 MR 
 slots, you could have 9 mappers and 9 reducers or 12 mappers and 9 reducers 
 or whatever you think is OK with you. 
 
 I don't know if it ,makes much sense, but it helps me pretty decently.
 
 
 Warm Regards,
 Tariq
 cloudfront.blogspot.com
 
 
 On Sat, May 11, 2013 at 8:57 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:
 Hi,
 
 I am also new to Hadoop world , here is my take on your question , if 
 there is something missing then others would surely correct that.
 
 For per-YARN , the slots are fixed and computed based on the crunching 
 capacity of the datanode hardware , once the slots per data node is 
 ascertained , they are divided into Map and reducer slots and that goes 
 into the config files and remain fixed , until changed.In YARN , its 
 decided at runtime based on the kind of requirement of particular task.Its 
 very much possible that a datanode at certain point of time running  10 
 tasks and another similar datanode is only running 4 tasks.
 
 Coming to your question. Based of the data set size , block size of dfs 
 and input formater , the number of map tasks are decided , generally for 
 file based inputformats its one mapper per data block , however there are 
 way to change this using configuration settings.Reduce tasks are set using 
 job configuration.
 
 General rule as I have read from various documents is that Mappers should 
 run atleast a minute , so you can run a sample to find out a good size of 
 data block which would make you mapper run more than a minute. Now it 
 again depends on your SLA , in case you are not looking for a very small 
 SLA you can choose to run less mappers at the expense of higher runtime.
 
 But again its all theory , not sure how these things are handled in actual 
 prod clusters.
 
 HTH,
 
 
 
 Thanks,
 Rahul
 
 
 On Sat, May 11, 2013 at 8:02 PM, Shashidhar Rao 
 raoshashidhar...@gmail.com wrote:
 Hi Users,
 
 I am new to Hadoop and confused about task slots in a cluster. How would 
 I know how many task slots would be required for a job. Is there any 
 empirical formula or on what basis should I set the number of task slots.
 
 Advanced Thanks

Re: Need help about task slots

Oh! I though distcp works on complete files rather then mappers per
datablock.
So I guess parallelism would still be there if there are multipel files..
please correct if ther is anything wrong.

Thank,
Rahul


On Sun, May 12, 2013 at 5:39 PM, Mohammad Tariq donta...@gmail.com wrote:

 @Rahul : I'm sorry as I am not aware of any such document. But you could
 use distcp for local to HDFS copy :
 *bin/hadoop  distcp  file:///home/tariq/in.txt  hdfs://localhost:9000/*
 *
 *
 And yes. When you use distcp from local to HDFS, you can't take the
 pleasure of parallelism as the data is stored in a non distributed fashion.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, May 11, 2013 at 11:07 PM, Mohammad Tariq donta...@gmail.comwrote:

 Hello guys,

  My 2 cents :

 Actually no. of mappers is primarily governed by the no. of InputSplits
 created by the InputFormat you are using and the no. of reducers by the no.
 of partitions you get after the map phase. Having said that, you should
 also keep the no of slots, available per slave, in mind, along with the
 available memory. But as a general rule you could use this approach :

 Take the no. of virtual CPUs*.75 and that's the no. of slots you can
 configure. For example, if you have 12 physical cores (or 24 virtual
 cores), you would have (24*.75)=18 slots. Now, based on your requirement
 you could choose how many mappers and reducers you want to use. With 18 MR
 slots, you could have 9 mappers and 9 reducers or 12 mappers and 9 reducers
 or whatever you think is OK with you.

 I don't know if it ,makes much sense, but it helps me pretty decently.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, May 11, 2013 at 8:57 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 Hi,

 I am also new to Hadoop world , here is my take on your question , if
 there is something missing then others would surely correct that.

 For per-YARN , the slots are fixed and computed based on the crunching
 capacity of the datanode hardware , once the slots per data node is
 ascertained , they are divided into Map and reducer slots and that goes
 into the config files and remain fixed , until changed.In YARN , its
 decided at runtime based on the kind of requirement of particular task.Its
 very much possible that a datanode at certain point of time running  10
 tasks and another similar datanode is only running 4 tasks.

 Coming to your question. Based of the data set size , block size of dfs
 and input formater , the number of map tasks are decided , generally for
 file based inputformats its one mapper per data block , however there are
 way to change this using configuration settings.Reduce tasks are set using
 job configuration.

 General rule as I have read from various documents is that Mappers
 should run atleast a minute , so you can run a sample to find out a good
 size of data block which would make you mapper run more than a minute. Now
 it again depends on your SLA , in case you are not looking for a very small
 SLA you can choose to run less mappers at the expense of higher runtime.

 But again its all theory , not sure how these things are handled in
 actual prod clusters.

 HTH,



 Thanks,
 Rahul


 On Sat, May 11, 2013 at 8:02 PM, Shashidhar Rao 
 raoshashidhar...@gmail.com wrote:

 Hi Users,

 I am new to Hadoop and confused about task slots in a cluster. How
 would I know how many task slots would be required for a job. Is there any
 empirical formula or on what basis should I set the number of task slots.

 Advanced Thanks

Re: Need help about task slots

sorry for my blunder as well. my previous post for for Tariq in a wrong
post.

Thanks.
Rahul


On Sun, May 12, 2013 at 6:03 PM, Rahul Bhattacharjee 
rahul.rec@gmail.com wrote:

 Oh! I though distcp works on complete files rather then mappers per
 datablock.
 So I guess parallelism would still be there if there are multipel files..
 please correct if ther is anything wrong.

 Thank,
 Rahul


 On Sun, May 12, 2013 at 5:39 PM, Mohammad Tariq donta...@gmail.comwrote:

 @Rahul : I'm sorry as I am not aware of any such document. But you could
 use distcp for local to HDFS copy :
 *bin/hadoop  distcp  file:///home/tariq/in.txt  hdfs://localhost:9000/*
 *
 *
 And yes. When you use distcp from local to HDFS, you can't take the
 pleasure of parallelism as the data is stored in a non distributed fashion.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, May 11, 2013 at 11:07 PM, Mohammad Tariq donta...@gmail.comwrote:

 Hello guys,

  My 2 cents :

 Actually no. of mappers is primarily governed by the no. of InputSplits
 created by the InputFormat you are using and the no. of reducers by the no.
 of partitions you get after the map phase. Having said that, you should
 also keep the no of slots, available per slave, in mind, along with the
 available memory. But as a general rule you could use this approach :

 Take the no. of virtual CPUs*.75 and that's the no. of slots you can
 configure. For example, if you have 12 physical cores (or 24 virtual
 cores), you would have (24*.75)=18 slots. Now, based on your requirement
 you could choose how many mappers and reducers you want to use. With 18 MR
 slots, you could have 9 mappers and 9 reducers or 12 mappers and 9 reducers
 or whatever you think is OK with you.

 I don't know if it ,makes much sense, but it helps me pretty decently.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, May 11, 2013 at 8:57 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 Hi,

 I am also new to Hadoop world , here is my take on your question , if
 there is something missing then others would surely correct that.

 For per-YARN , the slots are fixed and computed based on the crunching
 capacity of the datanode hardware , once the slots per data node is
 ascertained , they are divided into Map and reducer slots and that goes
 into the config files and remain fixed , until changed.In YARN , its
 decided at runtime based on the kind of requirement of particular task.Its
 very much possible that a datanode at certain point of time running  10
 tasks and another similar datanode is only running 4 tasks.

 Coming to your question. Based of the data set size , block size of dfs
 and input formater , the number of map tasks are decided , generally for
 file based inputformats its one mapper per data block , however there are
 way to change this using configuration settings.Reduce tasks are set using
 job configuration.

 General rule as I have read from various documents is that Mappers
 should run atleast a minute , so you can run a sample to find out a good
 size of data block which would make you mapper run more than a minute. Now
 it again depends on your SLA , in case you are not looking for a very small
 SLA you can choose to run less mappers at the expense of higher runtime.

 But again its all theory , not sure how these things are handled in
 actual prod clusters.

 HTH,



 Thanks,
 Rahul


 On Sat, May 11, 2013 at 8:02 PM, Shashidhar Rao 
 raoshashidhar...@gmail.com wrote:

 Hi Users,

 I am new to Hadoop and confused about task slots in a cluster. How
 would I know how many task slots would be required for a job. Is there any
 empirical formula or on what basis should I set the number of task slots.

 Advanced Thanks

Re: Need help about task slots

Hahaha..I think we could continue this over there..

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sun, May 12, 2013 at 6:04 PM, Rahul Bhattacharjee 
rahul.rec@gmail.com wrote:

 sorry for my blunder as well. my previous post for for Tariq in a wrong
 post.

 Thanks.
 Rahul


 On Sun, May 12, 2013 at 6:03 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 Oh! I though distcp works on complete files rather then mappers per
 datablock.
 So I guess parallelism would still be there if there are multipel files..
 please correct if ther is anything wrong.

 Thank,
 Rahul


 On Sun, May 12, 2013 at 5:39 PM, Mohammad Tariq donta...@gmail.comwrote:

 @Rahul : I'm sorry as I am not aware of any such document. But you could
 use distcp for local to HDFS copy :
 *bin/hadoop  distcp  file:///home/tariq/in.txt  hdfs://localhost:9000/*
 *
 *
 And yes. When you use distcp from local to HDFS, you can't take the
 pleasure of parallelism as the data is stored in a non distributed fashion.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, May 11, 2013 at 11:07 PM, Mohammad Tariq donta...@gmail.comwrote:

 Hello guys,

  My 2 cents :

 Actually no. of mappers is primarily governed by the no. of InputSplits
 created by the InputFormat you are using and the no. of reducers by the no.
 of partitions you get after the map phase. Having said that, you should
 also keep the no of slots, available per slave, in mind, along with the
 available memory. But as a general rule you could use this approach :

 Take the no. of virtual CPUs*.75 and that's the no. of slots you can
 configure. For example, if you have 12 physical cores (or 24 virtual
 cores), you would have (24*.75)=18 slots. Now, based on your requirement
 you could choose how many mappers and reducers you want to use. With 18 MR
 slots, you could have 9 mappers and 9 reducers or 12 mappers and 9 reducers
 or whatever you think is OK with you.

 I don't know if it ,makes much sense, but it helps me pretty decently.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, May 11, 2013 at 8:57 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 Hi,

 I am also new to Hadoop world , here is my take on your question , if
 there is something missing then others would surely correct that.

 For per-YARN , the slots are fixed and computed based on the crunching
 capacity of the datanode hardware , once the slots per data node is
 ascertained , they are divided into Map and reducer slots and that goes
 into the config files and remain fixed , until changed.In YARN , its
 decided at runtime based on the kind of requirement of particular task.Its
 very much possible that a datanode at certain point of time running  10
 tasks and another similar datanode is only running 4 tasks.

 Coming to your question. Based of the data set size , block size of
 dfs and input formater , the number of map tasks are decided , generally
 for file based inputformats its one mapper per data block , however there
 are way to change this using configuration settings.Reduce tasks are set
 using job configuration.

 General rule as I have read from various documents is that Mappers
 should run atleast a minute , so you can run a sample to find out a good
 size of data block which would make you mapper run more than a minute. Now
 it again depends on your SLA , in case you are not looking for a very 
 small
 SLA you can choose to run less mappers at the expense of higher runtime.

 But again its all theory , not sure how these things are handled in
 actual prod clusters.

 HTH,



 Thanks,
 Rahul


 On Sat, May 11, 2013 at 8:02 PM, Shashidhar Rao 
 raoshashidhar...@gmail.com wrote:

 Hi Users,

 I am new to Hadoop and confused about task slots in a cluster. How
 would I know how many task slots would be required for a job. Is there 
 any
 empirical formula or on what basis should I set the number of task slots.

 Advanced Thanks

Re: Hadoop noob question

No. distcp is actually a mapreduce job under the hood.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sun, May 12, 2013 at 6:00 PM, Rahul Bhattacharjee 
rahul.rec@gmail.com wrote:

 Thanks to both of you!

 Rahul


 On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar nitinpawar...@gmail.comwrote:

 you can do that using file:///

 example:

 hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/







 On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 @Tariq can you point me to some resource which shows how distcp is used
 to upload files from local to hdfs.

 isn't distcp a MR job ? wouldn't it need the data to be already present
 in the hadoop's fs?

  Rahul


 On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq donta...@gmail.comwrote:

 You'r welcome :)

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 Thanks Tariq!


 On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq 
 donta...@gmail.comwrote:

 @Rahul : Yes. distcp can do that.

 And, bigger the files lesser the metadata hence lesser memory
 consumption.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 IMHO,I think the statement about NN with regard to block metadata is
 more like a general statement. Even if you put lots of small files of
 combined size 10 TB , you need to have a capable NN.

 can disct cp be used to copy local - to - hdfs ?

 Thanks,
 Rahul


 On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar 
 nitinpawar...@gmail.com wrote:

 absolutely rite Mohammad


 On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq donta...@gmail.com
  wrote:

 Sorry for barging in guys. I think Nitin is talking about this :

 Every file and block in HDFS is treated as an object and for each
 object around 200B of metadata get created. So the NN should be 
 powerful
 enough to handle that much metadata, since it is going to be 
 in-memory.
 Actually memory is the most important metric when it comes to NN.

 Am I correct @Nitin?

 @Thoihen : As Nitin has said, when you talk about that much data
 you don't actually just do a put. You could use something like 
 distcp
 for parallel copying. A better approach would be to use a data 
 aggregation
 tool like Flume or Chukwa, as Nitin has already pointed. Facebook uses
 their own data aggregation tool, called Scribe for this purpose.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar 
 nitinpawar...@gmail.com wrote:

 NN would still be in picture because it will be writing a lot of
 meta data for each individual file. so you will need a NN capable 
 enough
 which can store the metadata for your entire dataset. Data will 
 never go to
 NN but lot of metadata about data will be on NN so its always good 
 idea to
 have a strong NN.


 On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 @Nitin , parallel dfs to write to hdfs is great , but could not
 understand the meaning of capable NN. As I know , the NN would not 
 be a
 part of the actual data write pipeline , means that the data would 
 not
 travel through the NN , the dfs would contact the NN from time to 
 time to
 get locations of DN as where to store the data blocks.

 Thanks,
 Rahul



 On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar 
 nitinpawar...@gmail.com wrote:

 is it safe? .. there is no direct answer yes or no

 when you say , you have files worth 10TB files and you want to
 upload  to HDFS, several factors come into picture

 1) Is the machine in the same network as your hadoop cluster?
 2) If there guarantee that network will not go down?

 and Most importantly I assume that you have a capable hadoop
 cluster. By that I mean you have a capable namenode.

 I would definitely not write files sequentially in HDFS. I
 would prefer to write files in parallel to hdfs to utilize the DFS 
 write
 features to speed up the process.
 you can hdfs put command in parallel manner and in my
 experience it has not failed when we write a lot of data.


 On Sat, May 11, 2013 at 4:38 PM, maisnam ns 
 maisnam...@gmail.com wrote:

 @Nitin Pawar , thanks for clearing my doubts .

 But I have one more question , say I have 10 TB data in the
 pipeline .

 Is it perfectly OK to use hadopo fs put command to upload
 these files of size 10 TB and is there any limit to the file size 
  using
 hadoop command line . Can hadoop put command line work with huge 
 data.

 Thanks in advance


 On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar 
 nitinpawar...@gmail.com wrote:

 first of all .. most of the companies do not get 100 PB of
 data in one go. Its an accumulating process and most of the 
 companies do
 have a data pipeline in place where the data is written to hdfs 
 on a
 frequency basis and  then its retained on hdfs for some duration 
 as per
 needed and from there its sent to

Re: Hadoop noob question

I had said that if you use distcp to copy data *from localFS to HDFS* then
you won't be able to exploit parallelism as entire file is present on a
single machine. So no multiple TTs.

Please comment if you think I am wring somewhere.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sun, May 12, 2013 at 6:15 PM, Rahul Bhattacharjee 
rahul.rec@gmail.com wrote:

 Yes , it's a MR job under the hood . my question was that you wrote that
 using distcp you loose the benefits  of parallel processing of Hadoop. I
 think the MR job of distcp divides files into individual map tasks based on
 the total size of the transfer , so multiple mappers would still be spawned
 if the size of transfer is huge and they would work in parallel.

 Correct me if there is anything wrong!

 Thanks,
 Rahul


 On Sun, May 12, 2013 at 6:07 PM, Mohammad Tariq donta...@gmail.comwrote:

 No. distcp is actually a mapreduce job under the hood.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sun, May 12, 2013 at 6:00 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 Thanks to both of you!

 Rahul


 On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar nitinpawar...@gmail.comwrote:

 you can do that using file:///

 example:

 hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/












 On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 @Tariq can you point me to some resource which shows how distcp is
 used to upload files from local to hdfs.

 isn't distcp a MR job ? wouldn't it need the data to be already
 present in the hadoop's fs?

  Rahul


 On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq 
 donta...@gmail.comwrote:

 You'r welcome :)

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 Thanks Tariq!


 On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq donta...@gmail.com
  wrote:

 @Rahul : Yes. distcp can do that.

 And, bigger the files lesser the metadata hence lesser memory
 consumption.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 IMHO,I think the statement about NN with regard to block metadata
 is more like a general statement. Even if you put lots of small files 
 of
 combined size 10 TB , you need to have a capable NN.

 can disct cp be used to copy local - to - hdfs ?

 Thanks,
 Rahul


 On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar 
 nitinpawar...@gmail.com wrote:

 absolutely rite Mohammad


 On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq 
 donta...@gmail.com wrote:

 Sorry for barging in guys. I think Nitin is talking about this :

 Every file and block in HDFS is treated as an object and for
 each object around 200B of metadata get created. So the NN should be
 powerful enough to handle that much metadata, since it is going to 
 be
 in-memory. Actually memory is the most important metric when it 
 comes to
 NN.

 Am I correct @Nitin?

 @Thoihen : As Nitin has said, when you talk about that much data
 you don't actually just do a put. You could use something like 
 distcp
 for parallel copying. A better approach would be to use a data 
 aggregation
 tool like Flume or Chukwa, as Nitin has already pointed. Facebook 
 uses
 their own data aggregation tool, called Scribe for this purpose.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar 
 nitinpawar...@gmail.com wrote:

 NN would still be in picture because it will be writing a lot
 of meta data for each individual file. so you will need a NN 
 capable enough
 which can store the metadata for your entire dataset. Data will 
 never go to
 NN but lot of metadata about data will be on NN so its always good 
 idea to
 have a strong NN.


 On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 @Nitin , parallel dfs to write to hdfs is great , but could
 not understand the meaning of capable NN. As I know , the NN 
 would not be a
 part of the actual data write pipeline , means that the data 
 would not
 travel through the NN , the dfs would contact the NN from time to 
 time to
 get locations of DN as where to store the data blocks.

 Thanks,
 Rahul



 On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar 
 nitinpawar...@gmail.com wrote:

 is it safe? .. there is no direct answer yes or no

 when you say , you have files worth 10TB files and you want
 to upload  to HDFS, several factors come into picture

 1) Is the machine in the same network as your hadoop cluster?
 2) If there guarantee that network will not go down?

 and Most importantly I assume that you have a capable hadoop
 cluster. By that I mean you have a capable namenode.

 I would definitely not write files sequentially in HDFS. I
 would prefer to write files in parallel to hdfs to utilize the 
 DFS write
 features to speed up the process.
 you can hdfs put command in parallel manner and in my

Re: Hadoop noob question

yeah you are right I mis read your earlier post.

Thanks,
Rahul


On Sun, May 12, 2013 at 6:25 PM, Mohammad Tariq donta...@gmail.com wrote:

 I had said that if you use distcp to copy data *from localFS to HDFS*then you 
 won't be able to exploit parallelism as entire file is present on
 a single machine. So no multiple TTs.

 Please comment if you think I am wring somewhere.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sun, May 12, 2013 at 6:15 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 Yes , it's a MR job under the hood . my question was that you wrote that
 using distcp you loose the benefits  of parallel processing of Hadoop. I
 think the MR job of distcp divides files into individual map tasks based on
 the total size of the transfer , so multiple mappers would still be spawned
 if the size of transfer is huge and they would work in parallel.

 Correct me if there is anything wrong!

 Thanks,
 Rahul


 On Sun, May 12, 2013 at 6:07 PM, Mohammad Tariq donta...@gmail.comwrote:

 No. distcp is actually a mapreduce job under the hood.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sun, May 12, 2013 at 6:00 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 Thanks to both of you!

 Rahul


 On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar 
 nitinpawar...@gmail.comwrote:

 you can do that using file:///

 example:


 hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/














 On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 @Tariq can you point me to some resource which shows how distcp is
 used to upload files from local to hdfs.

 isn't distcp a MR job ? wouldn't it need the data to be already
 present in the hadoop's fs?

  Rahul


 On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq 
 donta...@gmail.comwrote:

 You'r welcome :)

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 Thanks Tariq!


 On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq 
 donta...@gmail.com wrote:

 @Rahul : Yes. distcp can do that.

 And, bigger the files lesser the metadata hence lesser memory
 consumption.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 IMHO,I think the statement about NN with regard to block metadata
 is more like a general statement. Even if you put lots of small 
 files of
 combined size 10 TB , you need to have a capable NN.

 can disct cp be used to copy local - to - hdfs ?

 Thanks,
 Rahul


 On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar 
 nitinpawar...@gmail.com wrote:

 absolutely rite Mohammad


 On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq 
 donta...@gmail.com wrote:

 Sorry for barging in guys. I think Nitin is talking about this :

 Every file and block in HDFS is treated as an object and for
 each object around 200B of metadata get created. So the NN should 
 be
 powerful enough to handle that much metadata, since it is going to 
 be
 in-memory. Actually memory is the most important metric when it 
 comes to
 NN.

 Am I correct @Nitin?

 @Thoihen : As Nitin has said, when you talk about that much
 data you don't actually just do a put. You could use something 
 like
 distcp for parallel copying. A better approach would be to use a 
 data
 aggregation tool like Flume or Chukwa, as Nitin has already 
 pointed.
 Facebook uses their own data aggregation tool, called Scribe for 
 this
 purpose.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar 
 nitinpawar...@gmail.com wrote:

 NN would still be in picture because it will be writing a lot
 of meta data for each individual file. so you will need a NN 
 capable enough
 which can store the metadata for your entire dataset. Data will 
 never go to
 NN but lot of metadata about data will be on NN so its always 
 good idea to
 have a strong NN.


 On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 @Nitin , parallel dfs to write to hdfs is great , but could
 not understand the meaning of capable NN. As I know , the NN 
 would not be a
 part of the actual data write pipeline , means that the data 
 would not
 travel through the NN , the dfs would contact the NN from time 
 to time to
 get locations of DN as where to store the data blocks.

 Thanks,
 Rahul



 On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar 
 nitinpawar...@gmail.com wrote:

 is it safe? .. there is no direct answer yes or no

 when you say , you have files worth 10TB files and you want
 to upload  to HDFS, several factors come into picture

 1) Is the machine in the same network as your hadoop cluster?
 2) If there guarantee that network will not go down?

 and Most importantly I assume that you have a capable hadoop
 cluster. By that I mean you have a capable namenode.

 I would definitely not write files sequentially in HDFS. I
 would

Re: Hadoop noob question

This is what I would say :

The number of maps is decided as follows. Since it’s a good idea to get
each map to copy a reasonable amount of data to minimize overheads in task
setup, each map copies at least 256 MB (unless the total size of the input
is less, in which case one map handles it all). For example, 1 GB of files
will be given four map tasks. When the data size is very large, it becomes
necessary to limit the number of maps in order to limit bandwidth and
cluster utilization. By default, the maximum number of maps is 20 per
(tasktracker) cluster node. For example, copying 1,000 GB of files to a
100-node cluster will allocate 2,000 maps (20 per node), so each will copy
512 MB on average. This can be reduced by specifying the-m argument to *
distcp*. For example, -m 1000 would allocate 1,000 maps, each copying 1 GB
on average.

HTH

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sun, May 12, 2013 at 6:35 PM, Rahul Bhattacharjee 
rahul.rec@gmail.com wrote:

 Soon after replying I realized something else related to this.

 Say we have a single file in HDFS (hdfs configured for default block size
 64 MB) and the size of the file is 1 GB. Now if we use distcp to move it
 from the current hdfs to another one , then
 whether there would be any parallelism or just a single map task would be
 fired?

 As per what I have read , a mapper is launcher for a complete file or a
 set of files. It doesn't operate at block level.So no parallelism even if
 the file resides in HDFS.

 Thanks,
 Rahul


 On Sun, May 12, 2013 at 6:28 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 yeah you are right I mis read your earlier post.

 Thanks,
 Rahul


 On Sun, May 12, 2013 at 6:25 PM, Mohammad Tariq donta...@gmail.comwrote:

 I had said that if you use distcp to copy data *from localFS to HDFS*then 
 you won't be able to exploit parallelism as entire file is present on
 a single machine. So no multiple TTs.

 Please comment if you think I am wring somewhere.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sun, May 12, 2013 at 6:15 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 Yes , it's a MR job under the hood . my question was that you wrote
 that using distcp you loose the benefits  of parallel processing of Hadoop.
 I think the MR job of distcp divides files into individual map tasks based
 on the total size of the transfer , so multiple mappers would still be
 spawned if the size of transfer is huge and they would work in parallel.

 Correct me if there is anything wrong!

 Thanks,
 Rahul


 On Sun, May 12, 2013 at 6:07 PM, Mohammad Tariq donta...@gmail.comwrote:

 No. distcp is actually a mapreduce job under the hood.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sun, May 12, 2013 at 6:00 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 Thanks to both of you!

 Rahul


 On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar nitinpawar...@gmail.com
  wrote:

 you can do that using file:///

 example:



 hadoop distcp hdfs://localhost:8020/somefile 
 file:///Users/myhome/Desktop/

















 On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 @Tariq can you point me to some resource which shows how distcp is
 used to upload files from local to hdfs.

 isn't distcp a MR job ? wouldn't it need the data to be already
 present in the hadoop's fs?

  Rahul


 On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq 
 donta...@gmail.com wrote:

 You'r welcome :)

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 Thanks Tariq!


 On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq 
 donta...@gmail.com wrote:

 @Rahul : Yes. distcp can do that.

 And, bigger the files lesser the metadata hence lesser memory
 consumption.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 IMHO,I think the statement about NN with regard to block
 metadata is more like a general statement. Even if you put lots of 
 small
 files of combined size 10 TB , you need to have a capable NN.

 can disct cp be used to copy local - to - hdfs ?

 Thanks,
 Rahul


 On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar 
 nitinpawar...@gmail.com wrote:

 absolutely rite Mohammad


 On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq 
 donta...@gmail.com wrote:

 Sorry for barging in guys. I think Nitin is talking about
 this :

 Every file and block in HDFS is treated as an object and for
 each object around 200B of metadata get created. So the NN 
 should be
 powerful enough to handle that much metadata, since it is going 
 to be
 in-memory. Actually memory is the most important metric when it 
 comes to
 NN.

 Am I correct @Nitin?

 @Thoihen : As Nitin has said, when you talk about that much
 data you don't actually just do a put. You could use something 
 like
 distcp for parallel copying. A better approach would be to use

Re: Submitting a hadoop job in large clusters.

On Sun, May 12, 2013 at 12:19 AM, Nitin Pawar nitinpawar...@gmail.comwrote:


 normally if you want to copy the jar then hadoop admins setu


Submit you job to Job tracker it will distribute throughout the
tasktrackers.

*Thanks  Regards*

∞
Shashwat Shriparv

Re: Permissions

The user through which you are trying to run the task should jave
permission on hdfs. just verify that

*Thanks Regards*

∞
Shashwat Shriparv

On Sat, May 11, 2013 at 1:02 AM, Amal G Jose amalg...@gmail.com wrote:

After starting the hdfs, ie NN, SN and DN, create an hdfs directory
structure in the form /hadoop.tmp.dir/mapred/staging.
Then give 777 permission to staging. After that change the ownership of
mapred directory to mapred user.
After doing this start jobtracker, it will start. Otherwise, it will not
start.
The reason for not showing any datanodes may be due to firewall. Check
whether the necessary ports are open.

On Tue, Apr 30, 2013 at 2:28 AM, rkevinbur...@charter.net wrote:

I look in the name node log and I get the following errors:

2013-04-29 15:25:11,646 ERROR
org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException
as:mapred (auth:SIMPLE)
cause:org.apache.hadoop.security.AccessControlException: Permission denied:
*user=mapred*, access=WRITE, inode=/:hdfs:supergroup:drwxr-xr-x

2013-04-29 15:25:11,646 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 6 on 9000, call
org.apache.hadoop.hdfs.protocol.ClientProtocol.mkdirs from
172.16.26.68:45044: error:
org.apache.hadoop.security.AccessControlException: Permission denied: *
user=mapred*, access=WRITE, inode=/:hdfs:supergroup:drwxr-xr-x
org.apache.hadoop.security.AccessControlException: Permission denied: *
user=mapred,* access=WRITE, inode=/:hdfs:supergroup:drwxr-xr-x
at
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:205)
at
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:186)

When I create the file system I have the user hdfs on the root folder.
(/). I am not sure now to have both the user mapred and hdfs have access to
the root (which it seems these errors are indicating).

I get a page from 50070 put when I try to browse the filesystem from the
web UI I get an error that there are no nodes listening (I have 3 data
nodes and 1 namenode). The browser indicates that there is nothing
listening to port 50030, so it seems that the JobTracker is not up.

Re: issues with decrease the default.block.size

The block size is for allocation not storage on the disk.

*Thanks  Regards*

∞
Shashwat Shriparv



On Fri, May 10, 2013 at 8:54 PM, Harsh J ha...@cloudera.com wrote:

 Thanks. I failed to add: It should be okay to do if those cases are
 true and the cluster seems under-utilized right now.

 On Fri, May 10, 2013 at 8:29 PM, yypvsxf19870706
 yypvsxf19870...@gmail.com wrote:
  Hi harsh
 
  Yep.
 
 
 
  Regards
 
 
 
 
 
 
  发自我的 iPhone
 
  在 2013-5-10，13:27，Harsh J ha...@cloudera.com 写道：
 
  Are you looking to decrease it to get more parallel map tasks out of
  the small files? Are you currently CPU bound on processing these small
  files?
 
  On Thu, May 9, 2013 at 9:12 PM, YouPeng Yang yypvsxf19870...@gmail.com
 wrote:
  hi ALL
 
  I am going to setup a new hadoop  environment, .Because  of  there
  are
  lots of small  files, I would  like to change  the  default.block.size
 to
  16MB
  other than adopting the ways to merge  the files into large  enough
 (e.g
  using  sequencefiles).
 I want to ask are  there  any bad influences or issues?
 
  Regards
 
 
 
  --
  Harsh J



 --
 Harsh J

Re: hadoop map-reduce errors

Your connection setting to Mysql may not be correct check that.

*Thanks  Regards*

∞
Shashwat Shriparv



On Fri, May 10, 2013 at 6:12 PM, Shahab Yunus shahab.yu...@gmail.comwrote:

 Have your checked your connection settings to the MySQL DB? Where and how
 are you passing the connection properties for the database? Is it
 accessible from the machine you are running this? Is the db up?


 On Thu, May 9, 2013 at 9:32 PM, 丙子 woyaof...@gmail.com wrote:

 When i  run a hadoop job ,there are some errors like this:
 13/05/10 08:20:59 ERROR manager.SqlManager: Error executing statement:
 com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications
 link failure

 The last packet successfully received from the server was 28,484
 milliseconds ago.  The last packet sent successfully to the server was 1
 milliseconds ago.
 com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications
 link failure

 The last packet successfully received from the server was 28,484
 milliseconds ago.  The last packet sent successfully to the server was 1
 milliseconds ago.
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
 Method)

 ……
 ……
 at org.apache.sqoop.Sqoop.main(Sqoop.java:238)
 at com.cloudera.sqoop.Sqoop.main(Sqoop.java:57)
 Caused by: java.io.EOFException: Can not read response from server.
 Expected to read 4 bytes, read 0 bytes before connection was unexpectedly
 lost.
 at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:3039)
 at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3489)
 ... 24 more


 How  can i resolve it .

Re: Problem while running simple WordCount program(hadoop-1.0.4) on eclipse.

the user through which you are running your hadoop, set permission to tmp
dir for that user.

*Thanks  Regards*

∞
Shashwat Shriparv



On Fri, May 10, 2013 at 5:24 PM, Nitin Pawar nitinpawar...@gmail.comwrote:

 What are the permission of your /tmp/ folder?
 On May 10, 2013 5:03 PM, Khaleel Khalid khale...@suntecgroup.com
 wrote:

  Hi all,

 I am facing the following error when I run a simple WordCount
 program using hadoop-1.0.4 on eclipse(Galileo).  The map/reduce plugin
 version I use is 1.0.4 as well.  It would be really helpful if
 someone gives me a solution for the problem.

 ERROR:

 13/05/10 16:53:51 WARN util.NativeCodeLoader: Unable to load
 native-hadoop library for your platform... using builtin-java classes where
 applicable

 13/05/10 16:53:51 ERROR security.UserGroupInformation:
 *PriviledgedActionException* as:khaleelk *cause:java.io.IOException*:
 Failed to set permissions of path:
 \tmp\hadoop-khaleelk\mapred\staging\khaleelk-1067522586\.staging to 0700

 Exception in thread main
 *java.io.IOException*: Failed to set permissions of path:
 \tmp\hadoop-khaleelk\mapred\staging\khaleelk-1067522586\.staging to 0700

 at org.apache.hadoop.fs.FileUtil.checkReturnValue(
 *FileUtil.java:689*)

 at org.apache.hadoop.fs.FileUtil.setPermission(
 *FileUtil.java:662*)

 at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(
 *RawLocalFileSystem.java:509*)

 at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(
 *RawLocalFileSystem.java:344*)

 at org.apache.hadoop.fs.FilterFileSystem.mkdirs(
 *FilterFileSystem.java:189*)

 at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(
 *JobSubmissionFiles.java:116*)

 at org.apache.hadoop.mapred.JobClient$2.run(
 *JobClient.java:856*)

 at org.apache.hadoop.mapred.JobClient$2.run(
 *JobClient.java:850*)

 at java.security.AccessController.doPrivileged(
 *Native Method*)

 at javax.security.auth.Subject.doAs(Unknown Source)

 at org.apache.hadoop.security.UserGroupInformation.doAs(
 *UserGroupInformation.java:1121*)

 at org.apache.hadoop.mapred.JobClient.submitJobInternal(
 *JobClient.java:850*)

 at org.apache.hadoop.mapreduce.Job.submit(
 *Job.java:500*)

 at org.apache.hadoop.mapreduce.Job.waitForCompletion(
 *Job.java:530*)

 at WordCount.main(
 *WordCount.java:65*)



 Thank you in advance.

Re: Problem while running simple WordCount program(hadoop-1.0.4) on eclipse.

2013-05-12 Thread Nitin Pawar

its a /tmp/ folder so I guess all the users will need access to it. better
it make it a routine linux like /tmp folder


On Sun, May 12, 2013 at 11:12 PM, shashwat shriparv 
dwivedishash...@gmail.com wrote:

 the user through which you are running your hadoop, set permission to tmp
 dir for that user.

 *Thanks  Regards*

 ∞
 Shashwat Shriparv



 On Fri, May 10, 2013 at 5:24 PM, Nitin Pawar nitinpawar...@gmail.comwrote:

 What are the permission of your /tmp/ folder?
 On May 10, 2013 5:03 PM, Khaleel Khalid khale...@suntecgroup.com
 wrote:

  Hi all,

 I am facing the following error when I run a simple WordCount
 program using hadoop-1.0.4 on eclipse(Galileo).  The map/reduce plugin
 version I use is 1.0.4 as well.  It would be really helpful if
 someone gives me a solution for the problem.

 ERROR:

 13/05/10 16:53:51 WARN util.NativeCodeLoader: Unable to load
 native-hadoop library for your platform... using builtin-java classes where
 applicable

 13/05/10 16:53:51 ERROR security.UserGroupInformation:
 *PriviledgedActionException* as:khaleelk *cause:java.io.IOException*:
 Failed to set permissions of path:
 \tmp\hadoop-khaleelk\mapred\staging\khaleelk-1067522586\.staging to 0700

 Exception in thread main
 *java.io.IOException*: Failed to set permissions of path:
 \tmp\hadoop-khaleelk\mapred\staging\khaleelk-1067522586\.staging to 0700

 at org.apache.hadoop.fs.FileUtil.checkReturnValue(
 *FileUtil.java:689*)

 at org.apache.hadoop.fs.FileUtil.setPermission(
 *FileUtil.java:662*)

 at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(
 *RawLocalFileSystem.java:509*)

 at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(
 *RawLocalFileSystem.java:344*)

 at org.apache.hadoop.fs.FilterFileSystem.mkdirs(
 *FilterFileSystem.java:189*)

 at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(
 *JobSubmissionFiles.java:116*)

 at org.apache.hadoop.mapred.JobClient$2.run(
 *JobClient.java:856*)

 at org.apache.hadoop.mapred.JobClient$2.run(
 *JobClient.java:850*)

 at java.security.AccessController.doPrivileged(
 *Native Method*)

 at javax.security.auth.Subject.doAs(Unknown Source)

 at org.apache.hadoop.security.UserGroupInformation.doAs(
 *UserGroupInformation.java:1121*)

 at org.apache.hadoop.mapred.JobClient.submitJobInternal(
 *JobClient.java:850*)

 at org.apache.hadoop.mapreduce.Job.submit(
 *Job.java:500*)

 at org.apache.hadoop.mapreduce.Job.waitForCompletion(
 *Job.java:530*)

 at WordCount.main(
 *WordCount.java:65*)



 Thank you in advance.











-- 
Nitin Pawar

Re: Submitting a hadoop job in large clusters.

2013-05-12 Thread Shashidhar Rao

@shashwat shriparv

Can the a hadoop job be submitted to any datanode in the cluster and not to
jobTracker.

Correct me if it I am wrong , I was told that a hadoop job can be submitted
to datanode also apart from JobTracker. Is it correct?

Advanced thanks


On Sun, May 12, 2013 at 11:02 PM, shashwat shriparv 
dwivedishash...@gmail.com wrote:


 On Sun, May 12, 2013 at 12:19 AM, Nitin Pawar nitinpawar...@gmail.comwrote:


 normally if you want to copy the jar then hadoop admins setu


 Submit you job to Job tracker it will distribute throughout the
 tasktrackers.

 *Thanks  Regards*

 ∞
 Shashwat Shriparv

Re: Submitting a hadoop job in large clusters.

2013-05-12 Thread Nitin Pawar

nope
in MRv1 only jobtracker can accept jobs. You can not trigger job on any
other process in hadoop other than jobtracker.


On Sun, May 12, 2013 at 11:25 PM, Shashidhar Rao raoshashidhar...@gmail.com
 wrote:

 @shashwat shriparv

 Can the a hadoop job be submitted to any datanode in the cluster and not
 to jobTracker.

 Correct me if it I am wrong , I was told that a hadoop job can be
 submitted to datanode also apart from JobTracker. Is it correct?

 Advanced thanks


 On Sun, May 12, 2013 at 11:02 PM, shashwat shriparv 
 dwivedishash...@gmail.com wrote:


 On Sun, May 12, 2013 at 12:19 AM, Nitin Pawar nitinpawar...@gmail.comwrote:


 normally if you want to copy the jar then hadoop admins setu


  Submit you job to Job tracker it will distribute throughout the
 tasktrackers.

 *Thanks  Regards*

 ∞
 Shashwat Shriparv





-- 
Nitin Pawar

Re: Submitting a hadoop job in large clusters.