Running Hadoop client as a different user

2013-05-12 Thread Steve Lewis
I have been running Hadoop on a clister set to not check permissions. I
would run a java client on my local machine and would run as the local user
on the cluster.

I say
*  String connectString =   hdfs:// + host + : + port + /;*
*Configuration config = new Configuration();*
*
*
*config.set(fs.default.name,connectString);*
*
*
*FileSystem fs  = FileSystem.get(config);*
*The above code works*
*  *
I am trying to port to a cluster where permissions are checked - I have  an
account but need to set a user and password to avoid Access Exceptions

How do I do this and If I can only access certain directories how do I do
that?

Also are there some directories my code MUST be able to access outside
those for my user only?

-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com


Re: Hadoop noob question

2013-05-12 Thread Rahul Bhattacharjee
@Tariq can you point me to some resource which shows how distcp is used to
upload files from local to hdfs.

isn't distcp a MR job ? wouldn't it need the data to be already present in
the hadoop's fs?

Rahul


On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq donta...@gmail.com wrote:

 You'r welcome :)

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 Thanks Tariq!


 On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq donta...@gmail.comwrote:

 @Rahul : Yes. distcp can do that.

 And, bigger the files lesser the metadata hence lesser memory
 consumption.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 IMHO,I think the statement about NN with regard to block metadata is
 more like a general statement. Even if you put lots of small files of
 combined size 10 TB , you need to have a capable NN.

 can disct cp be used to copy local - to - hdfs ?

 Thanks,
 Rahul


 On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar 
 nitinpawar...@gmail.comwrote:

 absolutely rite Mohammad


 On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq donta...@gmail.comwrote:

 Sorry for barging in guys. I think Nitin is talking about this :

 Every file and block in HDFS is treated as an object and for each
 object around 200B of metadata get created. So the NN should be powerful
 enough to handle that much metadata, since it is going to be in-memory.
 Actually memory is the most important metric when it comes to NN.

 Am I correct @Nitin?

 @Thoihen : As Nitin has said, when you talk about that much data you
 don't actually just do a put. You could use something like distcp for
 parallel copying. A better approach would be to use a data aggregation 
 tool
 like Flume or Chukwa, as Nitin has already pointed. Facebook uses their 
 own
 data aggregation tool, called Scribe for this purpose.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar nitinpawar...@gmail.com
  wrote:

 NN would still be in picture because it will be writing a lot of
 meta data for each individual file. so you will need a NN capable enough
 which can store the metadata for your entire dataset. Data will never 
 go to
 NN but lot of metadata about data will be on NN so its always good idea 
 to
 have a strong NN.


 On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 @Nitin , parallel dfs to write to hdfs is great , but could not
 understand the meaning of capable NN. As I know , the NN would not be a
 part of the actual data write pipeline , means that the data would not
 travel through the NN , the dfs would contact the NN from time to time 
 to
 get locations of DN as where to store the data blocks.

 Thanks,
 Rahul



 On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar 
 nitinpawar...@gmail.com wrote:

 is it safe? .. there is no direct answer yes or no

 when you say , you have files worth 10TB files and you want to
 upload  to HDFS, several factors come into picture

 1) Is the machine in the same network as your hadoop cluster?
 2) If there guarantee that network will not go down?

 and Most importantly I assume that you have a capable hadoop
 cluster. By that I mean you have a capable namenode.

 I would definitely not write files sequentially in HDFS. I would
 prefer to write files in parallel to hdfs to utilize the DFS write 
 features
 to speed up the process.
 you can hdfs put command in parallel manner and in my experience
 it has not failed when we write a lot of data.


 On Sat, May 11, 2013 at 4:38 PM, maisnam ns 
 maisnam...@gmail.comwrote:

 @Nitin Pawar , thanks for clearing my doubts .

 But I have one more question , say I have 10 TB data in the
 pipeline .

 Is it perfectly OK to use hadopo fs put command to upload these
 files of size 10 TB and is there any limit to the file size  using 
 hadoop
 command line . Can hadoop put command line work with huge data.

 Thanks in advance


 On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar 
 nitinpawar...@gmail.com wrote:

 first of all .. most of the companies do not get 100 PB of data
 in one go. Its an accumulating process and most of the companies do 
 have a
 data pipeline in place where the data is written to hdfs on a 
 frequency
 basis and  then its retained on hdfs for some duration as per 
 needed and
 from there its sent to archivers or deleted.

 For data management products, you can look at falcon which is
 open sourced by inmobi along with hortonworks.

 In any case, if you want to write files to hdfs there are few
 options available to you
 1) Write your dfs client which writes to dfs
 2) use hdfs proxy
 3) there is webhdfs
 4) command line hdfs
 5) data collection tools come with support to write to hdfs like
 flume etc


 On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam 
 thoihen...@gmail.com wrote:

 Hi All,

 Can anyone help me know how does 

Re: Hadoop noob question

2013-05-12 Thread Nitin Pawar
you can do that using file:///

example:

hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/



On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee 
rahul.rec@gmail.com wrote:

 @Tariq can you point me to some resource which shows how distcp is used to
 upload files from local to hdfs.

 isn't distcp a MR job ? wouldn't it need the data to be already present in
 the hadoop's fs?

 Rahul


 On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq donta...@gmail.comwrote:

 You'r welcome :)

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 Thanks Tariq!


 On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq donta...@gmail.comwrote:

 @Rahul : Yes. distcp can do that.

 And, bigger the files lesser the metadata hence lesser memory
 consumption.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 IMHO,I think the statement about NN with regard to block metadata is
 more like a general statement. Even if you put lots of small files of
 combined size 10 TB , you need to have a capable NN.

 can disct cp be used to copy local - to - hdfs ?

 Thanks,
 Rahul


 On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar 
 nitinpawar...@gmail.comwrote:

 absolutely rite Mohammad


 On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq 
 donta...@gmail.comwrote:

 Sorry for barging in guys. I think Nitin is talking about this :

 Every file and block in HDFS is treated as an object and for each
 object around 200B of metadata get created. So the NN should be powerful
 enough to handle that much metadata, since it is going to be in-memory.
 Actually memory is the most important metric when it comes to NN.

 Am I correct @Nitin?

 @Thoihen : As Nitin has said, when you talk about that much data you
 don't actually just do a put. You could use something like distcp 
 for
 parallel copying. A better approach would be to use a data aggregation 
 tool
 like Flume or Chukwa, as Nitin has already pointed. Facebook uses their 
 own
 data aggregation tool, called Scribe for this purpose.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar 
 nitinpawar...@gmail.com wrote:

 NN would still be in picture because it will be writing a lot of
 meta data for each individual file. so you will need a NN capable 
 enough
 which can store the metadata for your entire dataset. Data will never 
 go to
 NN but lot of metadata about data will be on NN so its always good 
 idea to
 have a strong NN.


 On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 @Nitin , parallel dfs to write to hdfs is great , but could not
 understand the meaning of capable NN. As I know , the NN would not be 
 a
 part of the actual data write pipeline , means that the data would not
 travel through the NN , the dfs would contact the NN from time to 
 time to
 get locations of DN as where to store the data blocks.

 Thanks,
 Rahul



 On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar 
 nitinpawar...@gmail.com wrote:

 is it safe? .. there is no direct answer yes or no

 when you say , you have files worth 10TB files and you want to
 upload  to HDFS, several factors come into picture

 1) Is the machine in the same network as your hadoop cluster?
 2) If there guarantee that network will not go down?

 and Most importantly I assume that you have a capable hadoop
 cluster. By that I mean you have a capable namenode.

 I would definitely not write files sequentially in HDFS. I would
 prefer to write files in parallel to hdfs to utilize the DFS write 
 features
 to speed up the process.
 you can hdfs put command in parallel manner and in my experience
 it has not failed when we write a lot of data.


 On Sat, May 11, 2013 at 4:38 PM, maisnam ns maisnam...@gmail.com
  wrote:

 @Nitin Pawar , thanks for clearing my doubts .

 But I have one more question , say I have 10 TB data in the
 pipeline .

 Is it perfectly OK to use hadopo fs put command to upload these
 files of size 10 TB and is there any limit to the file size  using 
 hadoop
 command line . Can hadoop put command line work with huge data.

 Thanks in advance


 On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar 
 nitinpawar...@gmail.com wrote:

 first of all .. most of the companies do not get 100 PB of data
 in one go. Its an accumulating process and most of the companies 
 do have a
 data pipeline in place where the data is written to hdfs on a 
 frequency
 basis and  then its retained on hdfs for some duration as per 
 needed and
 from there its sent to archivers or deleted.

 For data management products, you can look at falcon which is
 open sourced by inmobi along with hortonworks.

 In any case, if you want to write files to hdfs there are few
 options available to you
 1) Write your dfs client which writes to dfs
 2) use hdfs proxy
 3) there is webhdfs
 4) command 

Re: Need help about task slots

2013-05-12 Thread Mohammad Tariq
@Rahul : I'm sorry as I am not aware of any such document. But you could
use distcp for local to HDFS copy :
*bin/hadoop  distcp  file:///home/tariq/in.txt  hdfs://localhost:9000/*
*
*
And yes. When you use distcp from local to HDFS, you can't take the
pleasure of parallelism as the data is stored in a non distributed fashion.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sat, May 11, 2013 at 11:07 PM, Mohammad Tariq donta...@gmail.com wrote:

 Hello guys,

  My 2 cents :

 Actually no. of mappers is primarily governed by the no. of InputSplits
 created by the InputFormat you are using and the no. of reducers by the no.
 of partitions you get after the map phase. Having said that, you should
 also keep the no of slots, available per slave, in mind, along with the
 available memory. But as a general rule you could use this approach :

 Take the no. of virtual CPUs*.75 and that's the no. of slots you can
 configure. For example, if you have 12 physical cores (or 24 virtual
 cores), you would have (24*.75)=18 slots. Now, based on your requirement
 you could choose how many mappers and reducers you want to use. With 18 MR
 slots, you could have 9 mappers and 9 reducers or 12 mappers and 9 reducers
 or whatever you think is OK with you.

 I don't know if it ,makes much sense, but it helps me pretty decently.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, May 11, 2013 at 8:57 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 Hi,

 I am also new to Hadoop world , here is my take on your question , if
 there is something missing then others would surely correct that.

 For per-YARN , the slots are fixed and computed based on the crunching
 capacity of the datanode hardware , once the slots per data node is
 ascertained , they are divided into Map and reducer slots and that goes
 into the config files and remain fixed , until changed.In YARN , its
 decided at runtime based on the kind of requirement of particular task.Its
 very much possible that a datanode at certain point of time running  10
 tasks and another similar datanode is only running 4 tasks.

 Coming to your question. Based of the data set size , block size of dfs
 and input formater , the number of map tasks are decided , generally for
 file based inputformats its one mapper per data block , however there are
 way to change this using configuration settings.Reduce tasks are set using
 job configuration.

 General rule as I have read from various documents is that Mappers should
 run atleast a minute , so you can run a sample to find out a good size of
 data block which would make you mapper run more than a minute. Now it again
 depends on your SLA , in case you are not looking for a very small SLA you
 can choose to run less mappers at the expense of higher runtime.

 But again its all theory , not sure how these things are handled in
 actual prod clusters.

 HTH,



 Thanks,
 Rahul


 On Sat, May 11, 2013 at 8:02 PM, Shashidhar Rao 
 raoshashidhar...@gmail.com wrote:

 Hi Users,

 I am new to Hadoop and confused about task slots in a cluster. How would
 I know how many task slots would be required for a job. Is there any
 empirical formula or on what basis should I set the number of task slots.

 Advanced Thanks






Re: Hadoop noob question

2013-05-12 Thread Mohammad Tariq
@Rahul : I'm sorry I answered this on a wrong thread by mistake. You could
do that as Nitin has shown.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar nitinpawar...@gmail.comwrote:

 you can do that using file:///

 example:

 hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/



 On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 @Tariq can you point me to some resource which shows how distcp is used
 to upload files from local to hdfs.

 isn't distcp a MR job ? wouldn't it need the data to be already present
 in the hadoop's fs?

  Rahul


 On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq donta...@gmail.comwrote:

 You'r welcome :)

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 Thanks Tariq!


 On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq donta...@gmail.comwrote:

 @Rahul : Yes. distcp can do that.

 And, bigger the files lesser the metadata hence lesser memory
 consumption.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 IMHO,I think the statement about NN with regard to block metadata is
 more like a general statement. Even if you put lots of small files of
 combined size 10 TB , you need to have a capable NN.

 can disct cp be used to copy local - to - hdfs ?

 Thanks,
 Rahul


 On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar nitinpawar...@gmail.com
  wrote:

 absolutely rite Mohammad


 On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq 
 donta...@gmail.comwrote:

 Sorry for barging in guys. I think Nitin is talking about this :

 Every file and block in HDFS is treated as an object and for each
 object around 200B of metadata get created. So the NN should be 
 powerful
 enough to handle that much metadata, since it is going to be in-memory.
 Actually memory is the most important metric when it comes to NN.

 Am I correct @Nitin?

 @Thoihen : As Nitin has said, when you talk about that much data
 you don't actually just do a put. You could use something like 
 distcp
 for parallel copying. A better approach would be to use a data 
 aggregation
 tool like Flume or Chukwa, as Nitin has already pointed. Facebook uses
 their own data aggregation tool, called Scribe for this purpose.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar 
 nitinpawar...@gmail.com wrote:

 NN would still be in picture because it will be writing a lot of
 meta data for each individual file. so you will need a NN capable 
 enough
 which can store the metadata for your entire dataset. Data will never 
 go to
 NN but lot of metadata about data will be on NN so its always good 
 idea to
 have a strong NN.


 On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 @Nitin , parallel dfs to write to hdfs is great , but could not
 understand the meaning of capable NN. As I know , the NN would not 
 be a
 part of the actual data write pipeline , means that the data would 
 not
 travel through the NN , the dfs would contact the NN from time to 
 time to
 get locations of DN as where to store the data blocks.

 Thanks,
 Rahul



 On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar 
 nitinpawar...@gmail.com wrote:

 is it safe? .. there is no direct answer yes or no

 when you say , you have files worth 10TB files and you want to
 upload  to HDFS, several factors come into picture

 1) Is the machine in the same network as your hadoop cluster?
 2) If there guarantee that network will not go down?

 and Most importantly I assume that you have a capable hadoop
 cluster. By that I mean you have a capable namenode.

 I would definitely not write files sequentially in HDFS. I would
 prefer to write files in parallel to hdfs to utilize the DFS write 
 features
 to speed up the process.
 you can hdfs put command in parallel manner and in my experience
 it has not failed when we write a lot of data.


 On Sat, May 11, 2013 at 4:38 PM, maisnam ns 
 maisnam...@gmail.com wrote:

 @Nitin Pawar , thanks for clearing my doubts .

 But I have one more question , say I have 10 TB data in the
 pipeline .

 Is it perfectly OK to use hadopo fs put command to upload these
 files of size 10 TB and is there any limit to the file size  using 
 hadoop
 command line . Can hadoop put command line work with huge data.

 Thanks in advance


 On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar 
 nitinpawar...@gmail.com wrote:

 first of all .. most of the companies do not get 100 PB of
 data in one go. Its an accumulating process and most of the 
 companies do
 have a data pipeline in place where the data is written to hdfs 
 on a
 frequency basis and  then its retained on hdfs for some duration 
 as per
 needed and from there its sent to archivers or deleted.

 For data management products, you can look at falcon which is
 

Re: Need help about task slots

2013-05-12 Thread Mohammad Tariq
Sorry for the blunder guys.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sun, May 12, 2013 at 5:39 PM, Mohammad Tariq donta...@gmail.com wrote:

 @Rahul : I'm sorry as I am not aware of any such document. But you could
 use distcp for local to HDFS copy :
 *bin/hadoop  distcp  file:///home/tariq/in.txt  hdfs://localhost:9000/*
 *
 *
 And yes. When you use distcp from local to HDFS, you can't take the
 pleasure of parallelism as the data is stored in a non distributed fashion.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, May 11, 2013 at 11:07 PM, Mohammad Tariq donta...@gmail.comwrote:

 Hello guys,

  My 2 cents :

 Actually no. of mappers is primarily governed by the no. of InputSplits
 created by the InputFormat you are using and the no. of reducers by the no.
 of partitions you get after the map phase. Having said that, you should
 also keep the no of slots, available per slave, in mind, along with the
 available memory. But as a general rule you could use this approach :

 Take the no. of virtual CPUs*.75 and that's the no. of slots you can
 configure. For example, if you have 12 physical cores (or 24 virtual
 cores), you would have (24*.75)=18 slots. Now, based on your requirement
 you could choose how many mappers and reducers you want to use. With 18 MR
 slots, you could have 9 mappers and 9 reducers or 12 mappers and 9 reducers
 or whatever you think is OK with you.

 I don't know if it ,makes much sense, but it helps me pretty decently.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, May 11, 2013 at 8:57 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 Hi,

 I am also new to Hadoop world , here is my take on your question , if
 there is something missing then others would surely correct that.

 For per-YARN , the slots are fixed and computed based on the crunching
 capacity of the datanode hardware , once the slots per data node is
 ascertained , they are divided into Map and reducer slots and that goes
 into the config files and remain fixed , until changed.In YARN , its
 decided at runtime based on the kind of requirement of particular task.Its
 very much possible that a datanode at certain point of time running  10
 tasks and another similar datanode is only running 4 tasks.

 Coming to your question. Based of the data set size , block size of dfs
 and input formater , the number of map tasks are decided , generally for
 file based inputformats its one mapper per data block , however there are
 way to change this using configuration settings.Reduce tasks are set using
 job configuration.

 General rule as I have read from various documents is that Mappers
 should run atleast a minute , so you can run a sample to find out a good
 size of data block which would make you mapper run more than a minute. Now
 it again depends on your SLA , in case you are not looking for a very small
 SLA you can choose to run less mappers at the expense of higher runtime.

 But again its all theory , not sure how these things are handled in
 actual prod clusters.

 HTH,



 Thanks,
 Rahul


 On Sat, May 11, 2013 at 8:02 PM, Shashidhar Rao 
 raoshashidhar...@gmail.com wrote:

 Hi Users,

 I am new to Hadoop and confused about task slots in a cluster. How
 would I know how many task slots would be required for a job. Is there any
 empirical formula or on what basis should I set the number of task slots.

 Advanced Thanks







Re: Hadoop noob question

2013-05-12 Thread Rahul Bhattacharjee
Thanks to both of you!

Rahul


On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar nitinpawar...@gmail.comwrote:

 you can do that using file:///

 example:

 hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/




 On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 @Tariq can you point me to some resource which shows how distcp is used
 to upload files from local to hdfs.

 isn't distcp a MR job ? wouldn't it need the data to be already present
 in the hadoop's fs?

  Rahul


 On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq donta...@gmail.comwrote:

 You'r welcome :)

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 Thanks Tariq!


 On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq donta...@gmail.comwrote:

 @Rahul : Yes. distcp can do that.

 And, bigger the files lesser the metadata hence lesser memory
 consumption.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 IMHO,I think the statement about NN with regard to block metadata is
 more like a general statement. Even if you put lots of small files of
 combined size 10 TB , you need to have a capable NN.

 can disct cp be used to copy local - to - hdfs ?

 Thanks,
 Rahul


 On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar nitinpawar...@gmail.com
  wrote:

 absolutely rite Mohammad


 On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq 
 donta...@gmail.comwrote:

 Sorry for barging in guys. I think Nitin is talking about this :

 Every file and block in HDFS is treated as an object and for each
 object around 200B of metadata get created. So the NN should be 
 powerful
 enough to handle that much metadata, since it is going to be in-memory.
 Actually memory is the most important metric when it comes to NN.

 Am I correct @Nitin?

 @Thoihen : As Nitin has said, when you talk about that much data
 you don't actually just do a put. You could use something like 
 distcp
 for parallel copying. A better approach would be to use a data 
 aggregation
 tool like Flume or Chukwa, as Nitin has already pointed. Facebook uses
 their own data aggregation tool, called Scribe for this purpose.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar 
 nitinpawar...@gmail.com wrote:

 NN would still be in picture because it will be writing a lot of
 meta data for each individual file. so you will need a NN capable 
 enough
 which can store the metadata for your entire dataset. Data will never 
 go to
 NN but lot of metadata about data will be on NN so its always good 
 idea to
 have a strong NN.


 On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 @Nitin , parallel dfs to write to hdfs is great , but could not
 understand the meaning of capable NN. As I know , the NN would not 
 be a
 part of the actual data write pipeline , means that the data would 
 not
 travel through the NN , the dfs would contact the NN from time to 
 time to
 get locations of DN as where to store the data blocks.

 Thanks,
 Rahul



 On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar 
 nitinpawar...@gmail.com wrote:

 is it safe? .. there is no direct answer yes or no

 when you say , you have files worth 10TB files and you want to
 upload  to HDFS, several factors come into picture

 1) Is the machine in the same network as your hadoop cluster?
 2) If there guarantee that network will not go down?

 and Most importantly I assume that you have a capable hadoop
 cluster. By that I mean you have a capable namenode.

 I would definitely not write files sequentially in HDFS. I would
 prefer to write files in parallel to hdfs to utilize the DFS write 
 features
 to speed up the process.
 you can hdfs put command in parallel manner and in my experience
 it has not failed when we write a lot of data.


 On Sat, May 11, 2013 at 4:38 PM, maisnam ns 
 maisnam...@gmail.com wrote:

 @Nitin Pawar , thanks for clearing my doubts .

 But I have one more question , say I have 10 TB data in the
 pipeline .

 Is it perfectly OK to use hadopo fs put command to upload these
 files of size 10 TB and is there any limit to the file size  using 
 hadoop
 command line . Can hadoop put command line work with huge data.

 Thanks in advance


 On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar 
 nitinpawar...@gmail.com wrote:

 first of all .. most of the companies do not get 100 PB of
 data in one go. Its an accumulating process and most of the 
 companies do
 have a data pipeline in place where the data is written to hdfs 
 on a
 frequency basis and  then its retained on hdfs for some duration 
 as per
 needed and from there its sent to archivers or deleted.

 For data management products, you can look at falcon which is
 open sourced by inmobi along with hortonworks.

 In any case, if you want to write files to hdfs there are few
 

Re: Need help about task slots

2013-05-12 Thread yypvsxf19870706
Hi

The concept of task slots is used in MRv1.
 In the new version of Hadoop ,MRv2 uses yarn instead of slots.
  You can read it from Hadoop definitive 3rd.




发自我的 iPhone

在 2013-5-12,20:11,Mohammad Tariq donta...@gmail.com 写道:

 Sorry for the blunder guys.
 
 Warm Regards,
 Tariq
 cloudfront.blogspot.com
 
 
 On Sun, May 12, 2013 at 5:39 PM, Mohammad Tariq donta...@gmail.com wrote:
 @Rahul : I'm sorry as I am not aware of any such document. But you could use 
 distcp for local to HDFS copy :
 bin/hadoop  distcp  file:///home/tariq/in.txt  hdfs://localhost:9000/
 
 And yes. When you use distcp from local to HDFS, you can't take the pleasure 
 of parallelism as the data is stored in a non distributed fashion.
 
 Warm Regards,
 Tariq
 cloudfront.blogspot.com
 
 
 On Sat, May 11, 2013 at 11:07 PM, Mohammad Tariq donta...@gmail.com wrote:
 Hello guys, 
 
 My 2 cents : 
 
 Actually no. of mappers is primarily governed by the no. of InputSplits 
 created by the InputFormat you are using and the no. of reducers by the no. 
 of partitions you get after the map phase. Having said that, you should 
 also keep the no of slots, available per slave, in mind, along with the 
 available memory. But as a general rule you could use this approach :
 Take the no. of virtual CPUs*.75 and that's the no. of slots you can 
 configure. For example, if you have 12 physical cores (or 24 virtual 
 cores), you would have (24*.75)=18 slots. Now, based on your requirement 
 you could choose how many mappers and reducers you want to use. With 18 MR 
 slots, you could have 9 mappers and 9 reducers or 12 mappers and 9 reducers 
 or whatever you think is OK with you. 
 
 I don't know if it ,makes much sense, but it helps me pretty decently.
 
 
 Warm Regards,
 Tariq
 cloudfront.blogspot.com
 
 
 On Sat, May 11, 2013 at 8:57 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:
 Hi,
 
 I am also new to Hadoop world , here is my take on your question , if 
 there is something missing then others would surely correct that.
 
 For per-YARN , the slots are fixed and computed based on the crunching 
 capacity of the datanode hardware , once the slots per data node is 
 ascertained , they are divided into Map and reducer slots and that goes 
 into the config files and remain fixed , until changed.In YARN , its 
 decided at runtime based on the kind of requirement of particular task.Its 
 very much possible that a datanode at certain point of time running  10 
 tasks and another similar datanode is only running 4 tasks.
 
 Coming to your question. Based of the data set size , block size of dfs 
 and input formater , the number of map tasks are decided , generally for 
 file based inputformats its one mapper per data block , however there are 
 way to change this using configuration settings.Reduce tasks are set using 
 job configuration.
 
 General rule as I have read from various documents is that Mappers should 
 run atleast a minute , so you can run a sample to find out a good size of 
 data block which would make you mapper run more than a minute. Now it 
 again depends on your SLA , in case you are not looking for a very small 
 SLA you can choose to run less mappers at the expense of higher runtime.
 
 But again its all theory , not sure how these things are handled in actual 
 prod clusters.
 
 HTH,
 
 
 
 Thanks,
 Rahul
 
 
 On Sat, May 11, 2013 at 8:02 PM, Shashidhar Rao 
 raoshashidhar...@gmail.com wrote:
 Hi Users,
 
 I am new to Hadoop and confused about task slots in a cluster. How would 
 I know how many task slots would be required for a job. Is there any 
 empirical formula or on what basis should I set the number of task slots.
 
 Advanced Thanks
 


Re: Need help about task slots

2013-05-12 Thread Rahul Bhattacharjee
Oh! I though distcp works on complete files rather then mappers per
datablock.
So I guess parallelism would still be there if there are multipel files..
please correct if ther is anything wrong.

Thank,
Rahul


On Sun, May 12, 2013 at 5:39 PM, Mohammad Tariq donta...@gmail.com wrote:

 @Rahul : I'm sorry as I am not aware of any such document. But you could
 use distcp for local to HDFS copy :
 *bin/hadoop  distcp  file:///home/tariq/in.txt  hdfs://localhost:9000/*
 *
 *
 And yes. When you use distcp from local to HDFS, you can't take the
 pleasure of parallelism as the data is stored in a non distributed fashion.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, May 11, 2013 at 11:07 PM, Mohammad Tariq donta...@gmail.comwrote:

 Hello guys,

  My 2 cents :

 Actually no. of mappers is primarily governed by the no. of InputSplits
 created by the InputFormat you are using and the no. of reducers by the no.
 of partitions you get after the map phase. Having said that, you should
 also keep the no of slots, available per slave, in mind, along with the
 available memory. But as a general rule you could use this approach :

 Take the no. of virtual CPUs*.75 and that's the no. of slots you can
 configure. For example, if you have 12 physical cores (or 24 virtual
 cores), you would have (24*.75)=18 slots. Now, based on your requirement
 you could choose how many mappers and reducers you want to use. With 18 MR
 slots, you could have 9 mappers and 9 reducers or 12 mappers and 9 reducers
 or whatever you think is OK with you.

 I don't know if it ,makes much sense, but it helps me pretty decently.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, May 11, 2013 at 8:57 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 Hi,

 I am also new to Hadoop world , here is my take on your question , if
 there is something missing then others would surely correct that.

 For per-YARN , the slots are fixed and computed based on the crunching
 capacity of the datanode hardware , once the slots per data node is
 ascertained , they are divided into Map and reducer slots and that goes
 into the config files and remain fixed , until changed.In YARN , its
 decided at runtime based on the kind of requirement of particular task.Its
 very much possible that a datanode at certain point of time running  10
 tasks and another similar datanode is only running 4 tasks.

 Coming to your question. Based of the data set size , block size of dfs
 and input formater , the number of map tasks are decided , generally for
 file based inputformats its one mapper per data block , however there are
 way to change this using configuration settings.Reduce tasks are set using
 job configuration.

 General rule as I have read from various documents is that Mappers
 should run atleast a minute , so you can run a sample to find out a good
 size of data block which would make you mapper run more than a minute. Now
 it again depends on your SLA , in case you are not looking for a very small
 SLA you can choose to run less mappers at the expense of higher runtime.

 But again its all theory , not sure how these things are handled in
 actual prod clusters.

 HTH,



 Thanks,
 Rahul


 On Sat, May 11, 2013 at 8:02 PM, Shashidhar Rao 
 raoshashidhar...@gmail.com wrote:

 Hi Users,

 I am new to Hadoop and confused about task slots in a cluster. How
 would I know how many task slots would be required for a job. Is there any
 empirical formula or on what basis should I set the number of task slots.

 Advanced Thanks







Re: Need help about task slots

2013-05-12 Thread Rahul Bhattacharjee
sorry for my blunder as well. my previous post for for Tariq in a wrong
post.

Thanks.
Rahul


On Sun, May 12, 2013 at 6:03 PM, Rahul Bhattacharjee 
rahul.rec@gmail.com wrote:

 Oh! I though distcp works on complete files rather then mappers per
 datablock.
 So I guess parallelism would still be there if there are multipel files..
 please correct if ther is anything wrong.

 Thank,
 Rahul


 On Sun, May 12, 2013 at 5:39 PM, Mohammad Tariq donta...@gmail.comwrote:

 @Rahul : I'm sorry as I am not aware of any such document. But you could
 use distcp for local to HDFS copy :
 *bin/hadoop  distcp  file:///home/tariq/in.txt  hdfs://localhost:9000/*
 *
 *
 And yes. When you use distcp from local to HDFS, you can't take the
 pleasure of parallelism as the data is stored in a non distributed fashion.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, May 11, 2013 at 11:07 PM, Mohammad Tariq donta...@gmail.comwrote:

 Hello guys,

  My 2 cents :

 Actually no. of mappers is primarily governed by the no. of InputSplits
 created by the InputFormat you are using and the no. of reducers by the no.
 of partitions you get after the map phase. Having said that, you should
 also keep the no of slots, available per slave, in mind, along with the
 available memory. But as a general rule you could use this approach :

 Take the no. of virtual CPUs*.75 and that's the no. of slots you can
 configure. For example, if you have 12 physical cores (or 24 virtual
 cores), you would have (24*.75)=18 slots. Now, based on your requirement
 you could choose how many mappers and reducers you want to use. With 18 MR
 slots, you could have 9 mappers and 9 reducers or 12 mappers and 9 reducers
 or whatever you think is OK with you.

 I don't know if it ,makes much sense, but it helps me pretty decently.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, May 11, 2013 at 8:57 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 Hi,

 I am also new to Hadoop world , here is my take on your question , if
 there is something missing then others would surely correct that.

 For per-YARN , the slots are fixed and computed based on the crunching
 capacity of the datanode hardware , once the slots per data node is
 ascertained , they are divided into Map and reducer slots and that goes
 into the config files and remain fixed , until changed.In YARN , its
 decided at runtime based on the kind of requirement of particular task.Its
 very much possible that a datanode at certain point of time running  10
 tasks and another similar datanode is only running 4 tasks.

 Coming to your question. Based of the data set size , block size of dfs
 and input formater , the number of map tasks are decided , generally for
 file based inputformats its one mapper per data block , however there are
 way to change this using configuration settings.Reduce tasks are set using
 job configuration.

 General rule as I have read from various documents is that Mappers
 should run atleast a minute , so you can run a sample to find out a good
 size of data block which would make you mapper run more than a minute. Now
 it again depends on your SLA , in case you are not looking for a very small
 SLA you can choose to run less mappers at the expense of higher runtime.

 But again its all theory , not sure how these things are handled in
 actual prod clusters.

 HTH,



 Thanks,
 Rahul


 On Sat, May 11, 2013 at 8:02 PM, Shashidhar Rao 
 raoshashidhar...@gmail.com wrote:

 Hi Users,

 I am new to Hadoop and confused about task slots in a cluster. How
 would I know how many task slots would be required for a job. Is there any
 empirical formula or on what basis should I set the number of task slots.

 Advanced Thanks








Re: Need help about task slots

2013-05-12 Thread Mohammad Tariq
Hahaha..I think we could continue this over there..

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sun, May 12, 2013 at 6:04 PM, Rahul Bhattacharjee 
rahul.rec@gmail.com wrote:

 sorry for my blunder as well. my previous post for for Tariq in a wrong
 post.

 Thanks.
 Rahul


 On Sun, May 12, 2013 at 6:03 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 Oh! I though distcp works on complete files rather then mappers per
 datablock.
 So I guess parallelism would still be there if there are multipel files..
 please correct if ther is anything wrong.

 Thank,
 Rahul


 On Sun, May 12, 2013 at 5:39 PM, Mohammad Tariq donta...@gmail.comwrote:

 @Rahul : I'm sorry as I am not aware of any such document. But you could
 use distcp for local to HDFS copy :
 *bin/hadoop  distcp  file:///home/tariq/in.txt  hdfs://localhost:9000/*
 *
 *
 And yes. When you use distcp from local to HDFS, you can't take the
 pleasure of parallelism as the data is stored in a non distributed fashion.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, May 11, 2013 at 11:07 PM, Mohammad Tariq donta...@gmail.comwrote:

 Hello guys,

  My 2 cents :

 Actually no. of mappers is primarily governed by the no. of InputSplits
 created by the InputFormat you are using and the no. of reducers by the no.
 of partitions you get after the map phase. Having said that, you should
 also keep the no of slots, available per slave, in mind, along with the
 available memory. But as a general rule you could use this approach :

 Take the no. of virtual CPUs*.75 and that's the no. of slots you can
 configure. For example, if you have 12 physical cores (or 24 virtual
 cores), you would have (24*.75)=18 slots. Now, based on your requirement
 you could choose how many mappers and reducers you want to use. With 18 MR
 slots, you could have 9 mappers and 9 reducers or 12 mappers and 9 reducers
 or whatever you think is OK with you.

 I don't know if it ,makes much sense, but it helps me pretty decently.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, May 11, 2013 at 8:57 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 Hi,

 I am also new to Hadoop world , here is my take on your question , if
 there is something missing then others would surely correct that.

 For per-YARN , the slots are fixed and computed based on the crunching
 capacity of the datanode hardware , once the slots per data node is
 ascertained , they are divided into Map and reducer slots and that goes
 into the config files and remain fixed , until changed.In YARN , its
 decided at runtime based on the kind of requirement of particular task.Its
 very much possible that a datanode at certain point of time running  10
 tasks and another similar datanode is only running 4 tasks.

 Coming to your question. Based of the data set size , block size of
 dfs and input formater , the number of map tasks are decided , generally
 for file based inputformats its one mapper per data block , however there
 are way to change this using configuration settings.Reduce tasks are set
 using job configuration.

 General rule as I have read from various documents is that Mappers
 should run atleast a minute , so you can run a sample to find out a good
 size of data block which would make you mapper run more than a minute. Now
 it again depends on your SLA , in case you are not looking for a very 
 small
 SLA you can choose to run less mappers at the expense of higher runtime.

 But again its all theory , not sure how these things are handled in
 actual prod clusters.

 HTH,



 Thanks,
 Rahul


 On Sat, May 11, 2013 at 8:02 PM, Shashidhar Rao 
 raoshashidhar...@gmail.com wrote:

 Hi Users,

 I am new to Hadoop and confused about task slots in a cluster. How
 would I know how many task slots would be required for a job. Is there 
 any
 empirical formula or on what basis should I set the number of task slots.

 Advanced Thanks









Re: Hadoop noob question

2013-05-12 Thread Mohammad Tariq
No. distcp is actually a mapreduce job under the hood.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sun, May 12, 2013 at 6:00 PM, Rahul Bhattacharjee 
rahul.rec@gmail.com wrote:

 Thanks to both of you!

 Rahul


 On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar nitinpawar...@gmail.comwrote:

 you can do that using file:///

 example:

 hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/







 On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 @Tariq can you point me to some resource which shows how distcp is used
 to upload files from local to hdfs.

 isn't distcp a MR job ? wouldn't it need the data to be already present
 in the hadoop's fs?

  Rahul


 On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq donta...@gmail.comwrote:

 You'r welcome :)

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 Thanks Tariq!


 On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq 
 donta...@gmail.comwrote:

 @Rahul : Yes. distcp can do that.

 And, bigger the files lesser the metadata hence lesser memory
 consumption.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 IMHO,I think the statement about NN with regard to block metadata is
 more like a general statement. Even if you put lots of small files of
 combined size 10 TB , you need to have a capable NN.

 can disct cp be used to copy local - to - hdfs ?

 Thanks,
 Rahul


 On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar 
 nitinpawar...@gmail.com wrote:

 absolutely rite Mohammad


 On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq donta...@gmail.com
  wrote:

 Sorry for barging in guys. I think Nitin is talking about this :

 Every file and block in HDFS is treated as an object and for each
 object around 200B of metadata get created. So the NN should be 
 powerful
 enough to handle that much metadata, since it is going to be 
 in-memory.
 Actually memory is the most important metric when it comes to NN.

 Am I correct @Nitin?

 @Thoihen : As Nitin has said, when you talk about that much data
 you don't actually just do a put. You could use something like 
 distcp
 for parallel copying. A better approach would be to use a data 
 aggregation
 tool like Flume or Chukwa, as Nitin has already pointed. Facebook uses
 their own data aggregation tool, called Scribe for this purpose.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar 
 nitinpawar...@gmail.com wrote:

 NN would still be in picture because it will be writing a lot of
 meta data for each individual file. so you will need a NN capable 
 enough
 which can store the metadata for your entire dataset. Data will 
 never go to
 NN but lot of metadata about data will be on NN so its always good 
 idea to
 have a strong NN.


 On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 @Nitin , parallel dfs to write to hdfs is great , but could not
 understand the meaning of capable NN. As I know , the NN would not 
 be a
 part of the actual data write pipeline , means that the data would 
 not
 travel through the NN , the dfs would contact the NN from time to 
 time to
 get locations of DN as where to store the data blocks.

 Thanks,
 Rahul



 On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar 
 nitinpawar...@gmail.com wrote:

 is it safe? .. there is no direct answer yes or no

 when you say , you have files worth 10TB files and you want to
 upload  to HDFS, several factors come into picture

 1) Is the machine in the same network as your hadoop cluster?
 2) If there guarantee that network will not go down?

 and Most importantly I assume that you have a capable hadoop
 cluster. By that I mean you have a capable namenode.

 I would definitely not write files sequentially in HDFS. I
 would prefer to write files in parallel to hdfs to utilize the DFS 
 write
 features to speed up the process.
 you can hdfs put command in parallel manner and in my
 experience it has not failed when we write a lot of data.


 On Sat, May 11, 2013 at 4:38 PM, maisnam ns 
 maisnam...@gmail.com wrote:

 @Nitin Pawar , thanks for clearing my doubts .

 But I have one more question , say I have 10 TB data in the
 pipeline .

 Is it perfectly OK to use hadopo fs put command to upload
 these files of size 10 TB and is there any limit to the file size 
  using
 hadoop command line . Can hadoop put command line work with huge 
 data.

 Thanks in advance


 On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar 
 nitinpawar...@gmail.com wrote:

 first of all .. most of the companies do not get 100 PB of
 data in one go. Its an accumulating process and most of the 
 companies do
 have a data pipeline in place where the data is written to hdfs 
 on a
 frequency basis and  then its retained on hdfs for some duration 
 as per
 needed and from there its sent to 

Re: Hadoop noob question

2013-05-12 Thread Mohammad Tariq
I had said that if you use distcp to copy data *from localFS to HDFS* then
you won't be able to exploit parallelism as entire file is present on a
single machine. So no multiple TTs.

Please comment if you think I am wring somewhere.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sun, May 12, 2013 at 6:15 PM, Rahul Bhattacharjee 
rahul.rec@gmail.com wrote:

 Yes , it's a MR job under the hood . my question was that you wrote that
 using distcp you loose the benefits  of parallel processing of Hadoop. I
 think the MR job of distcp divides files into individual map tasks based on
 the total size of the transfer , so multiple mappers would still be spawned
 if the size of transfer is huge and they would work in parallel.

 Correct me if there is anything wrong!

 Thanks,
 Rahul


 On Sun, May 12, 2013 at 6:07 PM, Mohammad Tariq donta...@gmail.comwrote:

 No. distcp is actually a mapreduce job under the hood.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sun, May 12, 2013 at 6:00 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 Thanks to both of you!

 Rahul


 On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar nitinpawar...@gmail.comwrote:

 you can do that using file:///

 example:

 hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/












 On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 @Tariq can you point me to some resource which shows how distcp is
 used to upload files from local to hdfs.

 isn't distcp a MR job ? wouldn't it need the data to be already
 present in the hadoop's fs?

  Rahul


 On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq 
 donta...@gmail.comwrote:

 You'r welcome :)

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 Thanks Tariq!


 On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq donta...@gmail.com
  wrote:

 @Rahul : Yes. distcp can do that.

 And, bigger the files lesser the metadata hence lesser memory
 consumption.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 IMHO,I think the statement about NN with regard to block metadata
 is more like a general statement. Even if you put lots of small files 
 of
 combined size 10 TB , you need to have a capable NN.

 can disct cp be used to copy local - to - hdfs ?

 Thanks,
 Rahul


 On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar 
 nitinpawar...@gmail.com wrote:

 absolutely rite Mohammad


 On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq 
 donta...@gmail.com wrote:

 Sorry for barging in guys. I think Nitin is talking about this :

 Every file and block in HDFS is treated as an object and for
 each object around 200B of metadata get created. So the NN should be
 powerful enough to handle that much metadata, since it is going to 
 be
 in-memory. Actually memory is the most important metric when it 
 comes to
 NN.

 Am I correct @Nitin?

 @Thoihen : As Nitin has said, when you talk about that much data
 you don't actually just do a put. You could use something like 
 distcp
 for parallel copying. A better approach would be to use a data 
 aggregation
 tool like Flume or Chukwa, as Nitin has already pointed. Facebook 
 uses
 their own data aggregation tool, called Scribe for this purpose.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar 
 nitinpawar...@gmail.com wrote:

 NN would still be in picture because it will be writing a lot
 of meta data for each individual file. so you will need a NN 
 capable enough
 which can store the metadata for your entire dataset. Data will 
 never go to
 NN but lot of metadata about data will be on NN so its always good 
 idea to
 have a strong NN.


 On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 @Nitin , parallel dfs to write to hdfs is great , but could
 not understand the meaning of capable NN. As I know , the NN 
 would not be a
 part of the actual data write pipeline , means that the data 
 would not
 travel through the NN , the dfs would contact the NN from time to 
 time to
 get locations of DN as where to store the data blocks.

 Thanks,
 Rahul



 On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar 
 nitinpawar...@gmail.com wrote:

 is it safe? .. there is no direct answer yes or no

 when you say , you have files worth 10TB files and you want
 to upload  to HDFS, several factors come into picture

 1) Is the machine in the same network as your hadoop cluster?
 2) If there guarantee that network will not go down?

 and Most importantly I assume that you have a capable hadoop
 cluster. By that I mean you have a capable namenode.

 I would definitely not write files sequentially in HDFS. I
 would prefer to write files in parallel to hdfs to utilize the 
 DFS write
 features to speed up the process.
 you can hdfs put command in parallel manner and in my
 

Re: Hadoop noob question

2013-05-12 Thread Rahul Bhattacharjee
yeah you are right I mis read your earlier post.

Thanks,
Rahul


On Sun, May 12, 2013 at 6:25 PM, Mohammad Tariq donta...@gmail.com wrote:

 I had said that if you use distcp to copy data *from localFS to HDFS*then you 
 won't be able to exploit parallelism as entire file is present on
 a single machine. So no multiple TTs.

 Please comment if you think I am wring somewhere.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sun, May 12, 2013 at 6:15 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 Yes , it's a MR job under the hood . my question was that you wrote that
 using distcp you loose the benefits  of parallel processing of Hadoop. I
 think the MR job of distcp divides files into individual map tasks based on
 the total size of the transfer , so multiple mappers would still be spawned
 if the size of transfer is huge and they would work in parallel.

 Correct me if there is anything wrong!

 Thanks,
 Rahul


 On Sun, May 12, 2013 at 6:07 PM, Mohammad Tariq donta...@gmail.comwrote:

 No. distcp is actually a mapreduce job under the hood.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sun, May 12, 2013 at 6:00 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 Thanks to both of you!

 Rahul


 On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar 
 nitinpawar...@gmail.comwrote:

 you can do that using file:///

 example:


 hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/














 On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 @Tariq can you point me to some resource which shows how distcp is
 used to upload files from local to hdfs.

 isn't distcp a MR job ? wouldn't it need the data to be already
 present in the hadoop's fs?

  Rahul


 On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq 
 donta...@gmail.comwrote:

 You'r welcome :)

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 Thanks Tariq!


 On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq 
 donta...@gmail.com wrote:

 @Rahul : Yes. distcp can do that.

 And, bigger the files lesser the metadata hence lesser memory
 consumption.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 IMHO,I think the statement about NN with regard to block metadata
 is more like a general statement. Even if you put lots of small 
 files of
 combined size 10 TB , you need to have a capable NN.

 can disct cp be used to copy local - to - hdfs ?

 Thanks,
 Rahul


 On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar 
 nitinpawar...@gmail.com wrote:

 absolutely rite Mohammad


 On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq 
 donta...@gmail.com wrote:

 Sorry for barging in guys. I think Nitin is talking about this :

 Every file and block in HDFS is treated as an object and for
 each object around 200B of metadata get created. So the NN should 
 be
 powerful enough to handle that much metadata, since it is going to 
 be
 in-memory. Actually memory is the most important metric when it 
 comes to
 NN.

 Am I correct @Nitin?

 @Thoihen : As Nitin has said, when you talk about that much
 data you don't actually just do a put. You could use something 
 like
 distcp for parallel copying. A better approach would be to use a 
 data
 aggregation tool like Flume or Chukwa, as Nitin has already 
 pointed.
 Facebook uses their own data aggregation tool, called Scribe for 
 this
 purpose.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar 
 nitinpawar...@gmail.com wrote:

 NN would still be in picture because it will be writing a lot
 of meta data for each individual file. so you will need a NN 
 capable enough
 which can store the metadata for your entire dataset. Data will 
 never go to
 NN but lot of metadata about data will be on NN so its always 
 good idea to
 have a strong NN.


 On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 @Nitin , parallel dfs to write to hdfs is great , but could
 not understand the meaning of capable NN. As I know , the NN 
 would not be a
 part of the actual data write pipeline , means that the data 
 would not
 travel through the NN , the dfs would contact the NN from time 
 to time to
 get locations of DN as where to store the data blocks.

 Thanks,
 Rahul



 On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar 
 nitinpawar...@gmail.com wrote:

 is it safe? .. there is no direct answer yes or no

 when you say , you have files worth 10TB files and you want
 to upload  to HDFS, several factors come into picture

 1) Is the machine in the same network as your hadoop cluster?
 2) If there guarantee that network will not go down?

 and Most importantly I assume that you have a capable hadoop
 cluster. By that I mean you have a capable namenode.

 I would definitely not write files sequentially in HDFS. I
 would 

Re: Hadoop noob question

2013-05-12 Thread Mohammad Tariq
This is what I would say :

The number of maps is decided as follows. Since it’s a good idea to get
each map to copy a reasonable amount of data to minimize overheads in task
setup, each map copies at least 256 MB (unless the total size of the input
is less, in which case one map handles it all). For example, 1 GB of files
will be given four map tasks. When the data size is very large, it becomes
necessary to limit the number of maps in order to limit bandwidth and
cluster utilization. By default, the maximum number of maps is 20 per
(tasktracker) cluster node. For example, copying 1,000 GB of files to a
100-node cluster will allocate 2,000 maps (20 per node), so each will copy
512 MB on average. This can be reduced by specifying the-m argument to *
distcp*. For example, -m 1000 would allocate 1,000 maps, each copying 1 GB
on average.

HTH

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sun, May 12, 2013 at 6:35 PM, Rahul Bhattacharjee 
rahul.rec@gmail.com wrote:

 Soon after replying I realized something else related to this.

 Say we have a single file in HDFS (hdfs configured for default block size
 64 MB) and the size of the file is 1 GB. Now if we use distcp to move it
 from the current hdfs to another one , then
 whether there would be any parallelism or just a single map task would be
 fired?

 As per what I have read , a mapper is launcher for a complete file or a
 set of files. It doesn't operate at block level.So no parallelism even if
 the file resides in HDFS.

 Thanks,
 Rahul


 On Sun, May 12, 2013 at 6:28 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 yeah you are right I mis read your earlier post.

 Thanks,
 Rahul


 On Sun, May 12, 2013 at 6:25 PM, Mohammad Tariq donta...@gmail.comwrote:

 I had said that if you use distcp to copy data *from localFS to HDFS*then 
 you won't be able to exploit parallelism as entire file is present on
 a single machine. So no multiple TTs.

 Please comment if you think I am wring somewhere.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sun, May 12, 2013 at 6:15 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 Yes , it's a MR job under the hood . my question was that you wrote
 that using distcp you loose the benefits  of parallel processing of Hadoop.
 I think the MR job of distcp divides files into individual map tasks based
 on the total size of the transfer , so multiple mappers would still be
 spawned if the size of transfer is huge and they would work in parallel.

 Correct me if there is anything wrong!

 Thanks,
 Rahul


 On Sun, May 12, 2013 at 6:07 PM, Mohammad Tariq donta...@gmail.comwrote:

 No. distcp is actually a mapreduce job under the hood.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sun, May 12, 2013 at 6:00 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 Thanks to both of you!

 Rahul


 On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar nitinpawar...@gmail.com
  wrote:

 you can do that using file:///

 example:



 hadoop distcp hdfs://localhost:8020/somefile 
 file:///Users/myhome/Desktop/

















 On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 @Tariq can you point me to some resource which shows how distcp is
 used to upload files from local to hdfs.

 isn't distcp a MR job ? wouldn't it need the data to be already
 present in the hadoop's fs?

  Rahul


 On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq 
 donta...@gmail.com wrote:

 You'r welcome :)

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 Thanks Tariq!


 On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq 
 donta...@gmail.com wrote:

 @Rahul : Yes. distcp can do that.

 And, bigger the files lesser the metadata hence lesser memory
 consumption.

 Warm Regards,
 Tariq
 cloudfront.blogspot.com


 On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 IMHO,I think the statement about NN with regard to block
 metadata is more like a general statement. Even if you put lots of 
 small
 files of combined size 10 TB , you need to have a capable NN.

 can disct cp be used to copy local - to - hdfs ?

 Thanks,
 Rahul


 On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar 
 nitinpawar...@gmail.com wrote:

 absolutely rite Mohammad


 On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq 
 donta...@gmail.com wrote:

 Sorry for barging in guys. I think Nitin is talking about
 this :

 Every file and block in HDFS is treated as an object and for
 each object around 200B of metadata get created. So the NN 
 should be
 powerful enough to handle that much metadata, since it is going 
 to be
 in-memory. Actually memory is the most important metric when it 
 comes to
 NN.

 Am I correct @Nitin?

 @Thoihen : As Nitin has said, when you talk about that much
 data you don't actually just do a put. You could use something 
 like
 distcp for parallel copying. A better approach would be to use 

Re: Submitting a hadoop job in large clusters.

2013-05-12 Thread shashwat shriparv
On Sun, May 12, 2013 at 12:19 AM, Nitin Pawar nitinpawar...@gmail.comwrote:


 normally if you want to copy the jar then hadoop admins setu


Submit you job to Job tracker it will distribute throughout the
tasktrackers.

*Thanks  Regards*

∞
Shashwat Shriparv


Re: Permissions

2013-05-12 Thread shashwat shriparv
The user through which you are trying to run the task should jave
permission on hdfs. just verify that

*Thanks  Regards*

∞
Shashwat Shriparv



On Sat, May 11, 2013 at 1:02 AM, Amal G Jose amalg...@gmail.com wrote:

 After starting the hdfs, ie NN, SN and DN, create an hdfs directory
 structure in the form /hadoop.tmp.dir/mapred/staging.
 Then give 777 permission to staging. After that change the ownership of
 mapred directory to mapred user.
 After doing this start jobtracker, it will start. Otherwise, it will not
 start.
 The reason for not showing any datanodes may be due to firewall. Check
 whether the necessary ports are open.



 On Tue, Apr 30, 2013 at 2:28 AM, rkevinbur...@charter.net wrote:

 I look in the name node log and I get the following errors:

 2013-04-29 15:25:11,646 ERROR
 org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException
 as:mapred (auth:SIMPLE)
 cause:org.apache.hadoop.security.AccessControlException: Permission denied:
 *user=mapred*, access=WRITE, inode=/:hdfs:supergroup:drwxr-xr-x

 2013-04-29 15:25:11,646 INFO org.apache.hadoop.ipc.Server: IPC Server
 handler 6 on 9000, call
 org.apache.hadoop.hdfs.protocol.ClientProtocol.mkdirs from
 172.16.26.68:45044: error:
 org.apache.hadoop.security.AccessControlException: Permission denied: *
 user=mapred*, access=WRITE, inode=/:hdfs:supergroup:drwxr-xr-x
 org.apache.hadoop.security.AccessControlException: Permission denied: *
 user=mapred,* access=WRITE, inode=/:hdfs:supergroup:drwxr-xr-x
 at
 org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:205)
 at
 org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:186)

 When I create the file system I have the user hdfs on the root folder.
 (/). I am not sure now to have both the user mapred and hdfs have access to
 the root (which it seems these errors are indicating).

 I get a page from 50070 put when I try to browse the filesystem from the
 web UI I get an error that there are no nodes listening (I have 3 data
 nodes and 1 namenode). The browser indicates that there is nothing
 listening to port 50030, so it seems that the JobTracker is not up.





Re: issues with decrease the default.block.size

2013-05-12 Thread shashwat shriparv
The block size is for allocation not storage on the disk.

*Thanks  Regards*

∞
Shashwat Shriparv



On Fri, May 10, 2013 at 8:54 PM, Harsh J ha...@cloudera.com wrote:

 Thanks. I failed to add: It should be okay to do if those cases are
 true and the cluster seems under-utilized right now.

 On Fri, May 10, 2013 at 8:29 PM, yypvsxf19870706
 yypvsxf19870...@gmail.com wrote:
  Hi harsh
 
  Yep.
 
 
 
  Regards
 
 
 
 
 
 
  发自我的 iPhone
 
  在 2013-5-10,13:27,Harsh J ha...@cloudera.com 写道:
 
  Are you looking to decrease it to get more parallel map tasks out of
  the small files? Are you currently CPU bound on processing these small
  files?
 
  On Thu, May 9, 2013 at 9:12 PM, YouPeng Yang yypvsxf19870...@gmail.com
 wrote:
  hi ALL
 
  I am going to setup a new hadoop  environment, .Because  of  there
  are
  lots of small  files, I would  like to change  the  default.block.size
 to
  16MB
  other than adopting the ways to merge  the files into large  enough
 (e.g
  using  sequencefiles).
 I want to ask are  there  any bad influences or issues?
 
  Regards
 
 
 
  --
  Harsh J



 --
 Harsh J



Re: hadoop map-reduce errors

2013-05-12 Thread shashwat shriparv
Your connection setting to Mysql may not be correct check that.

*Thanks  Regards*

∞
Shashwat Shriparv



On Fri, May 10, 2013 at 6:12 PM, Shahab Yunus shahab.yu...@gmail.comwrote:

 Have your checked your connection settings to the MySQL DB? Where and how
 are you passing the connection properties for the database? Is it
 accessible from the machine you are running this? Is the db up?


 On Thu, May 9, 2013 at 9:32 PM, 丙子 woyaof...@gmail.com wrote:

 When i  run a hadoop job ,there are some errors like this:
 13/05/10 08:20:59 ERROR manager.SqlManager: Error executing statement:
 com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications
 link failure

 The last packet successfully received from the server was 28,484
 milliseconds ago.  The last packet sent successfully to the server was 1
 milliseconds ago.
 com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications
 link failure

 The last packet successfully received from the server was 28,484
 milliseconds ago.  The last packet sent successfully to the server was 1
 milliseconds ago.
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
 Method)

 ……
 ……
 at org.apache.sqoop.Sqoop.main(Sqoop.java:238)
 at com.cloudera.sqoop.Sqoop.main(Sqoop.java:57)
 Caused by: java.io.EOFException: Can not read response from server.
 Expected to read 4 bytes, read 0 bytes before connection was unexpectedly
 lost.
 at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:3039)
 at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3489)
 ... 24 more


 How  can i resolve it .





Re: Problem while running simple WordCount program(hadoop-1.0.4) on eclipse.

2013-05-12 Thread shashwat shriparv
the user through which you are running your hadoop, set permission to tmp
dir for that user.

*Thanks  Regards*

∞
Shashwat Shriparv



On Fri, May 10, 2013 at 5:24 PM, Nitin Pawar nitinpawar...@gmail.comwrote:

 What are the permission of your /tmp/ folder?
 On May 10, 2013 5:03 PM, Khaleel Khalid khale...@suntecgroup.com
 wrote:

  Hi all,

 I am facing the following error when I run a simple WordCount
 program using hadoop-1.0.4 on eclipse(Galileo).  The map/reduce plugin
 version I use is 1.0.4 as well.  It would be really helpful if
 someone gives me a solution for the problem.

 ERROR:

 13/05/10 16:53:51 WARN util.NativeCodeLoader: Unable to load
 native-hadoop library for your platform... using builtin-java classes where
 applicable

 13/05/10 16:53:51 ERROR security.UserGroupInformation:
 *PriviledgedActionException* as:khaleelk *cause:java.io.IOException*:
 Failed to set permissions of path:
 \tmp\hadoop-khaleelk\mapred\staging\khaleelk-1067522586\.staging to 0700

 Exception in thread main
 *java.io.IOException*: Failed to set permissions of path:
 \tmp\hadoop-khaleelk\mapred\staging\khaleelk-1067522586\.staging to 0700

 at org.apache.hadoop.fs.FileUtil.checkReturnValue(
 *FileUtil.java:689*)

 at org.apache.hadoop.fs.FileUtil.setPermission(
 *FileUtil.java:662*)

 at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(
 *RawLocalFileSystem.java:509*)

 at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(
 *RawLocalFileSystem.java:344*)

 at org.apache.hadoop.fs.FilterFileSystem.mkdirs(
 *FilterFileSystem.java:189*)

 at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(
 *JobSubmissionFiles.java:116*)

 at org.apache.hadoop.mapred.JobClient$2.run(
 *JobClient.java:856*)

 at org.apache.hadoop.mapred.JobClient$2.run(
 *JobClient.java:850*)

 at java.security.AccessController.doPrivileged(
 *Native Method*)

 at javax.security.auth.Subject.doAs(Unknown Source)

 at org.apache.hadoop.security.UserGroupInformation.doAs(
 *UserGroupInformation.java:1121*)

 at org.apache.hadoop.mapred.JobClient.submitJobInternal(
 *JobClient.java:850*)

 at org.apache.hadoop.mapreduce.Job.submit(
 *Job.java:500*)

 at org.apache.hadoop.mapreduce.Job.waitForCompletion(
 *Job.java:530*)

 at WordCount.main(
 *WordCount.java:65*)



 Thank you in advance.










Re: Problem while running simple WordCount program(hadoop-1.0.4) on eclipse.

2013-05-12 Thread Nitin Pawar
its a /tmp/ folder so I guess all the users will need access to it. better
it make it a routine linux like /tmp folder


On Sun, May 12, 2013 at 11:12 PM, shashwat shriparv 
dwivedishash...@gmail.com wrote:

 the user through which you are running your hadoop, set permission to tmp
 dir for that user.

 *Thanks  Regards*

 ∞
 Shashwat Shriparv



 On Fri, May 10, 2013 at 5:24 PM, Nitin Pawar nitinpawar...@gmail.comwrote:

 What are the permission of your /tmp/ folder?
 On May 10, 2013 5:03 PM, Khaleel Khalid khale...@suntecgroup.com
 wrote:

  Hi all,

 I am facing the following error when I run a simple WordCount
 program using hadoop-1.0.4 on eclipse(Galileo).  The map/reduce plugin
 version I use is 1.0.4 as well.  It would be really helpful if
 someone gives me a solution for the problem.

 ERROR:

 13/05/10 16:53:51 WARN util.NativeCodeLoader: Unable to load
 native-hadoop library for your platform... using builtin-java classes where
 applicable

 13/05/10 16:53:51 ERROR security.UserGroupInformation:
 *PriviledgedActionException* as:khaleelk *cause:java.io.IOException*:
 Failed to set permissions of path:
 \tmp\hadoop-khaleelk\mapred\staging\khaleelk-1067522586\.staging to 0700

 Exception in thread main
 *java.io.IOException*: Failed to set permissions of path:
 \tmp\hadoop-khaleelk\mapred\staging\khaleelk-1067522586\.staging to 0700

 at org.apache.hadoop.fs.FileUtil.checkReturnValue(
 *FileUtil.java:689*)

 at org.apache.hadoop.fs.FileUtil.setPermission(
 *FileUtil.java:662*)

 at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(
 *RawLocalFileSystem.java:509*)

 at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(
 *RawLocalFileSystem.java:344*)

 at org.apache.hadoop.fs.FilterFileSystem.mkdirs(
 *FilterFileSystem.java:189*)

 at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(
 *JobSubmissionFiles.java:116*)

 at org.apache.hadoop.mapred.JobClient$2.run(
 *JobClient.java:856*)

 at org.apache.hadoop.mapred.JobClient$2.run(
 *JobClient.java:850*)

 at java.security.AccessController.doPrivileged(
 *Native Method*)

 at javax.security.auth.Subject.doAs(Unknown Source)

 at org.apache.hadoop.security.UserGroupInformation.doAs(
 *UserGroupInformation.java:1121*)

 at org.apache.hadoop.mapred.JobClient.submitJobInternal(
 *JobClient.java:850*)

 at org.apache.hadoop.mapreduce.Job.submit(
 *Job.java:500*)

 at org.apache.hadoop.mapreduce.Job.waitForCompletion(
 *Job.java:530*)

 at WordCount.main(
 *WordCount.java:65*)



 Thank you in advance.











-- 
Nitin Pawar


Re: Submitting a hadoop job in large clusters.

2013-05-12 Thread Shashidhar Rao
@shashwat shriparv

Can the a hadoop job be submitted to any datanode in the cluster and not to
jobTracker.

Correct me if it I am wrong , I was told that a hadoop job can be submitted
to datanode also apart from JobTracker. Is it correct?

Advanced thanks


On Sun, May 12, 2013 at 11:02 PM, shashwat shriparv 
dwivedishash...@gmail.com wrote:


 On Sun, May 12, 2013 at 12:19 AM, Nitin Pawar nitinpawar...@gmail.comwrote:


 normally if you want to copy the jar then hadoop admins setu


 Submit you job to Job tracker it will distribute throughout the
 tasktrackers.

 *Thanks  Regards*

 ∞
 Shashwat Shriparv




Re: Submitting a hadoop job in large clusters.

2013-05-12 Thread Nitin Pawar
nope
in MRv1 only jobtracker can accept jobs. You can not trigger job on any
other process in hadoop other than jobtracker.


On Sun, May 12, 2013 at 11:25 PM, Shashidhar Rao raoshashidhar...@gmail.com
 wrote:

 @shashwat shriparv

 Can the a hadoop job be submitted to any datanode in the cluster and not
 to jobTracker.

 Correct me if it I am wrong , I was told that a hadoop job can be
 submitted to datanode also apart from JobTracker. Is it correct?

 Advanced thanks


 On Sun, May 12, 2013 at 11:02 PM, shashwat shriparv 
 dwivedishash...@gmail.com wrote:


 On Sun, May 12, 2013 at 12:19 AM, Nitin Pawar nitinpawar...@gmail.comwrote:


 normally if you want to copy the jar then hadoop admins setu


  Submit you job to Job tracker it will distribute throughout the
 tasktrackers.

 *Thanks  Regards*

 ∞
 Shashwat Shriparv





-- 
Nitin Pawar


Re: Submitting a hadoop job in large clusters.

2013-05-12 Thread shashwat shriparv
As nitin said , its responsibility of Jobtracker to distribute the job to
task to the tasktrackers so you need to submitt the job to the job tracker

*Thanks  Regards*

∞
Shashwat Shriparv



On Sun, May 12, 2013 at 11:26 PM, Nitin Pawar nitinpawar...@gmail.comwrote:

 nope
 in MRv1 only jobtracker can accept jobs. You can not trigger job on any
 other process in hadoop other than jobtracker.


 On Sun, May 12, 2013 at 11:25 PM, Shashidhar Rao 
 raoshashidhar...@gmail.com wrote:

 @shashwat shriparv

 Can the a hadoop job be submitted to any datanode in the cluster and not
 to jobTracker.

 Correct me if it I am wrong , I was told that a hadoop job can be
 submitted to datanode also apart from JobTracker. Is it correct?

 Advanced thanks


 On Sun, May 12, 2013 at 11:02 PM, shashwat shriparv 
 dwivedishash...@gmail.com wrote:


 On Sun, May 12, 2013 at 12:19 AM, Nitin Pawar 
 nitinpawar...@gmail.comwrote:


 normally if you want to copy the jar then hadoop admins setu


  Submit you job to Job tracker it will distribute throughout the
 tasktrackers.

 *Thanks  Regards*

 ∞
 Shashwat Shriparv





 --
 Nitin Pawar



Eclipse Plugin: HADOOP 2.0.3

2013-05-12 Thread Gourav Sengupta
Hi,

Is there a method to build the ECLIPSE plugin using HADOOP 2.0.3?

I am looking at the details in  http://wiki.apache.org/hadoop/EclipsePlugIn,
but I am not able to find any eclipse-plugin folder in the src.


Thanks and Regards,
Gourav


Re: Submitting a hadoop job in large clusters.

2013-05-12 Thread Bertrand Dechoux
Which doesn't imply that you should log yourself to the physical machine
where the JobTracker is hosted. It only implies that the hadoop client must
be able to reach the JobTracker. It could be from any physical machines
hosting the slaves (DataNode, Tasktracker) but it is rarely the case.
Often, job are submitted from a machine which doesn't belong to the cluster
but can reach every machine of it.

Regards

Bertrand



On Sun, May 12, 2013 at 7:59 PM, shashwat shriparv 
dwivedishash...@gmail.com wrote:

 As nitin said , its responsibility of Jobtracker to distribute the job to
 task to the tasktrackers so you need to submitt the job to the job tracker

 *Thanks  Regards*

 ∞
 Shashwat Shriparv



 On Sun, May 12, 2013 at 11:26 PM, Nitin Pawar nitinpawar...@gmail.comwrote:

 nope
 in MRv1 only jobtracker can accept jobs. You can not trigger job on any
 other process in hadoop other than jobtracker.


 On Sun, May 12, 2013 at 11:25 PM, Shashidhar Rao 
 raoshashidhar...@gmail.com wrote:

 @shashwat shriparv

 Can the a hadoop job be submitted to any datanode in the cluster and not
 to jobTracker.

 Correct me if it I am wrong , I was told that a hadoop job can be
 submitted to datanode also apart from JobTracker. Is it correct?

 Advanced thanks


 On Sun, May 12, 2013 at 11:02 PM, shashwat shriparv 
 dwivedishash...@gmail.com wrote:


 On Sun, May 12, 2013 at 12:19 AM, Nitin Pawar 
 nitinpawar...@gmail.comwrote:


 normally if you want to copy the jar then hadoop admins setu


  Submit you job to Job tracker it will distribute throughout the
 tasktrackers.

 *Thanks  Regards*

 ∞
 Shashwat Shriparv





 --
 Nitin Pawar





Re: Wrapping around BitSet with the Writable interface

2013-05-12 Thread Harsh J
You can perhaps consider using the experimental JavaSerialization [1]
enhancement to skip transforming to
Writables/other-serialization-formats. It may be slower but looks like
you are looking for a way to avoid transforming objects.

Enable by adding the class
org.apache.hadoop.io.serializer.JavaSerialization to the list of
io.serializations like so in your client configuration:

property
  nameio.serializations/name
  
valueorg.apache.hadoop.io.serializer.WritableSerialization,org.apache.hadoop.io.serializer.avro.AvroSpecificSerialization,org.apache.hadoop.io.serializer.avro.AvroReflectSerialization,org.apache.hadoop.io.serializer.JavaSerialization/value
/property

And you should then be able to rely on Java's inbuilt serialization to
directly serialize your BitSet object?

[1] - 
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/serializer/JavaSerialization.html

On Sun, May 12, 2013 at 11:54 PM, Jim Twensky jim.twen...@gmail.com wrote:
 I have large java.util.BitSet objects that I want to bitwise-OR using a
 MapReduce job. I decided to wrap around each object using the Writable
 interface. Right now I convert each BitSet to a byte array and serialize the
 byte array on disk.

 Converting them to byte arrays is a bit inefficient but I could not find a
 work around to write them directly to the DataOutput. Is there a way to skip
 this and serialize the object directly? Here is what my current
 implementation looks like:

 public class BitSetWritable implements Writable {

   private BitSet bs;

   public BitSetWritable() {
 this.bs = new BitSet();
   }

   @Override
   public void write(DataOutput out) throws IOException {

 ByteArrayOutputStream bos = new ByteArrayOutputStream(bs.size()/8);
 ObjectOutputStream oos = new ObjectOutputStream(bos);
 oos.writeObject(bs);
 byte[] bytes = bos.toByteArray();
 oos.close();
 out.writeInt(bytes.length);
 out.write(bytes);

   }

   @Override
   public void readFields(DataInput in) throws IOException {

 int len = in.readInt();
 byte[] bytes = new byte[len];
 in.readFully(bytes);

 ByteArrayInputStream bis = new ByteArrayInputStream(bytes);
 ObjectInputStream ois = new ObjectInputStream(bis);
 try {
   bs = (BitSet) ois.readObject();
 } catch (ClassNotFoundException e) {
   throw new IOException(e);
 }

 ois.close();
   }

 }



--
Harsh J


Re: Wrapping around BitSet with the Writable interface

2013-05-12 Thread Bertrand Dechoux
In order to make the code more readable, you could start by using the
methods toByteArray() and valueOf(bytes)

http://docs.oracle.com/javase/7/docs/api/java/util/BitSet.html#toByteArray%28%29
http://docs.oracle.com/javase/7/docs/api/java/util/BitSet.html#valueOf%28byte[]%29

Regards

Bertrand


On Sun, May 12, 2013 at 8:24 PM, Jim Twensky jim.twen...@gmail.com wrote:

 I have large java.util.BitSet objects that I want to bitwise-OR using a
 MapReduce job. I decided to wrap around each object using the Writable
 interface. Right now I convert each BitSet to a byte array and serialize
 the byte array on disk.

 Converting them to byte arrays is a bit inefficient but I could not find a
 work around to write them directly to the DataOutput. Is there a way to
 skip this and serialize the object directly? Here is what my current
 implementation looks like:

 public class BitSetWritable implements Writable {

   private BitSet bs;

   public BitSetWritable() {
 this.bs = new BitSet();
   }

   @Override
   public void write(DataOutput out) throws IOException {

 ByteArrayOutputStream bos = new ByteArrayOutputStream(bs.size()/8);
 ObjectOutputStream oos = new ObjectOutputStream(bos);
 oos.writeObject(bs);
 byte[] bytes = bos.toByteArray();
 oos.close();
 out.writeInt(bytes.length);
 out.write(bytes);

   }

   @Override
   public void readFields(DataInput in) throws IOException {

 int len = in.readInt();
 byte[] bytes = new byte[len];
 in.readFully(bytes);

 ByteArrayInputStream bis = new ByteArrayInputStream(bytes);
 ObjectInputStream ois = new ObjectInputStream(bis);
 try {
   bs = (BitSet) ois.readObject();
 } catch (ClassNotFoundException e) {
   throw new IOException(e);
 }

 ois.close();
   }

 }




-- 
Bertrand Dechoux


Re: issues with decrease the default.block.size

2013-05-12 Thread Ted Dunning
The block size controls lots of things in Hadoop.

It affects read parallelism, scalability, block allocation and other
aspects of operations either directly or indirectly.


On Sun, May 12, 2013 at 10:38 AM, shashwat shriparv 
dwivedishash...@gmail.com wrote:

 The block size is for allocation not storage on the disk.

 *Thanks  Regards*

 ∞
 Shashwat Shriparv



 On Fri, May 10, 2013 at 8:54 PM, Harsh J ha...@cloudera.com wrote:

 Thanks. I failed to add: It should be okay to do if those cases are
 true and the cluster seems under-utilized right now.

 On Fri, May 10, 2013 at 8:29 PM, yypvsxf19870706
 yypvsxf19870...@gmail.com wrote:
  Hi harsh
 
  Yep.
 
 
 
  Regards
 
 
 
 
 
 
  发自我的 iPhone
 
  在 2013-5-10,13:27,Harsh J ha...@cloudera.com 写道:
 
  Are you looking to decrease it to get more parallel map tasks out of
  the small files? Are you currently CPU bound on processing these small
  files?
 
  On Thu, May 9, 2013 at 9:12 PM, YouPeng Yang 
 yypvsxf19870...@gmail.com wrote:
  hi ALL
 
  I am going to setup a new hadoop  environment, .Because  of
  there  are
  lots of small  files, I would  like to change  the
  default.block.size to
  16MB
  other than adopting the ways to merge  the files into large  enough
 (e.g
  using  sequencefiles).
 I want to ask are  there  any bad influences or issues?
 
  Regards
 
 
 
  --
  Harsh J



 --
 Harsh J





Re: Wrapping around BitSet with the Writable interface

2013-05-12 Thread Ted Dunning
Another interesting alternative is the EWAH implementation of java bitsets
that allow efficient compressed bitsets with very fast OR operations.

https://github.com/lemire/javaewah

See also https://code.google.com/p/sparsebitmap/ by the same authors.


On Sun, May 12, 2013 at 1:11 PM, Bertrand Dechoux decho...@gmail.comwrote:

 In order to make the code more readable, you could start by using the
 methods toByteArray() and valueOf(bytes)


 http://docs.oracle.com/javase/7/docs/api/java/util/BitSet.html#toByteArray%28%29

 http://docs.oracle.com/javase/7/docs/api/java/util/BitSet.html#valueOf%28byte[]%29

 Regards

 Bertrand


 On Sun, May 12, 2013 at 8:24 PM, Jim Twensky jim.twen...@gmail.comwrote:

 I have large java.util.BitSet objects that I want to bitwise-OR using a
 MapReduce job. I decided to wrap around each object using the Writable
 interface. Right now I convert each BitSet to a byte array and serialize
 the byte array on disk.

 Converting them to byte arrays is a bit inefficient but I could not find
 a work around to write them directly to the DataOutput. Is there a way to
 skip this and serialize the object directly? Here is what my current
 implementation looks like:

 public class BitSetWritable implements Writable {

   private BitSet bs;

   public BitSetWritable() {
 this.bs = new BitSet();
   }

   @Override
   public void write(DataOutput out) throws IOException {

 ByteArrayOutputStream bos = new ByteArrayOutputStream(bs.size()/8);
 ObjectOutputStream oos = new ObjectOutputStream(bos);
 oos.writeObject(bs);
 byte[] bytes = bos.toByteArray();
 oos.close();
 out.writeInt(bytes.length);
 out.write(bytes);

   }

   @Override
   public void readFields(DataInput in) throws IOException {

 int len = in.readInt();
 byte[] bytes = new byte[len];
 in.readFully(bytes);

 ByteArrayInputStream bis = new ByteArrayInputStream(bytes);
 ObjectInputStream ois = new ObjectInputStream(bis);
 try {
   bs = (BitSet) ois.readObject();
 } catch (ClassNotFoundException e) {
   throw new IOException(e);
 }

 ois.close();
   }

 }




 --
 Bertrand Dechoux



Re: Wrapping around BitSet with the Writable interface

2013-05-12 Thread Bertrand Dechoux
You can disregard my links as their are only valid for java 1.7+.
The JavaSerialization might clean your code but shouldn't bring a
significant boost in performance.
The EWAH implementation has, at least, the methods you are looking for :
serialize / deserialize.

Regards

Bertrand

Note to myself : I have to remember this one.


On Sun, May 12, 2013 at 10:27 PM, Ted Dunning tdunn...@maprtech.com wrote:

 Another interesting alternative is the EWAH implementation of java bitsets
 that allow efficient compressed bitsets with very fast OR operations.

 https://github.com/lemire/javaewah

 See also https://code.google.com/p/sparsebitmap/ by the same authors.


 On Sun, May 12, 2013 at 1:11 PM, Bertrand Dechoux decho...@gmail.comwrote:

 In order to make the code more readable, you could start by using the
 methods toByteArray() and valueOf(bytes)


 http://docs.oracle.com/javase/7/docs/api/java/util/BitSet.html#toByteArray%28%29

 http://docs.oracle.com/javase/7/docs/api/java/util/BitSet.html#valueOf%28byte[]%29

 Regards

 Bertrand


 On Sun, May 12, 2013 at 8:24 PM, Jim Twensky jim.twen...@gmail.comwrote:

 I have large java.util.BitSet objects that I want to bitwise-OR using a
 MapReduce job. I decided to wrap around each object using the Writable
 interface. Right now I convert each BitSet to a byte array and serialize
 the byte array on disk.

 Converting them to byte arrays is a bit inefficient but I could not find
 a work around to write them directly to the DataOutput. Is there a way to
 skip this and serialize the object directly? Here is what my current
 implementation looks like:

 public class BitSetWritable implements Writable {

   private BitSet bs;

   public BitSetWritable() {
 this.bs = new BitSet();
   }

   @Override
   public void write(DataOutput out) throws IOException {

 ByteArrayOutputStream bos = new ByteArrayOutputStream(bs.size()/8);
 ObjectOutputStream oos = new ObjectOutputStream(bos);
 oos.writeObject(bs);
 byte[] bytes = bos.toByteArray();
 oos.close();
 out.writeInt(bytes.length);
 out.write(bytes);

   }

   @Override
   public void readFields(DataInput in) throws IOException {

 int len = in.readInt();
 byte[] bytes = new byte[len];
 in.readFully(bytes);

 ByteArrayInputStream bis = new ByteArrayInputStream(bytes);
 ObjectInputStream ois = new ObjectInputStream(bis);
 try {
   bs = (BitSet) ois.readObject();
 } catch (ClassNotFoundException e) {
   throw new IOException(e);
 }

 ois.close();
   }

 }




 --
 Bertrand Dechoux





-- 
Bertrand Dechoux


The minimum memory requirements to datanode and namenode?

2013-05-12 Thread sam liu
Hi,

I setup a cluster with 3 nodes, and after that I did not submit any job on
it. But, after few days, I found the cluster is unhealthy:
- No result returned after issuing command 'hadoop dfs -ls /' or 'hadoop
dfsadmin -report' for a while
- The page of 'http://namenode:50070' could not be opened as expected...
- ...

I did not find any usefull info in the logs, but found the avaible memory
of the cluster nodes are very low at that time:
- node1(NN,JT,DN,TT): 158 mb mem is available
- node2(DN,TT): 75 mb mem is available
- node3(DN,TT): 174 mb mem is available

I guess the issue of my cluster is caused by lacking of memeory, and my
questions are:
- Without running jobs, what's the minimum memory requirements to datanode
and namenode?
- How to define the minimum memeory for datanode and namenode?

Thanks!

Sam Liu


Re: The minimum memory requirements to datanode and namenode?

2013-05-12 Thread Rishi Yadav
do you get any error when trying to connect to cluster, something like
'tried n times' or replicated 0 times.




On Sun, May 12, 2013 at 7:28 PM, sam liu samliuhad...@gmail.com wrote:

 Hi,

 I setup a cluster with 3 nodes, and after that I did not submit any job on
 it. But, after few days, I found the cluster is unhealthy:
 - No result returned after issuing command 'hadoop dfs -ls /' or 'hadoop
 dfsadmin -report' for a while
 - The page of 'http://namenode:50070' could not be opened as expected...
 - ...

 I did not find any usefull info in the logs, but found the avaible memory
 of the cluster nodes are very low at that time:
 - node1(NN,JT,DN,TT): 158 mb mem is available
 - node2(DN,TT): 75 mb mem is available
 - node3(DN,TT): 174 mb mem is available

 I guess the issue of my cluster is caused by lacking of memeory, and my
 questions are:
 - Without running jobs, what's the minimum memory requirements to datanode
 and namenode?
 - How to define the minimum memeory for datanode and namenode?

 Thanks!

 Sam Liu



Re: The minimum memory requirements to datanode and namenode?

2013-05-12 Thread sam liu
Got some exceptions on node3:
1. datanode log:
2013-04-17 11:13:44,719 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
blk_2478755809192724446_1477 received exception
java.net.SocketTimeoutException: 63000 millis timeout while waiting for
channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/9.50.102.80:58371 remote=/
9.50.102.79:50010]
2013-04-17 11:13:44,721 ERROR
org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
9.50.102.80:50010, storageID=DS-2038715921-9.50.102.80-50010-1366091297051,
infoPort=50075, ipcPort=50020):DataXceiver
java.net.SocketTimeoutException: 63000 millis timeout while waiting for
channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/9.50.102.80:58371 remote=/
9.50.102.79:50010]
at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:116)
at java.io.DataInputStream.readShort(DataInputStream.java:306)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:359)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:112)
at java.lang.Thread.run(Thread.java:738)
2013-04-17 11:13:44,818 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
blk_8413378381769505032_1477 src: /9.50.102.81:35279 dest: /
9.50.102.80:50010


2. tasktracker log:
2013-04-23 11:48:26,783 INFO org.apache.hadoop.mapred.UserLogCleaner:
Deleting user log path job_201304152248_0011
2013-04-30 14:48:15,506 ERROR org.apache.hadoop.mapred.TaskTracker: Caught
exception: java.io.IOException: Call to node1/9.50.102.81:9001 failed on
local exception: java.io.IOException: Connection reset by peer
at org.apache.hadoop.ipc.Client.wrapException(Client.java:1144)
at org.apache.hadoop.ipc.Client.call(Client.java:1112)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229)
at org.apache.hadoop.mapred.$Proxy2.heartbeat(Unknown Source)
at
org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:2008)
at
org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:1802)
at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:2654)
at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:3909)
Caused by: java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:33)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:210)
at sun.nio.ch.IOUtil.read(IOUtil.java:183)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:257)
at
org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
at java.io.FilterInputStream.read(FilterInputStream.java:127)
at
org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:361)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:229)
at java.io.BufferedInputStream.read(BufferedInputStream.java:248)
at java.io.DataInputStream.readInt(DataInputStream.java:381)
at
org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:841)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:786)

2013-04-30 14:48:15,517 INFO org.apache.hadoop.mapred.TaskTracker:
Resending 'status' to 'node1' with reponseId '-12904
2013-04-30 14:48:16,404 INFO org.apache.hadoop.mapred.TaskTracker:
SHUTDOWN_MSG:



2013/5/13 Rishi Yadav ri...@infoobjects.com

 do you get any error when trying to connect to cluster, something like
 'tried n times' or replicated 0 times.




 On Sun, May 12, 2013 at 7:28 PM, sam liu samliuhad...@gmail.com wrote:

 Hi,

 I setup a cluster with 3 nodes, and after that I did not submit any job
 on it. But, after few days, I found the cluster is unhealthy:
 - No result returned after issuing command 'hadoop dfs -ls /' or 'hadoop
 dfsadmin -report' for a while
 - The page of 'http://namenode:50070' could not be opened as expected...
 - ...

 I did not find any usefull info in the logs, but found the avaible memory
 of the cluster nodes are very low at that time:
 - node1(NN,JT,DN,TT): 158 mb mem is available
 - node2(DN,TT): 75 mb mem is available
 - node3(DN,TT): 174 mb mem is available

 I guess the issue of my cluster is caused by lacking of memeory, and my
 questions are:
 - Without running jobs, what's the minimum memory 

Re: The minimum memory requirements to datanode and namenode?

2013-05-12 Thread sam liu
For node3, the memory is:
   total   used   free sharedbuffers
cached
Mem:  3834   3666167  0187   1136
-/+ buffers/cache:   2342   1491
Swap: 8196  0   8196

To a 3 nodes cluster as mine, what's the required minimum free/available
memory for the datanode process and tasktracker process, without running
any map/reduce task?
Any formula to determine it?


2013/5/13 Rishi Yadav ri...@infoobjects.com

 can you tell specs of node3. Even on a test/demo cluster, anything below 4
 GB ram makes the node almost inaccessible as per my experience.



 On Sun, May 12, 2013 at 8:25 PM, sam liu samliuhad...@gmail.com wrote:

 Got some exceptions on node3:
 1. datanode log:
 2013-04-17 11:13:44,719 INFO
 org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
 blk_2478755809192724446_1477 received exception
 java.net.SocketTimeoutException: 63000 millis timeout while waiting for
 channel to be ready for read. ch :
 java.nio.channels.SocketChannel[connected local=/9.50.102.80:58371remote=/
 9.50.102.79:50010]
 2013-04-17 11:13:44,721 ERROR
 org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
 9.50.102.80:50010,
 storageID=DS-2038715921-9.50.102.80-50010-1366091297051, infoPort=50075,
 ipcPort=50020):DataXceiver
 java.net.SocketTimeoutException: 63000 millis timeout while waiting for
 channel to be ready for read. ch :
 java.nio.channels.SocketChannel[connected local=/9.50.102.80:58371remote=/
 9.50.102.79:50010]
 at
 org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
 at
 org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
 at
 org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
 at
 org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:116)
 at java.io.DataInputStream.readShort(DataInputStream.java:306)
 at
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:359)
 at
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:112)
 at java.lang.Thread.run(Thread.java:738)
 2013-04-17 11:13:44,818 INFO
 org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
 blk_8413378381769505032_1477 src: /9.50.102.81:35279 dest: /
 9.50.102.80:50010


 2. tasktracker log:
 2013-04-23 11:48:26,783 INFO org.apache.hadoop.mapred.UserLogCleaner:
 Deleting user log path job_201304152248_0011
 2013-04-30 14:48:15,506 ERROR org.apache.hadoop.mapred.TaskTracker:
 Caught exception: java.io.IOException: Call to node1/9.50.102.81:9001failed 
 on local exception: java.io.IOException: Connection reset by peer
 at org.apache.hadoop.ipc.Client.wrapException(Client.java:1144)
 at org.apache.hadoop.ipc.Client.call(Client.java:1112)
 at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229)
 at org.apache.hadoop.mapred.$Proxy2.heartbeat(Unknown Source)
 at
 org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:2008)
 at
 org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:1802)
 at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:2654)
 at
 org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:3909)
 Caused by: java.io.IOException: Connection reset by peer
 at sun.nio.ch.FileDispatcher.read0(Native Method)
 at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:33)
 at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:210)
 at sun.nio.ch.IOUtil.read(IOUtil.java:183)
 at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:257)
 at
 org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
 at
 org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
 at
 org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
 at
 org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
 at java.io.FilterInputStream.read(FilterInputStream.java:127)
 at
 org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:361)
 at java.io.BufferedInputStream.fill(BufferedInputStream.java:229)
 at java.io.BufferedInputStream.read(BufferedInputStream.java:248)
 at java.io.DataInputStream.readInt(DataInputStream.java:381)
 at
 org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:841)
 at org.apache.hadoop.ipc.Client$Connection.run(Client.java:786)

 2013-04-30 14:48:15,517 INFO org.apache.hadoop.mapred.TaskTracker:
 Resending 'status' to 'node1' with reponseId '-12904
 2013-04-30 14:48:16,404 INFO org.apache.hadoop.mapred.TaskTracker:
 SHUTDOWN_MSG:



 2013/5/13 Rishi Yadav ri...@infoobjects.com

 do you get any error when trying to connect to cluster, something like
 'tried n times' or replicated 0 times.




 On Sun,