Running Hadoop client as a different user
I have been running Hadoop on a clister set to not check permissions. I would run a java client on my local machine and would run as the local user on the cluster. I say * String connectString = hdfs:// + host + : + port + /;* *Configuration config = new Configuration();* * * *config.set(fs.default.name,connectString);* * * *FileSystem fs = FileSystem.get(config);* *The above code works* * * I am trying to port to a cluster where permissions are checked - I have an account but need to set a user and password to avoid Access Exceptions How do I do this and If I can only access certain directories how do I do that? Also are there some directories my code MUST be able to access outside those for my user only? -- Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com
Re: Hadoop noob question
@Tariq can you point me to some resource which shows how distcp is used to upload files from local to hdfs. isn't distcp a MR job ? wouldn't it need the data to be already present in the hadoop's fs? Rahul On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq donta...@gmail.com wrote: You'r welcome :) Warm Regards, Tariq cloudfront.blogspot.com On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: Thanks Tariq! On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq donta...@gmail.comwrote: @Rahul : Yes. distcp can do that. And, bigger the files lesser the metadata hence lesser memory consumption. Warm Regards, Tariq cloudfront.blogspot.com On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: IMHO,I think the statement about NN with regard to block metadata is more like a general statement. Even if you put lots of small files of combined size 10 TB , you need to have a capable NN. can disct cp be used to copy local - to - hdfs ? Thanks, Rahul On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar nitinpawar...@gmail.comwrote: absolutely rite Mohammad On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq donta...@gmail.comwrote: Sorry for barging in guys. I think Nitin is talking about this : Every file and block in HDFS is treated as an object and for each object around 200B of metadata get created. So the NN should be powerful enough to handle that much metadata, since it is going to be in-memory. Actually memory is the most important metric when it comes to NN. Am I correct @Nitin? @Thoihen : As Nitin has said, when you talk about that much data you don't actually just do a put. You could use something like distcp for parallel copying. A better approach would be to use a data aggregation tool like Flume or Chukwa, as Nitin has already pointed. Facebook uses their own data aggregation tool, called Scribe for this purpose. Warm Regards, Tariq cloudfront.blogspot.com On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar nitinpawar...@gmail.com wrote: NN would still be in picture because it will be writing a lot of meta data for each individual file. so you will need a NN capable enough which can store the metadata for your entire dataset. Data will never go to NN but lot of metadata about data will be on NN so its always good idea to have a strong NN. On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: @Nitin , parallel dfs to write to hdfs is great , but could not understand the meaning of capable NN. As I know , the NN would not be a part of the actual data write pipeline , means that the data would not travel through the NN , the dfs would contact the NN from time to time to get locations of DN as where to store the data blocks. Thanks, Rahul On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar nitinpawar...@gmail.com wrote: is it safe? .. there is no direct answer yes or no when you say , you have files worth 10TB files and you want to upload to HDFS, several factors come into picture 1) Is the machine in the same network as your hadoop cluster? 2) If there guarantee that network will not go down? and Most importantly I assume that you have a capable hadoop cluster. By that I mean you have a capable namenode. I would definitely not write files sequentially in HDFS. I would prefer to write files in parallel to hdfs to utilize the DFS write features to speed up the process. you can hdfs put command in parallel manner and in my experience it has not failed when we write a lot of data. On Sat, May 11, 2013 at 4:38 PM, maisnam ns maisnam...@gmail.comwrote: @Nitin Pawar , thanks for clearing my doubts . But I have one more question , say I have 10 TB data in the pipeline . Is it perfectly OK to use hadopo fs put command to upload these files of size 10 TB and is there any limit to the file size using hadoop command line . Can hadoop put command line work with huge data. Thanks in advance On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar nitinpawar...@gmail.com wrote: first of all .. most of the companies do not get 100 PB of data in one go. Its an accumulating process and most of the companies do have a data pipeline in place where the data is written to hdfs on a frequency basis and then its retained on hdfs for some duration as per needed and from there its sent to archivers or deleted. For data management products, you can look at falcon which is open sourced by inmobi along with hortonworks. In any case, if you want to write files to hdfs there are few options available to you 1) Write your dfs client which writes to dfs 2) use hdfs proxy 3) there is webhdfs 4) command line hdfs 5) data collection tools come with support to write to hdfs like flume etc On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam thoihen...@gmail.com wrote: Hi All, Can anyone help me know how does
Re: Hadoop noob question
you can do that using file:/// example: hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/ On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: @Tariq can you point me to some resource which shows how distcp is used to upload files from local to hdfs. isn't distcp a MR job ? wouldn't it need the data to be already present in the hadoop's fs? Rahul On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq donta...@gmail.comwrote: You'r welcome :) Warm Regards, Tariq cloudfront.blogspot.com On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: Thanks Tariq! On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq donta...@gmail.comwrote: @Rahul : Yes. distcp can do that. And, bigger the files lesser the metadata hence lesser memory consumption. Warm Regards, Tariq cloudfront.blogspot.com On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: IMHO,I think the statement about NN with regard to block metadata is more like a general statement. Even if you put lots of small files of combined size 10 TB , you need to have a capable NN. can disct cp be used to copy local - to - hdfs ? Thanks, Rahul On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar nitinpawar...@gmail.comwrote: absolutely rite Mohammad On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq donta...@gmail.comwrote: Sorry for barging in guys. I think Nitin is talking about this : Every file and block in HDFS is treated as an object and for each object around 200B of metadata get created. So the NN should be powerful enough to handle that much metadata, since it is going to be in-memory. Actually memory is the most important metric when it comes to NN. Am I correct @Nitin? @Thoihen : As Nitin has said, when you talk about that much data you don't actually just do a put. You could use something like distcp for parallel copying. A better approach would be to use a data aggregation tool like Flume or Chukwa, as Nitin has already pointed. Facebook uses their own data aggregation tool, called Scribe for this purpose. Warm Regards, Tariq cloudfront.blogspot.com On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar nitinpawar...@gmail.com wrote: NN would still be in picture because it will be writing a lot of meta data for each individual file. so you will need a NN capable enough which can store the metadata for your entire dataset. Data will never go to NN but lot of metadata about data will be on NN so its always good idea to have a strong NN. On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: @Nitin , parallel dfs to write to hdfs is great , but could not understand the meaning of capable NN. As I know , the NN would not be a part of the actual data write pipeline , means that the data would not travel through the NN , the dfs would contact the NN from time to time to get locations of DN as where to store the data blocks. Thanks, Rahul On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar nitinpawar...@gmail.com wrote: is it safe? .. there is no direct answer yes or no when you say , you have files worth 10TB files and you want to upload to HDFS, several factors come into picture 1) Is the machine in the same network as your hadoop cluster? 2) If there guarantee that network will not go down? and Most importantly I assume that you have a capable hadoop cluster. By that I mean you have a capable namenode. I would definitely not write files sequentially in HDFS. I would prefer to write files in parallel to hdfs to utilize the DFS write features to speed up the process. you can hdfs put command in parallel manner and in my experience it has not failed when we write a lot of data. On Sat, May 11, 2013 at 4:38 PM, maisnam ns maisnam...@gmail.com wrote: @Nitin Pawar , thanks for clearing my doubts . But I have one more question , say I have 10 TB data in the pipeline . Is it perfectly OK to use hadopo fs put command to upload these files of size 10 TB and is there any limit to the file size using hadoop command line . Can hadoop put command line work with huge data. Thanks in advance On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar nitinpawar...@gmail.com wrote: first of all .. most of the companies do not get 100 PB of data in one go. Its an accumulating process and most of the companies do have a data pipeline in place where the data is written to hdfs on a frequency basis and then its retained on hdfs for some duration as per needed and from there its sent to archivers or deleted. For data management products, you can look at falcon which is open sourced by inmobi along with hortonworks. In any case, if you want to write files to hdfs there are few options available to you 1) Write your dfs client which writes to dfs 2) use hdfs proxy 3) there is webhdfs 4) command
Re: Need help about task slots
@Rahul : I'm sorry as I am not aware of any such document. But you could use distcp for local to HDFS copy : *bin/hadoop distcp file:///home/tariq/in.txt hdfs://localhost:9000/* * * And yes. When you use distcp from local to HDFS, you can't take the pleasure of parallelism as the data is stored in a non distributed fashion. Warm Regards, Tariq cloudfront.blogspot.com On Sat, May 11, 2013 at 11:07 PM, Mohammad Tariq donta...@gmail.com wrote: Hello guys, My 2 cents : Actually no. of mappers is primarily governed by the no. of InputSplits created by the InputFormat you are using and the no. of reducers by the no. of partitions you get after the map phase. Having said that, you should also keep the no of slots, available per slave, in mind, along with the available memory. But as a general rule you could use this approach : Take the no. of virtual CPUs*.75 and that's the no. of slots you can configure. For example, if you have 12 physical cores (or 24 virtual cores), you would have (24*.75)=18 slots. Now, based on your requirement you could choose how many mappers and reducers you want to use. With 18 MR slots, you could have 9 mappers and 9 reducers or 12 mappers and 9 reducers or whatever you think is OK with you. I don't know if it ,makes much sense, but it helps me pretty decently. Warm Regards, Tariq cloudfront.blogspot.com On Sat, May 11, 2013 at 8:57 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: Hi, I am also new to Hadoop world , here is my take on your question , if there is something missing then others would surely correct that. For per-YARN , the slots are fixed and computed based on the crunching capacity of the datanode hardware , once the slots per data node is ascertained , they are divided into Map and reducer slots and that goes into the config files and remain fixed , until changed.In YARN , its decided at runtime based on the kind of requirement of particular task.Its very much possible that a datanode at certain point of time running 10 tasks and another similar datanode is only running 4 tasks. Coming to your question. Based of the data set size , block size of dfs and input formater , the number of map tasks are decided , generally for file based inputformats its one mapper per data block , however there are way to change this using configuration settings.Reduce tasks are set using job configuration. General rule as I have read from various documents is that Mappers should run atleast a minute , so you can run a sample to find out a good size of data block which would make you mapper run more than a minute. Now it again depends on your SLA , in case you are not looking for a very small SLA you can choose to run less mappers at the expense of higher runtime. But again its all theory , not sure how these things are handled in actual prod clusters. HTH, Thanks, Rahul On Sat, May 11, 2013 at 8:02 PM, Shashidhar Rao raoshashidhar...@gmail.com wrote: Hi Users, I am new to Hadoop and confused about task slots in a cluster. How would I know how many task slots would be required for a job. Is there any empirical formula or on what basis should I set the number of task slots. Advanced Thanks
Re: Hadoop noob question
@Rahul : I'm sorry I answered this on a wrong thread by mistake. You could do that as Nitin has shown. Warm Regards, Tariq cloudfront.blogspot.com On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar nitinpawar...@gmail.comwrote: you can do that using file:/// example: hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/ On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: @Tariq can you point me to some resource which shows how distcp is used to upload files from local to hdfs. isn't distcp a MR job ? wouldn't it need the data to be already present in the hadoop's fs? Rahul On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq donta...@gmail.comwrote: You'r welcome :) Warm Regards, Tariq cloudfront.blogspot.com On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: Thanks Tariq! On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq donta...@gmail.comwrote: @Rahul : Yes. distcp can do that. And, bigger the files lesser the metadata hence lesser memory consumption. Warm Regards, Tariq cloudfront.blogspot.com On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: IMHO,I think the statement about NN with regard to block metadata is more like a general statement. Even if you put lots of small files of combined size 10 TB , you need to have a capable NN. can disct cp be used to copy local - to - hdfs ? Thanks, Rahul On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar nitinpawar...@gmail.com wrote: absolutely rite Mohammad On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq donta...@gmail.comwrote: Sorry for barging in guys. I think Nitin is talking about this : Every file and block in HDFS is treated as an object and for each object around 200B of metadata get created. So the NN should be powerful enough to handle that much metadata, since it is going to be in-memory. Actually memory is the most important metric when it comes to NN. Am I correct @Nitin? @Thoihen : As Nitin has said, when you talk about that much data you don't actually just do a put. You could use something like distcp for parallel copying. A better approach would be to use a data aggregation tool like Flume or Chukwa, as Nitin has already pointed. Facebook uses their own data aggregation tool, called Scribe for this purpose. Warm Regards, Tariq cloudfront.blogspot.com On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar nitinpawar...@gmail.com wrote: NN would still be in picture because it will be writing a lot of meta data for each individual file. so you will need a NN capable enough which can store the metadata for your entire dataset. Data will never go to NN but lot of metadata about data will be on NN so its always good idea to have a strong NN. On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: @Nitin , parallel dfs to write to hdfs is great , but could not understand the meaning of capable NN. As I know , the NN would not be a part of the actual data write pipeline , means that the data would not travel through the NN , the dfs would contact the NN from time to time to get locations of DN as where to store the data blocks. Thanks, Rahul On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar nitinpawar...@gmail.com wrote: is it safe? .. there is no direct answer yes or no when you say , you have files worth 10TB files and you want to upload to HDFS, several factors come into picture 1) Is the machine in the same network as your hadoop cluster? 2) If there guarantee that network will not go down? and Most importantly I assume that you have a capable hadoop cluster. By that I mean you have a capable namenode. I would definitely not write files sequentially in HDFS. I would prefer to write files in parallel to hdfs to utilize the DFS write features to speed up the process. you can hdfs put command in parallel manner and in my experience it has not failed when we write a lot of data. On Sat, May 11, 2013 at 4:38 PM, maisnam ns maisnam...@gmail.com wrote: @Nitin Pawar , thanks for clearing my doubts . But I have one more question , say I have 10 TB data in the pipeline . Is it perfectly OK to use hadopo fs put command to upload these files of size 10 TB and is there any limit to the file size using hadoop command line . Can hadoop put command line work with huge data. Thanks in advance On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar nitinpawar...@gmail.com wrote: first of all .. most of the companies do not get 100 PB of data in one go. Its an accumulating process and most of the companies do have a data pipeline in place where the data is written to hdfs on a frequency basis and then its retained on hdfs for some duration as per needed and from there its sent to archivers or deleted. For data management products, you can look at falcon which is
Re: Need help about task slots
Sorry for the blunder guys. Warm Regards, Tariq cloudfront.blogspot.com On Sun, May 12, 2013 at 5:39 PM, Mohammad Tariq donta...@gmail.com wrote: @Rahul : I'm sorry as I am not aware of any such document. But you could use distcp for local to HDFS copy : *bin/hadoop distcp file:///home/tariq/in.txt hdfs://localhost:9000/* * * And yes. When you use distcp from local to HDFS, you can't take the pleasure of parallelism as the data is stored in a non distributed fashion. Warm Regards, Tariq cloudfront.blogspot.com On Sat, May 11, 2013 at 11:07 PM, Mohammad Tariq donta...@gmail.comwrote: Hello guys, My 2 cents : Actually no. of mappers is primarily governed by the no. of InputSplits created by the InputFormat you are using and the no. of reducers by the no. of partitions you get after the map phase. Having said that, you should also keep the no of slots, available per slave, in mind, along with the available memory. But as a general rule you could use this approach : Take the no. of virtual CPUs*.75 and that's the no. of slots you can configure. For example, if you have 12 physical cores (or 24 virtual cores), you would have (24*.75)=18 slots. Now, based on your requirement you could choose how many mappers and reducers you want to use. With 18 MR slots, you could have 9 mappers and 9 reducers or 12 mappers and 9 reducers or whatever you think is OK with you. I don't know if it ,makes much sense, but it helps me pretty decently. Warm Regards, Tariq cloudfront.blogspot.com On Sat, May 11, 2013 at 8:57 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: Hi, I am also new to Hadoop world , here is my take on your question , if there is something missing then others would surely correct that. For per-YARN , the slots are fixed and computed based on the crunching capacity of the datanode hardware , once the slots per data node is ascertained , they are divided into Map and reducer slots and that goes into the config files and remain fixed , until changed.In YARN , its decided at runtime based on the kind of requirement of particular task.Its very much possible that a datanode at certain point of time running 10 tasks and another similar datanode is only running 4 tasks. Coming to your question. Based of the data set size , block size of dfs and input formater , the number of map tasks are decided , generally for file based inputformats its one mapper per data block , however there are way to change this using configuration settings.Reduce tasks are set using job configuration. General rule as I have read from various documents is that Mappers should run atleast a minute , so you can run a sample to find out a good size of data block which would make you mapper run more than a minute. Now it again depends on your SLA , in case you are not looking for a very small SLA you can choose to run less mappers at the expense of higher runtime. But again its all theory , not sure how these things are handled in actual prod clusters. HTH, Thanks, Rahul On Sat, May 11, 2013 at 8:02 PM, Shashidhar Rao raoshashidhar...@gmail.com wrote: Hi Users, I am new to Hadoop and confused about task slots in a cluster. How would I know how many task slots would be required for a job. Is there any empirical formula or on what basis should I set the number of task slots. Advanced Thanks
Re: Hadoop noob question
Thanks to both of you! Rahul On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar nitinpawar...@gmail.comwrote: you can do that using file:/// example: hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/ On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: @Tariq can you point me to some resource which shows how distcp is used to upload files from local to hdfs. isn't distcp a MR job ? wouldn't it need the data to be already present in the hadoop's fs? Rahul On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq donta...@gmail.comwrote: You'r welcome :) Warm Regards, Tariq cloudfront.blogspot.com On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: Thanks Tariq! On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq donta...@gmail.comwrote: @Rahul : Yes. distcp can do that. And, bigger the files lesser the metadata hence lesser memory consumption. Warm Regards, Tariq cloudfront.blogspot.com On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: IMHO,I think the statement about NN with regard to block metadata is more like a general statement. Even if you put lots of small files of combined size 10 TB , you need to have a capable NN. can disct cp be used to copy local - to - hdfs ? Thanks, Rahul On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar nitinpawar...@gmail.com wrote: absolutely rite Mohammad On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq donta...@gmail.comwrote: Sorry for barging in guys. I think Nitin is talking about this : Every file and block in HDFS is treated as an object and for each object around 200B of metadata get created. So the NN should be powerful enough to handle that much metadata, since it is going to be in-memory. Actually memory is the most important metric when it comes to NN. Am I correct @Nitin? @Thoihen : As Nitin has said, when you talk about that much data you don't actually just do a put. You could use something like distcp for parallel copying. A better approach would be to use a data aggregation tool like Flume or Chukwa, as Nitin has already pointed. Facebook uses their own data aggregation tool, called Scribe for this purpose. Warm Regards, Tariq cloudfront.blogspot.com On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar nitinpawar...@gmail.com wrote: NN would still be in picture because it will be writing a lot of meta data for each individual file. so you will need a NN capable enough which can store the metadata for your entire dataset. Data will never go to NN but lot of metadata about data will be on NN so its always good idea to have a strong NN. On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: @Nitin , parallel dfs to write to hdfs is great , but could not understand the meaning of capable NN. As I know , the NN would not be a part of the actual data write pipeline , means that the data would not travel through the NN , the dfs would contact the NN from time to time to get locations of DN as where to store the data blocks. Thanks, Rahul On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar nitinpawar...@gmail.com wrote: is it safe? .. there is no direct answer yes or no when you say , you have files worth 10TB files and you want to upload to HDFS, several factors come into picture 1) Is the machine in the same network as your hadoop cluster? 2) If there guarantee that network will not go down? and Most importantly I assume that you have a capable hadoop cluster. By that I mean you have a capable namenode. I would definitely not write files sequentially in HDFS. I would prefer to write files in parallel to hdfs to utilize the DFS write features to speed up the process. you can hdfs put command in parallel manner and in my experience it has not failed when we write a lot of data. On Sat, May 11, 2013 at 4:38 PM, maisnam ns maisnam...@gmail.com wrote: @Nitin Pawar , thanks for clearing my doubts . But I have one more question , say I have 10 TB data in the pipeline . Is it perfectly OK to use hadopo fs put command to upload these files of size 10 TB and is there any limit to the file size using hadoop command line . Can hadoop put command line work with huge data. Thanks in advance On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar nitinpawar...@gmail.com wrote: first of all .. most of the companies do not get 100 PB of data in one go. Its an accumulating process and most of the companies do have a data pipeline in place where the data is written to hdfs on a frequency basis and then its retained on hdfs for some duration as per needed and from there its sent to archivers or deleted. For data management products, you can look at falcon which is open sourced by inmobi along with hortonworks. In any case, if you want to write files to hdfs there are few
Re: Need help about task slots
Hi The concept of task slots is used in MRv1. In the new version of Hadoop ,MRv2 uses yarn instead of slots. You can read it from Hadoop definitive 3rd. 发自我的 iPhone 在 2013-5-12,20:11,Mohammad Tariq donta...@gmail.com 写道: Sorry for the blunder guys. Warm Regards, Tariq cloudfront.blogspot.com On Sun, May 12, 2013 at 5:39 PM, Mohammad Tariq donta...@gmail.com wrote: @Rahul : I'm sorry as I am not aware of any such document. But you could use distcp for local to HDFS copy : bin/hadoop distcp file:///home/tariq/in.txt hdfs://localhost:9000/ And yes. When you use distcp from local to HDFS, you can't take the pleasure of parallelism as the data is stored in a non distributed fashion. Warm Regards, Tariq cloudfront.blogspot.com On Sat, May 11, 2013 at 11:07 PM, Mohammad Tariq donta...@gmail.com wrote: Hello guys, My 2 cents : Actually no. of mappers is primarily governed by the no. of InputSplits created by the InputFormat you are using and the no. of reducers by the no. of partitions you get after the map phase. Having said that, you should also keep the no of slots, available per slave, in mind, along with the available memory. But as a general rule you could use this approach : Take the no. of virtual CPUs*.75 and that's the no. of slots you can configure. For example, if you have 12 physical cores (or 24 virtual cores), you would have (24*.75)=18 slots. Now, based on your requirement you could choose how many mappers and reducers you want to use. With 18 MR slots, you could have 9 mappers and 9 reducers or 12 mappers and 9 reducers or whatever you think is OK with you. I don't know if it ,makes much sense, but it helps me pretty decently. Warm Regards, Tariq cloudfront.blogspot.com On Sat, May 11, 2013 at 8:57 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: Hi, I am also new to Hadoop world , here is my take on your question , if there is something missing then others would surely correct that. For per-YARN , the slots are fixed and computed based on the crunching capacity of the datanode hardware , once the slots per data node is ascertained , they are divided into Map and reducer slots and that goes into the config files and remain fixed , until changed.In YARN , its decided at runtime based on the kind of requirement of particular task.Its very much possible that a datanode at certain point of time running 10 tasks and another similar datanode is only running 4 tasks. Coming to your question. Based of the data set size , block size of dfs and input formater , the number of map tasks are decided , generally for file based inputformats its one mapper per data block , however there are way to change this using configuration settings.Reduce tasks are set using job configuration. General rule as I have read from various documents is that Mappers should run atleast a minute , so you can run a sample to find out a good size of data block which would make you mapper run more than a minute. Now it again depends on your SLA , in case you are not looking for a very small SLA you can choose to run less mappers at the expense of higher runtime. But again its all theory , not sure how these things are handled in actual prod clusters. HTH, Thanks, Rahul On Sat, May 11, 2013 at 8:02 PM, Shashidhar Rao raoshashidhar...@gmail.com wrote: Hi Users, I am new to Hadoop and confused about task slots in a cluster. How would I know how many task slots would be required for a job. Is there any empirical formula or on what basis should I set the number of task slots. Advanced Thanks
Re: Need help about task slots
Oh! I though distcp works on complete files rather then mappers per datablock. So I guess parallelism would still be there if there are multipel files.. please correct if ther is anything wrong. Thank, Rahul On Sun, May 12, 2013 at 5:39 PM, Mohammad Tariq donta...@gmail.com wrote: @Rahul : I'm sorry as I am not aware of any such document. But you could use distcp for local to HDFS copy : *bin/hadoop distcp file:///home/tariq/in.txt hdfs://localhost:9000/* * * And yes. When you use distcp from local to HDFS, you can't take the pleasure of parallelism as the data is stored in a non distributed fashion. Warm Regards, Tariq cloudfront.blogspot.com On Sat, May 11, 2013 at 11:07 PM, Mohammad Tariq donta...@gmail.comwrote: Hello guys, My 2 cents : Actually no. of mappers is primarily governed by the no. of InputSplits created by the InputFormat you are using and the no. of reducers by the no. of partitions you get after the map phase. Having said that, you should also keep the no of slots, available per slave, in mind, along with the available memory. But as a general rule you could use this approach : Take the no. of virtual CPUs*.75 and that's the no. of slots you can configure. For example, if you have 12 physical cores (or 24 virtual cores), you would have (24*.75)=18 slots. Now, based on your requirement you could choose how many mappers and reducers you want to use. With 18 MR slots, you could have 9 mappers and 9 reducers or 12 mappers and 9 reducers or whatever you think is OK with you. I don't know if it ,makes much sense, but it helps me pretty decently. Warm Regards, Tariq cloudfront.blogspot.com On Sat, May 11, 2013 at 8:57 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: Hi, I am also new to Hadoop world , here is my take on your question , if there is something missing then others would surely correct that. For per-YARN , the slots are fixed and computed based on the crunching capacity of the datanode hardware , once the slots per data node is ascertained , they are divided into Map and reducer slots and that goes into the config files and remain fixed , until changed.In YARN , its decided at runtime based on the kind of requirement of particular task.Its very much possible that a datanode at certain point of time running 10 tasks and another similar datanode is only running 4 tasks. Coming to your question. Based of the data set size , block size of dfs and input formater , the number of map tasks are decided , generally for file based inputformats its one mapper per data block , however there are way to change this using configuration settings.Reduce tasks are set using job configuration. General rule as I have read from various documents is that Mappers should run atleast a minute , so you can run a sample to find out a good size of data block which would make you mapper run more than a minute. Now it again depends on your SLA , in case you are not looking for a very small SLA you can choose to run less mappers at the expense of higher runtime. But again its all theory , not sure how these things are handled in actual prod clusters. HTH, Thanks, Rahul On Sat, May 11, 2013 at 8:02 PM, Shashidhar Rao raoshashidhar...@gmail.com wrote: Hi Users, I am new to Hadoop and confused about task slots in a cluster. How would I know how many task slots would be required for a job. Is there any empirical formula or on what basis should I set the number of task slots. Advanced Thanks
Re: Need help about task slots
sorry for my blunder as well. my previous post for for Tariq in a wrong post. Thanks. Rahul On Sun, May 12, 2013 at 6:03 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: Oh! I though distcp works on complete files rather then mappers per datablock. So I guess parallelism would still be there if there are multipel files.. please correct if ther is anything wrong. Thank, Rahul On Sun, May 12, 2013 at 5:39 PM, Mohammad Tariq donta...@gmail.comwrote: @Rahul : I'm sorry as I am not aware of any such document. But you could use distcp for local to HDFS copy : *bin/hadoop distcp file:///home/tariq/in.txt hdfs://localhost:9000/* * * And yes. When you use distcp from local to HDFS, you can't take the pleasure of parallelism as the data is stored in a non distributed fashion. Warm Regards, Tariq cloudfront.blogspot.com On Sat, May 11, 2013 at 11:07 PM, Mohammad Tariq donta...@gmail.comwrote: Hello guys, My 2 cents : Actually no. of mappers is primarily governed by the no. of InputSplits created by the InputFormat you are using and the no. of reducers by the no. of partitions you get after the map phase. Having said that, you should also keep the no of slots, available per slave, in mind, along with the available memory. But as a general rule you could use this approach : Take the no. of virtual CPUs*.75 and that's the no. of slots you can configure. For example, if you have 12 physical cores (or 24 virtual cores), you would have (24*.75)=18 slots. Now, based on your requirement you could choose how many mappers and reducers you want to use. With 18 MR slots, you could have 9 mappers and 9 reducers or 12 mappers and 9 reducers or whatever you think is OK with you. I don't know if it ,makes much sense, but it helps me pretty decently. Warm Regards, Tariq cloudfront.blogspot.com On Sat, May 11, 2013 at 8:57 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: Hi, I am also new to Hadoop world , here is my take on your question , if there is something missing then others would surely correct that. For per-YARN , the slots are fixed and computed based on the crunching capacity of the datanode hardware , once the slots per data node is ascertained , they are divided into Map and reducer slots and that goes into the config files and remain fixed , until changed.In YARN , its decided at runtime based on the kind of requirement of particular task.Its very much possible that a datanode at certain point of time running 10 tasks and another similar datanode is only running 4 tasks. Coming to your question. Based of the data set size , block size of dfs and input formater , the number of map tasks are decided , generally for file based inputformats its one mapper per data block , however there are way to change this using configuration settings.Reduce tasks are set using job configuration. General rule as I have read from various documents is that Mappers should run atleast a minute , so you can run a sample to find out a good size of data block which would make you mapper run more than a minute. Now it again depends on your SLA , in case you are not looking for a very small SLA you can choose to run less mappers at the expense of higher runtime. But again its all theory , not sure how these things are handled in actual prod clusters. HTH, Thanks, Rahul On Sat, May 11, 2013 at 8:02 PM, Shashidhar Rao raoshashidhar...@gmail.com wrote: Hi Users, I am new to Hadoop and confused about task slots in a cluster. How would I know how many task slots would be required for a job. Is there any empirical formula or on what basis should I set the number of task slots. Advanced Thanks
Re: Need help about task slots
Hahaha..I think we could continue this over there.. Warm Regards, Tariq cloudfront.blogspot.com On Sun, May 12, 2013 at 6:04 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: sorry for my blunder as well. my previous post for for Tariq in a wrong post. Thanks. Rahul On Sun, May 12, 2013 at 6:03 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: Oh! I though distcp works on complete files rather then mappers per datablock. So I guess parallelism would still be there if there are multipel files.. please correct if ther is anything wrong. Thank, Rahul On Sun, May 12, 2013 at 5:39 PM, Mohammad Tariq donta...@gmail.comwrote: @Rahul : I'm sorry as I am not aware of any such document. But you could use distcp for local to HDFS copy : *bin/hadoop distcp file:///home/tariq/in.txt hdfs://localhost:9000/* * * And yes. When you use distcp from local to HDFS, you can't take the pleasure of parallelism as the data is stored in a non distributed fashion. Warm Regards, Tariq cloudfront.blogspot.com On Sat, May 11, 2013 at 11:07 PM, Mohammad Tariq donta...@gmail.comwrote: Hello guys, My 2 cents : Actually no. of mappers is primarily governed by the no. of InputSplits created by the InputFormat you are using and the no. of reducers by the no. of partitions you get after the map phase. Having said that, you should also keep the no of slots, available per slave, in mind, along with the available memory. But as a general rule you could use this approach : Take the no. of virtual CPUs*.75 and that's the no. of slots you can configure. For example, if you have 12 physical cores (or 24 virtual cores), you would have (24*.75)=18 slots. Now, based on your requirement you could choose how many mappers and reducers you want to use. With 18 MR slots, you could have 9 mappers and 9 reducers or 12 mappers and 9 reducers or whatever you think is OK with you. I don't know if it ,makes much sense, but it helps me pretty decently. Warm Regards, Tariq cloudfront.blogspot.com On Sat, May 11, 2013 at 8:57 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: Hi, I am also new to Hadoop world , here is my take on your question , if there is something missing then others would surely correct that. For per-YARN , the slots are fixed and computed based on the crunching capacity of the datanode hardware , once the slots per data node is ascertained , they are divided into Map and reducer slots and that goes into the config files and remain fixed , until changed.In YARN , its decided at runtime based on the kind of requirement of particular task.Its very much possible that a datanode at certain point of time running 10 tasks and another similar datanode is only running 4 tasks. Coming to your question. Based of the data set size , block size of dfs and input formater , the number of map tasks are decided , generally for file based inputformats its one mapper per data block , however there are way to change this using configuration settings.Reduce tasks are set using job configuration. General rule as I have read from various documents is that Mappers should run atleast a minute , so you can run a sample to find out a good size of data block which would make you mapper run more than a minute. Now it again depends on your SLA , in case you are not looking for a very small SLA you can choose to run less mappers at the expense of higher runtime. But again its all theory , not sure how these things are handled in actual prod clusters. HTH, Thanks, Rahul On Sat, May 11, 2013 at 8:02 PM, Shashidhar Rao raoshashidhar...@gmail.com wrote: Hi Users, I am new to Hadoop and confused about task slots in a cluster. How would I know how many task slots would be required for a job. Is there any empirical formula or on what basis should I set the number of task slots. Advanced Thanks
Re: Hadoop noob question
No. distcp is actually a mapreduce job under the hood. Warm Regards, Tariq cloudfront.blogspot.com On Sun, May 12, 2013 at 6:00 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: Thanks to both of you! Rahul On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar nitinpawar...@gmail.comwrote: you can do that using file:/// example: hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/ On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: @Tariq can you point me to some resource which shows how distcp is used to upload files from local to hdfs. isn't distcp a MR job ? wouldn't it need the data to be already present in the hadoop's fs? Rahul On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq donta...@gmail.comwrote: You'r welcome :) Warm Regards, Tariq cloudfront.blogspot.com On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: Thanks Tariq! On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq donta...@gmail.comwrote: @Rahul : Yes. distcp can do that. And, bigger the files lesser the metadata hence lesser memory consumption. Warm Regards, Tariq cloudfront.blogspot.com On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: IMHO,I think the statement about NN with regard to block metadata is more like a general statement. Even if you put lots of small files of combined size 10 TB , you need to have a capable NN. can disct cp be used to copy local - to - hdfs ? Thanks, Rahul On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar nitinpawar...@gmail.com wrote: absolutely rite Mohammad On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq donta...@gmail.com wrote: Sorry for barging in guys. I think Nitin is talking about this : Every file and block in HDFS is treated as an object and for each object around 200B of metadata get created. So the NN should be powerful enough to handle that much metadata, since it is going to be in-memory. Actually memory is the most important metric when it comes to NN. Am I correct @Nitin? @Thoihen : As Nitin has said, when you talk about that much data you don't actually just do a put. You could use something like distcp for parallel copying. A better approach would be to use a data aggregation tool like Flume or Chukwa, as Nitin has already pointed. Facebook uses their own data aggregation tool, called Scribe for this purpose. Warm Regards, Tariq cloudfront.blogspot.com On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar nitinpawar...@gmail.com wrote: NN would still be in picture because it will be writing a lot of meta data for each individual file. so you will need a NN capable enough which can store the metadata for your entire dataset. Data will never go to NN but lot of metadata about data will be on NN so its always good idea to have a strong NN. On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: @Nitin , parallel dfs to write to hdfs is great , but could not understand the meaning of capable NN. As I know , the NN would not be a part of the actual data write pipeline , means that the data would not travel through the NN , the dfs would contact the NN from time to time to get locations of DN as where to store the data blocks. Thanks, Rahul On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar nitinpawar...@gmail.com wrote: is it safe? .. there is no direct answer yes or no when you say , you have files worth 10TB files and you want to upload to HDFS, several factors come into picture 1) Is the machine in the same network as your hadoop cluster? 2) If there guarantee that network will not go down? and Most importantly I assume that you have a capable hadoop cluster. By that I mean you have a capable namenode. I would definitely not write files sequentially in HDFS. I would prefer to write files in parallel to hdfs to utilize the DFS write features to speed up the process. you can hdfs put command in parallel manner and in my experience it has not failed when we write a lot of data. On Sat, May 11, 2013 at 4:38 PM, maisnam ns maisnam...@gmail.com wrote: @Nitin Pawar , thanks for clearing my doubts . But I have one more question , say I have 10 TB data in the pipeline . Is it perfectly OK to use hadopo fs put command to upload these files of size 10 TB and is there any limit to the file size using hadoop command line . Can hadoop put command line work with huge data. Thanks in advance On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar nitinpawar...@gmail.com wrote: first of all .. most of the companies do not get 100 PB of data in one go. Its an accumulating process and most of the companies do have a data pipeline in place where the data is written to hdfs on a frequency basis and then its retained on hdfs for some duration as per needed and from there its sent to
Re: Hadoop noob question
I had said that if you use distcp to copy data *from localFS to HDFS* then you won't be able to exploit parallelism as entire file is present on a single machine. So no multiple TTs. Please comment if you think I am wring somewhere. Warm Regards, Tariq cloudfront.blogspot.com On Sun, May 12, 2013 at 6:15 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: Yes , it's a MR job under the hood . my question was that you wrote that using distcp you loose the benefits of parallel processing of Hadoop. I think the MR job of distcp divides files into individual map tasks based on the total size of the transfer , so multiple mappers would still be spawned if the size of transfer is huge and they would work in parallel. Correct me if there is anything wrong! Thanks, Rahul On Sun, May 12, 2013 at 6:07 PM, Mohammad Tariq donta...@gmail.comwrote: No. distcp is actually a mapreduce job under the hood. Warm Regards, Tariq cloudfront.blogspot.com On Sun, May 12, 2013 at 6:00 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: Thanks to both of you! Rahul On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar nitinpawar...@gmail.comwrote: you can do that using file:/// example: hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/ On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: @Tariq can you point me to some resource which shows how distcp is used to upload files from local to hdfs. isn't distcp a MR job ? wouldn't it need the data to be already present in the hadoop's fs? Rahul On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq donta...@gmail.comwrote: You'r welcome :) Warm Regards, Tariq cloudfront.blogspot.com On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: Thanks Tariq! On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq donta...@gmail.com wrote: @Rahul : Yes. distcp can do that. And, bigger the files lesser the metadata hence lesser memory consumption. Warm Regards, Tariq cloudfront.blogspot.com On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: IMHO,I think the statement about NN with regard to block metadata is more like a general statement. Even if you put lots of small files of combined size 10 TB , you need to have a capable NN. can disct cp be used to copy local - to - hdfs ? Thanks, Rahul On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar nitinpawar...@gmail.com wrote: absolutely rite Mohammad On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq donta...@gmail.com wrote: Sorry for barging in guys. I think Nitin is talking about this : Every file and block in HDFS is treated as an object and for each object around 200B of metadata get created. So the NN should be powerful enough to handle that much metadata, since it is going to be in-memory. Actually memory is the most important metric when it comes to NN. Am I correct @Nitin? @Thoihen : As Nitin has said, when you talk about that much data you don't actually just do a put. You could use something like distcp for parallel copying. A better approach would be to use a data aggregation tool like Flume or Chukwa, as Nitin has already pointed. Facebook uses their own data aggregation tool, called Scribe for this purpose. Warm Regards, Tariq cloudfront.blogspot.com On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar nitinpawar...@gmail.com wrote: NN would still be in picture because it will be writing a lot of meta data for each individual file. so you will need a NN capable enough which can store the metadata for your entire dataset. Data will never go to NN but lot of metadata about data will be on NN so its always good idea to have a strong NN. On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: @Nitin , parallel dfs to write to hdfs is great , but could not understand the meaning of capable NN. As I know , the NN would not be a part of the actual data write pipeline , means that the data would not travel through the NN , the dfs would contact the NN from time to time to get locations of DN as where to store the data blocks. Thanks, Rahul On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar nitinpawar...@gmail.com wrote: is it safe? .. there is no direct answer yes or no when you say , you have files worth 10TB files and you want to upload to HDFS, several factors come into picture 1) Is the machine in the same network as your hadoop cluster? 2) If there guarantee that network will not go down? and Most importantly I assume that you have a capable hadoop cluster. By that I mean you have a capable namenode. I would definitely not write files sequentially in HDFS. I would prefer to write files in parallel to hdfs to utilize the DFS write features to speed up the process. you can hdfs put command in parallel manner and in my
Re: Hadoop noob question
yeah you are right I mis read your earlier post. Thanks, Rahul On Sun, May 12, 2013 at 6:25 PM, Mohammad Tariq donta...@gmail.com wrote: I had said that if you use distcp to copy data *from localFS to HDFS*then you won't be able to exploit parallelism as entire file is present on a single machine. So no multiple TTs. Please comment if you think I am wring somewhere. Warm Regards, Tariq cloudfront.blogspot.com On Sun, May 12, 2013 at 6:15 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: Yes , it's a MR job under the hood . my question was that you wrote that using distcp you loose the benefits of parallel processing of Hadoop. I think the MR job of distcp divides files into individual map tasks based on the total size of the transfer , so multiple mappers would still be spawned if the size of transfer is huge and they would work in parallel. Correct me if there is anything wrong! Thanks, Rahul On Sun, May 12, 2013 at 6:07 PM, Mohammad Tariq donta...@gmail.comwrote: No. distcp is actually a mapreduce job under the hood. Warm Regards, Tariq cloudfront.blogspot.com On Sun, May 12, 2013 at 6:00 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: Thanks to both of you! Rahul On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar nitinpawar...@gmail.comwrote: you can do that using file:/// example: hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/ On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: @Tariq can you point me to some resource which shows how distcp is used to upload files from local to hdfs. isn't distcp a MR job ? wouldn't it need the data to be already present in the hadoop's fs? Rahul On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq donta...@gmail.comwrote: You'r welcome :) Warm Regards, Tariq cloudfront.blogspot.com On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: Thanks Tariq! On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq donta...@gmail.com wrote: @Rahul : Yes. distcp can do that. And, bigger the files lesser the metadata hence lesser memory consumption. Warm Regards, Tariq cloudfront.blogspot.com On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: IMHO,I think the statement about NN with regard to block metadata is more like a general statement. Even if you put lots of small files of combined size 10 TB , you need to have a capable NN. can disct cp be used to copy local - to - hdfs ? Thanks, Rahul On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar nitinpawar...@gmail.com wrote: absolutely rite Mohammad On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq donta...@gmail.com wrote: Sorry for barging in guys. I think Nitin is talking about this : Every file and block in HDFS is treated as an object and for each object around 200B of metadata get created. So the NN should be powerful enough to handle that much metadata, since it is going to be in-memory. Actually memory is the most important metric when it comes to NN. Am I correct @Nitin? @Thoihen : As Nitin has said, when you talk about that much data you don't actually just do a put. You could use something like distcp for parallel copying. A better approach would be to use a data aggregation tool like Flume or Chukwa, as Nitin has already pointed. Facebook uses their own data aggregation tool, called Scribe for this purpose. Warm Regards, Tariq cloudfront.blogspot.com On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar nitinpawar...@gmail.com wrote: NN would still be in picture because it will be writing a lot of meta data for each individual file. so you will need a NN capable enough which can store the metadata for your entire dataset. Data will never go to NN but lot of metadata about data will be on NN so its always good idea to have a strong NN. On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: @Nitin , parallel dfs to write to hdfs is great , but could not understand the meaning of capable NN. As I know , the NN would not be a part of the actual data write pipeline , means that the data would not travel through the NN , the dfs would contact the NN from time to time to get locations of DN as where to store the data blocks. Thanks, Rahul On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar nitinpawar...@gmail.com wrote: is it safe? .. there is no direct answer yes or no when you say , you have files worth 10TB files and you want to upload to HDFS, several factors come into picture 1) Is the machine in the same network as your hadoop cluster? 2) If there guarantee that network will not go down? and Most importantly I assume that you have a capable hadoop cluster. By that I mean you have a capable namenode. I would definitely not write files sequentially in HDFS. I would
Re: Hadoop noob question
This is what I would say : The number of maps is decided as follows. Since it’s a good idea to get each map to copy a reasonable amount of data to minimize overheads in task setup, each map copies at least 256 MB (unless the total size of the input is less, in which case one map handles it all). For example, 1 GB of files will be given four map tasks. When the data size is very large, it becomes necessary to limit the number of maps in order to limit bandwidth and cluster utilization. By default, the maximum number of maps is 20 per (tasktracker) cluster node. For example, copying 1,000 GB of files to a 100-node cluster will allocate 2,000 maps (20 per node), so each will copy 512 MB on average. This can be reduced by specifying the-m argument to * distcp*. For example, -m 1000 would allocate 1,000 maps, each copying 1 GB on average. HTH Warm Regards, Tariq cloudfront.blogspot.com On Sun, May 12, 2013 at 6:35 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: Soon after replying I realized something else related to this. Say we have a single file in HDFS (hdfs configured for default block size 64 MB) and the size of the file is 1 GB. Now if we use distcp to move it from the current hdfs to another one , then whether there would be any parallelism or just a single map task would be fired? As per what I have read , a mapper is launcher for a complete file or a set of files. It doesn't operate at block level.So no parallelism even if the file resides in HDFS. Thanks, Rahul On Sun, May 12, 2013 at 6:28 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: yeah you are right I mis read your earlier post. Thanks, Rahul On Sun, May 12, 2013 at 6:25 PM, Mohammad Tariq donta...@gmail.comwrote: I had said that if you use distcp to copy data *from localFS to HDFS*then you won't be able to exploit parallelism as entire file is present on a single machine. So no multiple TTs. Please comment if you think I am wring somewhere. Warm Regards, Tariq cloudfront.blogspot.com On Sun, May 12, 2013 at 6:15 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: Yes , it's a MR job under the hood . my question was that you wrote that using distcp you loose the benefits of parallel processing of Hadoop. I think the MR job of distcp divides files into individual map tasks based on the total size of the transfer , so multiple mappers would still be spawned if the size of transfer is huge and they would work in parallel. Correct me if there is anything wrong! Thanks, Rahul On Sun, May 12, 2013 at 6:07 PM, Mohammad Tariq donta...@gmail.comwrote: No. distcp is actually a mapreduce job under the hood. Warm Regards, Tariq cloudfront.blogspot.com On Sun, May 12, 2013 at 6:00 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: Thanks to both of you! Rahul On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar nitinpawar...@gmail.com wrote: you can do that using file:/// example: hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/ On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: @Tariq can you point me to some resource which shows how distcp is used to upload files from local to hdfs. isn't distcp a MR job ? wouldn't it need the data to be already present in the hadoop's fs? Rahul On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq donta...@gmail.com wrote: You'r welcome :) Warm Regards, Tariq cloudfront.blogspot.com On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: Thanks Tariq! On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq donta...@gmail.com wrote: @Rahul : Yes. distcp can do that. And, bigger the files lesser the metadata hence lesser memory consumption. Warm Regards, Tariq cloudfront.blogspot.com On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: IMHO,I think the statement about NN with regard to block metadata is more like a general statement. Even if you put lots of small files of combined size 10 TB , you need to have a capable NN. can disct cp be used to copy local - to - hdfs ? Thanks, Rahul On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar nitinpawar...@gmail.com wrote: absolutely rite Mohammad On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq donta...@gmail.com wrote: Sorry for barging in guys. I think Nitin is talking about this : Every file and block in HDFS is treated as an object and for each object around 200B of metadata get created. So the NN should be powerful enough to handle that much metadata, since it is going to be in-memory. Actually memory is the most important metric when it comes to NN. Am I correct @Nitin? @Thoihen : As Nitin has said, when you talk about that much data you don't actually just do a put. You could use something like distcp for parallel copying. A better approach would be to use
Re: Submitting a hadoop job in large clusters.
On Sun, May 12, 2013 at 12:19 AM, Nitin Pawar nitinpawar...@gmail.comwrote: normally if you want to copy the jar then hadoop admins setu Submit you job to Job tracker it will distribute throughout the tasktrackers. *Thanks Regards* ∞ Shashwat Shriparv
Re: Permissions
The user through which you are trying to run the task should jave permission on hdfs. just verify that *Thanks Regards* ∞ Shashwat Shriparv On Sat, May 11, 2013 at 1:02 AM, Amal G Jose amalg...@gmail.com wrote: After starting the hdfs, ie NN, SN and DN, create an hdfs directory structure in the form /hadoop.tmp.dir/mapred/staging. Then give 777 permission to staging. After that change the ownership of mapred directory to mapred user. After doing this start jobtracker, it will start. Otherwise, it will not start. The reason for not showing any datanodes may be due to firewall. Check whether the necessary ports are open. On Tue, Apr 30, 2013 at 2:28 AM, rkevinbur...@charter.net wrote: I look in the name node log and I get the following errors: 2013-04-29 15:25:11,646 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:mapred (auth:SIMPLE) cause:org.apache.hadoop.security.AccessControlException: Permission denied: *user=mapred*, access=WRITE, inode=/:hdfs:supergroup:drwxr-xr-x 2013-04-29 15:25:11,646 INFO org.apache.hadoop.ipc.Server: IPC Server handler 6 on 9000, call org.apache.hadoop.hdfs.protocol.ClientProtocol.mkdirs from 172.16.26.68:45044: error: org.apache.hadoop.security.AccessControlException: Permission denied: * user=mapred*, access=WRITE, inode=/:hdfs:supergroup:drwxr-xr-x org.apache.hadoop.security.AccessControlException: Permission denied: * user=mapred,* access=WRITE, inode=/:hdfs:supergroup:drwxr-xr-x at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:205) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:186) When I create the file system I have the user hdfs on the root folder. (/). I am not sure now to have both the user mapred and hdfs have access to the root (which it seems these errors are indicating). I get a page from 50070 put when I try to browse the filesystem from the web UI I get an error that there are no nodes listening (I have 3 data nodes and 1 namenode). The browser indicates that there is nothing listening to port 50030, so it seems that the JobTracker is not up.
Re: issues with decrease the default.block.size
The block size is for allocation not storage on the disk. *Thanks Regards* ∞ Shashwat Shriparv On Fri, May 10, 2013 at 8:54 PM, Harsh J ha...@cloudera.com wrote: Thanks. I failed to add: It should be okay to do if those cases are true and the cluster seems under-utilized right now. On Fri, May 10, 2013 at 8:29 PM, yypvsxf19870706 yypvsxf19870...@gmail.com wrote: Hi harsh Yep. Regards 发自我的 iPhone 在 2013-5-10,13:27,Harsh J ha...@cloudera.com 写道: Are you looking to decrease it to get more parallel map tasks out of the small files? Are you currently CPU bound on processing these small files? On Thu, May 9, 2013 at 9:12 PM, YouPeng Yang yypvsxf19870...@gmail.com wrote: hi ALL I am going to setup a new hadoop environment, .Because of there are lots of small files, I would like to change the default.block.size to 16MB other than adopting the ways to merge the files into large enough (e.g using sequencefiles). I want to ask are there any bad influences or issues? Regards -- Harsh J -- Harsh J
Re: hadoop map-reduce errors
Your connection setting to Mysql may not be correct check that. *Thanks Regards* ∞ Shashwat Shriparv On Fri, May 10, 2013 at 6:12 PM, Shahab Yunus shahab.yu...@gmail.comwrote: Have your checked your connection settings to the MySQL DB? Where and how are you passing the connection properties for the database? Is it accessible from the machine you are running this? Is the db up? On Thu, May 9, 2013 at 9:32 PM, 丙子 woyaof...@gmail.com wrote: When i run a hadoop job ,there are some errors like this: 13/05/10 08:20:59 ERROR manager.SqlManager: Error executing statement: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure The last packet successfully received from the server was 28,484 milliseconds ago. The last packet sent successfully to the server was 1 milliseconds ago. com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure The last packet successfully received from the server was 28,484 milliseconds ago. The last packet sent successfully to the server was 1 milliseconds ago. at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) …… …… at org.apache.sqoop.Sqoop.main(Sqoop.java:238) at com.cloudera.sqoop.Sqoop.main(Sqoop.java:57) Caused by: java.io.EOFException: Can not read response from server. Expected to read 4 bytes, read 0 bytes before connection was unexpectedly lost. at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:3039) at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3489) ... 24 more How can i resolve it .
Re: Problem while running simple WordCount program(hadoop-1.0.4) on eclipse.
the user through which you are running your hadoop, set permission to tmp dir for that user. *Thanks Regards* ∞ Shashwat Shriparv On Fri, May 10, 2013 at 5:24 PM, Nitin Pawar nitinpawar...@gmail.comwrote: What are the permission of your /tmp/ folder? On May 10, 2013 5:03 PM, Khaleel Khalid khale...@suntecgroup.com wrote: Hi all, I am facing the following error when I run a simple WordCount program using hadoop-1.0.4 on eclipse(Galileo). The map/reduce plugin version I use is 1.0.4 as well. It would be really helpful if someone gives me a solution for the problem. ERROR: 13/05/10 16:53:51 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 13/05/10 16:53:51 ERROR security.UserGroupInformation: *PriviledgedActionException* as:khaleelk *cause:java.io.IOException*: Failed to set permissions of path: \tmp\hadoop-khaleelk\mapred\staging\khaleelk-1067522586\.staging to 0700 Exception in thread main *java.io.IOException*: Failed to set permissions of path: \tmp\hadoop-khaleelk\mapred\staging\khaleelk-1067522586\.staging to 0700 at org.apache.hadoop.fs.FileUtil.checkReturnValue( *FileUtil.java:689*) at org.apache.hadoop.fs.FileUtil.setPermission( *FileUtil.java:662*) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission( *RawLocalFileSystem.java:509*) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs( *RawLocalFileSystem.java:344*) at org.apache.hadoop.fs.FilterFileSystem.mkdirs( *FilterFileSystem.java:189*) at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir( *JobSubmissionFiles.java:116*) at org.apache.hadoop.mapred.JobClient$2.run( *JobClient.java:856*) at org.apache.hadoop.mapred.JobClient$2.run( *JobClient.java:850*) at java.security.AccessController.doPrivileged( *Native Method*) at javax.security.auth.Subject.doAs(Unknown Source) at org.apache.hadoop.security.UserGroupInformation.doAs( *UserGroupInformation.java:1121*) at org.apache.hadoop.mapred.JobClient.submitJobInternal( *JobClient.java:850*) at org.apache.hadoop.mapreduce.Job.submit( *Job.java:500*) at org.apache.hadoop.mapreduce.Job.waitForCompletion( *Job.java:530*) at WordCount.main( *WordCount.java:65*) Thank you in advance.
Re: Problem while running simple WordCount program(hadoop-1.0.4) on eclipse.
its a /tmp/ folder so I guess all the users will need access to it. better it make it a routine linux like /tmp folder On Sun, May 12, 2013 at 11:12 PM, shashwat shriparv dwivedishash...@gmail.com wrote: the user through which you are running your hadoop, set permission to tmp dir for that user. *Thanks Regards* ∞ Shashwat Shriparv On Fri, May 10, 2013 at 5:24 PM, Nitin Pawar nitinpawar...@gmail.comwrote: What are the permission of your /tmp/ folder? On May 10, 2013 5:03 PM, Khaleel Khalid khale...@suntecgroup.com wrote: Hi all, I am facing the following error when I run a simple WordCount program using hadoop-1.0.4 on eclipse(Galileo). The map/reduce plugin version I use is 1.0.4 as well. It would be really helpful if someone gives me a solution for the problem. ERROR: 13/05/10 16:53:51 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 13/05/10 16:53:51 ERROR security.UserGroupInformation: *PriviledgedActionException* as:khaleelk *cause:java.io.IOException*: Failed to set permissions of path: \tmp\hadoop-khaleelk\mapred\staging\khaleelk-1067522586\.staging to 0700 Exception in thread main *java.io.IOException*: Failed to set permissions of path: \tmp\hadoop-khaleelk\mapred\staging\khaleelk-1067522586\.staging to 0700 at org.apache.hadoop.fs.FileUtil.checkReturnValue( *FileUtil.java:689*) at org.apache.hadoop.fs.FileUtil.setPermission( *FileUtil.java:662*) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission( *RawLocalFileSystem.java:509*) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs( *RawLocalFileSystem.java:344*) at org.apache.hadoop.fs.FilterFileSystem.mkdirs( *FilterFileSystem.java:189*) at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir( *JobSubmissionFiles.java:116*) at org.apache.hadoop.mapred.JobClient$2.run( *JobClient.java:856*) at org.apache.hadoop.mapred.JobClient$2.run( *JobClient.java:850*) at java.security.AccessController.doPrivileged( *Native Method*) at javax.security.auth.Subject.doAs(Unknown Source) at org.apache.hadoop.security.UserGroupInformation.doAs( *UserGroupInformation.java:1121*) at org.apache.hadoop.mapred.JobClient.submitJobInternal( *JobClient.java:850*) at org.apache.hadoop.mapreduce.Job.submit( *Job.java:500*) at org.apache.hadoop.mapreduce.Job.waitForCompletion( *Job.java:530*) at WordCount.main( *WordCount.java:65*) Thank you in advance. -- Nitin Pawar
Re: Submitting a hadoop job in large clusters.
@shashwat shriparv Can the a hadoop job be submitted to any datanode in the cluster and not to jobTracker. Correct me if it I am wrong , I was told that a hadoop job can be submitted to datanode also apart from JobTracker. Is it correct? Advanced thanks On Sun, May 12, 2013 at 11:02 PM, shashwat shriparv dwivedishash...@gmail.com wrote: On Sun, May 12, 2013 at 12:19 AM, Nitin Pawar nitinpawar...@gmail.comwrote: normally if you want to copy the jar then hadoop admins setu Submit you job to Job tracker it will distribute throughout the tasktrackers. *Thanks Regards* ∞ Shashwat Shriparv
Re: Submitting a hadoop job in large clusters.
nope in MRv1 only jobtracker can accept jobs. You can not trigger job on any other process in hadoop other than jobtracker. On Sun, May 12, 2013 at 11:25 PM, Shashidhar Rao raoshashidhar...@gmail.com wrote: @shashwat shriparv Can the a hadoop job be submitted to any datanode in the cluster and not to jobTracker. Correct me if it I am wrong , I was told that a hadoop job can be submitted to datanode also apart from JobTracker. Is it correct? Advanced thanks On Sun, May 12, 2013 at 11:02 PM, shashwat shriparv dwivedishash...@gmail.com wrote: On Sun, May 12, 2013 at 12:19 AM, Nitin Pawar nitinpawar...@gmail.comwrote: normally if you want to copy the jar then hadoop admins setu Submit you job to Job tracker it will distribute throughout the tasktrackers. *Thanks Regards* ∞ Shashwat Shriparv -- Nitin Pawar
Re: Submitting a hadoop job in large clusters.
As nitin said , its responsibility of Jobtracker to distribute the job to task to the tasktrackers so you need to submitt the job to the job tracker *Thanks Regards* ∞ Shashwat Shriparv On Sun, May 12, 2013 at 11:26 PM, Nitin Pawar nitinpawar...@gmail.comwrote: nope in MRv1 only jobtracker can accept jobs. You can not trigger job on any other process in hadoop other than jobtracker. On Sun, May 12, 2013 at 11:25 PM, Shashidhar Rao raoshashidhar...@gmail.com wrote: @shashwat shriparv Can the a hadoop job be submitted to any datanode in the cluster and not to jobTracker. Correct me if it I am wrong , I was told that a hadoop job can be submitted to datanode also apart from JobTracker. Is it correct? Advanced thanks On Sun, May 12, 2013 at 11:02 PM, shashwat shriparv dwivedishash...@gmail.com wrote: On Sun, May 12, 2013 at 12:19 AM, Nitin Pawar nitinpawar...@gmail.comwrote: normally if you want to copy the jar then hadoop admins setu Submit you job to Job tracker it will distribute throughout the tasktrackers. *Thanks Regards* ∞ Shashwat Shriparv -- Nitin Pawar
Eclipse Plugin: HADOOP 2.0.3
Hi, Is there a method to build the ECLIPSE plugin using HADOOP 2.0.3? I am looking at the details in http://wiki.apache.org/hadoop/EclipsePlugIn, but I am not able to find any eclipse-plugin folder in the src. Thanks and Regards, Gourav
Re: Submitting a hadoop job in large clusters.
Which doesn't imply that you should log yourself to the physical machine where the JobTracker is hosted. It only implies that the hadoop client must be able to reach the JobTracker. It could be from any physical machines hosting the slaves (DataNode, Tasktracker) but it is rarely the case. Often, job are submitted from a machine which doesn't belong to the cluster but can reach every machine of it. Regards Bertrand On Sun, May 12, 2013 at 7:59 PM, shashwat shriparv dwivedishash...@gmail.com wrote: As nitin said , its responsibility of Jobtracker to distribute the job to task to the tasktrackers so you need to submitt the job to the job tracker *Thanks Regards* ∞ Shashwat Shriparv On Sun, May 12, 2013 at 11:26 PM, Nitin Pawar nitinpawar...@gmail.comwrote: nope in MRv1 only jobtracker can accept jobs. You can not trigger job on any other process in hadoop other than jobtracker. On Sun, May 12, 2013 at 11:25 PM, Shashidhar Rao raoshashidhar...@gmail.com wrote: @shashwat shriparv Can the a hadoop job be submitted to any datanode in the cluster and not to jobTracker. Correct me if it I am wrong , I was told that a hadoop job can be submitted to datanode also apart from JobTracker. Is it correct? Advanced thanks On Sun, May 12, 2013 at 11:02 PM, shashwat shriparv dwivedishash...@gmail.com wrote: On Sun, May 12, 2013 at 12:19 AM, Nitin Pawar nitinpawar...@gmail.comwrote: normally if you want to copy the jar then hadoop admins setu Submit you job to Job tracker it will distribute throughout the tasktrackers. *Thanks Regards* ∞ Shashwat Shriparv -- Nitin Pawar
Re: Wrapping around BitSet with the Writable interface
You can perhaps consider using the experimental JavaSerialization [1] enhancement to skip transforming to Writables/other-serialization-formats. It may be slower but looks like you are looking for a way to avoid transforming objects. Enable by adding the class org.apache.hadoop.io.serializer.JavaSerialization to the list of io.serializations like so in your client configuration: property nameio.serializations/name valueorg.apache.hadoop.io.serializer.WritableSerialization,org.apache.hadoop.io.serializer.avro.AvroSpecificSerialization,org.apache.hadoop.io.serializer.avro.AvroReflectSerialization,org.apache.hadoop.io.serializer.JavaSerialization/value /property And you should then be able to rely on Java's inbuilt serialization to directly serialize your BitSet object? [1] - http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/serializer/JavaSerialization.html On Sun, May 12, 2013 at 11:54 PM, Jim Twensky jim.twen...@gmail.com wrote: I have large java.util.BitSet objects that I want to bitwise-OR using a MapReduce job. I decided to wrap around each object using the Writable interface. Right now I convert each BitSet to a byte array and serialize the byte array on disk. Converting them to byte arrays is a bit inefficient but I could not find a work around to write them directly to the DataOutput. Is there a way to skip this and serialize the object directly? Here is what my current implementation looks like: public class BitSetWritable implements Writable { private BitSet bs; public BitSetWritable() { this.bs = new BitSet(); } @Override public void write(DataOutput out) throws IOException { ByteArrayOutputStream bos = new ByteArrayOutputStream(bs.size()/8); ObjectOutputStream oos = new ObjectOutputStream(bos); oos.writeObject(bs); byte[] bytes = bos.toByteArray(); oos.close(); out.writeInt(bytes.length); out.write(bytes); } @Override public void readFields(DataInput in) throws IOException { int len = in.readInt(); byte[] bytes = new byte[len]; in.readFully(bytes); ByteArrayInputStream bis = new ByteArrayInputStream(bytes); ObjectInputStream ois = new ObjectInputStream(bis); try { bs = (BitSet) ois.readObject(); } catch (ClassNotFoundException e) { throw new IOException(e); } ois.close(); } } -- Harsh J
Re: Wrapping around BitSet with the Writable interface
In order to make the code more readable, you could start by using the methods toByteArray() and valueOf(bytes) http://docs.oracle.com/javase/7/docs/api/java/util/BitSet.html#toByteArray%28%29 http://docs.oracle.com/javase/7/docs/api/java/util/BitSet.html#valueOf%28byte[]%29 Regards Bertrand On Sun, May 12, 2013 at 8:24 PM, Jim Twensky jim.twen...@gmail.com wrote: I have large java.util.BitSet objects that I want to bitwise-OR using a MapReduce job. I decided to wrap around each object using the Writable interface. Right now I convert each BitSet to a byte array and serialize the byte array on disk. Converting them to byte arrays is a bit inefficient but I could not find a work around to write them directly to the DataOutput. Is there a way to skip this and serialize the object directly? Here is what my current implementation looks like: public class BitSetWritable implements Writable { private BitSet bs; public BitSetWritable() { this.bs = new BitSet(); } @Override public void write(DataOutput out) throws IOException { ByteArrayOutputStream bos = new ByteArrayOutputStream(bs.size()/8); ObjectOutputStream oos = new ObjectOutputStream(bos); oos.writeObject(bs); byte[] bytes = bos.toByteArray(); oos.close(); out.writeInt(bytes.length); out.write(bytes); } @Override public void readFields(DataInput in) throws IOException { int len = in.readInt(); byte[] bytes = new byte[len]; in.readFully(bytes); ByteArrayInputStream bis = new ByteArrayInputStream(bytes); ObjectInputStream ois = new ObjectInputStream(bis); try { bs = (BitSet) ois.readObject(); } catch (ClassNotFoundException e) { throw new IOException(e); } ois.close(); } } -- Bertrand Dechoux
Re: issues with decrease the default.block.size
The block size controls lots of things in Hadoop. It affects read parallelism, scalability, block allocation and other aspects of operations either directly or indirectly. On Sun, May 12, 2013 at 10:38 AM, shashwat shriparv dwivedishash...@gmail.com wrote: The block size is for allocation not storage on the disk. *Thanks Regards* ∞ Shashwat Shriparv On Fri, May 10, 2013 at 8:54 PM, Harsh J ha...@cloudera.com wrote: Thanks. I failed to add: It should be okay to do if those cases are true and the cluster seems under-utilized right now. On Fri, May 10, 2013 at 8:29 PM, yypvsxf19870706 yypvsxf19870...@gmail.com wrote: Hi harsh Yep. Regards 发自我的 iPhone 在 2013-5-10,13:27,Harsh J ha...@cloudera.com 写道: Are you looking to decrease it to get more parallel map tasks out of the small files? Are you currently CPU bound on processing these small files? On Thu, May 9, 2013 at 9:12 PM, YouPeng Yang yypvsxf19870...@gmail.com wrote: hi ALL I am going to setup a new hadoop environment, .Because of there are lots of small files, I would like to change the default.block.size to 16MB other than adopting the ways to merge the files into large enough (e.g using sequencefiles). I want to ask are there any bad influences or issues? Regards -- Harsh J -- Harsh J
Re: Wrapping around BitSet with the Writable interface
Another interesting alternative is the EWAH implementation of java bitsets that allow efficient compressed bitsets with very fast OR operations. https://github.com/lemire/javaewah See also https://code.google.com/p/sparsebitmap/ by the same authors. On Sun, May 12, 2013 at 1:11 PM, Bertrand Dechoux decho...@gmail.comwrote: In order to make the code more readable, you could start by using the methods toByteArray() and valueOf(bytes) http://docs.oracle.com/javase/7/docs/api/java/util/BitSet.html#toByteArray%28%29 http://docs.oracle.com/javase/7/docs/api/java/util/BitSet.html#valueOf%28byte[]%29 Regards Bertrand On Sun, May 12, 2013 at 8:24 PM, Jim Twensky jim.twen...@gmail.comwrote: I have large java.util.BitSet objects that I want to bitwise-OR using a MapReduce job. I decided to wrap around each object using the Writable interface. Right now I convert each BitSet to a byte array and serialize the byte array on disk. Converting them to byte arrays is a bit inefficient but I could not find a work around to write them directly to the DataOutput. Is there a way to skip this and serialize the object directly? Here is what my current implementation looks like: public class BitSetWritable implements Writable { private BitSet bs; public BitSetWritable() { this.bs = new BitSet(); } @Override public void write(DataOutput out) throws IOException { ByteArrayOutputStream bos = new ByteArrayOutputStream(bs.size()/8); ObjectOutputStream oos = new ObjectOutputStream(bos); oos.writeObject(bs); byte[] bytes = bos.toByteArray(); oos.close(); out.writeInt(bytes.length); out.write(bytes); } @Override public void readFields(DataInput in) throws IOException { int len = in.readInt(); byte[] bytes = new byte[len]; in.readFully(bytes); ByteArrayInputStream bis = new ByteArrayInputStream(bytes); ObjectInputStream ois = new ObjectInputStream(bis); try { bs = (BitSet) ois.readObject(); } catch (ClassNotFoundException e) { throw new IOException(e); } ois.close(); } } -- Bertrand Dechoux
Re: Wrapping around BitSet with the Writable interface
You can disregard my links as their are only valid for java 1.7+. The JavaSerialization might clean your code but shouldn't bring a significant boost in performance. The EWAH implementation has, at least, the methods you are looking for : serialize / deserialize. Regards Bertrand Note to myself : I have to remember this one. On Sun, May 12, 2013 at 10:27 PM, Ted Dunning tdunn...@maprtech.com wrote: Another interesting alternative is the EWAH implementation of java bitsets that allow efficient compressed bitsets with very fast OR operations. https://github.com/lemire/javaewah See also https://code.google.com/p/sparsebitmap/ by the same authors. On Sun, May 12, 2013 at 1:11 PM, Bertrand Dechoux decho...@gmail.comwrote: In order to make the code more readable, you could start by using the methods toByteArray() and valueOf(bytes) http://docs.oracle.com/javase/7/docs/api/java/util/BitSet.html#toByteArray%28%29 http://docs.oracle.com/javase/7/docs/api/java/util/BitSet.html#valueOf%28byte[]%29 Regards Bertrand On Sun, May 12, 2013 at 8:24 PM, Jim Twensky jim.twen...@gmail.comwrote: I have large java.util.BitSet objects that I want to bitwise-OR using a MapReduce job. I decided to wrap around each object using the Writable interface. Right now I convert each BitSet to a byte array and serialize the byte array on disk. Converting them to byte arrays is a bit inefficient but I could not find a work around to write them directly to the DataOutput. Is there a way to skip this and serialize the object directly? Here is what my current implementation looks like: public class BitSetWritable implements Writable { private BitSet bs; public BitSetWritable() { this.bs = new BitSet(); } @Override public void write(DataOutput out) throws IOException { ByteArrayOutputStream bos = new ByteArrayOutputStream(bs.size()/8); ObjectOutputStream oos = new ObjectOutputStream(bos); oos.writeObject(bs); byte[] bytes = bos.toByteArray(); oos.close(); out.writeInt(bytes.length); out.write(bytes); } @Override public void readFields(DataInput in) throws IOException { int len = in.readInt(); byte[] bytes = new byte[len]; in.readFully(bytes); ByteArrayInputStream bis = new ByteArrayInputStream(bytes); ObjectInputStream ois = new ObjectInputStream(bis); try { bs = (BitSet) ois.readObject(); } catch (ClassNotFoundException e) { throw new IOException(e); } ois.close(); } } -- Bertrand Dechoux -- Bertrand Dechoux
The minimum memory requirements to datanode and namenode?
Hi, I setup a cluster with 3 nodes, and after that I did not submit any job on it. But, after few days, I found the cluster is unhealthy: - No result returned after issuing command 'hadoop dfs -ls /' or 'hadoop dfsadmin -report' for a while - The page of 'http://namenode:50070' could not be opened as expected... - ... I did not find any usefull info in the logs, but found the avaible memory of the cluster nodes are very low at that time: - node1(NN,JT,DN,TT): 158 mb mem is available - node2(DN,TT): 75 mb mem is available - node3(DN,TT): 174 mb mem is available I guess the issue of my cluster is caused by lacking of memeory, and my questions are: - Without running jobs, what's the minimum memory requirements to datanode and namenode? - How to define the minimum memeory for datanode and namenode? Thanks! Sam Liu
Re: The minimum memory requirements to datanode and namenode?
do you get any error when trying to connect to cluster, something like 'tried n times' or replicated 0 times. On Sun, May 12, 2013 at 7:28 PM, sam liu samliuhad...@gmail.com wrote: Hi, I setup a cluster with 3 nodes, and after that I did not submit any job on it. But, after few days, I found the cluster is unhealthy: - No result returned after issuing command 'hadoop dfs -ls /' or 'hadoop dfsadmin -report' for a while - The page of 'http://namenode:50070' could not be opened as expected... - ... I did not find any usefull info in the logs, but found the avaible memory of the cluster nodes are very low at that time: - node1(NN,JT,DN,TT): 158 mb mem is available - node2(DN,TT): 75 mb mem is available - node3(DN,TT): 174 mb mem is available I guess the issue of my cluster is caused by lacking of memeory, and my questions are: - Without running jobs, what's the minimum memory requirements to datanode and namenode? - How to define the minimum memeory for datanode and namenode? Thanks! Sam Liu
Re: The minimum memory requirements to datanode and namenode?
Got some exceptions on node3: 1. datanode log: 2013-04-17 11:13:44,719 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock blk_2478755809192724446_1477 received exception java.net.SocketTimeoutException: 63000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/9.50.102.80:58371 remote=/ 9.50.102.79:50010] 2013-04-17 11:13:44,721 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration( 9.50.102.80:50010, storageID=DS-2038715921-9.50.102.80-50010-1366091297051, infoPort=50075, ipcPort=50020):DataXceiver java.net.SocketTimeoutException: 63000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/9.50.102.80:58371 remote=/ 9.50.102.79:50010] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:116) at java.io.DataInputStream.readShort(DataInputStream.java:306) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:359) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:112) at java.lang.Thread.run(Thread.java:738) 2013-04-17 11:13:44,818 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block blk_8413378381769505032_1477 src: /9.50.102.81:35279 dest: / 9.50.102.80:50010 2. tasktracker log: 2013-04-23 11:48:26,783 INFO org.apache.hadoop.mapred.UserLogCleaner: Deleting user log path job_201304152248_0011 2013-04-30 14:48:15,506 ERROR org.apache.hadoop.mapred.TaskTracker: Caught exception: java.io.IOException: Call to node1/9.50.102.81:9001 failed on local exception: java.io.IOException: Connection reset by peer at org.apache.hadoop.ipc.Client.wrapException(Client.java:1144) at org.apache.hadoop.ipc.Client.call(Client.java:1112) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229) at org.apache.hadoop.mapred.$Proxy2.heartbeat(Unknown Source) at org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:2008) at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:1802) at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:2654) at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:3909) Caused by: java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcher.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:33) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:210) at sun.nio.ch.IOUtil.read(IOUtil.java:183) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:257) at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) at java.io.FilterInputStream.read(FilterInputStream.java:127) at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:361) at java.io.BufferedInputStream.fill(BufferedInputStream.java:229) at java.io.BufferedInputStream.read(BufferedInputStream.java:248) at java.io.DataInputStream.readInt(DataInputStream.java:381) at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:841) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:786) 2013-04-30 14:48:15,517 INFO org.apache.hadoop.mapred.TaskTracker: Resending 'status' to 'node1' with reponseId '-12904 2013-04-30 14:48:16,404 INFO org.apache.hadoop.mapred.TaskTracker: SHUTDOWN_MSG: 2013/5/13 Rishi Yadav ri...@infoobjects.com do you get any error when trying to connect to cluster, something like 'tried n times' or replicated 0 times. On Sun, May 12, 2013 at 7:28 PM, sam liu samliuhad...@gmail.com wrote: Hi, I setup a cluster with 3 nodes, and after that I did not submit any job on it. But, after few days, I found the cluster is unhealthy: - No result returned after issuing command 'hadoop dfs -ls /' or 'hadoop dfsadmin -report' for a while - The page of 'http://namenode:50070' could not be opened as expected... - ... I did not find any usefull info in the logs, but found the avaible memory of the cluster nodes are very low at that time: - node1(NN,JT,DN,TT): 158 mb mem is available - node2(DN,TT): 75 mb mem is available - node3(DN,TT): 174 mb mem is available I guess the issue of my cluster is caused by lacking of memeory, and my questions are: - Without running jobs, what's the minimum memory
Re: The minimum memory requirements to datanode and namenode?
For node3, the memory is: total used free sharedbuffers cached Mem: 3834 3666167 0187 1136 -/+ buffers/cache: 2342 1491 Swap: 8196 0 8196 To a 3 nodes cluster as mine, what's the required minimum free/available memory for the datanode process and tasktracker process, without running any map/reduce task? Any formula to determine it? 2013/5/13 Rishi Yadav ri...@infoobjects.com can you tell specs of node3. Even on a test/demo cluster, anything below 4 GB ram makes the node almost inaccessible as per my experience. On Sun, May 12, 2013 at 8:25 PM, sam liu samliuhad...@gmail.com wrote: Got some exceptions on node3: 1. datanode log: 2013-04-17 11:13:44,719 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock blk_2478755809192724446_1477 received exception java.net.SocketTimeoutException: 63000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/9.50.102.80:58371remote=/ 9.50.102.79:50010] 2013-04-17 11:13:44,721 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration( 9.50.102.80:50010, storageID=DS-2038715921-9.50.102.80-50010-1366091297051, infoPort=50075, ipcPort=50020):DataXceiver java.net.SocketTimeoutException: 63000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/9.50.102.80:58371remote=/ 9.50.102.79:50010] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:116) at java.io.DataInputStream.readShort(DataInputStream.java:306) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:359) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:112) at java.lang.Thread.run(Thread.java:738) 2013-04-17 11:13:44,818 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block blk_8413378381769505032_1477 src: /9.50.102.81:35279 dest: / 9.50.102.80:50010 2. tasktracker log: 2013-04-23 11:48:26,783 INFO org.apache.hadoop.mapred.UserLogCleaner: Deleting user log path job_201304152248_0011 2013-04-30 14:48:15,506 ERROR org.apache.hadoop.mapred.TaskTracker: Caught exception: java.io.IOException: Call to node1/9.50.102.81:9001failed on local exception: java.io.IOException: Connection reset by peer at org.apache.hadoop.ipc.Client.wrapException(Client.java:1144) at org.apache.hadoop.ipc.Client.call(Client.java:1112) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229) at org.apache.hadoop.mapred.$Proxy2.heartbeat(Unknown Source) at org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:2008) at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:1802) at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:2654) at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:3909) Caused by: java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcher.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:33) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:210) at sun.nio.ch.IOUtil.read(IOUtil.java:183) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:257) at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) at java.io.FilterInputStream.read(FilterInputStream.java:127) at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:361) at java.io.BufferedInputStream.fill(BufferedInputStream.java:229) at java.io.BufferedInputStream.read(BufferedInputStream.java:248) at java.io.DataInputStream.readInt(DataInputStream.java:381) at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:841) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:786) 2013-04-30 14:48:15,517 INFO org.apache.hadoop.mapred.TaskTracker: Resending 'status' to 'node1' with reponseId '-12904 2013-04-30 14:48:16,404 INFO org.apache.hadoop.mapred.TaskTracker: SHUTDOWN_MSG: 2013/5/13 Rishi Yadav ri...@infoobjects.com do you get any error when trying to connect to cluster, something like 'tried n times' or replicated 0 times. On Sun,