Re: Hadoop noob question

2013-05-12 Thread Rahul Bhattacharjee
@Tariq can you point me to some resource which shows how distcp is used to upload files from local to hdfs. isn't distcp a MR job ? wouldn't it need the data to be already present in the hadoop's fs? Rahul On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq donta...@gmail.com wrote: You'r

Re: Hadoop noob question

2013-05-12 Thread Nitin Pawar
you can do that using file:/// example: hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/ On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: @Tariq can you point me to some resource which shows how distcp is used to upload files

Re: Hadoop noob question

2013-05-12 Thread Mohammad Tariq
@Rahul : I'm sorry I answered this on a wrong thread by mistake. You could do that as Nitin has shown. Warm Regards, Tariq cloudfront.blogspot.com On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar nitinpawar...@gmail.comwrote: you can do that using file:/// example: hadoop distcp

Re: Hadoop noob question

2013-05-12 Thread Rahul Bhattacharjee
Thanks to both of you! Rahul On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar nitinpawar...@gmail.comwrote: you can do that using file:/// example: hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/ On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee

Re: Hadoop noob question

2013-05-12 Thread Mohammad Tariq
No. distcp is actually a mapreduce job under the hood. Warm Regards, Tariq cloudfront.blogspot.com On Sun, May 12, 2013 at 6:00 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: Thanks to both of you! Rahul On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar nitinpawar...@gmail.comwrote:

Re: Hadoop noob question

2013-05-12 Thread Mohammad Tariq
I had said that if you use distcp to copy data *from localFS to HDFS* then you won't be able to exploit parallelism as entire file is present on a single machine. So no multiple TTs. Please comment if you think I am wring somewhere. Warm Regards, Tariq cloudfront.blogspot.com On Sun, May 12,

Re: Hadoop noob question

2013-05-12 Thread Rahul Bhattacharjee
yeah you are right I mis read your earlier post. Thanks, Rahul On Sun, May 12, 2013 at 6:25 PM, Mohammad Tariq donta...@gmail.com wrote: I had said that if you use distcp to copy data *from localFS to HDFS*then you won't be able to exploit parallelism as entire file is present on a single

Re: Hadoop noob question

2013-05-12 Thread Mohammad Tariq
This is what I would say : The number of maps is decided as follows. Since it’s a good idea to get each map to copy a reasonable amount of data to minimize overheads in task setup, each map copies at least 256 MB (unless the total size of the input is less, in which case one map handles it all).

Hadoop noob question

2013-05-11 Thread Thoihen Maibam
Hi All, Can anyone help me know how does companies like Facebook ,Yahoo etc upload bulk files say to the tune of 100 petabytes to Hadoop HDFS cluster for processing and after processing how they download those files from HDFS to local file system. I don't think they might be using the command

Re: Hadoop noob question

2013-05-11 Thread Nitin Pawar
first of all .. most of the companies do not get 100 PB of data in one go. Its an accumulating process and most of the companies do have a data pipeline in place where the data is written to hdfs on a frequency basis and then its retained on hdfs for some duration as per needed and from there its

Re: Hadoop noob question

2013-05-11 Thread maisnam ns
@Nitin Pawar , thanks for clearing my doubts . But I have one more question , say I have 10 TB data in the pipeline . Is it perfectly OK to use hadopo fs put command to upload these files of size 10 TB and is there any limit to the file size using hadoop command line . Can hadoop put command

Re: Hadoop noob question

2013-05-11 Thread Nitin Pawar
is it safe? .. there is no direct answer yes or no when you say , you have files worth 10TB files and you want to upload to HDFS, several factors come into picture 1) Is the machine in the same network as your hadoop cluster? 2) If there guarantee that network will not go down? and Most

Re: Hadoop noob question

2013-05-11 Thread Rahul Bhattacharjee
@Nitin , parallel dfs to write to hdfs is great , but could not understand the meaning of capable NN. As I know , the NN would not be a part of the actual data write pipeline , means that the data would not travel through the NN , the dfs would contact the NN from time to time to get locations of

Re: Hadoop noob question

2013-05-11 Thread Nitin Pawar
NN would still be in picture because it will be writing a lot of meta data for each individual file. so you will need a NN capable enough which can store the metadata for your entire dataset. Data will never go to NN but lot of metadata about data will be on NN so its always good idea to have a

Re: Hadoop noob question

2013-05-11 Thread Mohammad Tariq
Sorry for barging in guys. I think Nitin is talking about this : Every file and block in HDFS is treated as an object and for each object around 200B of metadata get created. So the NN should be powerful enough to handle that much metadata, since it is going to be in-memory. Actually memory is

Re: Hadoop noob question

2013-05-11 Thread Nitin Pawar
absolutely rite Mohammad On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq donta...@gmail.com wrote: Sorry for barging in guys. I think Nitin is talking about this : Every file and block in HDFS is treated as an object and for each object around 200B of metadata get created. So the NN should

Re: Hadoop noob question

2013-05-11 Thread Shahab Yunus
@Thoihen. If the data that you are trying to load is not streaming or the data loading is not real-time in nature then why don't you use Sqoop? Relatively easy to use with not much learning curve. Regards, Shahab On Sat, May 11, 2013 at 12:03 PM, Mohammad Tariq donta...@gmail.com wrote: Sorry

Re: Hadoop noob question

2013-05-11 Thread Rahul Bhattacharjee
IMHO,I think the statement about NN with regard to block metadata is more like a general statement. Even if you put lots of small files of combined size 10 TB , you need to have a capable NN. can disct cp be used to copy local - to - hdfs ? Thanks, Rahul On Sat, May 11, 2013 at 9:35 PM, Nitin

Re: Hadoop noob question

2013-05-11 Thread Mohammad Tariq
@Rahul : Yes. distcp can do that. And, bigger the files lesser the metadata hence lesser memory consumption. Warm Regards, Tariq cloudfront.blogspot.com On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: IMHO,I think the statement about NN with regard to

Re: Hadoop noob question

2013-05-11 Thread Rahul Bhattacharjee
Thanks Tariq! On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq donta...@gmail.com wrote: @Rahul : Yes. distcp can do that. And, bigger the files lesser the metadata hence lesser memory consumption. Warm Regards, Tariq cloudfront.blogspot.com On Sat, May 11, 2013 at 9:40 PM, Rahul

Re: Hadoop noob question

2013-05-11 Thread Mohammad Tariq
You'r welcome :) Warm Regards, Tariq cloudfront.blogspot.com On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: Thanks Tariq! On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq donta...@gmail.comwrote: @Rahul : Yes. distcp can do that. And, bigger the

Re: Hadoop noob question

2013-05-11 Thread shashwat shriparv
In our case we have our own written hdfs client to write the data and downlod it. *Thanks Regards* ∞ Shashwat Shriparv On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq donta...@gmail.com wrote: You'r welcome :) Warm Regards, Tariq cloudfront.blogspot.com On Sat, May 11, 2013 at