Thanks to both of you! Rahul
On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar <nitinpawar...@gmail.com>wrote: > you can do that using file:/// > > example: > > hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/ > > > > > On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee < > rahul.rec....@gmail.com> wrote: > >> @Tariq can you point me to some resource which shows how distcp is used >> to upload files from local to hdfs. >> >> isn't distcp a MR job ? wouldn't it need the data to be already present >> in the hadoop's fs? >> >> Rahul >> >> >> On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <donta...@gmail.com>wrote: >> >>> You'r welcome :) >>> >>> Warm Regards, >>> Tariq >>> cloudfront.blogspot.com >>> >>> >>> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee < >>> rahul.rec....@gmail.com> wrote: >>> >>>> Thanks Tariq! >>>> >>>> >>>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <donta...@gmail.com>wrote: >>>> >>>>> @Rahul : Yes. distcp can do that. >>>>> >>>>> And, bigger the files lesser the metadata hence lesser memory >>>>> consumption. >>>>> >>>>> Warm Regards, >>>>> Tariq >>>>> cloudfront.blogspot.com >>>>> >>>>> >>>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee < >>>>> rahul.rec....@gmail.com> wrote: >>>>> >>>>>> IMHO,I think the statement about NN with regard to block metadata is >>>>>> more like a general statement. Even if you put lots of small files of >>>>>> combined size 10 TB , you need to have a capable NN. >>>>>> >>>>>> can disct cp be used to copy local - to - hdfs ? >>>>>> >>>>>> Thanks, >>>>>> Rahul >>>>>> >>>>>> >>>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <nitinpawar...@gmail.com >>>>>> > wrote: >>>>>> >>>>>>> absolutely rite Mohammad >>>>>>> >>>>>>> >>>>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq >>>>>>> <donta...@gmail.com>wrote: >>>>>>> >>>>>>>> Sorry for barging in guys. I think Nitin is talking about this : >>>>>>>> >>>>>>>> Every file and block in HDFS is treated as an object and for each >>>>>>>> object around 200B of metadata get created. So the NN should be >>>>>>>> powerful >>>>>>>> enough to handle that much metadata, since it is going to be in-memory. >>>>>>>> Actually memory is the most important metric when it comes to NN. >>>>>>>> >>>>>>>> Am I correct @Nitin? >>>>>>>> >>>>>>>> @Thoihen : As Nitin has said, when you talk about that much data >>>>>>>> you don't actually just do a "put". You could use something like >>>>>>>> "distcp" >>>>>>>> for parallel copying. A better approach would be to use a data >>>>>>>> aggregation >>>>>>>> tool like Flume or Chukwa, as Nitin has already pointed. Facebook uses >>>>>>>> their own data aggregation tool, called Scribe for this purpose. >>>>>>>> >>>>>>>> Warm Regards, >>>>>>>> Tariq >>>>>>>> cloudfront.blogspot.com >>>>>>>> >>>>>>>> >>>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar < >>>>>>>> nitinpawar...@gmail.com> wrote: >>>>>>>> >>>>>>>>> NN would still be in picture because it will be writing a lot of >>>>>>>>> meta data for each individual file. so you will need a NN capable >>>>>>>>> enough >>>>>>>>> which can store the metadata for your entire dataset. Data will never >>>>>>>>> go to >>>>>>>>> NN but lot of metadata about data will be on NN so its always good >>>>>>>>> idea to >>>>>>>>> have a strong NN. >>>>>>>>> >>>>>>>>> >>>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee < >>>>>>>>> rahul.rec....@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could not >>>>>>>>>> understand the meaning of capable NN. As I know , the NN would not >>>>>>>>>> be a >>>>>>>>>> part of the actual data write pipeline , means that the data would >>>>>>>>>> not >>>>>>>>>> travel through the NN , the dfs would contact the NN from time to >>>>>>>>>> time to >>>>>>>>>> get locations of DN as where to store the data blocks. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Rahul >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar < >>>>>>>>>> nitinpawar...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> is it safe? .. there is no direct answer yes or no >>>>>>>>>>> >>>>>>>>>>> when you say , you have files worth 10TB files and you want to >>>>>>>>>>> upload to HDFS, several factors come into picture >>>>>>>>>>> >>>>>>>>>>> 1) Is the machine in the same network as your hadoop cluster? >>>>>>>>>>> 2) If there guarantee that network will not go down? >>>>>>>>>>> >>>>>>>>>>> and Most importantly I assume that you have a capable hadoop >>>>>>>>>>> cluster. By that I mean you have a capable namenode. >>>>>>>>>>> >>>>>>>>>>> I would definitely not write files sequentially in HDFS. I would >>>>>>>>>>> prefer to write files in parallel to hdfs to utilize the DFS write >>>>>>>>>>> features >>>>>>>>>>> to speed up the process. >>>>>>>>>>> you can hdfs put command in parallel manner and in my experience >>>>>>>>>>> it has not failed when we write a lot of data. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns < >>>>>>>>>>> maisnam...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> @Nitin Pawar , thanks for clearing my doubts . >>>>>>>>>>>> >>>>>>>>>>>> But I have one more question , say I have 10 TB data in the >>>>>>>>>>>> pipeline . >>>>>>>>>>>> >>>>>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload these >>>>>>>>>>>> files of size 10 TB and is there any limit to the file size using >>>>>>>>>>>> hadoop >>>>>>>>>>>> command line . Can hadoop put command line work with huge data. >>>>>>>>>>>> >>>>>>>>>>>> Thanks in advance >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar < >>>>>>>>>>>> nitinpawar...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> first of all .. most of the companies do not get 100 PB of >>>>>>>>>>>>> data in one go. Its an accumulating process and most of the >>>>>>>>>>>>> companies do >>>>>>>>>>>>> have a data pipeline in place where the data is written to hdfs >>>>>>>>>>>>> on a >>>>>>>>>>>>> frequency basis and then its retained on hdfs for some duration >>>>>>>>>>>>> as per >>>>>>>>>>>>> needed and from there its sent to archivers or deleted. >>>>>>>>>>>>> >>>>>>>>>>>>> For data management products, you can look at falcon which is >>>>>>>>>>>>> open sourced by inmobi along with hortonworks. >>>>>>>>>>>>> >>>>>>>>>>>>> In any case, if you want to write files to hdfs there are few >>>>>>>>>>>>> options available to you >>>>>>>>>>>>> 1) Write your dfs client which writes to dfs >>>>>>>>>>>>> 2) use hdfs proxy >>>>>>>>>>>>> 3) there is webhdfs >>>>>>>>>>>>> 4) command line hdfs >>>>>>>>>>>>> 5) data collection tools come with support to write to hdfs >>>>>>>>>>>>> like flume etc >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam < >>>>>>>>>>>>> thoihen...@gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi All, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Can anyone help me know how does companies like Facebook >>>>>>>>>>>>>> ,Yahoo etc upload bulk files say to the tune of 100 petabytes to >>>>>>>>>>>>>> Hadoop >>>>>>>>>>>>>> HDFS cluster for processing >>>>>>>>>>>>>> and after processing how they download those files from HDFS >>>>>>>>>>>>>> to local file system. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I don't think they might be using the command line hadoop fs >>>>>>>>>>>>>> put to upload files as it would take too long or do they divide >>>>>>>>>>>>>> say 10 >>>>>>>>>>>>>> parts each 10 petabytes and compress and use the command line >>>>>>>>>>>>>> hadoop fs put >>>>>>>>>>>>>> >>>>>>>>>>>>>> Or if they use any tool to upload huge files. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Please help me . >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>> thoihen >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Nitin Pawar >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Nitin Pawar >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Nitin Pawar >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Nitin Pawar >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> > > > -- > Nitin Pawar >