Thanks Tariq!
On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <donta...@gmail.com> wrote: > @Rahul : Yes. distcp can do that. > > And, bigger the files lesser the metadata hence lesser memory consumption. > > Warm Regards, > Tariq > cloudfront.blogspot.com > > > On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee < > rahul.rec....@gmail.com> wrote: > >> IMHO,I think the statement about NN with regard to block metadata is more >> like a general statement. Even if you put lots of small files of combined >> size 10 TB , you need to have a capable NN. >> >> can disct cp be used to copy local - to - hdfs ? >> >> Thanks, >> Rahul >> >> >> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <nitinpawar...@gmail.com>wrote: >> >>> absolutely rite Mohammad >>> >>> >>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <donta...@gmail.com>wrote: >>> >>>> Sorry for barging in guys. I think Nitin is talking about this : >>>> >>>> Every file and block in HDFS is treated as an object and for each >>>> object around 200B of metadata get created. So the NN should be powerful >>>> enough to handle that much metadata, since it is going to be in-memory. >>>> Actually memory is the most important metric when it comes to NN. >>>> >>>> Am I correct @Nitin? >>>> >>>> @Thoihen : As Nitin has said, when you talk about that much data you >>>> don't actually just do a "put". You could use something like "distcp" for >>>> parallel copying. A better approach would be to use a data aggregation tool >>>> like Flume or Chukwa, as Nitin has already pointed. Facebook uses their own >>>> data aggregation tool, called Scribe for this purpose. >>>> >>>> Warm Regards, >>>> Tariq >>>> cloudfront.blogspot.com >>>> >>>> >>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar >>>> <nitinpawar...@gmail.com>wrote: >>>> >>>>> NN would still be in picture because it will be writing a lot of meta >>>>> data for each individual file. so you will need a NN capable enough which >>>>> can store the metadata for your entire dataset. Data will never go to NN >>>>> but lot of metadata about data will be on NN so its always good idea to >>>>> have a strong NN. >>>>> >>>>> >>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee < >>>>> rahul.rec....@gmail.com> wrote: >>>>> >>>>>> @Nitin , parallel dfs to write to hdfs is great , but could not >>>>>> understand the meaning of capable NN. As I know , the NN would not be a >>>>>> part of the actual data write pipeline , means that the data would not >>>>>> travel through the NN , the dfs would contact the NN from time to time to >>>>>> get locations of DN as where to store the data blocks. >>>>>> >>>>>> Thanks, >>>>>> Rahul >>>>>> >>>>>> >>>>>> >>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <nitinpawar...@gmail.com >>>>>> > wrote: >>>>>> >>>>>>> is it safe? .. there is no direct answer yes or no >>>>>>> >>>>>>> when you say , you have files worth 10TB files and you want to >>>>>>> upload to HDFS, several factors come into picture >>>>>>> >>>>>>> 1) Is the machine in the same network as your hadoop cluster? >>>>>>> 2) If there guarantee that network will not go down? >>>>>>> >>>>>>> and Most importantly I assume that you have a capable hadoop >>>>>>> cluster. By that I mean you have a capable namenode. >>>>>>> >>>>>>> I would definitely not write files sequentially in HDFS. I would >>>>>>> prefer to write files in parallel to hdfs to utilize the DFS write >>>>>>> features >>>>>>> to speed up the process. >>>>>>> you can hdfs put command in parallel manner and in my experience it >>>>>>> has not failed when we write a lot of data. >>>>>>> >>>>>>> >>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <maisnam...@gmail.com>wrote: >>>>>>> >>>>>>>> @Nitin Pawar , thanks for clearing my doubts . >>>>>>>> >>>>>>>> But I have one more question , say I have 10 TB data in the >>>>>>>> pipeline . >>>>>>>> >>>>>>>> Is it perfectly OK to use hadopo fs put command to upload these >>>>>>>> files of size 10 TB and is there any limit to the file size using >>>>>>>> hadoop >>>>>>>> command line . Can hadoop put command line work with huge data. >>>>>>>> >>>>>>>> Thanks in advance >>>>>>>> >>>>>>>> >>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar < >>>>>>>> nitinpawar...@gmail.com> wrote: >>>>>>>> >>>>>>>>> first of all .. most of the companies do not get 100 PB of data in >>>>>>>>> one go. Its an accumulating process and most of the companies do have >>>>>>>>> a >>>>>>>>> data pipeline in place where the data is written to hdfs on a >>>>>>>>> frequency >>>>>>>>> basis and then its retained on hdfs for some duration as per needed >>>>>>>>> and >>>>>>>>> from there its sent to archivers or deleted. >>>>>>>>> >>>>>>>>> For data management products, you can look at falcon which is open >>>>>>>>> sourced by inmobi along with hortonworks. >>>>>>>>> >>>>>>>>> In any case, if you want to write files to hdfs there are few >>>>>>>>> options available to you >>>>>>>>> 1) Write your dfs client which writes to dfs >>>>>>>>> 2) use hdfs proxy >>>>>>>>> 3) there is webhdfs >>>>>>>>> 4) command line hdfs >>>>>>>>> 5) data collection tools come with support to write to hdfs like >>>>>>>>> flume etc >>>>>>>>> >>>>>>>>> >>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam < >>>>>>>>> thoihen...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Hi All, >>>>>>>>>> >>>>>>>>>> Can anyone help me know how does companies like Facebook ,Yahoo >>>>>>>>>> etc upload bulk files say to the tune of 100 petabytes to Hadoop HDFS >>>>>>>>>> cluster for processing >>>>>>>>>> and after processing how they download those files from HDFS to >>>>>>>>>> local file system. >>>>>>>>>> >>>>>>>>>> I don't think they might be using the command line hadoop fs put >>>>>>>>>> to upload files as it would take too long or do they divide say 10 >>>>>>>>>> parts >>>>>>>>>> each 10 petabytes and compress and use the command line hadoop fs >>>>>>>>>> put >>>>>>>>>> >>>>>>>>>> Or if they use any tool to upload huge files. >>>>>>>>>> >>>>>>>>>> Please help me . >>>>>>>>>> >>>>>>>>>> Thanks >>>>>>>>>> thoihen >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Nitin Pawar >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Nitin Pawar >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Nitin Pawar >>>>> >>>> >>>> >>> >>> >>> -- >>> Nitin Pawar >>> >> >> >