yeah you are right I mis read your earlier post. Thanks, Rahul
On Sun, May 12, 2013 at 6:25 PM, Mohammad Tariq <donta...@gmail.com> wrote: > I had said that if you use distcp to copy data *from localFS to HDFS*then you > won't be able to exploit parallelism as entire file is present on > a single machine. So no multiple TTs. > > Please comment if you think I am wring somewhere. > > Warm Regards, > Tariq > cloudfront.blogspot.com > > > On Sun, May 12, 2013 at 6:15 PM, Rahul Bhattacharjee < > rahul.rec....@gmail.com> wrote: > >> Yes , it's a MR job under the hood . my question was that you wrote that >> using distcp you loose the benefits of parallel processing of Hadoop. I >> think the MR job of distcp divides files into individual map tasks based on >> the total size of the transfer , so multiple mappers would still be spawned >> if the size of transfer is huge and they would work in parallel. >> >> Correct me if there is anything wrong! >> >> Thanks, >> Rahul >> >> >> On Sun, May 12, 2013 at 6:07 PM, Mohammad Tariq <donta...@gmail.com>wrote: >> >>> No. distcp is actually a mapreduce job under the hood. >>> >>> Warm Regards, >>> Tariq >>> cloudfront.blogspot.com >>> >>> >>> On Sun, May 12, 2013 at 6:00 PM, Rahul Bhattacharjee < >>> rahul.rec....@gmail.com> wrote: >>> >>>> Thanks to both of you! >>>> >>>> Rahul >>>> >>>> >>>> On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar >>>> <nitinpawar...@gmail.com>wrote: >>>> >>>>> you can do that using file:/// >>>>> >>>>> example: >>>>> >>>>> >>>>> hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/ >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee < >>>>> rahul.rec....@gmail.com> wrote: >>>>> >>>>>> @Tariq can you point me to some resource which shows how distcp is >>>>>> used to upload files from local to hdfs. >>>>>> >>>>>> isn't distcp a MR job ? wouldn't it need the data to be already >>>>>> present in the hadoop's fs? >>>>>> >>>>>> Rahul >>>>>> >>>>>> >>>>>> On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq >>>>>> <donta...@gmail.com>wrote: >>>>>> >>>>>>> You'r welcome :) >>>>>>> >>>>>>> Warm Regards, >>>>>>> Tariq >>>>>>> cloudfront.blogspot.com >>>>>>> >>>>>>> >>>>>>> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee < >>>>>>> rahul.rec....@gmail.com> wrote: >>>>>>> >>>>>>>> Thanks Tariq! >>>>>>>> >>>>>>>> >>>>>>>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq < >>>>>>>> donta...@gmail.com> wrote: >>>>>>>> >>>>>>>>> @Rahul : Yes. distcp can do that. >>>>>>>>> >>>>>>>>> And, bigger the files lesser the metadata hence lesser memory >>>>>>>>> consumption. >>>>>>>>> >>>>>>>>> Warm Regards, >>>>>>>>> Tariq >>>>>>>>> cloudfront.blogspot.com >>>>>>>>> >>>>>>>>> >>>>>>>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee < >>>>>>>>> rahul.rec....@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> IMHO,I think the statement about NN with regard to block metadata >>>>>>>>>> is more like a general statement. Even if you put lots of small >>>>>>>>>> files of >>>>>>>>>> combined size 10 TB , you need to have a capable NN. >>>>>>>>>> >>>>>>>>>> can disct cp be used to copy local - to - hdfs ? >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Rahul >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar < >>>>>>>>>> nitinpawar...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> absolutely rite Mohammad >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq < >>>>>>>>>>> donta...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Sorry for barging in guys. I think Nitin is talking about this : >>>>>>>>>>>> >>>>>>>>>>>> Every file and block in HDFS is treated as an object and for >>>>>>>>>>>> each object around 200B of metadata get created. So the NN should >>>>>>>>>>>> be >>>>>>>>>>>> powerful enough to handle that much metadata, since it is going to >>>>>>>>>>>> be >>>>>>>>>>>> in-memory. Actually memory is the most important metric when it >>>>>>>>>>>> comes to >>>>>>>>>>>> NN. >>>>>>>>>>>> >>>>>>>>>>>> Am I correct @Nitin? >>>>>>>>>>>> >>>>>>>>>>>> @Thoihen : As Nitin has said, when you talk about that much >>>>>>>>>>>> data you don't actually just do a "put". You could use something >>>>>>>>>>>> like >>>>>>>>>>>> "distcp" for parallel copying. A better approach would be to use a >>>>>>>>>>>> data >>>>>>>>>>>> aggregation tool like Flume or Chukwa, as Nitin has already >>>>>>>>>>>> pointed. >>>>>>>>>>>> Facebook uses their own data aggregation tool, called Scribe for >>>>>>>>>>>> this >>>>>>>>>>>> purpose. >>>>>>>>>>>> >>>>>>>>>>>> Warm Regards, >>>>>>>>>>>> Tariq >>>>>>>>>>>> cloudfront.blogspot.com >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar < >>>>>>>>>>>> nitinpawar...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> NN would still be in picture because it will be writing a lot >>>>>>>>>>>>> of meta data for each individual file. so you will need a NN >>>>>>>>>>>>> capable enough >>>>>>>>>>>>> which can store the metadata for your entire dataset. Data will >>>>>>>>>>>>> never go to >>>>>>>>>>>>> NN but lot of metadata about data will be on NN so its always >>>>>>>>>>>>> good idea to >>>>>>>>>>>>> have a strong NN. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee < >>>>>>>>>>>>> rahul.rec....@gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could >>>>>>>>>>>>>> not understand the meaning of capable NN. As I know , the NN >>>>>>>>>>>>>> would not be a >>>>>>>>>>>>>> part of the actual data write pipeline , means that the data >>>>>>>>>>>>>> would not >>>>>>>>>>>>>> travel through the NN , the dfs would contact the NN from time >>>>>>>>>>>>>> to time to >>>>>>>>>>>>>> get locations of DN as where to store the data blocks. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>> Rahul >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar < >>>>>>>>>>>>>> nitinpawar...@gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> is it safe? .. there is no direct answer yes or no >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> when you say , you have files worth 10TB files and you want >>>>>>>>>>>>>>> to upload to HDFS, several factors come into picture >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 1) Is the machine in the same network as your hadoop cluster? >>>>>>>>>>>>>>> 2) If there guarantee that network will not go down? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> and Most importantly I assume that you have a capable hadoop >>>>>>>>>>>>>>> cluster. By that I mean you have a capable namenode. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I would definitely not write files sequentially in HDFS. I >>>>>>>>>>>>>>> would prefer to write files in parallel to hdfs to utilize the >>>>>>>>>>>>>>> DFS write >>>>>>>>>>>>>>> features to speed up the process. >>>>>>>>>>>>>>> you can hdfs put command in parallel manner and in my >>>>>>>>>>>>>>> experience it has not failed when we write a lot of data. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns < >>>>>>>>>>>>>>> maisnam...@gmail.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> @Nitin Pawar , thanks for clearing my doubts . >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> But I have one more question , say I have 10 TB data in the >>>>>>>>>>>>>>>> pipeline . >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload >>>>>>>>>>>>>>>> these files of size 10 TB and is there any limit to the file >>>>>>>>>>>>>>>> size using >>>>>>>>>>>>>>>> hadoop command line . Can hadoop put command line work with >>>>>>>>>>>>>>>> huge data. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks in advance >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar < >>>>>>>>>>>>>>>> nitinpawar...@gmail.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> first of all .. most of the companies do not get 100 PB of >>>>>>>>>>>>>>>>> data in one go. Its an accumulating process and most of the >>>>>>>>>>>>>>>>> companies do >>>>>>>>>>>>>>>>> have a data pipeline in place where the data is written to >>>>>>>>>>>>>>>>> hdfs on a >>>>>>>>>>>>>>>>> frequency basis and then its retained on hdfs for some >>>>>>>>>>>>>>>>> duration as per >>>>>>>>>>>>>>>>> needed and from there its sent to archivers or deleted. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> For data management products, you can look at falcon which >>>>>>>>>>>>>>>>> is open sourced by inmobi along with hortonworks. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> In any case, if you want to write files to hdfs there are >>>>>>>>>>>>>>>>> few options available to you >>>>>>>>>>>>>>>>> 1) Write your dfs client which writes to dfs >>>>>>>>>>>>>>>>> 2) use hdfs proxy >>>>>>>>>>>>>>>>> 3) there is webhdfs >>>>>>>>>>>>>>>>> 4) command line hdfs >>>>>>>>>>>>>>>>> 5) data collection tools come with support to write to >>>>>>>>>>>>>>>>> hdfs like flume etc >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam < >>>>>>>>>>>>>>>>> thoihen...@gmail.com> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Hi All, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Can anyone help me know how does companies like Facebook >>>>>>>>>>>>>>>>>> ,Yahoo etc upload bulk files say to the tune of 100 >>>>>>>>>>>>>>>>>> petabytes to Hadoop >>>>>>>>>>>>>>>>>> HDFS cluster for processing >>>>>>>>>>>>>>>>>> and after processing how they download those files from >>>>>>>>>>>>>>>>>> HDFS to local file system. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I don't think they might be using the command line hadoop >>>>>>>>>>>>>>>>>> fs put to upload files as it would take too long or do they >>>>>>>>>>>>>>>>>> divide say 10 >>>>>>>>>>>>>>>>>> parts each 10 petabytes and compress and use the command >>>>>>>>>>>>>>>>>> line hadoop fs put >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Or if they use any tool to upload huge files. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Please help me . >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>>>>> thoihen >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>> Nitin Pawar >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>> Nitin Pawar >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Nitin Pawar >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Nitin Pawar >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Nitin Pawar >>>>> >>>> >>>> >>> >> >