Hi, On Tue, Jun 21, 2011 at 16:14, Mapred Learn <mapred.le...@gmail.com> wrote: > The problem is when 1 text file goes on HDFS as 60 GB file, one mapper takes > more than an hour to convert it to sequence file and finally fails. > > I was thinking how to split it from the client box before uploading to HDFS.
Have a look at this: http://stackoverflow.com/questions/3960651/splitting-gzipped-logfiles-without-storing-the-ungzipped-splits-on-disk > If I read file and split it with filestream. Read() based on size, it takes > 2 hours to process 1 60 gb file and upload on HDFS as 120 500 mb files. > Sent from my iPhone > On Jun 21, 2011, at 2:57 AM, Evert Lammerts <evert.lamme...@sara.nl> wrote: > > What we did was on not-hadoop hardware. We streamed the file from a storage > cluster to a single machine and cut it up while streaming the pieces back to > the storage cluster. That will probably not work for you, unless you have > the hardware for it. But then still its inefficient. > > > > You should be able to unzip your file in a MR job. If you still want to use > compression you can install LZO and rezip the file from within the same job. > (LZO uses block-compression, which allows Hadoop to process all blocks in > parallel.) Note that you’ll need enough storage capacity. I don’t have > example code, but I’m guessing Google can help. > > > > > > > > From: Mapred Learn [mailto:mapred.le...@gmail.com] > Sent: maandag 20 juni 2011 18:09 > To: Niels Basjes; Evert Lammerts > Subject: Re: AW: How to split a big file in HDFS by size > > > > Thanks for sharing. > > > > Could you guys share how are you divinding your 2.7 TB into 10 Gb file each > on HDFS ? That wud be helpful for me ! > > > > > > > > On Mon, Jun 20, 2011 at 8:39 AM, Marcos Ortiz <mlor...@uci.cu> wrote: > > Evert Lammerts at Sara.nl did something seemed to your problem, spliting a > big 2.7 TB file to chunks of 10 GB. > This work was presented on the BioAssist Programmers' Day on January of this > year and its name was > "Large-Scale Data Storage and Processing for Scientist in The Netherlands" > > http://www.slideshare.net/evertlammerts > > P.D: I sent the message with a copy to him > > El 6/20/2011 10:38 AM, Niels Basjes escribió: > > > > Hi, > > On Mon, Jun 20, 2011 at 16:13, Mapred Learn<mapred.le...@gmail.com> wrote: > > > But this file is a gzipped text file. In this case, it will only go to 1 > mapper than the case if it was > split into 60 1 GB files which will make map-red job finish earlier than one > 60 GB file as it will > Hv 60 mappers running in parallel. Isn't it so ? > > > Yes, that is very true. > > > > -- > > Marcos Luís Ortíz Valmaseda > Software Engineer (UCI) > http://marcosluis2186.posterous.com > http://twitter.com/marcosluis2186 > > > -- Best regards / Met vriendelijke groeten, Niels Basjes