Re: AW: How to split a big file in HDFS by size

2011-06-21 Thread Niels Basjes
, which allows Hadoop to process all blocks in > parallel.) Note that you’ll need enough storage capacity. I don’t have > example code, but I’m guessing Google can help. > > > > > > > > From: Mapred Learn [mailto:mapred.le...@gmail.com] > Sent: maandag 20 juni 2011 18:0

Re: AW: How to split a big file in HDFS by size

2011-06-20 Thread Marcos Ortiz
Evert Lammerts at Sara.nl did something seemed to your problem, spliting a big 2.7 TB file to chunks of 10 GB. This work was presented on the BioAssist Programmers' Day on January of this year and its name was "Large-Scale Data Storage and Processing for Scientist in The Netherlands" http://www

Re: AW: How to split a big file in HDFS by size

2011-06-20 Thread Niels Basjes
Hi, On Mon, Jun 20, 2011 at 16:13, Mapred Learn wrote: > But this file is a gzipped text file. In this case, it will only go to 1 > mapper than the case if it was > split into 60 1 GB files which will make map-red job finish earlier than one > 60 GB file as it will > Hv 60 mappers running in pa

AW: AW: How to split a big file in HDFS by size

2011-06-20 Thread Christoph Schmitz
Nachricht- Von: Mapred Learn [mailto:mapred.le...@gmail.com] Gesendet: Montag, 20. Juni 2011 16:14 An: mapreduce-user@hadoop.apache.org Cc: mapreduce-user@hadoop.apache.org Betreff: Re: AW: How to split a big file in HDFS by size But this file is a gzipped text file. In this case, it will

Re: AW: How to split a big file in HDFS by size

2011-06-20 Thread Mapred Learn
But this file is a gzipped text file. In this case, it will only go to 1 mapper than the case if it was split into 60 1 GB files which will make map-red job finish earlier than one 60 GB file as it will Hv 60 mappers running in parallel. Isn't it so ? Sent from my iPhone On Jun 20, 2011, at 12

AW: How to split a big file in HDFS by size

2011-06-20 Thread Christoph Schmitz
Simple answer: don't. The Hadoop framework will take care of that for you and split the file. The logical 60 GB file you see in the HDFS actually *is* split into smaller chunks (default size is 64 MB) and physically distributed across the cluster. Regards, Christoph -Ursprüngliche Nachrich

AW: How to split a big file in HDFS by size

2011-06-19 Thread Christoph Schmitz
JJ, uploading 60 GB single-threaded (i.e. hadoop fs -copyFromLocal etc.) will be slow. If possible, try to get the files in smaller chunks where they are created, and upload them in parallel with a simple MapReduce job that only passes the data through (i.e. uses the standard Mapper and Reducer