One thing I had done to speed up copy/put speeds was write a simple map-reduce job to do parallel copies of files from a input directory (in our case the input directory is nfs mounted from all task nodes). It gives us a huge speed-bump.
It's trivial to roll ur own - but would be happy to share as well. -----Original Message----- From: C G [mailto:[EMAIL PROTECTED] Sent: Friday, August 31, 2007 11:21 AM To: hadoop-user@lucene.apache.org Subject: RE: Compression using Hadoop... My input is typical row-based stuff across which are run a large stack of aggregations/rollups. After reading earlier posts on this thread, I modified my loader to split the input up into 1M row partitions (literally gunzip -cd input.gz | split...). I then ran an experiment using 50M rows (i.e. 50 gz files loaded into HDFS) on a 8 node cluster. Ted, from what you are saying I should be using at least 80 files given the cluster size, and I should modify the loader to be aware of the number of nodes and split accordingly. Do you concur? Load time to HDFS may be the next challenge. My HDFS configuration on 8 nodes uses a replication factor of 3. Sequentially copying my data to HDFS using -copyFromLocal took 23 minutes to move 266M in individual files of 5.7M each. Does anybody find this result surprising? Note that this is on EC2, where there is no such thing as rack-level or switch-level locality. Should I expect dramatically better performance on a real iron? Once I get this prototyping/education under my belt my plan is to deploy a 64 node grid of 4 way machines with a terabyte of local storage on each node. Thanks for the discussion...the Hadoop community is very helpful! C G Ted Dunning <[EMAIL PROTECTED]> wrote: They will only be a non-issue if you have enough of them to get the parallelism you want. If you have number of gzip files > 10*number of task nodes you should be fine. -----Original Message----- From: [EMAIL PROTECTED] on behalf of jason gessner Sent: Fri 8/31/2007 9:38 AM To: hadoop-user@lucene.apache.org Subject: Re: Compression using Hadoop... ted, will the gzip files be a non-issue as far as splitting goes if they are under the default block size? C G, glad i could help a little. -jason On 8/31/07, C G wrote: > Thanks Ted and Jason for your comments. Ted, your comments about gzip not being splittable was very timely...I'm watching my 8 node cluster saturate one node (with one gz file) and was wondering why. Thanks for the "answer in advance" :-). > > Ted Dunning wrote: > With gzipped files, you do face the problem that your parallelism in the map > phase is pretty much limited to the number of files you have (because > gzip'ed files aren't splittable). This is often not a problem since most > people can arrange to have dozens to hundreds of input files easier than > they can arrange to have dozens to hundreds of CPU cores working on their > data. --------------------------------- Luggage? GPS? Comic books? Check out fitting gifts for grads at Yahoo! Search.