On Mon, 2011-01-03 at 12:20 -0800, Buttler, David wrote: > Hi Ron, > Loading into HDFS and HBase are two different issues. > > HDFS: if you have a large number of files to load from your nfs file system > into HDFS it is not clear that parallelizing the load will help.
Its not nfs. Its a parallel file system. > You have two sources of bottlenecks: the nfs file system and the HDFS file > system. In your parallel example, you will likely saturate your nfs file > system first. Unlikely in this case. We're in the unusual position of our archive cluster being faster then our hadoop cluster. > If they are actually local files, then loading them via M/R is a > non-starter as you have no control over which machine will get a map task. If the same files are "local" on each node, does it matter? Shouldn't the map jobs all be scheduled in a way as to spread out the load? Thanks, Kevin > Unless all of the machines have files in the same directory and you are > just going to look in that directory to upload. Then, it sounds like more of > a job for a parallel shell command and less of a map/reduce command. > > HBase: So far my strategy has been to get the files into HDFS first, and then > write a Map job to load them into HBase. You can try to do this and see if > direct inserts into hbase are fast enough for your use case. But, if you are > going to TBs/week then you will likely want to investigate the bulk load > features. I haven't yet incorporated that into my workflow so I can't offer > much advice there. Just be sure your cluster is sized appropriately. E.g., > with your compression turned on in hbase, see how much a 1 GB input file > expands to inside hbase / hdfs. That should give you a feeling for how much > space you will need for your expected data load. > > Dave > > > -----Original Message----- > From: Taylor, Ronald C [mailto:[email protected]] > Sent: Tuesday, December 28, 2010 2:05 PM > To: '[email protected]'; '[email protected]' > Cc: Taylor, Ronald C; Fox, Kevin M; Brown, David M JR > Subject: What is the fastest way to get a large amount of data into the > Hadoop HDFS file system (or Hbase)? > > > Folks, > > We plan on uploading large amounts of data on a regular basis onto a Hadoop > cluster, with Hbase operating on top of Hadoop. Figure eventually on the > order of multiple terabytes per week. So - we are concerned about doing the > uploads themselves as fast as possible from our native Linux file system into > HDFS. Figure files will be in, roughly, the 1 to 300 GB range. > > Off the top of my head, I'm thinking that doing this in parallel using a Java > MapReduce program would work fastest. So my idea would be to have a file > listing all the data files (full paths) to be uploaded, one per line, and > then use that listing file as input to a MapReduce program. > > Each Mapper would then upload one of the data files (using "hadoop fs > -copyFromLocal <source> <dest>") in parallel with all the other Mappers, with > the Mappers operating on all the nodes of the cluster, spreading out the file > upload across the nodes. > > Does that sound like a wise way to approach this? Are there better methods? > Anything else out there for doing automated upload in parallel? We would very > much appreciate advice in this area, since we believe upload speed might > become a bottleneck. > > - Ron Taylor > > ___________________________________________ > Ronald Taylor, Ph.D. > Computational Biology & Bioinformatics Group > > Pacific Northwest National Laboratory > 902 Battelle Boulevard > P.O. Box 999, Mail Stop J4-33 > Richland, WA 99352 USA > Office: 509-372-6568 > Email: [email protected] > >
