Is there any reason why it has to be a single SequenceFile? You could write a local program to write several block compressed SequenceFiles in parallel (to HDFS), each containing a portion of the files on your PC.
Tom On Mon, Feb 2, 2009 at 3:24 PM, Mark Kerzner <markkerz...@gmail.com> wrote: > Truly, I do not see any advantage to doing this, as opposed to writing > (Java) code which will copy files to HDFS, because then tarring becomes my > bottleneck. Unless I write code measure the file sizes and prepare pointers > for multiple tarring tasks. It becomes pretty complex though, and I thought > of something simple. I might as well accept that copying one hard drive to > HDFS is not going to be parallelized. > Mark > > On Sun, Feb 1, 2009 at 11:44 PM, Philip (flip) Kromer > <f...@infochimps.org>wrote: > >> Could you tar.bz2 them up (setting up the tar so that it made a few dozen >> files), toss them onto the HDFS, and use >> http://stuartsierra.com/2008/04/24/a-million-little-files >> to go into SequenceFile? >> >> This lets you preserve the originals and do the sequence file conversion >> across the cluster. It's only really helpful, of course, if you also want >> to >> prepare a .tar.bz2 so you can clear out the sprawl >> >> flip >> >> On Sun, Feb 1, 2009 at 11:22 PM, Mark Kerzner <markkerz...@gmail.com> >> wrote: >> >> > Hi, >> > >> > I am writing an application to copy all files from a regular PC to a >> > SequenceFile. I can surely do this by simply recursing all directories on >> > my >> > PC, but I wonder if there is any way to parallelize this, a MapReduce >> task >> > even. Tom White's books seems to imply that it will have to be a custom >> > application. >> > >> > Thank you, >> > Mark >> > >> >> >> >> -- >> http://www.infochimps.org >> Connected Open Free Data >> >