Yes. SequenceFile is splittable, which means it can be broken into chunks, called splits, each of which can be processed by a separate map task.
Tom On Mon, Feb 2, 2009 at 3:46 PM, Mark Kerzner <markkerz...@gmail.com> wrote: > No, no reason for a single file - just a little simpler to think about. By > the way, can multiple MapReduce workers read the same SequenceFile > simultaneously? > > On Mon, Feb 2, 2009 at 9:42 AM, Tom White <t...@cloudera.com> wrote: > >> Is there any reason why it has to be a single SequenceFile? You could >> write a local program to write several block compressed SequenceFiles >> in parallel (to HDFS), each containing a portion of the files on your >> PC. >> >> Tom >> >> On Mon, Feb 2, 2009 at 3:24 PM, Mark Kerzner <markkerz...@gmail.com> >> wrote: >> > Truly, I do not see any advantage to doing this, as opposed to writing >> > (Java) code which will copy files to HDFS, because then tarring becomes >> my >> > bottleneck. Unless I write code measure the file sizes and prepare >> pointers >> > for multiple tarring tasks. It becomes pretty complex though, and I >> thought >> > of something simple. I might as well accept that copying one hard drive >> to >> > HDFS is not going to be parallelized. >> > Mark >> > >> > On Sun, Feb 1, 2009 at 11:44 PM, Philip (flip) Kromer >> > <f...@infochimps.org>wrote: >> > >> >> Could you tar.bz2 them up (setting up the tar so that it made a few >> dozen >> >> files), toss them onto the HDFS, and use >> >> http://stuartsierra.com/2008/04/24/a-million-little-files >> >> to go into SequenceFile? >> >> >> >> This lets you preserve the originals and do the sequence file conversion >> >> across the cluster. It's only really helpful, of course, if you also >> want >> >> to >> >> prepare a .tar.bz2 so you can clear out the sprawl >> >> >> >> flip >> >> >> >> On Sun, Feb 1, 2009 at 11:22 PM, Mark Kerzner <markkerz...@gmail.com> >> >> wrote: >> >> >> >> > Hi, >> >> > >> >> > I am writing an application to copy all files from a regular PC to a >> >> > SequenceFile. I can surely do this by simply recursing all directories >> on >> >> > my >> >> > PC, but I wonder if there is any way to parallelize this, a MapReduce >> >> task >> >> > even. Tom White's books seems to imply that it will have to be a >> custom >> >> > application. >> >> > >> >> > Thank you, >> >> > Mark >> >> > >> >> >> >> >> >> >> >> -- >> >> http://www.infochimps.org >> >> Connected Open Free Data >> >> >> > >> >