Re: best way to copy all files from a file system to hdfs

Tom White Mon, 02 Feb 2009 08:01:33 -0800

Yes. SequenceFile is splittable, which means it can be broken into
chunks, called splits, each of which can be processed by a separate
map task.


Tom

On Mon, Feb 2, 2009 at 3:46 PM, Mark Kerzner <markkerz...@gmail.com> wrote:
> No, no reason for a single file - just a little simpler to think about. By
> the way, can multiple MapReduce workers read the same SequenceFile
> simultaneously?
>
> On Mon, Feb 2, 2009 at 9:42 AM, Tom White <t...@cloudera.com> wrote:
>
>> Is there any reason why it has to be a single SequenceFile? You could
>> write a local program to write several block compressed SequenceFiles
>> in parallel (to HDFS), each containing a portion of the files on your
>> PC.
>>
>> Tom
>>
>> On Mon, Feb 2, 2009 at 3:24 PM, Mark Kerzner <markkerz...@gmail.com>
>> wrote:
>> > Truly, I do not see any advantage to doing this, as opposed to writing
>> > (Java) code which will copy files to HDFS, because then tarring becomes
>> my
>> > bottleneck. Unless I write code measure the file sizes and prepare
>> pointers
>> > for multiple tarring tasks. It becomes pretty complex though, and I
>> thought
>> > of something simple. I might as well accept that copying one hard drive
>> to
>> > HDFS is not going to be parallelized.
>> > Mark
>> >
>> > On Sun, Feb 1, 2009 at 11:44 PM, Philip (flip) Kromer
>> > <f...@infochimps.org>wrote:
>> >
>> >> Could you tar.bz2 them up (setting up the tar so that it made a few
>> dozen
>> >> files), toss them onto the HDFS, and use
>> >> http://stuartsierra.com/2008/04/24/a-million-little-files
>> >> to go into SequenceFile?
>> >>
>> >> This lets you preserve the originals and do the sequence file conversion
>> >> across the cluster. It's only really helpful, of course, if you also
>> want
>> >> to
>> >> prepare a .tar.bz2 so you can clear out the sprawl
>> >>
>> >> flip
>> >>
>> >> On Sun, Feb 1, 2009 at 11:22 PM, Mark Kerzner <markkerz...@gmail.com>
>> >> wrote:
>> >>
>> >> > Hi,
>> >> >
>> >> > I am writing an application to copy all files from a regular PC to a
>> >> > SequenceFile. I can surely do this by simply recursing all directories
>> on
>> >> > my
>> >> > PC, but I wonder if there is any way to parallelize this, a MapReduce
>> >> task
>> >> > even. Tom White's books seems to imply that it will have to be a
>> custom
>> >> > application.
>> >> >
>> >> > Thank you,
>> >> > Mark
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> http://www.infochimps.org
>> >> Connected Open Free Data
>> >>
>> >
>>
>

Re: best way to copy all files from a file system to hdfs

Reply via email to