Re: best way to copy all files from a file system to hdfs

Tom White Mon, 02 Feb 2009 07:43:19 -0800

Is there any reason why it has to be a single SequenceFile? You could
write a local program to write several block compressed SequenceFiles
in parallel (to HDFS), each containing a portion of the files on your
PC.


Tom

On Mon, Feb 2, 2009 at 3:24 PM, Mark Kerzner <markkerz...@gmail.com> wrote:
> Truly, I do not see any advantage to doing this, as opposed to writing
> (Java) code which will copy files to HDFS, because then tarring becomes my
> bottleneck. Unless I write code measure the file sizes and prepare pointers
> for multiple tarring tasks. It becomes pretty complex though, and I thought
> of something simple. I might as well accept that copying one hard drive to
> HDFS is not going to be parallelized.
> Mark
>
> On Sun, Feb 1, 2009 at 11:44 PM, Philip (flip) Kromer
> <f...@infochimps.org>wrote:
>
>> Could you tar.bz2 them up (setting up the tar so that it made a few dozen
>> files), toss them onto the HDFS, and use
>> http://stuartsierra.com/2008/04/24/a-million-little-files
>> to go into SequenceFile?
>>
>> This lets you preserve the originals and do the sequence file conversion
>> across the cluster. It's only really helpful, of course, if you also want
>> to
>> prepare a .tar.bz2 so you can clear out the sprawl
>>
>> flip
>>
>> On Sun, Feb 1, 2009 at 11:22 PM, Mark Kerzner <markkerz...@gmail.com>
>> wrote:
>>
>> > Hi,
>> >
>> > I am writing an application to copy all files from a regular PC to a
>> > SequenceFile. I can surely do this by simply recursing all directories on
>> > my
>> > PC, but I wonder if there is any way to parallelize this, a MapReduce
>> task
>> > even. Tom White's books seems to imply that it will have to be a custom
>> > application.
>> >
>> > Thank you,
>> > Mark
>> >
>>
>>
>>
>> --
>> http://www.infochimps.org
>> Connected Open Free Data
>>
>

Re: best way to copy all files from a file system to hdfs

Reply via email to