Why not write a Hadoop map task that fetches the remote files into
memory and then emits them as key-value pairs into a SequenceFile?

Zak


On Fri, Mar 12, 2010 at 8:22 AM, Scott Whitecross <sc...@dataxu.com> wrote:
> Hi -
>
> I'd like to create a job that pulls small files from a remote server (using 
> FTP, SCP, etc.) and stores them directly to sequence files on HDFS.  Looking 
> at the sequence file APi, I don't see an obvious way to do this.  It looks 
> like what I have to do is pull the remote file to disk, then read the file 
> into memory to place in the sequence file.  Is there a better way?
>
> Looking at the API, am I forced to use the append method?
>
>            FileSystem hdfs = FileSystem.get(context.getConfiguration());
>            FSDataOutputStream outputStream = hdfs.create(new 
> Path(outputPath));
>            writer = SequenceFile.createWriter(context.getConfiguration(), 
> outputStream, Text.class, BytesWritable.class, null, null);
>
>           // read in file to remotefilebytes
>
>            writer.append(filekey, remotefilebytes);
>
>
> The alternative would be to have one job pull the remote files, and a 
> secondary job write them into sequence files.
>
> I'm using the latest Cloudera release, which I believe is Hadoop 20.1
>
> Thanks.
>
>
>
>
>

Reply via email to