Have you looked at TFile?
On Mar 12, 2010, at 5:22 AM, Scott Whitecross wrote:
Hi -
I'd like to create a job that pulls small files from a remote server
(using FTP, SCP, etc.) and stores them directly to sequence files on
HDFS. Looking at the sequence file APi, I don't see an obvious way
to do this. It looks like what I have to do is pull the remote file
to disk, then read the file into memory to place in the sequence
file. Is there a better way?
Looking at the API, am I forced to use the append method?
FileSystem hdfs =
FileSystem.get(context.getConfiguration());
FSDataOutputStream outputStream = hdfs.create(new
Path(outputPath));
writer =
SequenceFile.createWriter(context.getConfiguration(), outputStream,
Text.class, BytesWritable.class, null, null);
// read in file to remotefilebytes
writer.append(filekey, remotefilebytes);
The alternative would be to have one job pull the remote files, and
a secondary job write them into sequence files.
I'm using the latest Cloudera release, which I believe is Hadoop 20.1
Thanks.