I just looked at the javadocs, but it is unclear to me what the difference between a TFile and Sequence File? It also looks like you need to load append the data in a similar way as with normal sequence files.
On Mar 12, 2010, at 2:15 PM, Hong Tang wrote: > Have you looked at TFile? > > On Mar 12, 2010, at 5:22 AM, Scott Whitecross wrote: > >> Hi - >> >> I'd like to create a job that pulls small files from a remote server >> (using FTP, SCP, etc.) and stores them directly to sequence files on >> HDFS. Looking at the sequence file APi, I don't see an obvious way >> to do this. It looks like what I have to do is pull the remote file >> to disk, then read the file into memory to place in the sequence >> file. Is there a better way? >> >> Looking at the API, am I forced to use the append method? >> >> FileSystem hdfs = >> FileSystem.get(context.getConfiguration()); >> FSDataOutputStream outputStream = hdfs.create(new >> Path(outputPath)); >> writer = >> SequenceFile.createWriter(context.getConfiguration(), outputStream, >> Text.class, BytesWritable.class, null, null); >> >> // read in file to remotefilebytes >> >> writer.append(filekey, remotefilebytes); >> >> >> The alternative would be to have one job pull the remote files, and >> a secondary job write them into sequence files. >> >> I'm using the latest Cloudera release, which I believe is Hadoop 20.1 >> >> Thanks. >> >> >> >> >