Zheng,

I'll take a look at that.

It seems the easiest thing would be to subclass SequenceFileInputFormat, and override getRecordReader(), to return a RecordReader which wraps SequenceFileRecordReader and overrides RecordReader.next....right?

Is it safe to assume that K,V are both Text writables, so I can just append the bytes of one to the other?

Bobby
On Oct 6, 2009, at 8:10 PM, Zheng Shao wrote:

Hi Bobby,

We just need a special FileInputFormat - The FileInputFormat should be able to read SequenceFile, and then prepend the key to the value before it's returned to the Hive framework.

Then in Hive language, we can say:

add jar my.jar;
CREATE TABLE mytable (key STRING, value STRING)
STORED AS INPUTFORMAT 'com.my.inputformat' OUTPUTFORMAT 'org.apache.hadoop.io.SequenceFileOutputFormat';

See http://issues.apache.org/jira/browse/HIVE-177

You may also want to write your own OutputFileFormat which split the row passed in into key and value and store them separately. But that is not needed unless you want to use Hive to INSERT to this table (LOAD does NOT need this).

Zheng

On Tue, Oct 6, 2009 at 6:19 PM, Bobby Rullo <[email protected]> wrote:
Hi there,

It seems that Hive ignores the key when reading hadoop sequence files. Is there a way to make it not do that?

If there's no way to do this with a 'stock' Hive build, could someone point me to the code that reads sequence files in Hive and I can have a go at it? It's sort of a show-stopper for us - we have a bunch of large files where the key field is important.

Bobby



--
Yours,
Zheng

Reply via email to