Hi,
I seem to have a problem getting Hive to use a custom InputFormat.
I am using Hive version 0.10.0 with Hadoop 1.0.4 on Centos 6.3
currently in standalone mode. At this stage I am just experimenting.
I have a file with 10 records which I am using for testing.
I've created a table called zownvehead to access this file.
So if I do
select * from zownvehead;
I get the 10 records and if I do
select count(1) from zownvehead;
then I get the result 10. No surprises.
Now I've created my own class
package com.trilliumsoftware.loader.duality;
public class WrappedInputFormat implements InputFormat<LongWritable, Text>,
JobConfigurable {
And I've written this class to restrict the number of records. Specifically, in
the getSplits method instead of
returning the whole file I return two splits which effectively limit the data
scanned to two records instead of 10.
(Inside my class I create an instance of TextInputFormat I delegate all the
calls to this instance apart
from getSplits where I call the method on TextInputFormat and then I use the
result to build two new FileSplits which I return instead.)
I delete the table and re-create it with the following
CREATE EXTERNAL TABLE zownvehead (PID STRING,
... lots of other columns elided...
AHM_STAT_CODE STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
STORED AS
INPUTFORMAT
'com.trilliumsoftware.loader.duality.WrappedInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';
Now when I perform
select * from zownvehead;
then, much to my delight, I see only the two records.
However when I perform
select count(1) from zownvehead;
I get the result 10 and not 2, as I would expect.
So the results of the two queries are inconsistent.
When I investigate I can see that, in the second query, the class
CombineHiveInputFormat is being used. I can see that an
instance of my class WrappedInputFormat is being constructed
and configured. I can also see that when the query runs this
instance of my class is being used to obtained a record reader
(that is the
public RecordReader<LongWritable,Text> getRecordReader(InputSplit split,
JobConf jc, Reporter rprtr) throws IOException {
method is being invoked. However the getSplits method
is _not_ being invoked and the split being passed to the getRecordReader method
is
a FileSplit (or derived class) for the whole file.
I've had a look at the source of CombineHiveInputFormat and it
seems to be looking for an InputFormat class to invoked getSplits
based on the path. But I can't see why it might get it wrong, or
what I can do to help it get it right. I suppose that I could build
my own version of Hive with instrumentation to see exactly
what's going on, but I'd like to avoid that if I can.
So can anyone tell me why the CombineHiveInputFormat wrapped class
is not calling my getSplits? And why this only seems to happen if a
Map/Reduce is required? And, most importantly, what do I have to
do to get it to work the way that I expect?
Any help or comments would be welcome.
Peter Marron
Trillium Software UK Limited
Tel : +44 (0) 118 940 7609
Fax : +44 (0) 118 940 7699
E: [email protected]<mailto:[email protected]>