Hive Dev Team,
Greetings!
We have encountered some issue when using Hive 0.8.1.8 and Hive 0.11.0. After
some investigation, we think this looks like a bug in Hive. I'm therefore
sending this email to report this issue and to confirm with you. Please let me
know if this is not the correct mailing list for this kind of topic.
The issue we had is related to indexed queries on external tables stored as
sequence file. For example, if we have a simple table like the one created
below,
CREATE TABLE hive_test
(
id int,
name string,
info string
)
STORED AS SEQUENCEFILE;
We first insert 5000 rows with the same id (e.g., id = 1) into this table. We
then count the total number of rows in this table by running the query below
and get the correct result 5000.
select count(*) from hive_test where id = 1;
After this, we create an index on id,
CREATE INDEX test_index ON TABLE hive_test(id) AS
'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' WITH DEFERRED
REBUILD;
ALTER INDEX test_index ON hive_test REBUILD;
set hive.optimize.index.filter=true;
set hive.optimize.index.filter.compact.minsize=0;
Then, we run the same query 'select count(*) from hive_test where id = 1;'
again but get a different result (count > 5000).
We tried to dig into the Hive source code and found the following piece of code
in HiveIndexedInputFormat.java which might be the root cause of the duplicated
rows,
if (split.inputFormatClassName().contains("RCFile") ||
split.inputFormatClassName().contains("SequenceFile")) {
if (split.getStart() > SequenceFile.SYNC_INTERVAL) {
newSplit = new HiveInputSplit(new FileSplit(split.getPath(),
split.getStart() - SequenceFile.SYNC_INTERVAL,
split.getLength() + SequenceFile.SYNC_INTERVAL,
split.getLocations()),
split.inputFormatClassName());
}
}
According to my understanding on SequenceFile and SequenceFileRecordReader, I
think it's unnecessary and incorrect to add the extra 2000 bytes to the
beginning of each input split because it actually causes some of the rows in
the overlapping regions to be processed by two mappers. Please correct me if
I'm wrong.
Thank you,
Xing