Nandor Kollar created ORC-517:
---------------------------------
Summary: Incorrect statistics written for decimal values
Key: ORC-517
URL: https://issues.apache.org/jira/browse/ORC-517
Project: ORC
Issue Type: Bug
Affects Versions: 1.5.0
Reporter: Nandor Kollar
I came across with the following problem with min-max statistics while writing
test cases for ORC with Spark (latest master). I created an table stored as ORC
with a single decimal field, added a couple of negative number to this table,
and used ORC tools to print the details of the ORC file created. I noticed that
despite the minimum value was correct, the maximum was 0 (instead of the
largest negative number added). To better understand the problem, here is a
unit test to demonstrate it:
{code}
@Test
public void testDecimalMinMaxStatistics() throws Exception {
TypeDescription schema = TypeDescription.createDecimal()
.withScale(2).withPrecision(7);
Writer writer = OrcFile.createWriter(testFilePath,
OrcFile.writerOptions(conf).setSchema(schema).stripeSize(100000)
.bufferSize(10000));
VectorizedRowBatch batch = new VectorizedRowBatch(1, 1024);
DecimalColumnVector decimalColumnVector = new DecimalColumnVector(7, 2);
batch.cols[0] = decimalColumnVector;
batch.reset();
batch.size = 2;
decimalColumnVector.set(0, new HiveDecimalWritable("-99999.99"));
decimalColumnVector.set(1, new HiveDecimalWritable("-88888.88"));
writer.addRowBatch(batch);
writer.close();
Reader reader = OrcFile.createReader(testFilePath,
OrcFile.readerOptions(conf).filesystem(fs));
DecimalColumnStatistics statistics = (DecimalColumnStatistics)
reader.getStatistics()[0];
assertEquals("Incorrect maximum value", new BigDecimal("-99999.99"),
statistics.getMinimum().bigDecimalValue());
assertEquals("Incorrect minimum value", new BigDecimal("-88888.88"),
statistics.getMaximum().bigDecimalValue());
}
{code}
Note, that this test fails only on 1.5 and master, and passes on 1.4 branch. Am
I doing something wrong here? If this is indeed a bug, I don't think this
causes correctness problems, but might be source of performance regression in
case min-max stats are used with predicate pushdown.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)