Quanlong Huang created ORC-1024:
-----------------------------------

             Summary: BloomFilter hash computation is inconsistent between Java 
and C++ clients
                 Key: ORC-1024
                 URL: https://issues.apache.org/jira/browse/ORC-1024
             Project: ORC
          Issue Type: Bug
          Components: C++
    Affects Versions: 1.6.11, 1.6.10, 1.6.9, 1.6.8, 1.6.7, 1.7.0, 1.6.6, 1.6.5, 
1.6.4, 1.6.3, 1.6.2, 1.6.1, 1.6.0
            Reporter: Quanlong Huang
            Assignee: Quanlong Huang
         Attachments: id_name_with_bloom_filters.orc

[~drorke] found that the C++ reader could incorrectly filter out some rows 
(RowGroup) when reading Hive generated ORC files with SearchArgument "x = 
value" using some special values. It only happens when Hive generates bloom 
filters in these files.

I finally reproduced this by using the java tool (with ORC-1023) to generate an 
ORC file with bloom filters, and read it using the c++ reader. Attached the orc 
file (id_name_with_bloom_filters.orc). It contains 2 columns and 3 rows:
{code:java}
{"id": 0, "name": "Alice"}
{"id": 1, "name": "Bob"}
{"id": 18000000000, "name": "Mike"}
{code}
Using SearchArgument "id = 18000000000" in the C++ reader, no rows will be read 
out.

Looking into the codes, the Java codes use {{long}} as hash key, while the C++ 
codes use {{uint64_t}} as hash key. {{long}} in Java is signed so should 
correspond to {{int64_t}} in C++. I think this causes the issue.

In Java codes, the hash key of 18000000000 is -1097054448615658549. In the C++ 
codes, the hash key of it is 15298148493198126027. This results in different 
results in testHash().

Java codes: 
 
[https://github.com/apache/orc/blob/93b7aa67830104d6bd7fc55399947ee938549f55/java/core/src/java/org/apache/orc/util/BloomFilter.java#L195-L204]
 C++ codes:
 
[https://github.com/apache/orc/blob/93b7aa67830104d6bd7fc55399947ee938549f55/c%2B%2B/src/BloomFilter.cc#L106-L115]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to