Hi all,
I was looking at HBase scan operator, and noticed one interesting behavior
regarding how Drill/HBase reader handling a column with nulll value.
1. Date prepare (HBase shell). A table with 3 rows. one row has
'addr.city' column only, while the other two rows have column 'order.id'
only.
create 'customer', {NAME=>'addr'}, {NAME=>'order'}
put 'customer', 'jsmith', 'addr:city', 'sanjose'
put 'customer', 'tom', 'order:id', '2'
put 'customer', 'frank', 'order:id', '3'
scan 'customer'
ROW COLUMN+CELL
frank column=order:id,
timestamp=1494969396032, value=3
jsmith column=addr:city,
timestamp=1494969355484, value=sanjose
tom column=order:id,
timestamp=1494969387941, value=2
3 row(s) in 0.0170 seconds
2. Query in Drill
Q1. check row count in Drill. Result looks good.
select count(*) from hbase.customer t;
+---------+
| EXPR$0 |
+---------+
| 3 |
+---------+
Q2. Get column 'addr.city' only, just return 1 row.
select convert_from(t.addr.city, 'UTF8') as city from hbase.customer t;
+----------+
| city |
+----------+
| sanjose |
+----------+
Q3. Get column 'addr.city' and 'order.id', return 3 rows
select convert_from(t.addr.city, 'UTF8') as city,
convert_from(t.`order`.id, 'UTF8') as id from hbase.customer t;
+----------+-------+
| city | id |
+----------+-------+
| null | 3 |
| sanjose | null |
| null | 2 |
+----------+-------+
Comparing Q2 and Q3, looks like Drill/Hbase scan will skip rows where all
the requested columns are null.
Is this the expected behavior? I understand that behavior comes from HBase
Scan specification. But it looks a bit hard to understand initially, from
SQL's perspective.
If this is expected behavior, will it make sense to document this ( I did
search drill doc and did not find doc related to this behavior) ?
Another point, if we disable project push-down in query planner, then Q2
would return 3 rows. In theory, project push-down would only impact query
performance, and should not impact query result.
Any thoughts?
Thanks,
Jinfeng