Jinfeng Ni created DRILL-1499:
---------------------------------
Summary: Different column order could appear in the result set for
a schema-less select * query, even there are no changing schemas.
Key: DRILL-1499
URL: https://issues.apache.org/jira/browse/DRILL-1499
Project: Apache Drill
Issue Type: Bug
Reporter: Jinfeng Ni
Assignee: Jinfeng Ni
For a select * query referring to a schema-less table, Drill could return
different column, depending on the physical operators the query involves:
Q1:
{code}
select * from cp.`employee.json` limit 3;
+-------------+------------+------------+------------+-------------+----------------+------------+---------------+------------+------------+------------+---------------+-----------------+----------------+------------+-----------------+
| employee_id | full_name | first_name | last_name | position_id |
position_title | store_id | department_id | birth_date | hire_date |
salary | supervisor_id | education_level | marital_status | gender |
management_role |
+-------------+------------+------------+------------+-------------+----------------+------------+---------------+------------+------------+------------+---------------+-----------------+----------------+------------+-----------------+
{code}
Q2:
{code}
select * from cp.`employee.json` order by last_name limit 3;
+------------+---------------+-----------------+-------------+------------+------------+------------+------------+------------+-----------------+----------------+-------------+----------------+------------+------------+---------------+
| birth_date | department_id | education_level | employee_id | first_name |
full_name | gender | hire_date | last_name | management_role |
marital_status | position_id | position_title | salary | store_id |
supervisor_id |
+------------+---------------+-----------------+-------------+------------+------------+------------+------------+------------+-----------------+----------------+-------------+----------------+------------+------------+---------------+
{code}
The difference between Q1 and Q2 is the order by clause. With order by clause
in Q2, Drill will sort the column names alphabetically, while for Q1, the
column names are in the same order as in the data source.
The underlying cause for such difference is that the sort or sort-based merger
operator would require canonicalization, since the incoming batches could
contain different schemas.
However, it would be better that such canonicalization is used only when the
incoming batches have changing schemas. If all the incoming batches have
identical schemas, no need to sort the column orders. With this fix, Drill
will present the same column order in the result set, for a schema-less select
* query, if there is no changing schemas from incoming data sources.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)