Mike Davis created DRILL-1408:
---------------------------------
Summary: SELECT column from CSV with JOIN returns null if not part
of JOIN condition
Key: DRILL-1408
URL: https://issues.apache.org/jira/browse/DRILL-1408
Project: Apache Drill
Issue Type: Bug
Components: Storage - Text & CSV
Affects Versions: 0.5.0
Environment: CentOS 6.5, MapR M7
Reporter: Mike Davis
A SELECT for a column from a CSV file with a JOIN condition always returns null
for columns not included in the JOIN condition. When querying each table
separately, the values are returned as expected.
Note that this works fine for any combination of JSON and Parquet, but fails
when at least one of the files is CSV.
Simple example with two small CSV files:
beatles.csv:
---------------------------
1,John,Lennon
2,Paul,McCartney
3,George,Harrison
4,Ringo,Starr
----------------------------
songs.csv:
----------------------------
1,Help
2,Yesterday
3,Blue Jay Way
4,Yellow Submarine
----------------------------
This queries returns values as expected:
SELECT columns[0] AS id, CONCAT(columns[1], ' ', columns[2]) AS singer FROM
dfs.`beatles.csv`;
This query returns 4 results, all with null values:
SELECT S.columns[1] AS song, CONCAT(B.columns[1], ' ', B.columns[2]) AS singer
FROM dfs.`beatles.csv` AS B
INNER JOIN dfs.`songs.csv` AS S ON B.columns[0] = S.columns[0];
The only columns that return non-null values are the ones from the JOIN
condition (B.columns[0] and S.columns[0] in the following query):
SELECT S.columns[1] AS song, CONCAT(B.columns[1], ' ', B.columns[2]) AS singer,
S.columns[0] AS beatles_id, B.columns[0] AS id
FROM dfs.`beatles.csv` AS B
INNER JOIN dfs.`songs.csv` AS S ON B.columns[0] = S.columns[0];
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)