Felix Bier created IMPALA-12254:
-----------------------------------
Summary: Inconsistent empty string handling in CSV parser
Key: IMPALA-12254
URL: https://issues.apache.org/jira/browse/IMPALA-12254
Project: IMPALA
Issue Type: Bug
Components: be
Reporter: Felix Bier
h3. Steps to reproduce:
CSV table:
{code:java}
,
,, {code}
DDL:
{code:java}
CREATE EXTERNAL TABLE tab1
(
col_1 VARCHAR(256),
col_2 VARCHAR(256),
col_3 VARCHAR(256)
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/tmp/tab1'; {code}
h3. Expected behavior:
Querying tab1 should consistently either give all nulls or all empty strings.
h3. Actual behavior:
Some fields are NULL and some fields are empty string.
{code:java}
[localhost.localdomain:21050] default> select col1 is NULL, col2 is NULL, col3
is NULL from tab1;
+---------------+---------------+---------------+
| col_1 is null | col_2 is null | col_3 is null |
+---------------+---------------+---------------+
| false | true | false |
| false | false | true |
| false | false | true |
+---------------+---------------+---------------+
{code}
h3. Analysis:
Problem 1:
DelimitedTextParser has function FillColumns for filling missing
columns. This function creates a local variable for representing the fill value:
{code:java}
char* dummy = NULL;
if (last_column == NULL) last_column = &dummy; {code}
The last_column pointer is passed to AddColumns, which will derefence the
pointer and perform increment. As last_column points to dummy, the increment is
applied to dummy, so after the call to AddColumns, dummy is NULL +1. Each
further iteration in the loop in FillColumns will further increment dummy. This
results in non-null pointers being set as the column value, which results in
WriteSlot not entering the NULL case and writing an empty string instead.
Problem 2:
{color:#268bd2}ParseFieldLocations{color} makes calls to AddColumn with len=0,
but non-null
column pointers. These are interpreted as empty string instead of SQL-NULL
in the result.
h3. Suggested fix:
Problem 1: The pointer increment in AddColumn could be guarded so that it is
only performed on non-NULL pointers:
{code:java}
diff --git a/be/src/exec/delimited-text-parser.inline.h
b/be/src/exec/delimited-text-parser.inline.h
index 266fa06dd..868f4e9bc 100644
--- a/be/src/exec/delimited-text-parser.inline.h
+++ b/be/src/exec/delimited-text-parser.inline.h
@@ -70,7 +70,9 @@ inline Status
DelimitedTextParser<DELIMITED_TUPLES>::AddColumn(int64_t len,
++(*num_fields);
}
if (PROCESS_ESCAPES) current_column_has_escape_ = false;
- *next_column_start += len + 1;
+ if (*next_column_start != NULL) {
+ *next_column_start += len + 1;
+ }
++column_idx_;
return Status::OK();
}
{code}
With this fix, querying the example tables gives:
{code:java}
[localhost.localdomain:21050] default> select col_1 is NULL, col_2 is NULL,
col_3 is NULL from tab1;
+---------------+---------------+---------------+
| col_1 is null | col_2 is null | col_3 is null |
+---------------+---------------+---------------+
| false | true | true |
| false | false | true |
| false | false | true |
+---------------+---------------+---------------+
{code}
Problem 2: Here it is not clear to what degree that is expected behavior. If
this is not expected behavior, I would expect that some calls to AddColumn
would have to be changed such that they pass a nullptr for the current column
position if len=0.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)