[jira] [Created] (IMPALA-12254) Inconsistent empty string handling in CSV parser

Felix Bier (Jira) Wed, 28 Jun 2023 12:31:03 -0700

Felix Bier created IMPALA-12254:
-----------------------------------

             Summary: Inconsistent empty string handling in CSV parser
                 Key: IMPALA-12254
                 URL: https://issues.apache.org/jira/browse/IMPALA-12254
             Project: IMPALA
          Issue Type: Bug
          Components: be
            Reporter: Felix Bier



h3. Steps to reproduce:

CSV table:

 
{code:java}

,
,, {code}
DDL:

 

 
{code:java}
CREATE EXTERNAL TABLE tab1
(
   col_1 VARCHAR(256),
   col_2 VARCHAR(256),
   col_3 VARCHAR(256)
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/tmp/tab1'; {code}
h3. Expected behavior:

Querying tab1 should consistently either give all nulls or all empty strings.
h3. Actual behavior:

Some fields are NULL and some fields are empty string.

 
{code:java}
[localhost.localdomain:21050] default> select col1 is NULL, col2 is NULL, col3 
is NULL from tab1;
+---------------+---------------+---------------+
| col_1 is null | col_2 is null | col_3 is null |
+---------------+---------------+---------------+
| false         | true          | false         |
| false         | false         | true          |
| false         | false         | true          |
+---------------+---------------+---------------+
 {code}
h3. Analysis:

Problem 1:

 
DelimitedTextParser has function FillColumns for filling missing
columns. This function creates a local variable for representing the fill value:
{code:java}
  char* dummy = NULL;
  if (last_column == NULL) last_column = &dummy; {code}
The last_column pointer is passed to AddColumns, which will derefence the 
pointer and perform increment. As last_column points to dummy, the increment is 
applied to dummy, so after the call to AddColumns, dummy is NULL +1. Each 
further iteration in the loop in FillColumns will further increment dummy. This 
results in non-null pointers being set as the column value, which results in 
WriteSlot not entering the NULL case and writing an empty string instead.
 
Problem 2:
{color:#268bd2}ParseFieldLocations{color} makes calls to AddColumn with len=0, 
but non-null
column pointers. These are interpreted as empty string instead of SQL-NULL
in the result.
 
h3. Suggested fix:
 
Problem 1:  The pointer increment in AddColumn could be guarded so that it is 
only performed on non-NULL pointers:
{code:java}
diff --git a/be/src/exec/delimited-text-parser.inline.h 
b/be/src/exec/delimited-text-parser.inline.h
index 266fa06dd..868f4e9bc 100644
--- a/be/src/exec/delimited-text-parser.inline.h
+++ b/be/src/exec/delimited-text-parser.inline.h
@@ -70,7 +70,9 @@ inline Status 
DelimitedTextParser<DELIMITED_TUPLES>::AddColumn(int64_t len,
     ++(*num_fields);
   }
   if (PROCESS_ESCAPES) current_column_has_escape_ = false;
-  *next_column_start += len + 1;
+  if (*next_column_start != NULL) {
+    *next_column_start += len + 1;
+  }
   ++column_idx_;
   return Status::OK();
 }
{code}
With this fix, querying the example tables gives:

{code:java}
[localhost.localdomain:21050] default> select col_1 is NULL, col_2 is NULL, 
col_3 is NULL from tab1;
+---------------+---------------+---------------+
| col_1 is null | col_2 is null | col_3 is null |
+---------------+---------------+---------------+
| false         | true          | true          |
| false         | false         | true          |
| false         | false         | true          |
+---------------+---------------+---------------+
 {code}
Problem 2: Here it is not clear to what degree that is expected behavior. If 
this is not expected behavior, I would expect that some calls to AddColumn 
would have to be changed such that they pass a nullptr for the current column 
position if len=0.
 
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (IMPALA-12254) Inconsistent empty string handling in CSV parser

Reply via email to