[jira] [Updated] (IMPALA-12254) Inconsistent empty string handling in CSV parser

Felix Bier (Jira) Wed, 28 Jun 2023 12:48:42 -0700


     [ 
https://issues.apache.org/jira/browse/IMPALA-12254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Felix Bier updated IMPALA-12254:
--------------------------------
    Description: 
h3. Steps to reproduce:

CSV table:

 
{code:java}
,
,, {code}
DDL:

 

 
{code:java}
CREATE EXTERNAL TABLE tab1
(
   col_1 VARCHAR(256),
   col_2 VARCHAR(256),
   col_3 VARCHAR(256)
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/tmp/tab1'; {code}
h3. Expected behavior:

Querying tab1 should consistently either give all nulls or all empty strings.
h3. Actual behavior:

Some fields are NULL and some fields are empty string.

 
{code:java}
[localhost.localdomain:21050] default> select col1 is NULL, col2 is NULL, col3 
is NULL from tab1;
+---------------+---------------+---------------+
| col_1 is null | col_2 is null | col_3 is null |
+---------------+---------------+---------------+
| false         | true          | false         |
| false         | false         | true          |
| false         | false         | true          |
+---------------+---------------+---------------+
 {code}
h3. Analysis:

Problem 1:

 
DelimitedTextParser has function FillColumns for filling missing
columns. This function creates a local variable for representing the fill value:
{code:java}
  char* dummy = NULL;
  if (last_column == NULL) last_column = &dummy; {code}
The last_column pointer is passed to AddColumns, which will derefence the 
pointer and perform increment. As last_column points to dummy, the increment is 
applied to dummy, so after the call to AddColumns, dummy is NULL +1. Each 
further iteration in the loop in FillColumns will further increment dummy. This 
results in non-null pointers being set as the column value, which results in 
WriteSlot not entering the NULL case and writing an empty string instead.
 
Problem 2:
{color:#268bd2}ParseFieldLocations{color} makes calls to AddColumn with len=0, 
but non-null
column pointers. These are interpreted as empty string instead of SQL-NULL
in the result.
 
h3. Suggested fix:

 
Problem 1:  The pointer increment in AddColumn could be guarded so that it is 
only performed on non-NULL pointers:
{code:java}
diff --git a/be/src/exec/delimited-text-parser.inline.h 
b/be/src/exec/delimited-text-parser.inline.h
index 266fa06dd..868f4e9bc 100644
--- a/be/src/exec/delimited-text-parser.inline.h
+++ b/be/src/exec/delimited-text-parser.inline.h
@@ -70,7 +70,9 @@ inline Status 
DelimitedTextParser<DELIMITED_TUPLES>::AddColumn(int64_t len,
     ++(*num_fields);
   }
   if (PROCESS_ESCAPES) current_column_has_escape_ = false;
-  *next_column_start += len + 1;
+  if (*next_column_start != NULL) {
+    *next_column_start += len + 1;
+  }
   ++column_idx_;
   return Status::OK();
 }
{code}
With this fix, querying the example tables gives:
{code:java}
[localhost.localdomain:21050] default> select col_1 is NULL, col_2 is NULL, 
col_3 is NULL from tab1;
+---------------+---------------+---------------+
| col_1 is null | col_2 is null | col_3 is null |
+---------------+---------------+---------------+
| false         | true          | true          |
| false         | false         | true          |
| false         | false         | true          |
+---------------+---------------+---------------+
 {code}
Problem 2: Here it is not clear to what degree that is expected behavior. If 
this is not expected behavior, I would expect that some calls to AddColumn 
would have to be changed such that they pass a nullptr for the current column 
position if len=0.

If the behavior is intended, the rule that is applied is not fully clear to me. 
For example, if the rule is that each empty token should be represented as 
empty string and each delimiter starts a new token, then the first two lines in 
the output (after applying fix for problem 1) make sense. First line in CSV 
file is empty, so it is interpreted as one empty token and then two NULL values 
are filled in. Second line contains "<empty token> <delimiter> <empty token>", 
so one NULL value gets filled in. But then the third line is not consistent, as 
it contains "<empty token> <delimiter> <empty token> <delimiter> <empty token> 
<eof>", so I would expect all three values to be non-NULL if this rule is 
followed.
 
 

  was:
h3. Steps to reproduce:

CSV table:

 
{code:java}

,
,, {code}
DDL:

 

 
{code:java}
CREATE EXTERNAL TABLE tab1
(
   col_1 VARCHAR(256),
   col_2 VARCHAR(256),
   col_3 VARCHAR(256)
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/tmp/tab1'; {code}
h3. Expected behavior:

Querying tab1 should consistently either give all nulls or all empty strings.
h3. Actual behavior:

Some fields are NULL and some fields are empty string.

 
{code:java}
[localhost.localdomain:21050] default> select col1 is NULL, col2 is NULL, col3 
is NULL from tab1;
+---------------+---------------+---------------+
| col_1 is null | col_2 is null | col_3 is null |
+---------------+---------------+---------------+
| false         | true          | false         |
| false         | false         | true          |
| false         | false         | true          |
+---------------+---------------+---------------+
 {code}
h3. Analysis:

Problem 1:

 
DelimitedTextParser has function FillColumns for filling missing
columns. This function creates a local variable for representing the fill value:
{code:java}
  char* dummy = NULL;
  if (last_column == NULL) last_column = &dummy; {code}
The last_column pointer is passed to AddColumns, which will derefence the 
pointer and perform increment. As last_column points to dummy, the increment is 
applied to dummy, so after the call to AddColumns, dummy is NULL +1. Each 
further iteration in the loop in FillColumns will further increment dummy. This 
results in non-null pointers being set as the column value, which results in 
WriteSlot not entering the NULL case and writing an empty string instead.
 
Problem 2:
{color:#268bd2}ParseFieldLocations{color} makes calls to AddColumn with len=0, 
but non-null
column pointers. These are interpreted as empty string instead of SQL-NULL
in the result.
 
h3. Suggested fix:
 
Problem 1:  The pointer increment in AddColumn could be guarded so that it is 
only performed on non-NULL pointers:
{code:java}
diff --git a/be/src/exec/delimited-text-parser.inline.h 
b/be/src/exec/delimited-text-parser.inline.h
index 266fa06dd..868f4e9bc 100644
--- a/be/src/exec/delimited-text-parser.inline.h
+++ b/be/src/exec/delimited-text-parser.inline.h
@@ -70,7 +70,9 @@ inline Status 
DelimitedTextParser<DELIMITED_TUPLES>::AddColumn(int64_t len,
     ++(*num_fields);
   }
   if (PROCESS_ESCAPES) current_column_has_escape_ = false;
-  *next_column_start += len + 1;
+  if (*next_column_start != NULL) {
+    *next_column_start += len + 1;
+  }
   ++column_idx_;
   return Status::OK();
 }
{code}
With this fix, querying the example tables gives:

{code:java}
[localhost.localdomain:21050] default> select col_1 is NULL, col_2 is NULL, 
col_3 is NULL from tab1;
+---------------+---------------+---------------+
| col_1 is null | col_2 is null | col_3 is null |
+---------------+---------------+---------------+
| false         | true          | true          |
| false         | false         | true          |
| false         | false         | true          |
+---------------+---------------+---------------+
 {code}
Problem 2: Here it is not clear to what degree that is expected behavior. If 
this is not expected behavior, I would expect that some calls to AddColumn 
would have to be changed such that they pass a nullptr for the current column 
position if len=0.
 
 


> Inconsistent empty string handling in CSV parser
> ------------------------------------------------
>
>                 Key: IMPALA-12254
>                 URL: https://issues.apache.org/jira/browse/IMPALA-12254
>             Project: IMPALA
>          Issue Type: Bug
>          Components: be
>            Reporter: Felix Bier
>            Priority: Minor
>
> h3. Steps to reproduce:
> CSV table:
>  
> {code:java}
> ,
> ,, {code}
> DDL:
>  
>  
> {code:java}
> CREATE EXTERNAL TABLE tab1
> (
>    col_1 VARCHAR(256),
>    col_2 VARCHAR(256),
>    col_3 VARCHAR(256)
> )
> ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
> LOCATION '/tmp/tab1'; {code}
> h3. Expected behavior:
> Querying tab1 should consistently either give all nulls or all empty strings.
> h3. Actual behavior:
> Some fields are NULL and some fields are empty string.
>  
> {code:java}
> [localhost.localdomain:21050] default> select col1 is NULL, col2 is NULL, 
> col3 is NULL from tab1;
> +---------------+---------------+---------------+
> | col_1 is null | col_2 is null | col_3 is null |
> +---------------+---------------+---------------+
> | false         | true          | false         |
> | false         | false         | true          |
> | false         | false         | true          |
> +---------------+---------------+---------------+
>  {code}
> h3. Analysis:
> Problem 1:
>  
> DelimitedTextParser has function FillColumns for filling missing
> columns. This function creates a local variable for representing the fill 
> value:
> {code:java}
>   char* dummy = NULL;
>   if (last_column == NULL) last_column = &dummy; {code}
> The last_column pointer is passed to AddColumns, which will derefence the 
> pointer and perform increment. As last_column points to dummy, the increment 
> is applied to dummy, so after the call to AddColumns, dummy is NULL +1. Each 
> further iteration in the loop in FillColumns will further increment dummy. 
> This results in non-null pointers being set as the column value, which 
> results in WriteSlot not entering the NULL case and writing an empty string 
> instead.
>  
> Problem 2:
> {color:#268bd2}ParseFieldLocations{color} makes calls to AddColumn with 
> len=0, but non-null
> column pointers. These are interpreted as empty string instead of SQL-NULL
> in the result.
>  
> h3. Suggested fix:
>  
> Problem 1:  The pointer increment in AddColumn could be guarded so that it is 
> only performed on non-NULL pointers:
> {code:java}
> diff --git a/be/src/exec/delimited-text-parser.inline.h 
> b/be/src/exec/delimited-text-parser.inline.h
> index 266fa06dd..868f4e9bc 100644
> --- a/be/src/exec/delimited-text-parser.inline.h
> +++ b/be/src/exec/delimited-text-parser.inline.h
> @@ -70,7 +70,9 @@ inline Status 
> DelimitedTextParser<DELIMITED_TUPLES>::AddColumn(int64_t len,
>      ++(*num_fields);
>    }
>    if (PROCESS_ESCAPES) current_column_has_escape_ = false;
> -  *next_column_start += len + 1;
> +  if (*next_column_start != NULL) {
> +    *next_column_start += len + 1;
> +  }
>    ++column_idx_;
>    return Status::OK();
>  }
> {code}
> With this fix, querying the example tables gives:
> {code:java}
> [localhost.localdomain:21050] default> select col_1 is NULL, col_2 is NULL, 
> col_3 is NULL from tab1;
> +---------------+---------------+---------------+
> | col_1 is null | col_2 is null | col_3 is null |
> +---------------+---------------+---------------+
> | false         | true          | true          |
> | false         | false         | true          |
> | false         | false         | true          |
> +---------------+---------------+---------------+
>  {code}
> Problem 2: Here it is not clear to what degree that is expected behavior. If 
> this is not expected behavior, I would expect that some calls to AddColumn 
> would have to be changed such that they pass a nullptr for the current column 
> position if len=0.
> If the behavior is intended, the rule that is applied is not fully clear to 
> me. For example, if the rule is that each empty token should be represented 
> as empty string and each delimiter starts a new token, then the first two 
> lines in the output (after applying fix for problem 1) make sense. First line 
> in CSV file is empty, so it is interpreted as one empty token and then two 
> NULL values are filled in. Second line contains "<empty token> <delimiter> 
> <empty token>", so one NULL value gets filled in. But then the third line is 
> not consistent, as it contains "<empty token> <delimiter> <empty token> 
> <delimiter> <empty token> <eof>", so I would expect all three values to be 
> non-NULL if this rule is followed.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (IMPALA-12254) Inconsistent empty string handling in CSV parser

Reply via email to