[ 
https://issues.apache.org/jira/browse/DRILL-5548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers updated DRILL-5548:
-------------------------------
    Description: 
Drill's CSV column reader supports two forms of files:

* Files with column headers as the first line of the file.
* Files without column headers.

The CSV storage plugin specifies which format to use for files accessed via 
that storage plugin config.

Suppose we have a empty file. When queried in the CSV configuration without 
headers, the query works. The schema returned is the {{columns}} Varchar array, 
and the results contain no rows. Good.

Now, query the same file with the CSV plugin configured to use headers.

{code}
    TextFormatConfig csvFormat = new TextFormatConfig();
    csvFormat.fieldDelimiter = ',';
    csvFormat.skipFirstLine = false;
    csvFormat.extractHeader = true;
{code}

(The above can also be done using JSON when running Drill as a server.)

We get the following exception:

{code}
org.apache.drill.common.exceptions.UserRemoteException: 
SYSTEM ERROR: IllegalStateException: 
Incoming batch [#4, ProjectRecordBatch] has an empty schema. 
This is not allowed.
{code}

This particular case is a bit tricky. First, we want headers, but there are 
none. We can interpret this as an error (a file with headers must have 
headers). Or, we an treat it as a file that happens to have no columns. The 
latter choice is a bit more general.

The file also has no data rows. This could be an error, or it too could just be 
treated as a result set of zero rows.

Combined, the result set is one with no columns and no rows: an empty result 
set. This is actually a valid (if not very useful) result in SQL.

Conversation with Jinfeng suggested that, in such a scenario, the reader is 
supposed to make up a dummy column so that the result is not empty. While this 
is a workaround, it seems to just push the problem from the Project operator 
into each of many record readers.

Another alternative is to revert to the {{columns}} column: generate a result 
set with the {{columns}} array, but with no data. This solution avoids the 
empty batch problem.

  was:
Drill's CSV column reader supports two forms of files:

* Files with column headers as the first line of the file.
* Files without column headers.

The CSV storage plugin specifies which format to use for files accessed via 
that storage plugin config.

Suppose we have a empty file. When queried in the CSV configuration without 
headers, the query works. The schema returned is the {{columns}} Varchar array, 
and the results contain no rows. Good.

Now, query the same file with the CSV plugin configured to use headers.

{code}
    TextFormatConfig csvFormat = new TextFormatConfig();
    csvFormat.fieldDelimiter = ',';
    csvFormat.skipFirstLine = false;
    csvFormat.extractHeader = true;
{code}

(The above can also be done using JSON when running Drill as a server.)

We get the following exception:

{code}
org.apache.drill.common.exceptions.UserRemoteException: 
SYSTEM ERROR: IllegalStateException: 
Incoming batch [#4, ProjectRecordBatch] has an empty schema. 
This is not allowed.
{code}

This particular case is a bit tricky. First, we want headers, but there are 
none. We can interpret this as an error (a file with headers must have 
headers). Or, we an treat it as a file that happens to have no columns. The 
latter choice is a bit more general.

The file also has no data rows. This could be an error, or it too could just be 
treated as a result set of zero rows.

Combined, the result set is one with no columns and no rows: an empty result 
set. This is actually a valid (if not very useful) result in SQL.

Conversation with Jinfeng suggested that, in such a scenario, the reader is 
supposed to make up a dummy column so that the result is not empty. While this 
is a workaround, it seems to just push the problem from the Project operator 
into each of many record readers.


> SELECT * against an empty CSV file with headers produces error
> --------------------------------------------------------------
>
>                 Key: DRILL-5548
>                 URL: https://issues.apache.org/jira/browse/DRILL-5548
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.10.0
>            Reporter: Paul Rogers
>            Priority: Minor
>
> Drill's CSV column reader supports two forms of files:
> * Files with column headers as the first line of the file.
> * Files without column headers.
> The CSV storage plugin specifies which format to use for files accessed via 
> that storage plugin config.
> Suppose we have a empty file. When queried in the CSV configuration without 
> headers, the query works. The schema returned is the {{columns}} Varchar 
> array, and the results contain no rows. Good.
> Now, query the same file with the CSV plugin configured to use headers.
> {code}
>     TextFormatConfig csvFormat = new TextFormatConfig();
>     csvFormat.fieldDelimiter = ',';
>     csvFormat.skipFirstLine = false;
>     csvFormat.extractHeader = true;
> {code}
> (The above can also be done using JSON when running Drill as a server.)
> We get the following exception:
> {code}
> org.apache.drill.common.exceptions.UserRemoteException: 
> SYSTEM ERROR: IllegalStateException: 
> Incoming batch [#4, ProjectRecordBatch] has an empty schema. 
> This is not allowed.
> {code}
> This particular case is a bit tricky. First, we want headers, but there are 
> none. We can interpret this as an error (a file with headers must have 
> headers). Or, we an treat it as a file that happens to have no columns. The 
> latter choice is a bit more general.
> The file also has no data rows. This could be an error, or it too could just 
> be treated as a result set of zero rows.
> Combined, the result set is one with no columns and no rows: an empty result 
> set. This is actually a valid (if not very useful) result in SQL.
> Conversation with Jinfeng suggested that, in such a scenario, the reader is 
> supposed to make up a dummy column so that the result is not empty. While 
> this is a workaround, it seems to just push the problem from the Project 
> operator into each of many record readers.
> Another alternative is to revert to the {{columns}} column: generate a result 
> set with the {{columns}} array, but with no data. This solution avoids the 
> empty batch problem.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to