I found that the dfs storage section for csv file types did not all have the extractHeader setting in place. Manually putting it in all four of my nodes may have resolved the issue.

In my vanilla Hadoop 2.7.0 setup on the same servers, I don't recall having to set it on all nodes.

Did I perhaps miss something in the MapR cluster setup?


On 15 Apr 2016, at 14:16, Abhishek Girish wrote:

Hello,

This is my format setting:

    "csv": {
      "type": "text",
      "extensions": [
        "csv"
      ],
      "extractHeader": true,
      "delimiter": ","
    }

I was able to extract the header and get expected results:


select * from mfs.tmp.`abcd.csv`;
+----+----+----+----+
| A  | B  | C  | D  |
+----+----+----+----+
| 1  | 2  | 3  | 4  |
| 2  | 3  | 4  | 5  |
| 3  | 4  | 5  | 6  |
+----+----+----+----+
3 rows selected (0.196 seconds)

select A from mfs.tmp.`abcd.csv`;
+----+
| A  |
+----+
| 1  |
| 2  |
| 3  |
+----+
3 rows selected (0.16 seconds)

I am using a MapR cluster with Drill 1.6.0. I had also enabled the new text
reader.

Note: My initial query failed to extract header, similar to what you
reported. I had to set the "skipFirstLine" option to true, for it to work. Strangely, for subsequent queries, it works even after removing / disabling
the "skipFirstLine" option. This could be a bug, but I'm not able to
reproduce it right now. Will file a JIRA once i have more clarity.



Regards,
Abhishek

On Fri, Apr 15, 2016 at 10:53 AM, Matt <bsg...@gmail.com> wrote:

With files in the local filesystem, and an embedded drill bit from the download on drill.apache.org, I can successfully query csv data by column name with the extractHeader option on, as in SELECT customer_if FROM `file`;

But in a MapR cluster (v. 5.1.0.37549.GA) with the data in MapR-FS, the extractHeader options does not seem to be taking effect. A plain "SELECT *"
returns rows with the header as a data row, not in the columns list.

I have verified that exec.storage.enable_new_text_reader is true, and in
both cases csv storage is defined as:

~~~
    "csv": {
      "type": "text",
      "extensions": [
        "csv"
      ],
      "extractHeader": true,
      "delimiter": ","
    }
~~~

Of course with the csv reader not extracting the columns, an attempt to
reference columns by name results in:

Error: DATA_READ ERROR: Selected column 'customer_id' must have name
'columns' or must be plain '*'. In trying to diagnose the issue, I noted that at times the file header row not being part of the SELECT * results,
but also not being used to detect column names.

Both cases are Drill v1.6.0, but the MapR installed version has a
different commit than the standalone copy I am using:

MapR:

~~~

+----------+-------------------------------------------+----------------------------------------------------------------------------------------------------------+----------------------------+--------------+----------------------------+
| version  |                 commit_id                 |
                            commit_message
            |        commit_time         | build_email  |
 build_time         |

+----------+-------------------------------------------+----------------------------------------------------------------------------------------------------------+----------------------------+--------------+----------------------------+
| 1.6.0 | 2d532bd206d7ae9f3cb703ee7f51ae3764374d43 | MD-850: Treat the
type of decimal literals as DOUBLE only when
planner.enable_decimal_data_type is true | 31.03.2016 @ 04:47:25 UTC |
Unknown      | 31.03.2016 @ 04:40:54 UTC  |

+----------+-------------------------------------------+----------------------------------------------------------------------------------------------------------+----------------------------+--------------+----------------------------+
~~~

Local:

~~~

+----------+-------------------------------------------+-----------------------------------------------------+----------------------------+--------------------+----------------------------+
| version  |                 commit_id                 |
 commit_message                    |        commit_time         |
build_email     |         build_time         |

+----------+-------------------------------------------+-----------------------------------------------------+----------------------------+--------------------+----------------------------+
| 1.6.0    | d51f7fc14bd71d3e711ece0d02cdaa4d4c385eeb  |
[maven-release-plugin] prepare release drill-1.6.0 | 10.03.2016 @ 16:34:37
PST  | par...@apache.org  | 10.03.2016 @ 17:45:29 PST  |

+----------+-------------------------------------------+-----------------------------------------------------+----------------------------+--------------------+----------------------------+
~~~

Reply via email to