Sean Hsuan-Yi Chu created DRILL-3808:
----------------------------------------
Summary: When reading TSV files, TextReader does not follow the
standard
Key: DRILL-3808
URL: https://issues.apache.org/jira/browse/DRILL-3808
Project: Apache Drill
Issue Type: Bug
Components: Storage - Text & CSV
Reporter: Sean Hsuan-Yi Chu
Assignee: Sean Hsuan-Yi Chu
Priority: Critical
According to references [1], [2]:
In .csv, the double quote is a special character as it can optionally enclose a
text field. But in .tsv, it is not a special character, and it can appear
anywhere and when it does, it should treated as a literal. The tsv format
specification also does not provide for the tab or CR/LF characters to show up
anywhere in text fields. However, Drill treats tsv very the same like csv.
For an example, given data:
{code}
"test"\t"test"
{code}
A query: select columns[0], columns[1] from `t.tsv`; Drill would give
{code}
test test
{code}
However, according to the reference[2], it is supposed to be
{code}
"test" "test"
{code}
Ideally, the Drill should follow the standard see[2].
[1] CSV - https://tools.ietf.org/html/rfc4180
[2] TSV - http://www.iana.org/assignments/media-types/text/tab-separated-values
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)