Yuanhao Luo has uploaded a new patch set (#5).

Change subject: IMPALA-2428: Support multiple-character string as the field 
delimiter
......................................................................

IMPALA-2428: Support multiple-character string as the field delimiter

This commit add support for multi-byte string as the field delimiter.
Mean while other separators(e.g. escape char, line delimiter and key-map
delimiter) are only allowed to have one byte.

There some constrains on terminators for text file:
1. Field delimiter can't be empty
2. Tuple delimiter can't be the first byte of field delimiter
3. Escape character can't be the first byte of field delimiter
4. Terminators can't contains '\0' in text file

Warning: You can use character or octal in filed terminator, but not
unicode, decimal and hexadecimal. For example, to make "###" as field
delimiter, you can use fields terminated by '\043#', but not '\u0023',
'35', '\x32' respectively. I didn't find a solution to unescape decimal
and hexadecimal string. And there's a bug for SqlParser.parse() to parse
unicode string. I have opened a issue in
https://issues.cloudera.org/browse/IMPALA-3777. After fixing this, we
can also use unicode string.

Other one-byte terminators are still allow to use decimal value.

TODO: Thinking that SSE4_2 doesn't support multi-byte matching, this
commit supports multi-byte field delimiter via direct string matching.
As a result, we would get poor performance if the multi-byte field
delimiter is relatively long. Maybe we can get better performance via
better string matching algorithm such as KMP.

Change-Id: Id1437ca35dc4fdc58a7db1c2c70d4da30adf0c3e
---
M be/src/exec/delimited-text-parser-test.cc
M be/src/exec/delimited-text-parser.cc
M be/src/exec/delimited-text-parser.h
M be/src/exec/delimited-text-parser.inline.h
M be/src/exec/hdfs-sequence-table-writer.cc
M be/src/exec/hdfs-sequence-table-writer.h
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/hdfs-text-table-writer.h
M be/src/runtime/descriptors.h
M common/thrift/CatalogObjects.thrift
M fe/src/main/java/com/cloudera/impala/analysis/CreateTableStmt.java
M fe/src/main/java/com/cloudera/impala/catalog/HdfsStorageDescriptor.java
A testdata/data/text-commacomma-backslash-newline.txt
A testdata/data/text-dollarhash-hash-pipe.txt
A testdata/data/text-hashathash-ecirc-newline.txt
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M 
testdata/workloads/functional-query/queries/QueryTest/delimited-latin-text.test
M testdata/workloads/functional-query/queries/QueryTest/delimited-text.test
M tests/query_test/test_delimited_text.py
21 files changed, 390 insertions(+), 77 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/14/3314/5
-- 
To view, visit http://gerrit.cloudera.org:8080/3314
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id1437ca35dc4fdc58a7db1c2c70d4da30adf0c3e
Gerrit-PatchSet: 5
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Yuanhao Luo <luoyuan...@software.ict.ac.cn>
Gerrit-Reviewer: Jim Apple <jbap...@cloudera.com>
Gerrit-Reviewer: Yuanhao Luo <luoyuan...@software.ict.ac.cn>

Reply via email to