Yuanhao Luo has uploaded a new patch set (#5). Change subject: IMPALA-2428: Support multiple-character string as the field delimiter ......................................................................
IMPALA-2428: Support multiple-character string as the field delimiter This commit add support for multi-byte string as the field delimiter. Mean while other separators(e.g. escape char, line delimiter and key-map delimiter) are only allowed to have one byte. There some constrains on terminators for text file: 1. Field delimiter can't be empty 2. Tuple delimiter can't be the first byte of field delimiter 3. Escape character can't be the first byte of field delimiter 4. Terminators can't contains '\0' in text file Warning: You can use character or octal in filed terminator, but not unicode, decimal and hexadecimal. For example, to make "###" as field delimiter, you can use fields terminated by '\043#', but not '\u0023', '35', '\x32' respectively. I didn't find a solution to unescape decimal and hexadecimal string. And there's a bug for SqlParser.parse() to parse unicode string. I have opened a issue in https://issues.cloudera.org/browse/IMPALA-3777. After fixing this, we can also use unicode string. Other one-byte terminators are still allow to use decimal value. TODO: Thinking that SSE4_2 doesn't support multi-byte matching, this commit supports multi-byte field delimiter via direct string matching. As a result, we would get poor performance if the multi-byte field delimiter is relatively long. Maybe we can get better performance via better string matching algorithm such as KMP. Change-Id: Id1437ca35dc4fdc58a7db1c2c70d4da30adf0c3e --- M be/src/exec/delimited-text-parser-test.cc M be/src/exec/delimited-text-parser.cc M be/src/exec/delimited-text-parser.h M be/src/exec/delimited-text-parser.inline.h M be/src/exec/hdfs-sequence-table-writer.cc M be/src/exec/hdfs-sequence-table-writer.h M be/src/exec/hdfs-text-scanner.cc M be/src/exec/hdfs-text-table-writer.cc M be/src/exec/hdfs-text-table-writer.h M be/src/runtime/descriptors.h M common/thrift/CatalogObjects.thrift M fe/src/main/java/com/cloudera/impala/analysis/CreateTableStmt.java M fe/src/main/java/com/cloudera/impala/catalog/HdfsStorageDescriptor.java A testdata/data/text-commacomma-backslash-newline.txt A testdata/data/text-dollarhash-hash-pipe.txt A testdata/data/text-hashathash-ecirc-newline.txt M testdata/datasets/functional/functional_schema_template.sql M testdata/datasets/functional/schema_constraints.csv M testdata/workloads/functional-query/queries/QueryTest/delimited-latin-text.test M testdata/workloads/functional-query/queries/QueryTest/delimited-text.test M tests/query_test/test_delimited_text.py 21 files changed, 390 insertions(+), 77 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/14/3314/5 -- To view, visit http://gerrit.cloudera.org:8080/3314 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Id1437ca35dc4fdc58a7db1c2c70d4da30adf0c3e Gerrit-PatchSet: 5 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Yuanhao Luo <luoyuan...@software.ict.ac.cn> Gerrit-Reviewer: Jim Apple <jbap...@cloudera.com> Gerrit-Reviewer: Yuanhao Luo <luoyuan...@software.ict.ac.cn>