GitHub user sirpkt opened a pull request:
https://github.com/apache/tajo/pull/377
TAJO-1339: Incorrect handling of tables with custom delimiter when their
data contain '|'
meta options of scanned tables are passed between Stages through DataChannel
- DataChannel contains meta options of source Execution block, collected by
visiting all the ScanNodes in the block.
-- When multiple ScanNodes have different options about the same key,
options of last visited ScanNode overwrite previous ones.
- Each Execution block reflects incoming DataChannel in buildInputExecutor.
- Meta options of the Stage are recalculated before materializing the
result, which reflect all the options from DataChannels and also new ScanNode
in the given Stage.
Changes:
- DataChannel contains all the meta options from the source Execution
block, which means protocol buffer definition is also modified.
- PlannerUtil provides new static method getScanOptions(), which visit all
the ScanNode in the Execution block and get options.
- buildInputExecutor() is modified to reflect meta options of incoming
DataChannel of the given Execution block.
- 'csvfile.delimiter' option sets CSVFILE_DELIMITER (not TEXT_DELIMITER)
because TEXT_DELIMITER is set as '|' even when there is no explicit
'csvfile.delimiter' so that explicit 'csvfile.delimiter' should be prior to
default delimiter.
-- Without this, when two tables, one is 'csvfile.delimiter'=';' and the
other has no option, are used in the same query, one is set as TEXT_DELIMITER =
';' and the other is set as TEXT_DELIMITER = '|' and compete for the meta
option of the Execution block.
-- If we give high prioritiy to the explicit meta setting, it always wins
over the default value and I think it is desirable.
I think some changes may be controversial.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/sirpkt/tajo TAJO-1339
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/tajo/pull/377.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #377
----
commit 873e2a81dcf14080ec3c53c76a7137cbdcd3fd4a
Author: Keuntae Park <[email protected]>
Date: 2015-02-06T04:58:52Z
meta options of scanned tables are passed between Stages through DataChannel
- DataChannel contains meta options of source Execution block,
collected by visiting all the ScanNodes in the block.
When multiple ScanNodes have different options about the same key,
options of last visited ScanNode overwrite previous ones.
- Each Execution block reflects incoming DataChannel in buildInputExecutor.
- Meta options of the Stage are recalculated before materializing the
result,
which reflect all the options from DataChannels and also new ScanNode in
the given Stage.
Changes:
- DataChannel contains all the meta options from the source Execution block,
which means protocol buffer definition is also modified.
- PlannerUtil provides new static method getScanOptions(),
which visit all the ScanNode in the Execution block and get options.
- buildInputExecutor() is modified to reflect meta options of incoming
DataChannel of the given Execution block.
- 'csvfile.delimiter' option sets CSVFILE_DELIMITER (not TEXT_DELIMITER)
because TEXT_DELIMITER is set as '|' even when there is no explicit
'csvfile.delimiter'
so that explicit 'csvfile.delimiter' should be prior to default delimiter.
Without above, when two tables, one is 'csvfile.delimiter'=';' and the
other has no option, are used in the same query,
one is set as TEXT_DELIMITER = ';' and the other is set as TEXT_DELIMITER
= '|' and compete for the meta option of the Execution block.
If we give high prioritiy to the explicit meta setting, it always wins
over the default value and I think it is desirable.
Arguable points:
- If multiple meta options conflict on one option,
it just select lastly comming one. User SHOULD care about that.
- (A little out of focus) Default delimiter '|' is too common character,
isn't it?
commit 95bd0557a3ed6af21aac60892cd07bb47708d4d4
Author: Keuntae Park <[email protected]>
Date: 2015-02-06T05:30:54Z
test case is modified to add join, group by, order by operations also
commit 4152a9d319edc643d8e7aae90d1a191087662066
Author: Keuntae Park <[email protected]>
Date: 2015-02-06T05:32:29Z
Merge branch 'master' into TAJO-1339
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---