David Mollitor created HIVE-23172:
-------------------------------------
Summary: Quoted Backtick Columns Are Not Parsing Correctly
Key: HIVE-23172
URL: https://issues.apache.org/jira/browse/HIVE-23172
Project: Hive
Issue Type: Improvement
Reporter: David Mollitor
Assignee: David Mollitor
I recently came across a weird behavior while examining failures of
{{special_character_in_tabnames_2.q}} while working on HIVE-23150. I was
surprised to see it fail because I couldn't see of any reason why it should...
it's doing pretty standard SQL statements just like every other test, but for
some reason this test is just a *little bit* differently than most others and
it brought this issue to light.
Turns out,... the parsing of table names is pretty much wrong across the board.
The statement that caught my attention was this:
{code:sql}
DROP TABLE IF EXISTS `s/c`;
{code}
And here is the relevant grammar:
{code:none}
fragment
RegexComponent
: 'a'..'z' | 'A'..'Z' | '0'..'9' | '_'
| PLUS | STAR | QUESTION | MINUS | DOT
| LPAREN | RPAREN | LSQUARE | RSQUARE | LCURLY | RCURLY
| BITWISEXOR | BITWISEOR | DOLLAR | '!'
;
Identifier
:
(Letter | Digit) (Letter | Digit | '_')*
| {allowQuotedId()}? QuotedIdentifier /* though at the language level we
allow all Identifiers to be QuotedIdentifiers;
at the API level only columns are
allowed to be of this form */
| '`' RegexComponent+ '`'
;
fragment
QuotedIdentifier
:
'`' ( '``' | ~('`') )* '`' {
setText(StringUtils.replace(getText().substring(1, getText().length() -1 ),
"``", "`")); }
;
{code}
The mystery for me was that, for some reason, this String {{`s/c`}} was being
stripped of its back-ticks. Every other test I investigated did not have this
behavior, the back ticks were always preserved around the table name. The main
Hive Java code base would see the back-ticks and deal with it internally. For
HIVE-23150, I introduced some sanity checks and they were failing because they
were expecting the back ticks to be present.
With the help of HIVE-23171 I finally figured it out. So, what I discovered is
that pretty much every table name is hitting the {{RegexComponent}} rule and
the back ticks are carried forward. However, {{`s/c`}} the forward slash `/` is
not allowable in {{RegexComponent}} so it hits on {{QuotedIdentifier}} rule
which is trimming the back ticks.
I validated this by disabling {{QuotedIdentifier}}. When I did this, {{`s/c`}}
fails in error but {{`sc`}} parses successfully... because {{`sc`}} is being
treated as a {{RegexComponent}}.
So, if you have {{allowQuotedId}} disabled, table names can only use the
characters defined in the {{RegexComponent}} rule (otherwise it errors), and it
will *not* strip the back ticks. If you have {{allowQuotedId}} enabled, then if
the table name has a character not specified in {{RegexComponent}}, it will
identify it as a table name and it *will* strip the back ticks, if all the
characters are part of {{RegexComponent}} then it will *not* strip the back
ticks.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)