[Impala-ASF-CR] IMPALA-8703: ISO:SQL:2016 datetime patterns - Milestone 1

Gabor Kaszab (Code Review) Tue, 09 Jul 2019 02:58:12 -0700

Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13722 )


Change subject: IMPALA-8703: ISO:SQL:2016 datetime patterns - Milestone 1
......................................................................


Patch Set 3:

(94 comments)

Thanks for taking a look Attila! I managed to cover your comments except the 
ones referring to an existing implementation of "day of year" calculation as I 
don't have it locally. I'll do a rebase and tackle those as well.

Note, I did some performance measurements an apparently the 'new' pattern 
handling is slower than the 'old' handling. This is also reflected in running 
the benchmarks and also executing some casts through impala-shell. I'm 
investigating where the difference is.

http://gerrit.cloudera.org:8080/#/c/13722/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/13722/2//COMMIT_MSG@49
PS2, Line 49: In a string type to timestamp conversion the timezone offset 
tokens
            :   are parsed, expected to match with the input but they don't 
adjust
            :   the result as the input is already expected to be in UTC format.
> Is this behavior consistent with how other SQL systems work?
Not really since e.g. Oracle have different types for timestamp with timezone 
and timestamp without timezone. Impala tries to use a single type for both 
purposes. Consulted with Zoltan Ivanfi and according to him the best solution 
here is to simply omit the timezone part of the income.
Note, CAST() without FORMAT would also omit the timezone part if such an input 
is provided.


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/benchmarks/parse-timestamp-benchmark.cc
File be/src/benchmarks/parse-timestamp-benchmark.cc:

http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/benchmarks/parse-timestamp-benchmark.cc@47
PS2, Line 47: //                                                                
         (relative) (relative) (relative)
            : 
//---------------------------------------------------------------------------------------------------------
            : //                    BoostStringDate              0.667    0.691 
   0.731         1X         1X         1X
            : //                          BoostDate              0.642    0.679 
   0.704     0.962X     0.982X     0.963X
            : //                             Impala               6.67     6.92 
    7.25        10X        10X      9.93X
            : //
            : //ParseTimestamp:            Function  iters/ms   10%ile   50%ile 
  90%ile     10%ile     50%ile     90%ile
            : //                                                                
         (relative) (relative) (relative)
            : 
//---------------------------------------------------------------------------------------------------------
            : //                          BoostTime               0.48      0.5 
    0.52         1X         1X         1X
            : //                             Impala               5.73        6 
    6.21      11.9X        12X      11.9X
            : //
            : //ParseTimestampWithFormat:  Function  iters/ms   10%ile   50%ile 
  90%ile     10%ile     50%ile     90%ile
            : //                                                                
         (relative) (relative) (relative)
            : 
//---------------------------------------------------------------------------------------------------------
            : //                      BoostDateTime              0.241    0.255 
    0.26         1X         1X         1X
            : //    ImpalaSimpleDateFormatTimeStamp                 16     16.5 
    17.1      66.6X      64.7X      65.6X
            : //  ImpalaSimpleDateFormatTZTimeStamp               16.2     16.6 
    17.
> Maybe it wold make sense to add the new parsing functions to these benchmar
Done


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/common/init.cc
File be/src/common/init.cc:

http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/common/init.cc@312
PS2, Line 312: DateTimeSimp
> I think this should be renamed to InitSimpleDateParseCtx() to make it clear
Done


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/exprs/cast-expr.h
File be/src/exprs/cast-expr.h:

http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/exprs/cast-expr.h@29
PS2, Line 29: CastForm
> If I understand this correctly, this class is used only for the new cast op
Done


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/exprs/cast-functions-ir.cc
File be/src/exprs/cast-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/exprs/cast-functions-ir.cc@172
PS2, Line 172: const DateTimeFormatCo
> const DateTimeFormatContext*
Done


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/exprs/cast-functions-ir.cc@180
PS2, Line 180: int ret_val = tv.Format(*format_ctx,
> Check the return value, like you do in L199-200.
Done


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/exprs/cast-functions-ir.cc@191
PS2, Line 191: if (UNLIKELY(!dv.IsVa
> This should be const too.
Done


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/exprs/cast-functions-ir.cc@308
PS2, Line 308:  Times
> Rename to 'format_ctx' for consistency.
Done


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/exprs/cast-functions-ir.cc@308
PS2, Line 308: if (val.is_null) retu
> This should be const too.
Done


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/exprs/cast-functions-ir.cc@312
PS2, Line 312:
> const char*
Done


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/exprs/cast-functions-ir.cc@315
PS2, Line 315:
> const char*
Done


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/exprs/cast-functions-ir.cc@342
PS2, Line 342: teVal CastFunctions::C
> Should be const.
Done


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/exprs/cast-functions-ir.cc@342
PS2, Line 342: stToDa
> Rename to 'format_ctx'
Done


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/exprs/cast-functions-ir.cc@346
PS2, Line 346:
> const char*
Done


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/exprs/cast-functions-ir.cc@348
PS2, Line 348: erpre
> const char*
Done


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/exprs/expr-test.cc
File be/src/exprs/expr-test.cc:

http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/exprs/expr-test.cc@76
PS2, Line 76: DECLARE_bool(abort_on_config_error);
> Why the extra new line?
Some leftover. Removed.


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/exprs/expr-test.cc@3214
PS2, Line 3214: mpVa
> Any reason this was changed to 2002?
Leftover change from debugging. Reverted.


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/exprs/expr-test.cc@3322
PS2, Line 3322: 20
> Any reason for these timestamp changes in L3322-3337?
The following tests had the same expected output and when any f them failed it 
was hard to spot exactly which one was the problematic. So I changed them a bit 
so that I could find the failing test searching for the expected value in the 
output.


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/date-parse-util.h
File be/src/runtime/date-parse-util.h:

http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/date-parse-util.h@36
PS2, Line 36: Parse
> Probably this should be renamed to ParseSimpleDateFormat(), right?
Done


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/date-parse-util.h@47
PS2, Line 47: Parse
> Same here.
Done


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/date-parse-util.h@84
PS2, Line 84: output parameter
> "set output parameter to an invalid DateValue"
Done


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/date-parse-util.h@86
PS2, Line 86: tatic bool IndicateDateParseFailure(DateValue* date);
> I think this would be better placed in the .cc file only.
I prefer not having functions hanging around without belonging to a class and 
I'm not a big fan of macros. As this function is used only in this class I'd 
leave it here as a private function.


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/date-parse-util.cc
File be/src/runtime/date-parse-util.cc:

http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/date-parse-util.cc@40
PS2, Line 40: Parse
> rename to ParseSimpleDateFormat()?
Done


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/date-parse-util.cc@90
PS2, Line 90: te) {
> Rename to ParseSimpleDateFormat()?
Done


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/date-parse-util.cc@190
PS2, Line 190: bool DateParser::IndicateDateParseFailure(DateValue* date) {
             :   *date = DateValue();
             :   return false;
             : }
> A macro instead would be easier to use
See my comment in the header. (Thanks for the suggestion, though)


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/date-value.h
File be/src/runtime/date-value.h:

http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/date-value.h@126
PS2, Line 126: Parse
> Rename this and Parse() functions above to ParseSimpleDateFormat()
Done


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-parser.h
File be/src/runtime/datetime-iso-sql-format-parser.h:

http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-parser.h@24
PS2, Line 24: namespace impala
> Probably not unnecessary.
Dropped.


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-parser.h@39
PS2, Line 39: SED_RESULT;
> I could be mistaken but I couldn't find any BE-tests for this function.
I covered this with E2E tests in 
https://gerrit.cloudera.org/#/c/13722/2/testdata/workloads/functional-query/queries/QueryTest/date.test
I didn't feel the need to test this function directly in BE tests to cover the 
same use cases.


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-parser.cc
File be/src/runtime/datetime-iso-sql-format-parser.cc:

http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-parser.cc@54
PS2, Line 54:
            :       while (i < dt_ctx.toks.size() && dt_ctx
> current_pos should be validated first before calling IsSeparator() on it.
Nice improvement indeed. Done.


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-parser.cc@44
PS2, Line 44:     const DateTimeFormatToken* tok = &dt_ctx.toks[i];
            :     if (tok->type == SEPARATOR) {
            :       if 
(!DateTimeIsoSqlFormatTokenizer::IsSeparator(*current_pos)) return false;
            :       // Advance to the end of the separator sequence.
            :       ++current_pos;
            :       while (current_pos - input_str < input_len
            :           && 
DateTimeIsoSqlFormatTokenizer::IsSeparator(*current_pos)) {
            :         ++current_pos;
            :       }
            :       // Advance to the end of the separator sequence in the 
expected tokens list.
            :       ++i;
            :       while (i < dt_ctx.toks.size() && dt_ctx.toks[i].type == 
SEPARATOR) ++i;
            :
            :       // If we reached the end of input or the end of token 
sequence, we can return.
            :       if (current_pos >= end_pos || i >= dt_ctx.toks.size()) {
            :         return (current_pos >= end_pos && i >= 
dt_ctx.toks.size());
            :       }
            :
            :       // Next token, following the separator sequence.
            :       tok = &dt_ctx.toks[i];
            :
            :       // The last '-' of a separator sequence might be taken as a 
sign for timezone hour.
            :       if (*(current_pos - 1) == '-' && tok->type == 
TIMEZONE_HOUR) {
            :         --current_pos;
            :       }
            :     }
            :
            :     const char* group_end_pos = 
GetNextTokenGroupFromInput(current_pos,
            :         (int)(end_pos - current_pos), *tok);
            :     i
> I think this could be refactored a bit:
Looks way cleaner than the mess I had. Thanks!


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-parser.cc@76
PS2, Line 76: switch(tok->type) {
            :       case YEAR: {
            :         if (!ParseAndValidate(current_pos, group_le
> I think it would be better if GetNextTokenGroupFromInput() returned a bool
I chose to return end_pos (or nullptr). Done.


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-parser.cc@87
PS2, Line 87: }
> I'm not sure if this is correct.
Good catch. This parses only timestamps properly. The timestamp creation fails 
anyway later on for years smaller than 1400 so I can safely remove this check. 
Wrote some boundary tests for year to cover this.


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-parser.cc@96
PS2, Line 96: break;
> Same as L87
Same as above. Done.


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-parser.cc@111
PS2, Line 111:
> day_count < 1 ?
Done


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-parser.cc@141
PS2, Line 141:         if (!ParseFractionTokenGroup(current_pos, group_len, 
result)) return false;
             :         break;
             :       }
             :       case MERIDIEM_INDICATOR: {
             :         string indicator(current_pos, group_len);
             :         boost::to_upper(indicator);
             :         if (indicator == "PM" || indicator == "P.M.") 
result->hour += 12;
             :         break;
             :       }
> Block can be moved to a separate functon, here and elsewhere in the switch
The idea was to only move logic to functions that are used both by 
SimpleDateFormat and IsoSqlFormat parsing.


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-parser.cc@184
PS2, Line 184: n_year -= month_leng
> (result->year >= 0) ?
Indeed, thx. Done.


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-parser.cc@185
PS2, Line 185:
> Introducing a new variable 'month' is not really necessary.
renamed 'i' to 'month_id' for more readibility and dropped 'month'


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-parser.cc@206
PS2, Line 206:
             :   const char* end_pos = input_str;
> Return bool to signal success/failure. Or return end_pos instead of passing
Done


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-parser.cc@208
PS2, Line 208: int len = 0;
> We could just set *end_pos in the GetNextTokenGroupFromInput() function, in
Done


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-parser.cc@219
PS2, Line 219:   const char* input_str, int input_len) {
> I think, this should be:
No need to check if 'input_len' > 2 as the length of a  TIMEZONE_HOUR is by 
default 3. This reduces it to 2 in case it doesn't start with a sign.


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-parser.cc@225
PS2, Line 225: if (DateTimeIsoSqlFormatTokenizer::GetTokenType(token_str, &token
> Isn't it a parsing error if (len < tok.len) but *end_pos is a separator?
It's not an error. E.g. we have to parse inputs like this:
select cast("2019-1-1" as timestamp format "YYYY-MM-DD")
Here for the month and day token groups 'len'=1 while 'tok.len'=2 but still 
valid.


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-parser.cc@229
PS2, Line 229:   }
             :   return nullptr;
> Same as L206. Consider returning a bool or end_pos.
Done


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-parser.cc@231
PS2, Line 231:
> ParseMeridiemIndicatorFromInput() function should set end_pos instead of ex
Done


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-parser.cc@241
PS2, Line 241: }
             :
             : void DateTimeIsoSqlFormatParser::GetRoundYear(const 
TimestampValue* now,
             :     DateTimeParseResult* result) {
             :   DCHECK(now != nullptr);
             :   DCHECK(result != nullptr);
             :   DCHECK(result->year >= 0 && result->year < 100);
             :   i
> Duplicate code.
Done


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-parser.cc@253
PS2, Line 253:
> actual_token_len > 0 && actual_token-len < 4
Done


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-parser.cc@261
PS2, Line 261:
> Isn;t this called realigned year?
Oracle and Postgre refers to this pattern as round year. Realigned year might 
be some internal naming in Impala for this pattern in SimpleDateFormat.


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-tokenizer.h
File be/src/runtime/datetime-iso-sql-format-tokenizer.h:

http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-tokenizer.h@42
PS2, Line 42: boo
> bool?
No idea why I made it int :) Done.


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-tokenizer.h@45
PS2, Line 45:     accept_time_toks_ = time_toks;
> Shouldn't 'used_tokens_' be cleared when tokenizer is reset?
It's cleared at the end of Tokenize() after TokenizeImpl() was finished.


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-tokenizer.h@51
PS2, Line 51: Tokenize
> I could be mistaken but I couldn't find any BE-tests for this and the other
I didn't write BE tests for this specific function as this is basically tested 
with each E2E test where cast has a format clause. Didn't see the point to 
cover the same use cases with a different level of tests.


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-tokenizer.cc
File be/src/runtime/datetime-iso-sql-format-tokenizer.cc:

http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-tokenizer.cc@87
PS2, Line 87: used_tokens_.clear();
> Any reason 'used_tokens_' is cleared here and not in Reset()?
Nothing serious: Doing it here the user can run Tokenize() multiple times on 
the same input without calling Reset().Not sure if it's a valid scenario, 
though.


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-tokenizer.cc@93
PS2, Line 93: current_pos
> In some functions this is called 'current' while in others it is called 'cu
Used current_pos. Done.


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-tokenizer.cc@94
PS2, Line 94: current_pos < str_end)
> current_pos < str_end
Done


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-tokenizer.cc@96
PS2, Line 96: break;
> This should be break, otherwise the check in L101 won't be executed.
Good finding. Fixed and also wrote a test to cover this.


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-tokenizer.cc@104
PS2, Line 104: void DateTimeIsoSqlFormatTokenizer::ProcessSeparators(const 
char** current_pos) {
> DCHECK(*current != nullptr);
Done


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-tokenizer.cc@107
PS2, Line 107: har* str_end = dt_ctx_->fmt + dt_ctx_->fmt_l
> I think the condition should be:
Done


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-tokenizer.cc@114
PS2, Line 114:
> Mention in a comment that this function is doing greedy-matching and finds
Done


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-tokenizer.cc@115
PS2, Line 115: FormatTokenizationResult 
DateTimeIsoSqlFormatTokenizer::ProcessNextToken(
> DCHECK(*current_pos != nullptr);
Done


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-tokenizer.cc@120
PS2, Line 120: unsigned curr_token_size = min((long)MAX_TOKEN_SIZE, str_end - 
*current_pos);
             :   DCHECK(curr_token_size > 0);
> This should be done before the loop starts. No need to convert every prefix
Done, though won't have much impact on query performance as tokenization runs 
once per query and a format string won't have more than 10-15 tokens.


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-tokenizer.cc@122
PS2, Line 122: ring
> const auto
Done


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-tokenizer.cc@140
PS2, Line 140: retu
> const auto
Done


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-tokenizer.cc@204
PS2, Line 204:     return SECOND_IN_DAY_CONFLICT;
> I think, you should also check that no more than one fractional tokens were
I've put these check here in CheckIncompatibilities() and also added test to 
cover this.


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-tokenizer.cc@218
PS2, Line 218: (const string
> We can pass the char directly. No need for a pointer.
Done


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-parser-common.h
File be/src/runtime/datetime-parser-common.h:

http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-parser-common.h@128
PS2, Line 128: length of the input format string. However, there ar
> nit: produces output that is longer than the format string.
Done


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-parser-common.h@129
PS2, Line 129:
> nit: from
Done


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-parser-common.h@157
PS2, Line 157:
> nit: 1 space
Done


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-parser-common.cc
File be/src/runtime/datetime-parser-common.cc:

http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-parser-common.cc@46
PS2, Line 46:   DCHECK(!century_break_ptime.is_special());
> Add DCHECK(!century_break_time.is_special())
Done, however, this is basically copy-pasting this code to a common place.


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-parser-common.cc@61
PS2, Line 61:     const StringVal& format, bool is_error) {
> DCHECK(context != nullptr);
Done


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-parser-common.cc@118
PS2, Line 118: lse {
> Maybe "Parse" instead of "Construct" would be better in function names here
Done


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-parser-common.cc@120
PS2, Line 120:   }
> Here and in the functions below add:
Done


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-parser-common.cc@120
PS2, Line 120:   }
             : }
             :
             : bool ParseAndValidate(const char* token_group, int token_len, 
int min, int max,
             :     int* resul
> All the Construct.. functions are very similar, only the valid range differ
Done


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-parser-common.cc@181
PS2, Line 181:
> Can you use std::pow() to simplify this?
Done


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-simple-date-format-parser.h
File be/src/runtime/datetime-simple-date-format-parser.h:

http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-simple-date-format-parser.h@90
PS2, Line 90:  Parse the d
> rename to InitSimpleDateParseCtx().
Done


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-simple-date-format-parser.cc
File be/src/runtime/datetime-simple-date-format-parser.cc:

http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-simple-date-format-parser.cc@372
PS2, Line 372: DateTimeSimpleDateForma
> Rename to GetDefaultSimpleDateFormatContext() ?
For ISO SQL parsing there is nothing like default context as there is always a 
format given by the user. So calling this GetDefaultFormatContext() wouldn't 
lead to confusion between the two parsing.


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-simple-date-format-parser.cc@488
PS2, Line 488:      }
             :       case MONTH_IN_YEAR_SLT: {
             :         char raw_buff[tok.len];
             :         std::transform(tok_val, tok_val + tok.len, raw_buff, 
::tolower);
             :         StringValue buff(raw_buff, tok.len);
             :         boost::unordered_map<StringValue, int>::const_iterator 
iter =
             :             REV_MONTH_INDEX.find(buff);
             :         if (UNLIKELY(iter == REV_MONTH_INDEX.end())) return 
false;
             :         dt_res
> This should go to a separate function as well.
I could reply the same here as I had for the same in the ISO version: I wanted 
to move only the commonly used functionality to separate function. I prefer 
keeping the ones that are used only by one parsing (ISO SQL or 
SimpleDateFormat) untouched.


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/timestamp-parse-util.h
File be/src/runtime/timestamp-parse-util.h:

http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/timestamp-parse-util.h@67
PS2, Line 67: Otherwise it sets
> nit: Otherwise
Done


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/timestamp-parse-util.cc
File be/src/runtime/timestamp-parse-util.cc:

http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/timestamp-parse-util.cc@43
PS2, Line 43: ou
> nit:double space.
Done


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/timestamp-parse-util.cc@43
PS2, Line 43:  t
> nit: Double space.
Done


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/timestamp-parse-util.cc@130
PS2, Line 130:
> No need for a fully qualified name, 'time_duration' is available in this na
Done


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/timestamp-parse-util.cc@150
PS2, Line 150:   day_offset = AdjustWithTimezone(t, dt_result.tz_offset);
> Before this change, timezone-adjustment was done only when dt_ctx.has_time_
If has_time_toks=false then there won't be any time tokens to be adjusted with 
the TZ offset. But it's true that we can ignore this function call in this case 
so I'll move it back into the if.


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/timestamp-parse-util.cc@231
PS2, Line 231: val = d.year
> tok.len < 4
Done


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/timestamp-parse-util.cc@232
PS2, Line 232:        if (tok.len < 4) {
             :           int adjust_factor = std::
> Isn't this a breaking change. "% 100" vs "% adjust_factor" ?
I would call it a bugfix as previously round year didn't work well for all 
input year lengths.


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/service/impala-server.cc
File be/src/service/impala-server.cc:

http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/service/impala-server.cc@1119
PS2, Line 1119: // For testing purposes
> Is this query option used for testing only? Please add a comment to explain
Done


http://gerrit.cloudera.org:8080/#/c/13722/2/common/thrift/Exprs.thrift
File common/thrift/Exprs.thrift:

http://gerrit.cloudera.org:8080/#/c/13722/2/common/thrift/Exprs.thrift@144
PS2, Line 144: TCastExpr
> Consider renaming to TCastFormatExpr or something similar to emphasize that
I would leave it as it is to be consistent with the other elements of TExprNode.


http://gerrit.cloudera.org:8080/#/c/13722/2/fe/src/main/cup/sql-parser.cup
File fe/src/main/cup/sql-parser.cup:

http://gerrit.cloudera.org:8080/#/c/13722/2/fe/src/main/cup/sql-parser.cup@2937
PS2, Line 2937: cast_format_val
> Consider renaming this to 'cast_format_val'
Done


http://gerrit.cloudera.org:8080/#/c/13722/2/fe/src/main/java/org/apache/impala/analysis/CastExpr.java
File fe/src/main/java/org/apache/impala/analysis/CastExpr.java:

http://gerrit.cloudera.org:8080/#/c/13722/2/fe/src/main/java/org/apache/impala/analysis/CastExpr.java@51
PS2, Line 51: // Stores the value of the FORMAT
> Add a comment to describe the new member.
Done


http://gerrit.cloudera.org:8080/#/c/13722/2/fe/src/main/java/org/apache/impala/analysis/CastExpr.java@208
PS2, Line 208: tFormat_.is
> What if 'castFormat_' includes an apostrophe?
Will put the value within quotation marks as that is not allowed within a 
format.


http://gerrit.cloudera.org:8080/#/c/13722/2/fe/src/main/java/org/apache/impala/analysis/CastExpr.java@226
PS2, Line 226:     msg.node_type = TExprNodeType.FUNCTION_CALL;
             :     // Sets cast_expr in case FORMAT clause was provided and 
this is not an intermediate
             :     // step between a char and a datetime.
> The naming of this variable is a bit confusing. 'isIntermediateStep' is set
It's a bit difficult to understand the intention when I simply add the 
condition to the if. Did so and wrote a short comment to clarify.


http://gerrit.cloudera.org:8080/#/c/13722/2/fe/src/main/java/org/apache/impala/analysis/CastExpr.java@280
PS2, Line 280:
             :     if (null != castFormat_ && !isIntermediateStepNeeded) {
             :       if (!(type_.isDateOrTimeType() && 
getChild(0).getType().isStringType()) &&
             :           !(type_.isStringType() && 
getChild(0).getType().isDateOrTimeType())) {
             :         // FORMAT clause works only for casting between date 
types and string types
             :         throw new AnalysisException("FORMAT clause is not 
applicable from " +
             :             getChild(0).getType() + " to " + type_);
             :       }
             :       if (castFormat_.isEmpty()) {
             :         throw new AnalysisException("FORMAT clause can't be 
empty");
             :
> This block can be used as the else branch of he previous 'if' statement. An
The purpose of that variable is to enhance code readability.

It won't be simple an else branch as I also have to check if castFormat_ is 
set. I found it more readable this way.


http://gerrit.cloudera.org:8080/#/c/13722/2/fe/src/test/java/org/apache/impala/analysis/AnalyzeExprsTest.java
File fe/src/test/java/org/apache/impala/analysis/AnalyzeExprsTest.java:

http://gerrit.cloudera.org:8080/#/c/13722/2/fe/src/test/java/org/apache/impala/analysis/AnalyzeExprsTest.java@3088
PS2, Line 3088:  public void TestCastFormatClauseFromDatetime() throws 
AnalysisException {
              :     RunCastFormatTestOnType("TIMESTAMP");
              :     RunCastFormatTestOnType("DATE");
              :   }
              :
              :   private void RunCastFormatTestOnType(String type) {
              :     String to_timestamp_cast = "cast('05-01-2017' as " + type + 
")";
              :     AnalysisError(
              :         "select cast(" + to_timestamp_cast + " as DATETIME 
FORMAT 'MM-dd-yyyy')",
              :         "Unsupported data type: DATETIME");
              :     if (!type.equals("TIMESTAMP")) {
              :       AnalysisError(
              :           "select cast(" + to_timestamp_cast + " as TIMESTAMP 
FORMAT 'MM-dd-yyyy')",
              :           "FORMAT clause is not applicable from " + type + " to 
TIMESTAMP");
              :     }
              :     if (!type.equals("DATE")) {
              :       AnalysisError("select cast(" + to_timestamp_cast + " as 
DATE FORMAT 'MM-dd-yyyy')",
              :
> Would it make sense to add similar tests for DATE instead of TIMESTAMP?
It absolutely would. Done.


http://gerrit.cloudera.org:8080/#/c/13722/2/fe/src/test/java/org/apache/impala/analysis/AnalyzeExprsTest.java@3109
PS2, Line 3109:     AnalysisError("select cast(" + to_timestamp_cast + " AS 
BOOLEAN FORMAT 'MM-dd-yyyy')",
              :         "FORMAT clause is not applicable from " + type + " to 
BOOLEAN");
              :     AnalysisError("select cast(" + to_timestamp_cast + " AS 
DOUBLE FORMAT 'MM-
> Please add similar test for DATE instead of TIMESTAMP.
Done


http://gerrit.cloudera.org:8080/#/c/13722/2/fe/src/test/java/org/apache/impala/analysis/ParserTest.java
File fe/src/test/java/org/apache/impala/analysis/ParserTest.java:

http://gerrit.cloudera.org:8080/#/c/13722/2/fe/src/test/java/org/apache/impala/analysis/ParserTest.java@1458
PS2, Line 1458:     ParsesOk("select cast('05-01-2017' as timestamp format 
'MM-dd-yyyy')");
              :     ParsesOk("select cast('2017.01.02' as date format 
'YYYY-MM-DD')");
              :     ParserError("select cast(a + 5.0 as badtype) from t");
              :     ParserError("select cast(a + 5.0, string) from t");
              :     ParserError("select cast('05-01-2017' as timestamp format 
1234
> Please add similar tests for DATE.
from parsing perspective it doesn't really make a difference to do the same for 
dates as well. I re-wrote one of these to use date instead of TS.


http://gerrit.cloudera.org:8080/#/c/13722/2/testdata/workloads/functional-query/queries/QueryTest/date.test
File testdata/workloads/functional-query/queries/QueryTest/date.test:

http://gerrit.cloudera.org:8080/#/c/13722/2/testdata/workloads/functional-query/queries/QueryTest/date.test@629
PS2, Line 629: ====
             : ---- QUERY
             : select cast("2014/- ,11'/:-05" as date format "YYYY-MM-DD");
             : ---- RESULTS
             : 2014-11-05
             : ---- TYPES
             : DATE
> This should work right? Any reason it is commented out?
Yes, removed the comments. Note, the testing framework for some reason fails on 
';' char so I removed it from the input.


http://gerrit.cloudera.org:8080/#/c/13722/2/testdata/workloads/functional-query/queries/QueryTest/date.test@841
PS2, Line 841:
> Extra space
Done


http://gerrit.cloudera.org:8080/#/c/13722/2/tests/query_test/test_cast_with_format.py
File tests/query_test/test_cast_with_format.py:

http://gerrit.cloudera.org:8080/#/c/13722/2/tests/query_test/test_cast_with_format.py@22
PS2, Line 22: TestCastWithFormat
> Could the tests in this script be moved to cast_format.test? Or to another
There was one reason I couldn't do that: The tests in cast_format.test are run 
for both beeswax and HS2 and apparently those differ in how many digits they 
pad the fractional second part with.

select cast("2019-10-10 12:13:14.12345" as timestamp)
Beeswax: 2019-10-10 12:13:14.123450000
HS2:     2019-10-10 12:13:14.123450

As most of the tests contain fractional seconds I couldn't simply move them to 
that file. I could remove either Beeswax or HS2 from the test matrix but in 
that case we would lose some valuable coverage.

On a sidenote: It's a good question why these interfaces pad fractional seconds 
differently and if it's only related to the tests or to the interfaces 
themselves. I haven't investigated this part further but this characteristic is 
also present without this change.



--
To view, visit http://gerrit.cloudera.org:8080/13722
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I19d8d097a45ae6f103b6cd1b2d81aad38dfd9e23
Gerrit-Change-Number: 13722
Gerrit-PatchSet: 3
Gerrit-Owner: Gabor Kaszab <gaborkas...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <atti...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <gaborkas...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Comment-Date: Tue, 09 Jul 2019 09:57:50 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8703: ISO:SQL:2016 datetime patterns - Milestone 1

Reply via email to