Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/13722 )
Change subject: IMPALA-8703: ISO:SQL:2016 datetime patterns - Milestone 1 ...................................................................... Patch Set 3: (94 comments) Thanks for taking a look Attila! I managed to cover your comments except the ones referring to an existing implementation of "day of year" calculation as I don't have it locally. I'll do a rebase and tackle those as well. Note, I did some performance measurements an apparently the 'new' pattern handling is slower than the 'old' handling. This is also reflected in running the benchmarks and also executing some casts through impala-shell. I'm investigating where the difference is. http://gerrit.cloudera.org:8080/#/c/13722/2//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/13722/2//COMMIT_MSG@49 PS2, Line 49: In a string type to timestamp conversion the timezone offset tokens : are parsed, expected to match with the input but they don't adjust : the result as the input is already expected to be in UTC format. > Is this behavior consistent with how other SQL systems work? Not really since e.g. Oracle have different types for timestamp with timezone and timestamp without timezone. Impala tries to use a single type for both purposes. Consulted with Zoltan Ivanfi and according to him the best solution here is to simply omit the timezone part of the income. Note, CAST() without FORMAT would also omit the timezone part if such an input is provided. http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/benchmarks/parse-timestamp-benchmark.cc File be/src/benchmarks/parse-timestamp-benchmark.cc: http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/benchmarks/parse-timestamp-benchmark.cc@47 PS2, Line 47: // (relative) (relative) (relative) : //--------------------------------------------------------------------------------------------------------- : // BoostStringDate 0.667 0.691 0.731 1X 1X 1X : // BoostDate 0.642 0.679 0.704 0.962X 0.982X 0.963X : // Impala 6.67 6.92 7.25 10X 10X 9.93X : // : //ParseTimestamp: Function iters/ms 10%ile 50%ile 90%ile 10%ile 50%ile 90%ile : // (relative) (relative) (relative) : //--------------------------------------------------------------------------------------------------------- : // BoostTime 0.48 0.5 0.52 1X 1X 1X : // Impala 5.73 6 6.21 11.9X 12X 11.9X : // : //ParseTimestampWithFormat: Function iters/ms 10%ile 50%ile 90%ile 10%ile 50%ile 90%ile : // (relative) (relative) (relative) : //--------------------------------------------------------------------------------------------------------- : // BoostDateTime 0.241 0.255 0.26 1X 1X 1X : // ImpalaSimpleDateFormatTimeStamp 16 16.5 17.1 66.6X 64.7X 65.6X : // ImpalaSimpleDateFormatTZTimeStamp 16.2 16.6 17. > Maybe it wold make sense to add the new parsing functions to these benchmar Done http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/common/init.cc File be/src/common/init.cc: http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/common/init.cc@312 PS2, Line 312: DateTimeSimp > I think this should be renamed to InitSimpleDateParseCtx() to make it clear Done http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/exprs/cast-expr.h File be/src/exprs/cast-expr.h: http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/exprs/cast-expr.h@29 PS2, Line 29: CastForm > If I understand this correctly, this class is used only for the new cast op Done http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/exprs/cast-functions-ir.cc File be/src/exprs/cast-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/exprs/cast-functions-ir.cc@172 PS2, Line 172: const DateTimeFormatCo > const DateTimeFormatContext* Done http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/exprs/cast-functions-ir.cc@180 PS2, Line 180: int ret_val = tv.Format(*format_ctx, > Check the return value, like you do in L199-200. Done http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/exprs/cast-functions-ir.cc@191 PS2, Line 191: if (UNLIKELY(!dv.IsVa > This should be const too. Done http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/exprs/cast-functions-ir.cc@308 PS2, Line 308: Times > Rename to 'format_ctx' for consistency. Done http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/exprs/cast-functions-ir.cc@308 PS2, Line 308: if (val.is_null) retu > This should be const too. Done http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/exprs/cast-functions-ir.cc@312 PS2, Line 312: > const char* Done http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/exprs/cast-functions-ir.cc@315 PS2, Line 315: > const char* Done http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/exprs/cast-functions-ir.cc@342 PS2, Line 342: teVal CastFunctions::C > Should be const. Done http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/exprs/cast-functions-ir.cc@342 PS2, Line 342: stToDa > Rename to 'format_ctx' Done http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/exprs/cast-functions-ir.cc@346 PS2, Line 346: > const char* Done http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/exprs/cast-functions-ir.cc@348 PS2, Line 348: erpre > const char* Done http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/exprs/expr-test.cc File be/src/exprs/expr-test.cc: http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/exprs/expr-test.cc@76 PS2, Line 76: DECLARE_bool(abort_on_config_error); > Why the extra new line? Some leftover. Removed. http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/exprs/expr-test.cc@3214 PS2, Line 3214: mpVa > Any reason this was changed to 2002? Leftover change from debugging. Reverted. http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/exprs/expr-test.cc@3322 PS2, Line 3322: 20 > Any reason for these timestamp changes in L3322-3337? The following tests had the same expected output and when any f them failed it was hard to spot exactly which one was the problematic. So I changed them a bit so that I could find the failing test searching for the expected value in the output. http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/date-parse-util.h File be/src/runtime/date-parse-util.h: http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/date-parse-util.h@36 PS2, Line 36: Parse > Probably this should be renamed to ParseSimpleDateFormat(), right? Done http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/date-parse-util.h@47 PS2, Line 47: Parse > Same here. Done http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/date-parse-util.h@84 PS2, Line 84: output parameter > "set output parameter to an invalid DateValue" Done http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/date-parse-util.h@86 PS2, Line 86: tatic bool IndicateDateParseFailure(DateValue* date); > I think this would be better placed in the .cc file only. I prefer not having functions hanging around without belonging to a class and I'm not a big fan of macros. As this function is used only in this class I'd leave it here as a private function. http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/date-parse-util.cc File be/src/runtime/date-parse-util.cc: http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/date-parse-util.cc@40 PS2, Line 40: Parse > rename to ParseSimpleDateFormat()? Done http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/date-parse-util.cc@90 PS2, Line 90: te) { > Rename to ParseSimpleDateFormat()? Done http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/date-parse-util.cc@190 PS2, Line 190: bool DateParser::IndicateDateParseFailure(DateValue* date) { : *date = DateValue(); : return false; : } > A macro instead would be easier to use See my comment in the header. (Thanks for the suggestion, though) http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/date-value.h File be/src/runtime/date-value.h: http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/date-value.h@126 PS2, Line 126: Parse > Rename this and Parse() functions above to ParseSimpleDateFormat() Done http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-parser.h File be/src/runtime/datetime-iso-sql-format-parser.h: http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-parser.h@24 PS2, Line 24: namespace impala > Probably not unnecessary. Dropped. http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-parser.h@39 PS2, Line 39: SED_RESULT; > I could be mistaken but I couldn't find any BE-tests for this function. I covered this with E2E tests in https://gerrit.cloudera.org/#/c/13722/2/testdata/workloads/functional-query/queries/QueryTest/date.test I didn't feel the need to test this function directly in BE tests to cover the same use cases. http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-parser.cc File be/src/runtime/datetime-iso-sql-format-parser.cc: http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-parser.cc@54 PS2, Line 54: : while (i < dt_ctx.toks.size() && dt_ctx > current_pos should be validated first before calling IsSeparator() on it. Nice improvement indeed. Done. http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-parser.cc@44 PS2, Line 44: const DateTimeFormatToken* tok = &dt_ctx.toks[i]; : if (tok->type == SEPARATOR) { : if (!DateTimeIsoSqlFormatTokenizer::IsSeparator(*current_pos)) return false; : // Advance to the end of the separator sequence. : ++current_pos; : while (current_pos - input_str < input_len : && DateTimeIsoSqlFormatTokenizer::IsSeparator(*current_pos)) { : ++current_pos; : } : // Advance to the end of the separator sequence in the expected tokens list. : ++i; : while (i < dt_ctx.toks.size() && dt_ctx.toks[i].type == SEPARATOR) ++i; : : // If we reached the end of input or the end of token sequence, we can return. : if (current_pos >= end_pos || i >= dt_ctx.toks.size()) { : return (current_pos >= end_pos && i >= dt_ctx.toks.size()); : } : : // Next token, following the separator sequence. : tok = &dt_ctx.toks[i]; : : // The last '-' of a separator sequence might be taken as a sign for timezone hour. : if (*(current_pos - 1) == '-' && tok->type == TIMEZONE_HOUR) { : --current_pos; : } : } : : const char* group_end_pos = GetNextTokenGroupFromInput(current_pos, : (int)(end_pos - current_pos), *tok); : i > I think this could be refactored a bit: Looks way cleaner than the mess I had. Thanks! http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-parser.cc@76 PS2, Line 76: switch(tok->type) { : case YEAR: { : if (!ParseAndValidate(current_pos, group_le > I think it would be better if GetNextTokenGroupFromInput() returned a bool I chose to return end_pos (or nullptr). Done. http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-parser.cc@87 PS2, Line 87: } > I'm not sure if this is correct. Good catch. This parses only timestamps properly. The timestamp creation fails anyway later on for years smaller than 1400 so I can safely remove this check. Wrote some boundary tests for year to cover this. http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-parser.cc@96 PS2, Line 96: break; > Same as L87 Same as above. Done. http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-parser.cc@111 PS2, Line 111: > day_count < 1 ? Done http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-parser.cc@141 PS2, Line 141: if (!ParseFractionTokenGroup(current_pos, group_len, result)) return false; : break; : } : case MERIDIEM_INDICATOR: { : string indicator(current_pos, group_len); : boost::to_upper(indicator); : if (indicator == "PM" || indicator == "P.M.") result->hour += 12; : break; : } > Block can be moved to a separate functon, here and elsewhere in the switch The idea was to only move logic to functions that are used both by SimpleDateFormat and IsoSqlFormat parsing. http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-parser.cc@184 PS2, Line 184: n_year -= month_leng > (result->year >= 0) ? Indeed, thx. Done. http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-parser.cc@185 PS2, Line 185: > Introducing a new variable 'month' is not really necessary. renamed 'i' to 'month_id' for more readibility and dropped 'month' http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-parser.cc@206 PS2, Line 206: : const char* end_pos = input_str; > Return bool to signal success/failure. Or return end_pos instead of passing Done http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-parser.cc@208 PS2, Line 208: int len = 0; > We could just set *end_pos in the GetNextTokenGroupFromInput() function, in Done http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-parser.cc@219 PS2, Line 219: const char* input_str, int input_len) { > I think, this should be: No need to check if 'input_len' > 2 as the length of a TIMEZONE_HOUR is by default 3. This reduces it to 2 in case it doesn't start with a sign. http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-parser.cc@225 PS2, Line 225: if (DateTimeIsoSqlFormatTokenizer::GetTokenType(token_str, &token > Isn't it a parsing error if (len < tok.len) but *end_pos is a separator? It's not an error. E.g. we have to parse inputs like this: select cast("2019-1-1" as timestamp format "YYYY-MM-DD") Here for the month and day token groups 'len'=1 while 'tok.len'=2 but still valid. http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-parser.cc@229 PS2, Line 229: } : return nullptr; > Same as L206. Consider returning a bool or end_pos. Done http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-parser.cc@231 PS2, Line 231: > ParseMeridiemIndicatorFromInput() function should set end_pos instead of ex Done http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-parser.cc@241 PS2, Line 241: } : : void DateTimeIsoSqlFormatParser::GetRoundYear(const TimestampValue* now, : DateTimeParseResult* result) { : DCHECK(now != nullptr); : DCHECK(result != nullptr); : DCHECK(result->year >= 0 && result->year < 100); : i > Duplicate code. Done http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-parser.cc@253 PS2, Line 253: > actual_token_len > 0 && actual_token-len < 4 Done http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-parser.cc@261 PS2, Line 261: > Isn;t this called realigned year? Oracle and Postgre refers to this pattern as round year. Realigned year might be some internal naming in Impala for this pattern in SimpleDateFormat. http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-tokenizer.h File be/src/runtime/datetime-iso-sql-format-tokenizer.h: http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-tokenizer.h@42 PS2, Line 42: boo > bool? No idea why I made it int :) Done. http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-tokenizer.h@45 PS2, Line 45: accept_time_toks_ = time_toks; > Shouldn't 'used_tokens_' be cleared when tokenizer is reset? It's cleared at the end of Tokenize() after TokenizeImpl() was finished. http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-tokenizer.h@51 PS2, Line 51: Tokenize > I could be mistaken but I couldn't find any BE-tests for this and the other I didn't write BE tests for this specific function as this is basically tested with each E2E test where cast has a format clause. Didn't see the point to cover the same use cases with a different level of tests. http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-tokenizer.cc File be/src/runtime/datetime-iso-sql-format-tokenizer.cc: http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-tokenizer.cc@87 PS2, Line 87: used_tokens_.clear(); > Any reason 'used_tokens_' is cleared here and not in Reset()? Nothing serious: Doing it here the user can run Tokenize() multiple times on the same input without calling Reset().Not sure if it's a valid scenario, though. http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-tokenizer.cc@93 PS2, Line 93: current_pos > In some functions this is called 'current' while in others it is called 'cu Used current_pos. Done. http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-tokenizer.cc@94 PS2, Line 94: current_pos < str_end) > current_pos < str_end Done http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-tokenizer.cc@96 PS2, Line 96: break; > This should be break, otherwise the check in L101 won't be executed. Good finding. Fixed and also wrote a test to cover this. http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-tokenizer.cc@104 PS2, Line 104: void DateTimeIsoSqlFormatTokenizer::ProcessSeparators(const char** current_pos) { > DCHECK(*current != nullptr); Done http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-tokenizer.cc@107 PS2, Line 107: har* str_end = dt_ctx_->fmt + dt_ctx_->fmt_l > I think the condition should be: Done http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-tokenizer.cc@114 PS2, Line 114: > Mention in a comment that this function is doing greedy-matching and finds Done http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-tokenizer.cc@115 PS2, Line 115: FormatTokenizationResult DateTimeIsoSqlFormatTokenizer::ProcessNextToken( > DCHECK(*current_pos != nullptr); Done http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-tokenizer.cc@120 PS2, Line 120: unsigned curr_token_size = min((long)MAX_TOKEN_SIZE, str_end - *current_pos); : DCHECK(curr_token_size > 0); > This should be done before the loop starts. No need to convert every prefix Done, though won't have much impact on query performance as tokenization runs once per query and a format string won't have more than 10-15 tokens. http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-tokenizer.cc@122 PS2, Line 122: ring > const auto Done http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-tokenizer.cc@140 PS2, Line 140: retu > const auto Done http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-tokenizer.cc@204 PS2, Line 204: return SECOND_IN_DAY_CONFLICT; > I think, you should also check that no more than one fractional tokens were I've put these check here in CheckIncompatibilities() and also added test to cover this. http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-tokenizer.cc@218 PS2, Line 218: (const string > We can pass the char directly. No need for a pointer. Done http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-parser-common.h File be/src/runtime/datetime-parser-common.h: http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-parser-common.h@128 PS2, Line 128: length of the input format string. However, there ar > nit: produces output that is longer than the format string. Done http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-parser-common.h@129 PS2, Line 129: > nit: from Done http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-parser-common.h@157 PS2, Line 157: > nit: 1 space Done http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-parser-common.cc File be/src/runtime/datetime-parser-common.cc: http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-parser-common.cc@46 PS2, Line 46: DCHECK(!century_break_ptime.is_special()); > Add DCHECK(!century_break_time.is_special()) Done, however, this is basically copy-pasting this code to a common place. http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-parser-common.cc@61 PS2, Line 61: const StringVal& format, bool is_error) { > DCHECK(context != nullptr); Done http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-parser-common.cc@118 PS2, Line 118: lse { > Maybe "Parse" instead of "Construct" would be better in function names here Done http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-parser-common.cc@120 PS2, Line 120: } > Here and in the functions below add: Done http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-parser-common.cc@120 PS2, Line 120: } : } : : bool ParseAndValidate(const char* token_group, int token_len, int min, int max, : int* resul > All the Construct.. functions are very similar, only the valid range differ Done http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-parser-common.cc@181 PS2, Line 181: > Can you use std::pow() to simplify this? Done http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-simple-date-format-parser.h File be/src/runtime/datetime-simple-date-format-parser.h: http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-simple-date-format-parser.h@90 PS2, Line 90: Parse the d > rename to InitSimpleDateParseCtx(). Done http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-simple-date-format-parser.cc File be/src/runtime/datetime-simple-date-format-parser.cc: http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-simple-date-format-parser.cc@372 PS2, Line 372: DateTimeSimpleDateForma > Rename to GetDefaultSimpleDateFormatContext() ? For ISO SQL parsing there is nothing like default context as there is always a format given by the user. So calling this GetDefaultFormatContext() wouldn't lead to confusion between the two parsing. http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-simple-date-format-parser.cc@488 PS2, Line 488: } : case MONTH_IN_YEAR_SLT: { : char raw_buff[tok.len]; : std::transform(tok_val, tok_val + tok.len, raw_buff, ::tolower); : StringValue buff(raw_buff, tok.len); : boost::unordered_map<StringValue, int>::const_iterator iter = : REV_MONTH_INDEX.find(buff); : if (UNLIKELY(iter == REV_MONTH_INDEX.end())) return false; : dt_res > This should go to a separate function as well. I could reply the same here as I had for the same in the ISO version: I wanted to move only the commonly used functionality to separate function. I prefer keeping the ones that are used only by one parsing (ISO SQL or SimpleDateFormat) untouched. http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/timestamp-parse-util.h File be/src/runtime/timestamp-parse-util.h: http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/timestamp-parse-util.h@67 PS2, Line 67: Otherwise it sets > nit: Otherwise Done http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/timestamp-parse-util.cc File be/src/runtime/timestamp-parse-util.cc: http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/timestamp-parse-util.cc@43 PS2, Line 43: ou > nit:double space. Done http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/timestamp-parse-util.cc@43 PS2, Line 43: t > nit: Double space. Done http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/timestamp-parse-util.cc@130 PS2, Line 130: > No need for a fully qualified name, 'time_duration' is available in this na Done http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/timestamp-parse-util.cc@150 PS2, Line 150: day_offset = AdjustWithTimezone(t, dt_result.tz_offset); > Before this change, timezone-adjustment was done only when dt_ctx.has_time_ If has_time_toks=false then there won't be any time tokens to be adjusted with the TZ offset. But it's true that we can ignore this function call in this case so I'll move it back into the if. http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/timestamp-parse-util.cc@231 PS2, Line 231: val = d.year > tok.len < 4 Done http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/timestamp-parse-util.cc@232 PS2, Line 232: if (tok.len < 4) { : int adjust_factor = std:: > Isn't this a breaking change. "% 100" vs "% adjust_factor" ? I would call it a bugfix as previously round year didn't work well for all input year lengths. http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/service/impala-server.cc File be/src/service/impala-server.cc: http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/service/impala-server.cc@1119 PS2, Line 1119: // For testing purposes > Is this query option used for testing only? Please add a comment to explain Done http://gerrit.cloudera.org:8080/#/c/13722/2/common/thrift/Exprs.thrift File common/thrift/Exprs.thrift: http://gerrit.cloudera.org:8080/#/c/13722/2/common/thrift/Exprs.thrift@144 PS2, Line 144: TCastExpr > Consider renaming to TCastFormatExpr or something similar to emphasize that I would leave it as it is to be consistent with the other elements of TExprNode. http://gerrit.cloudera.org:8080/#/c/13722/2/fe/src/main/cup/sql-parser.cup File fe/src/main/cup/sql-parser.cup: http://gerrit.cloudera.org:8080/#/c/13722/2/fe/src/main/cup/sql-parser.cup@2937 PS2, Line 2937: cast_format_val > Consider renaming this to 'cast_format_val' Done http://gerrit.cloudera.org:8080/#/c/13722/2/fe/src/main/java/org/apache/impala/analysis/CastExpr.java File fe/src/main/java/org/apache/impala/analysis/CastExpr.java: http://gerrit.cloudera.org:8080/#/c/13722/2/fe/src/main/java/org/apache/impala/analysis/CastExpr.java@51 PS2, Line 51: // Stores the value of the FORMAT > Add a comment to describe the new member. Done http://gerrit.cloudera.org:8080/#/c/13722/2/fe/src/main/java/org/apache/impala/analysis/CastExpr.java@208 PS2, Line 208: tFormat_.is > What if 'castFormat_' includes an apostrophe? Will put the value within quotation marks as that is not allowed within a format. http://gerrit.cloudera.org:8080/#/c/13722/2/fe/src/main/java/org/apache/impala/analysis/CastExpr.java@226 PS2, Line 226: msg.node_type = TExprNodeType.FUNCTION_CALL; : // Sets cast_expr in case FORMAT clause was provided and this is not an intermediate : // step between a char and a datetime. > The naming of this variable is a bit confusing. 'isIntermediateStep' is set It's a bit difficult to understand the intention when I simply add the condition to the if. Did so and wrote a short comment to clarify. http://gerrit.cloudera.org:8080/#/c/13722/2/fe/src/main/java/org/apache/impala/analysis/CastExpr.java@280 PS2, Line 280: : if (null != castFormat_ && !isIntermediateStepNeeded) { : if (!(type_.isDateOrTimeType() && getChild(0).getType().isStringType()) && : !(type_.isStringType() && getChild(0).getType().isDateOrTimeType())) { : // FORMAT clause works only for casting between date types and string types : throw new AnalysisException("FORMAT clause is not applicable from " + : getChild(0).getType() + " to " + type_); : } : if (castFormat_.isEmpty()) { : throw new AnalysisException("FORMAT clause can't be empty"); : > This block can be used as the else branch of he previous 'if' statement. An The purpose of that variable is to enhance code readability. It won't be simple an else branch as I also have to check if castFormat_ is set. I found it more readable this way. http://gerrit.cloudera.org:8080/#/c/13722/2/fe/src/test/java/org/apache/impala/analysis/AnalyzeExprsTest.java File fe/src/test/java/org/apache/impala/analysis/AnalyzeExprsTest.java: http://gerrit.cloudera.org:8080/#/c/13722/2/fe/src/test/java/org/apache/impala/analysis/AnalyzeExprsTest.java@3088 PS2, Line 3088: public void TestCastFormatClauseFromDatetime() throws AnalysisException { : RunCastFormatTestOnType("TIMESTAMP"); : RunCastFormatTestOnType("DATE"); : } : : private void RunCastFormatTestOnType(String type) { : String to_timestamp_cast = "cast('05-01-2017' as " + type + ")"; : AnalysisError( : "select cast(" + to_timestamp_cast + " as DATETIME FORMAT 'MM-dd-yyyy')", : "Unsupported data type: DATETIME"); : if (!type.equals("TIMESTAMP")) { : AnalysisError( : "select cast(" + to_timestamp_cast + " as TIMESTAMP FORMAT 'MM-dd-yyyy')", : "FORMAT clause is not applicable from " + type + " to TIMESTAMP"); : } : if (!type.equals("DATE")) { : AnalysisError("select cast(" + to_timestamp_cast + " as DATE FORMAT 'MM-dd-yyyy')", : > Would it make sense to add similar tests for DATE instead of TIMESTAMP? It absolutely would. Done. http://gerrit.cloudera.org:8080/#/c/13722/2/fe/src/test/java/org/apache/impala/analysis/AnalyzeExprsTest.java@3109 PS2, Line 3109: AnalysisError("select cast(" + to_timestamp_cast + " AS BOOLEAN FORMAT 'MM-dd-yyyy')", : "FORMAT clause is not applicable from " + type + " to BOOLEAN"); : AnalysisError("select cast(" + to_timestamp_cast + " AS DOUBLE FORMAT 'MM- > Please add similar test for DATE instead of TIMESTAMP. Done http://gerrit.cloudera.org:8080/#/c/13722/2/fe/src/test/java/org/apache/impala/analysis/ParserTest.java File fe/src/test/java/org/apache/impala/analysis/ParserTest.java: http://gerrit.cloudera.org:8080/#/c/13722/2/fe/src/test/java/org/apache/impala/analysis/ParserTest.java@1458 PS2, Line 1458: ParsesOk("select cast('05-01-2017' as timestamp format 'MM-dd-yyyy')"); : ParsesOk("select cast('2017.01.02' as date format 'YYYY-MM-DD')"); : ParserError("select cast(a + 5.0 as badtype) from t"); : ParserError("select cast(a + 5.0, string) from t"); : ParserError("select cast('05-01-2017' as timestamp format 1234 > Please add similar tests for DATE. from parsing perspective it doesn't really make a difference to do the same for dates as well. I re-wrote one of these to use date instead of TS. http://gerrit.cloudera.org:8080/#/c/13722/2/testdata/workloads/functional-query/queries/QueryTest/date.test File testdata/workloads/functional-query/queries/QueryTest/date.test: http://gerrit.cloudera.org:8080/#/c/13722/2/testdata/workloads/functional-query/queries/QueryTest/date.test@629 PS2, Line 629: ==== : ---- QUERY : select cast("2014/- ,11'/:-05" as date format "YYYY-MM-DD"); : ---- RESULTS : 2014-11-05 : ---- TYPES : DATE > This should work right? Any reason it is commented out? Yes, removed the comments. Note, the testing framework for some reason fails on ';' char so I removed it from the input. http://gerrit.cloudera.org:8080/#/c/13722/2/testdata/workloads/functional-query/queries/QueryTest/date.test@841 PS2, Line 841: > Extra space Done http://gerrit.cloudera.org:8080/#/c/13722/2/tests/query_test/test_cast_with_format.py File tests/query_test/test_cast_with_format.py: http://gerrit.cloudera.org:8080/#/c/13722/2/tests/query_test/test_cast_with_format.py@22 PS2, Line 22: TestCastWithFormat > Could the tests in this script be moved to cast_format.test? Or to another There was one reason I couldn't do that: The tests in cast_format.test are run for both beeswax and HS2 and apparently those differ in how many digits they pad the fractional second part with. select cast("2019-10-10 12:13:14.12345" as timestamp) Beeswax: 2019-10-10 12:13:14.123450000 HS2: 2019-10-10 12:13:14.123450 As most of the tests contain fractional seconds I couldn't simply move them to that file. I could remove either Beeswax or HS2 from the test matrix but in that case we would lose some valuable coverage. On a sidenote: It's a good question why these interfaces pad fractional seconds differently and if it's only related to the tests or to the interfaces themselves. I haven't investigated this part further but this characteristic is also present without this change. -- To view, visit http://gerrit.cloudera.org:8080/13722 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I19d8d097a45ae6f103b6cd1b2d81aad38dfd9e23 Gerrit-Change-Number: 13722 Gerrit-PatchSet: 3 Gerrit-Owner: Gabor Kaszab <gaborkas...@cloudera.com> Gerrit-Reviewer: Attila Jeges <atti...@cloudera.com> Gerrit-Reviewer: Gabor Kaszab <gaborkas...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Comment-Date: Tue, 09 Jul 2019 09:57:50 +0000 Gerrit-HasComments: Yes