Todd Lipcon has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12481 )

Change subject: IMPALA-7368: Add initial support for DATE type
......................................................................


Patch Set 11:

(32 comments)

nice work, pretty impressive test coverage. Mostly small items here.

http://gerrit.cloudera.org:8080/#/c/12481/11//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12481/11//COMMIT_MSG@29
PS11, Line 29:     - from STRING to DATE if the source string value is used in a
             :       context where a DATE value is expected.
it's interesting that this differs from the behavior of TIMESTAMP today, at 
least for timestamp arithmetic expressions: SELECT '2019-01-01 12;12:12' + 
interval 1 day does not do an implicit conversion. (interestingly it does in 
MySQL).

Does the ANSI standard give rules for this?


http://gerrit.cloudera.org:8080/#/c/12481/11//COMMIT_MSG@38
PS11, Line 38: E.g: year('2019-02-15') must resolve to
             :        year(TIMESTAMP) instead of year(DATE). Note, that 
year(DATE) is
             :        not implemented yet, so this is not an issue at the 
moment but
             :        it will be in the future.
wouldn't the result be the same either way for this example?


http://gerrit.cloudera.org:8080/#/c/12481/11//COMMIT_MSG@42
PS11, Line 42: better fit
how is this defined, specifically?


http://gerrit.cloudera.org:8080/#/c/12481/11//COMMIT_MSG@70
PS11, Line 70: complete DATE type implementation
should we consider turning it off by default and only enabling it once we're 
sure it's stable and correct?


http://gerrit.cloudera.org:8080/#/c/12481/11//COMMIT_MSG@72
PS11, Line 72: - Add date support to the random query generator.
should we transition TPC-DS test tables over to 'date' instead of 'string' for 
the appropriate columns?


http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/codegen/codegen-anyval.cc
File be/src/codegen/codegen-anyval.cc:

http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/codegen/codegen-anyval.cc@68
PS11, Line 68:       return cg->i64_type();
why do we lower to i64 instead of i32 given the slot size is only 4 bytes? 
seems like we're using int32 below


http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/exec/hdfs-table-sink.cc
File be/src/exec/hdfs-table-sink.cc:

http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/exec/hdfs-table-sink.cc@500
PS11, Line 500: stringstream error_msg;
              :             error_msg << "Cannot write DATE column to a PARQUET 
table.";
              :             return Status(error_msg.str());
nit: why not just pass that string directly?

Could we make this better by including the column name from the table desc?


http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/exprs/aggregate-functions-ir.cc
File be/src/exprs/aggregate-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/exprs/aggregate-functions-ir.cc@381
PS11, Line 381: val_struct->sum / val_struct->count
Is this correct behavior when the dates are negative? Dividing as signed 
integers means we'll "truncate towards zero" which means that average of 
(1969-01-01, 1969-01-02) would be 1969-01-02 whereas average of 1971-01-01 and 
1971-01-02 would be 1971-01-01. I think that's surprising because it exposes 
the epoch reference point in the semantics of the operation.

Does the ANSI standard have anything to say here?

Checking briefly on sqlfiddle:
- postgres 9.6 doesn't support AVG(date) at all. Neither does Oracle 11g R2 nor 
SQL Server 2017.
- mysql 5.6 implicitly casts the dates to an DECIMIAL and return something 
weird like 19690101.5. Not sure what's going on under the covers.
- hive 3.1 doesn't support AVG(date) either

Maybe we should remove this functionality for now and consider adding it back 
based on user demand?


http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/exprs/conditional-functions-ir.cc
File be/src/exprs/conditional-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/exprs/conditional-functions-ir.cc@37
PS11, Line 37: ITERATE_OVER_SEQ
per comment elsewhere, I found it clearer in the old version since I didn't 
need to understand boost magic


http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/exprs/expr-test.cc
File be/src/exprs/expr-test.cc:

http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/exprs/expr-test.cc@3386
PS11, Line 3386:
               : TEST_F(ExprTest, CastDateExprs) {
add some tests for invalid dates like feb 30


http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/exprs/expr-test.cc@3475
PS11, Line 3475:   TestDateValue("cast(cast('1400-01-01 00:00:00' as timestamp) 
as date)",
can you test a before-the-epoch timestamp that includes a time portion? eg 
'1960-01-01 23:59' and make sure it truncates "down" and not truncate "towards 
the epoch"?


http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/runtime/date-value.h
File be/src/runtime/date-value.h:

http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/runtime/date-value.h@40
PS11, Line 40: /// - Proleptic Gregorian calendar is used to calculate the 
number of days since epoch,
             : ///   which can lead to different representation of historical 
dates compared to Hive.
             : ///   
(https://en.wikipedia.org/wiki/Proleptic_Gregorian_calendar)
I know this was from the original change a few months back, but should we be 
concerned about this? Should we have a compatibility mode even if there is a 
performance hit?

Doing some research online it seems like hive might have had different behavior 
previously but then switched to proleptic gregorian as of 3.1?


http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/runtime/timestamp-value.inline.h
File be/src/runtime/timestamp-value.inline.h:

http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/runtime/timestamp-value.inline.h@36
PS11, Line 36:   return TimestampValue(EPOCH + 
boost::gregorian::date_duration(days), t);
is there any chance of exceptions in this boost API? I seem to recall there 
were some unexpected exceptions thrown by boost::date_time stuff


http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/service/fe-support.cc
File be/src/service/fe-support.cc:

http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/service/fe-support.cc@638
PS11, Line 638:     jclass caller_class, jstring date) {
I see 'caller_class' was copied from above, but that's a somewhat inaccurate 
name. Really the class here is the FeSupport class itself, given this is a 
static method, right?


http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/udf/udf.h
File be/src/udf/udf.h:

http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/udf/udf.h@599
PS11, Line 599:   DateVal(underlying_type_t val = 0) : val(val) { }
should be marked explicit?


http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/util/string-parser-test.cc
File be/src/util/string-parser-test.cc:

http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/util/string-parser-test.cc@534
PS11, Line 534:   // Test bad formats.
worth a check where there is a valid leading date followed by junk ("2010-01-01 
foo") as well as one where it's a full valid timestamp ("2010-01-01 12:22:22") 
to make sure we get the expected results


http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/util/string-parser-test.cc@544
PS11, Line 544:   // Test invalid month/day values.
what about special days like February 29th? Worth testing that in a year when 
it doesn't exist and a year when it exists, including perhaps some "exception" 
leap years? ie 2000-02-29 does exist but 2100-02-29 doesn't


http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/util/symbols-util.cc
File be/src/util/symbols-util.cc:

http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/util/symbols-util.cc@129
PS11, Line 129:     ITERATE_OVER_SEQ2(CASE_TYPE_APPEND_MANGLED_TOKEN,
I dunno that using the ITERATE_OVER is really any clearer than just repeating 
the macro N times and not even using BOOST_PP_STRINGIZE:

CASE_TYPE_APPEND_MANGLED_TOKEN(TYPE_BOOLEAN, "BooleanVal");
CASE_TYPE_APPEND_MANGLED_TOKEN(TYPE_TINYINT, "TinyIntVal");
...

it's only slightly more typing and I think less cognitive overhead for someone 
to go figure out what this weird boost macro magic is doing


http://gerrit.cloudera.org:8080/#/c/12481/11/common/thrift/Exprs.thrift
File common/thrift/Exprs.thrift:

http://gerrit.cloudera.org:8080/#/c/12481/11/common/thrift/Exprs.thrift@58
PS11, Line 58:   // String representation
             :   2: required string date_string
why is it necessary to include the string representation here too? I don't see 
a similar field in other literal expr types that have to be "parsed" and may 
not round-trip to exactly the same string the user input (timestamp, float, etc)


http://gerrit.cloudera.org:8080/#/c/12481/11/common/thrift/ImpalaInternalService.thrift
File common/thrift/ImpalaInternalService.thrift:

http://gerrit.cloudera.org:8080/#/c/12481/11/common/thrift/ImpalaInternalService.thrift@745
PS11, Line 745: sinxce
typo


http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/main/java/org/apache/impala/analysis/DateLiteral.java
File fe/src/main/java/org/apache/impala/analysis/DateLiteral.java:

http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/main/java/org/apache/impala/analysis/DateLiteral.java@61
PS11, Line 61:       throw new AnalysisException("Invalid date literal: " + 
e.getMessage(), e);
would this actually be due to an invalid date literal? or would this be truly a 
bug? maybe makes sense to have a different error message here due to an 
internal error?


http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/main/java/org/apache/impala/analysis/DateLiteral.java@65
PS11, Line 65:       throw new AnalysisException("Invalid date literal: " + 
strDate);
I think it's worth quoting the date literal they passed in so if they did 
something like DATE ' ' then they will get a readable error message


http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/main/java/org/apache/impala/analysis/DateLiteral.java@66
PS11, Line 66: else
nit: no need for else here since you threw above


http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/main/java/org/apache/impala/analysis/DateLiteral.java@103
PS11, Line 103: return
nit: missing space


http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/main/java/org/apache/impala/analysis/PartitionSpec.java
File fe/src/main/java/org/apache/impala/analysis/PartitionSpec.java:

http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/main/java/org/apache/impala/analysis/PartitionSpec.java@83
PS11, Line 83: This is necessary as a DATE literal can have different string
             :     //       representations, but all these representations 
refer to the same partition
hrm... this part is spooking me a little bit. What happens if you have multiple 
partitions with different representations for the same date? Also, are you sure 
we aren't relying on immutability of the partitionSpec_ list anywhere? Is there 
any other way we can accomplish this?


http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
File fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java:

http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java@1148
PS11, Line 1148:     // Avg(Date)
is this capability part of the SQL standard?


http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
File fe/src/main/java/org/apache/impala/catalog/ColumnStats.java:

http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/main/java/org/apache/impala/catalog/ColumnStats.java@293
PS11, Line 293: LongMath.pow(2, Integer.SIZE)
should we instead cap at the number of unique dates in 0000-00-00 to 
9999-12-31? I think this is smaller than 2^32 right?


http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/main/java/org/apache/impala/catalog/Function.java
File fe/src/main/java/org/apache/impala/catalog/Function.java:

http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/main/java/org/apache/impala/catalog/Function.java@191
PS11, Line 191:   public boolean compare(Function other, CompareMode mode) {
Is there an ANSI standard for function resolution with implicit casts? Have you 
checked what Hive does here? It would be no good if we were inconsistent


http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java:

http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1248
PS11, Line 1248:               "Scanning DATE values is not supported for 
non-text fileformats ");
Can we include the table name here?


http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java
File fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java:

http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java@732
PS11, Line 732: TestResolveFunction
nit: here and elsewhere, the function names should start with a lower case 
letter in java


http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/test/java/org/apache/impala/analysis/LiteralExprTest.java
File fe/src/test/java/org/apache/impala/analysis/LiteralExprTest.java:

http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/test/java/org/apache/impala/analysis/LiteralExprTest.java@67
PS11, Line 67:     testLiteralExprNegative("ABC", Type.DATE);
worth a negative test here for an invalid date like february 31?


http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/test/java/org/apache/impala/analysis/ParserTest.java
File fe/src/test/java/org/apache/impala/analysis/ParserTest.java:

http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/test/java/org/apache/impala/analysis/ParserTest.java@1215
PS11, Line 1215:     ParserError("select date '2011--01'");
same, check for feb 31



--
To view, visit http://gerrit.cloudera.org:8080/12481
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iea8155ef09557e0afa2f8b2d0b2dc9d0896dc30f
Gerrit-Change-Number: 12481
Gerrit-PatchSet: 11
Gerrit-Owner: Attila Jeges <atti...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <atti...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <t...@apache.org>
Gerrit-Comment-Date: Thu, 14 Mar 2019 04:22:12 +0000
Gerrit-HasComments: Yes

Reply via email to