[Impala-ASF-CR] IMPALA-3504: [DOCS] Document utc timestamp()
Zoltan Ivanfi has posted comments on this change. ( http://gerrit.cloudera.org:8080/8190 ) Change subject: IMPALA-3504: [DOCS] Document utc_timestamp() .. Patch Set 2: (3 comments) http://gerrit.cloudera.org:8080/#/c/8190/2/docs/topics/impala_datetime_functions.xml File docs/topics/impala_datetime_functions.xml: http://gerrit.cloudera.org:8080/#/c/8190/2/docs/topics/impala_datetime_functions.xml@2543 PS2, Line 2543: For working with date/time values represented as integer values, you can convert The most typical integer timestamp representation is the number of seconds (not microsecs) elapsed since the Unix epoch. It may be worth including specific instructions for dealing with these. If there are no conversion functions for these, then I suggest providing examples incorporating multiplication and division. http://gerrit.cloudera.org:8080/#/c/8190/2/docs/topics/impala_datetime_functions.xml@2545 PS2, Line 2545: with the unix_micros_to_utc_timestamp() and Shouldn't the unix_micros_to_utc_timestamp and utc_to_unix_micros functions be briefly documented separately as well, just like the from_unixtime and unix_timestamp functions are? At least an indexterm should be added in my opinion. http://gerrit.cloudera.org:8080/#/c/8190/2/docs/topics/impala_datetime_functions.xml@2557 PS2, Line 2557: select now(), utc_timestamp(); (nit, optional) This pairing of functions for the queries seems quite arbitrary to me and makes it hard to compare all three functions. I would suggest examples showing either each function separately (3 queries) or all of them in a single query: select now(), current_timestamp(), utc_timestamp(); -- To view, visit http://gerrit.cloudera.org:8080/8190 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ia2662fc79d588f22a24a5067429a57b3c0d0f0f0 Gerrit-Change-Number: 8190 Gerrit-PatchSet: 2 Gerrit-Owner: John Russell Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Greg Rahn Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: John Russell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Zoltan Ivanfi Gerrit-Comment-Date: Mon, 09 Oct 2017 10:22:28 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet
Zoltan Ivanfi has posted comments on this change. Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet .. Patch Set 3: (2 comments) http://gerrit.cloudera.org:8080/#/c/5939/3//COMMIT_MSG Commit Message: Line 23: global flag --prevent_parquet_mr_zone_adjustment is set to true. > Not sure if the name of this flag was already decided upon, but imo, it's n It is also important that it is only set on new tables and that it is set to UTC. I would suggest --set_parquet_mr_int96_write_zone_to_utc_on_new_tables http://gerrit.cloudera.org:8080/#/c/5939/3/be/src/service/impala-server.cc File be/src/service/impala-server.cc: Line 123: "Impala, Hive, and Spark to not apply timestamp timezone adjustments for parquet " > Mention that this property also makes Hive write in UTC. Hive always writes in UTC (that's the problem). It makes Hive write as if it was located in UTC instead of the actual local timezone, which makes the written value match the original value. I think this effect is so complex that we can not describe it in a few lines. I would replace the last sentence with "This changes the behavior of recent versions of Hive and Spark as well. The writing of timestamps is affected in Hive and Spark but not in Impala. The reading of timestamps that were written by Hive, Spark, or any other parquet-mr writer is affected in Hive, Spark and Impala. You can find details in the documentation." -- To view, visit http://gerrit.cloudera.org:8080/5939 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila Jeges Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Zoltan Ivanfi Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet
Zoltan Ivanfi has posted comments on this change. Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet .. Patch Set 2: (11 comments) Thanks for tackling this issue! I raised a few concerns but in general it looks good. http://gerrit.cloudera.org:8080/#/c/5939/2/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: Line 237: timezone_ = TimezoneDatabase::FindTimezone( We should first check for TimezoneDatabase::IsTimestampDependentTimezone, otherwise there is no point in stroing a timezone_ that we won't use. Line 240: timezone_ = NULL; This exception should not be swallowed. If we can not convert to the desired time zone properly, we should show an error message to the user. Line 543: /// If not NULL, TIMESTAMP column readers require conversion to this time zone. Although technically correct, this comment is a bit misleading. One may easily interpret the if as an iff ("if and only if"). I would suggest to document its purpose instead: to cache the return value of TimezoneDatabase::FindTimezone in order to avoid repeated calls to it. Line 593: if (LIKELY(dst->HasDateAndTime())) { DCHECK(parent_->scan_node_->parquet_mr_time_zone() != "UTC") Line 597: dst->FromUtc(parent_->scan_node_->parquet_mr_time_zone()); This is somewhat confusing, as it seems as if the first two cases did the same thing: convert from UTC to the parent_->scan_node_->parquet_mr_time_zone() with the only exception that in the second case we use the timezone that we saved into the timezone_ field earlier, but it's value came from the same parent_->scan_node_->parquet_mr_time_zone() value. In reality though, the parameter of the first call is a string, while the parameter of the second one is a timezone, which leads to different behaviors. Since its very hard to notice this difference, please add comments to describe what's happening. http://gerrit.cloudera.org:8080/#/c/5939/2/be/src/exprs/timestamp-functions.cc File be/src/exprs/timestamp-functions.cc: Line 189: time_zone_ptr TimezoneDatabase::FindTimezone(const string& tz) { This should be a private helper function with two callers: - FindTimezone(const string& tz, const TimestampValue& tv) as you already do it - FindTimezone(const string& tz) which should first ensure that !TimezoneDatabase::IsTimestampDependentTimezone(tz) http://gerrit.cloudera.org:8080/#/c/5939/2/be/src/runtime/timestamp-value.cc File be/src/runtime/timestamp-value.cc: Line 109: // If a table has an invalid timezone, it should not have been loaded. How does this comment apply here? Was it supposed to be before the catch block? Line 112: *this = ptime(not_a_date_time); Again, swallowing this exception can be a problem as end-users may not notice that they don't see all data correctly. I understand that there is a check higher in the hierarchy, but then the try-catch is unnecessary. If we have it regardless, we should handle it according to the desired output: an error presented to the user. http://gerrit.cloudera.org:8080/#/c/5939/2/be/src/runtime/timestamp-value.h File be/src/runtime/timestamp-value.h: Line 201: /// Converts from UTC to local time in the given timezone. Please document the difference that one can be timestamp-dependent while the other not. http://gerrit.cloudera.org:8080/#/c/5939/2/be/src/service/fe-support.cc File be/src/service/fe-support.cc: Line 352: extern "C" Please document the purpose of this call (allowing the Java code to check whether a timezone is valid in the C code). http://gerrit.cloudera.org:8080/#/c/5939/2/fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java File fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java: Line 182: "'parquet.mr.int96.write.zone' is only supported for HDFS tables.")); Why? Parquet files should be handled this way regardless of using HDFS or not. -- To view, visit http://gerrit.cloudera.org:8080/5939 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila Jeges Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Zoltan Ivanfi Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3909: Populate min/max statistics in Parquet writer
Zoltan Ivanfi has posted comments on this change. Change subject: IMPALA-3909: Populate min/max statistics in Parquet writer .. Patch Set 2: > > (1 comment) > > Apologies for the delayed reply. Hive writes timestamps using 12 > bytes using little endian. Then it passes them to parquet-mr as a > BINARY string, which means it is hitting PARQUET-251. This explains > why I saw the odd values for min/max in my tests. > > Internally parquet-mr orders BINARY values using byte comparison, > potentially leading to a min/max value not being the semantically > smallest/largest value of a set of values. I am inclined to call > this a bug in hive, but I'm curious to hear what you think about > this. I don't think it's a bug that the min/max corresponds to the binary ordering, since at Parquet's level timestamps are just meaningless bytes. If we were using a proper Parquet logical type then it would be different, but when saving 12 bytes, I think the proper order is the binary ordering. In any case, I think we should aim for Hive-compatibility in this. The bug that causes the last row to be both the min and max values is a major pain though that will make column statistics for byte arrays totally useless. I don't see how we could handle that other than ignoring any such min/max values written by affected Hive versions. -- To view, visit http://gerrit.cloudera.org:8080/5611 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I8368ee58daa50c07a3b8ef65be70203eb941f619 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Ivanfi Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-3909: Populate min/max statistics in Parquet writer
Zoltan Ivanfi has posted comments on this change. Change subject: IMPALA-3909: Populate min/max statistics in Parquet writer .. Patch Set 2: (1 comment) http://gerrit.cloudera.org:8080/#/c/5611/2/be/src/exec/parquet-column-stats.h File be/src/exec/parquet-column-stats.h: Line 39: /// TIMESTAMP values are written in the in-memory format used by Impala, relative to UTC, > That's concerning if Hive and parquet-mr are inconsistent in how they defin It's not that Hive and parquet-mr do it differently, it's simply that there is no such type as our timestamp in Parquet (there is a separate timestamp_millis type), so parquet-mr doesn't handle this type at all, while Hive does. From Parquet's perspective it's just 96 bits without meaning, so probably the min and max should be calculated accordingly as well and not based on the base type. On the other hand, if Hive already has some implementation, we should probably use the same logic. But if I remember correctly, Lars looked at Hive's min/max values for timestamps and found that they were not the min/max values in any way, neither as instants on a timeline, neither as the 96 bits in any byte order. -- To view, visit http://gerrit.cloudera.org:8080/5611 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I8368ee58daa50c07a3b8ef65be70203eb941f619 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Ivanfi Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3909: Populate min/max statistics in Parquet writer
Zoltan Ivanfi has posted comments on this change. Change subject: IMPALA-3909: Populate min/max statistics in Parquet writer .. Patch Set 2: (1 comment) http://gerrit.cloudera.org:8080/#/c/5611/2/be/src/exec/parquet-column-stats.h File be/src/exec/parquet-column-stats.h: Line 39: /// TIMESTAMP values are written in the in-memory format used by Impala, relative to UTC, > I think we convert them to INT96 for parquet. I think the INT96 order is eq Lars, if I remember correctly you found that Hive does not write statistics according to these rules. Did you find out what logic they follow? Is it different than this implementation? If so, will the parsing logic contain a condition that checks whether the stats were written by Hive? -- To view, visit http://gerrit.cloudera.org:8080/5611 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I8368ee58daa50c07a3b8ef65be70203eb941f619 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Ivanfi Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3973: optional 2nd and 3rd arguments for instr().
Zoltan Ivanfi has posted comments on this change. Change subject: IMPALA-3973: optional 2nd and 3rd arguments for instr(). .. Patch Set 2: (1 comment) http://gerrit.cloudera.org:8080/#/c/5589/2/docs/topics/impala_string_functions.xml File docs/topics/impala_string_functions.xml: PS2, Line 395: negative position argument > That particular circumstance causes an error instead of a zero return value That's intentional. In fact, initially I returned 0 in this case as well, but then acting on Dan Hecht's recommendation I checked this behavior in Oracle and found that Oracle gives an error in this case, so we do the same. Related discussion can be found on https://gerrit.cloudera.org/#/c/4094/13/be/src/exprs/string-functions-ir.cc@292 -- To view, visit http://gerrit.cloudera.org:8080/5589 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I17268bdb480230938f94559fe1eabe34ac2448b7 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: John Russell Gerrit-Reviewer: Jim Apple Gerrit-Reviewer: John Russell Gerrit-Reviewer: Zoltan Ivanfi Gerrit-Reviewer: zi+z...@cloudera.com Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4371: Incorrect DCHECK-s in hdfs-parquet-table-writer
Zoltan Ivanfi has posted comments on this change. Change subject: IMPALA-4371: Incorrect DCHECK-s in hdfs-parquet-table-writer .. Patch Set 2: > Please see https://issues.cloudera.org/browse/IMPALA-4371 It seems that Gerrit auto-linkifies JIRA entries even inside URLs, which results in invalid links. I should just probably refer to it as IMPALA-4371. -- To view, visit http://gerrit.cloudera.org:8080/4835 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I52e1b1354e9ea056b49331e75e53759952a81b76 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Zoltan Ivanfi Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Ivanfi Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-4371: Incorrect DCHECK-s in hdfs-parquet-table-writer
Zoltan Ivanfi has posted comments on this change. Change subject: IMPALA-4371: Incorrect DCHECK-s in hdfs-parquet-table-writer .. Patch Set 2: > Oh, oops, I wasn't noticing this was the writier. In that case, > why are these DCHECKs not valid? Why does our code create pages > with zero values (other than the last)? Or were we hitting the > non-zero compressed size dcheck (and if so, that also doesn't seem > to make sense). Originally I explained the reason in the commit message, but when I was asked to open a JIRA about it, I moved the explanation there. Please see https://issues.cloudera.org/browse/IMPALA-4371 -- To view, visit http://gerrit.cloudera.org:8080/4835 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I52e1b1354e9ea056b49331e75e53759952a81b76 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Zoltan Ivanfi Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Ivanfi Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-4371: Incorrect DCHECK-s in hdfs-parquet-table-writer
Zoltan Ivanfi has posted comments on this change. Change subject: IMPALA-4371: Incorrect DCHECK-s in hdfs-parquet-table-writer .. Patch Set 2: > Do you have access to the file to repro'ed this in the wild? And > given that DCHECKs are compiled out of release builds, how did this > cause a problem in the wild? In the wild, some `insert into ... select from ...` queries have hit this check when converting CSV to Parquet, but I do not have access to those CSV files. A debug build was used while investigating an unrelated issue, that's how the DCHECK failure was noticed. -- To view, visit http://gerrit.cloudera.org:8080/4835 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I52e1b1354e9ea056b49331e75e53759952a81b76 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Zoltan Ivanfi Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Ivanfi Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-4371: Incorrect DCHECK-s in hdfs-parquet-table-writer
Zoltan Ivanfi has posted comments on this change. Change subject: IMPALA-4371: Incorrect DCHECK-s in hdfs-parquet-table-writer .. Patch Set 2: > Is it reasonably possible to create a file that demonstrates this > and can be added to the regression test suite? Dan, it has just occurred to me that since this condition happens while writing a file and not while reading it, I don't see how we could use a test file to test it. Additionally, I don't see a real risk of regression, as I don't expect anyone to put in DCHECK-s based on the same false assumptions again. So considering the value of such a test and the difficulty of reproducing the exact conditions triggering the DCHECK-s, I suggest not creating a test file for the regression suite. -- To view, visit http://gerrit.cloudera.org:8080/4835 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I52e1b1354e9ea056b49331e75e53759952a81b76 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Zoltan Ivanfi Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Ivanfi Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-4371: Incorrect DCHECK-s in hdfs-parquet-table-writer
Zoltan Ivanfi has posted comments on this change. Change subject: IMPALA-4371: Incorrect DCHECK-s in hdfs-parquet-table-writer .. Patch Set 2: Hi Tim, Sorry for being unresponsive, I was on vacation for a week. I would vote for opening a follow-up JIRA for test coverage, as creating such a test file will be more complicated than identifying and fixing this issue was. However, we have seen this happening in the wild, so this is certainly not only a theoretical, but a real issue. -- To view, visit http://gerrit.cloudera.org:8080/4835 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I52e1b1354e9ea056b49331e75e53759952a81b76 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Zoltan Ivanfi Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Ivanfi Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-4371: Incorrect DCHECK-s in hdfs-parquet-table-writer
Zoltan Ivanfi has uploaded a new patch set (#2). Change subject: IMPALA-4371: Incorrect DCHECK-s in hdfs-parquet-table-writer .. IMPALA-4371: Incorrect DCHECK-s in hdfs-parquet-table-writer Change-Id: I52e1b1354e9ea056b49331e75e53759952a81b76 --- M be/src/exec/hdfs-parquet-table-writer.cc 1 file changed, 1 insertion(+), 3 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/35/4835/2 -- To view, visit http://gerrit.cloudera.org:8080/4835 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I52e1b1354e9ea056b49331e75e53759952a81b76 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Zoltan Ivanfi Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] Remove seemingly incorrect DCHECK-s.
Zoltan Ivanfi has uploaded a new change for review. http://gerrit.cloudera.org:8080/4835 Change subject: Remove seemingly incorrect DCHECK-s. .. Remove seemingly incorrect DCHECK-s. The first conditional DCHECK means that if a page's size is 0 then it's compressed size is also 0. This, however, seems to be a false assumption, as the compressed output may include metadata, such as length or checksum. The GZIP compressor, for example, states that an input of 0 bytes requires 23 bytes when compressed. The Snappy compressor also mentions storing length information in the compressed output. The compressed size estimation in the LZ4 compressor also contains a constant part. The "Last page might be empty" comment and the second DCHECK also seems to be based on a false assumption. If a value doesn't fit on the current page, AppendRow creates a possible bigger new page and tries writing the data in the new page instead. This means that if the data is bigger than the page size, then the current page is finalized and a new page is added, even if the original page was empty. In other words, empty pages can occur in the middle of the pages_ array as well, not only at the end of it. Change-Id: I52e1b1354e9ea056b49331e75e53759952a81b76 --- M be/src/exec/hdfs-parquet-table-writer.cc 1 file changed, 1 insertion(+), 3 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/35/4835/1 -- To view, visit http://gerrit.cloudera.org:8080/4835 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newchange Gerrit-Change-Id: I52e1b1354e9ea056b49331e75e53759952a81b76 Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Zoltan Ivanfi
[Impala-ASF-CR] IMPALA-784: Use `-s in SHOW CREATE TABLE output
Zoltan Ivanfi has abandoned this change. Change subject: IMPALA-784: Use `-s in SHOW CREATE TABLE output .. Abandoned Abandoning the review until I can work on it again. -- To view, visit http://gerrit.cloudera.org:8080/4527 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: abandon Gerrit-Change-Id: I8e66439daa35d2e3725e7277c16f3cc5fdb6f128 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Zoltan Ivanfi Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Ivanfi
[Impala-ASF-CR] IMPALA-784: Use `-s in SHOW CREATE TABLE output
Zoltan Ivanfi has posted comments on this change. Change subject: IMPALA-784: Use `-s in SHOW CREATE TABLE output .. Patch Set 2: (1 comment) http://gerrit.cloudera.org:8080/#/c/4527/2/fe/src/main/java/com/cloudera/impala/analysis/ToSqlUtils.java File fe/src/main/java/com/cloudera/impala/analysis/ToSqlUtils.java: Line 76: return Joiner.on(".").join(getStandaloneIdentSqlList((Splitter.on('.').split(ident; > 1. It would be good talk about a concrete example. From my point of view, I Hi Alex, Thanks for the offer, I would certainly like to discuss the topic. Unfortuntely it looks like I won't have time to finish this change in the near future, as I was assigned to File Formats recently and there is a lot to do. Let's hibernate this review for a while and get back to it later. Thanks, Zoltan -- To view, visit http://gerrit.cloudera.org:8080/4527 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I8e66439daa35d2e3725e7277c16f3cc5fdb6f128 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Zoltan Ivanfi Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Ivanfi Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-784: Use `-s in SHOW CREATE TABLE output
Zoltan Ivanfi has posted comments on this change. Change subject: IMPALA-784: Use `-s in SHOW CREATE TABLE output .. Patch Set 2: (1 comment) http://gerrit.cloudera.org:8080/#/c/4527/2/fe/src/main/java/com/cloudera/impala/analysis/ToSqlUtils.java File fe/src/main/java/com/cloudera/impala/analysis/ToSqlUtils.java: Line 76: return Joiner.on(".").join(getStandaloneIdentSqlList((Splitter.on('.').split(ident; > I think we can still make progress here, I'm not sure I completely follow y If I remember correctly, I encountered a problem with the com.cloudera.impala.analysis.TableName class. It stores the table name as a String, but the table name itself can be qualified in case of complex types, thus it can not be quoted by simply putting it between backticks. It also seems that com.cloudera.impala.analysis.InsertStmt and com.cloudera.impala.catalog.View constructors take a List parameter for columns and store it in the object. It seems to me that in order to properly quote these, we would need to refactor the code to use a data structure capable of representing hierarchical names unambiguously to pass the table and column references around and to store them. I thought that the ongoing effort you mentioned is about addressing this. -- To view, visit http://gerrit.cloudera.org:8080/4527 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I8e66439daa35d2e3725e7277c16f3cc5fdb6f128 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Zoltan Ivanfi Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Ivanfi Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-784: Use `-s in SHOW CREATE TABLE output
Zoltan Ivanfi has posted comments on this change. Change subject: IMPALA-784: Use `-s in SHOW CREATE TABLE output .. Patch Set 2: (1 comment) http://gerrit.cloudera.org:8080/#/c/4527/2/fe/src/main/java/com/cloudera/impala/analysis/ToSqlUtils.java File fe/src/main/java/com/cloudera/impala/analysis/ToSqlUtils.java: Line 76: return Joiner.on(".").join(getStandaloneIdentSqlList((Splitter.on('.').split(ident; > Those are two separate issues. In your example, we are guarding against a l I added splitting because when I just put the whole string between backticks then Impala ended up escaping qualified identifiers of complex types incorrectly. I noticed that these references were already in a string form by the time getIdentSql gets called, so there is no way to properly quote them without refactoring to pass them in a structured format to the affected functions. The "Invalid column/field name" error message mislead me to believe that dots are not allowed in any identifier, which lead me to come up with my approach which would work if identifiers really couldn't contain dots. Since this turned out to be a false assumption, it seems that I have to discard this change as this task can only be implemented properly after IMPALA-2287 and probably some other refactorings. -- To view, visit http://gerrit.cloudera.org:8080/4527 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I8e66439daa35d2e3725e7277c16f3cc5fdb6f128 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Zoltan Ivanfi Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Ivanfi Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-784: Use `-s in SHOW CREATE TABLE output
Zoltan Ivanfi has posted comments on this change. Change subject: IMPALA-784: Use `-s in SHOW CREATE TABLE output .. Patch Set 2: (2 comments) http://gerrit.cloudera.org:8080/#/c/4527/2/fe/src/main/java/com/cloudera/impala/analysis/ToSqlUtils.java File fe/src/main/java/com/cloudera/impala/analysis/ToSqlUtils.java: Line 69: public static String getIdentSql(String ident) { > So we are in agreement then? I agree that it's slighly less pleasing to the eye but I think correctness/unambiguity is more important than aesthetics. Also Hive does the same, so I think people are used to it. Line 76: return Joiner.on(".").join(getStandaloneIdentSqlList((Splitter.on('.').split(ident; > select `a.b.c`.`x.y.z` from (select 1 `x.y.z`) `a.b.c` Hmmm. Interesting. On the other hand: > create view test as select `a.b.c`.`x.y.z` from (select 1 `x.y.z`) `a.b.c`; ERROR: AnalysisException: Invalid column/field name: x.y.z This is a serious problem, because it means that identifier names are ambigous. I checked that in case of a struct, functions get "column_name.field_name" as the column name. I haven't checked your example yet, but my guess is that in this case the column name parameter will become "x.y.z". Still, the former should be quoted as `column_name`.`field_name` and latter as `x.y.z`. It seems that it's too late to fix this problem by the time getIdentSql gets called. -- To view, visit http://gerrit.cloudera.org:8080/4527 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I8e66439daa35d2e3725e7277c16f3cc5fdb6f128 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Zoltan Ivanfi Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Ivanfi Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-784: Use `-s in SHOW CREATE TABLE output
Zoltan Ivanfi has posted comments on this change. Change subject: IMPALA-784: Use `-s in SHOW CREATE TABLE output .. Patch Set 2: (3 comments) http://gerrit.cloudera.org:8080/#/c/4527/2/fe/src/main/java/com/cloudera/impala/analysis/ToSqlUtils.java File fe/src/main/java/com/cloudera/impala/analysis/ToSqlUtils.java: Line 69: public static String getIdentSql(String ident) { > The main issue with this approach is that all invocations of toSql() will n Initially I was also worried that identifiers become "ugly" and considered making this behavior optional. I was thinking of a query option that the user can control using the SET statement. Then I checked how Hive works and I found that it already quotes every identifier (I checked the table and view definitions if I remember correctly), so I came to the conclusion that quoting everything should be okay. Of course I expected some suggestions to how it should work, that's why I didn't fix the styling issues in the test. Line 76: return Joiner.on(".").join(getStandaloneIdentSqlList((Splitter.on('.').split(ident; > I don't think splitting here is the right fix. An identifier itself could c I checked that identifiers can't contain dots, or at least users can't create such identifiers. If you try specifying `a.b` as an identifier, Impala will complain that dots are not allowed in identifier names. Do you know of a code piece that gets around this mechanism and creates identifiers with dots in some other way? Qualified table or column references are exactly the reason why I only made this version public and the other one private, as only this is safe to use to quote table or column references. In case of complex types, dots can become a part of either the table reference or the column reference. In case of a struct, a column reference might look like column_name.field_name, which should be quoted as `column_name`.`field_name`. In case of a map or array, table references might look like table_name.map_or_array_name, which should be quoted `table_name`.`map_or_array_name`. This is my understanding based on the short amount of time I spent on Impala, so please correct me if I'm wrong. Line 88: if (ident.equals("*")) { > This doesn't seem right or necessary. First, "*" is not an identifier (it i Without this check a SELECT * became SELECT `*` which is invalid. -- To view, visit http://gerrit.cloudera.org:8080/4527 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I8e66439daa35d2e3725e7277c16f3cc5fdb6f128 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Zoltan Ivanfi Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Ivanfi Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-784: Use `-s in SHOW CREATE TABLE output
Zoltan Ivanfi has uploaded a new patch set (#2). Change subject: IMPALA-784: Use `-s in SHOW CREATE TABLE output .. IMPALA-784: Use `-s in SHOW CREATE TABLE output Change-Id: I8e66439daa35d2e3725e7277c16f3cc5fdb6f128 --- M fe/src/main/java/com/cloudera/impala/analysis/ColumnDef.java M fe/src/main/java/com/cloudera/impala/analysis/DeleteStmt.java M fe/src/main/java/com/cloudera/impala/analysis/InsertStmt.java M fe/src/main/java/com/cloudera/impala/analysis/TableName.java M fe/src/main/java/com/cloudera/impala/analysis/TableRef.java M fe/src/main/java/com/cloudera/impala/analysis/ToSqlUtils.java M fe/src/main/java/com/cloudera/impala/analysis/UpdateStmt.java M fe/src/test/java/com/cloudera/impala/analysis/ToSqlTest.java 8 files changed, 339 insertions(+), 338 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/27/4527/2 -- To view, visit http://gerrit.cloudera.org:8080/4527 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I8e66439daa35d2e3725e7277c16f3cc5fdb6f128 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Zoltan Ivanfi Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Ivanfi
[Impala-ASF-CR] IMPALA-784: Use `-s in SHOW CREATE TABLE output
Zoltan Ivanfi has posted comments on this change. Change subject: IMPALA-784: Use `-s in SHOW CREATE TABLE output .. Patch Set 1: (1 comment) > How did you determine all the places that needed to be changed? I modified the expected results in ToSqlTest, then I checked for failures and fixed them. http://gerrit.cloudera.org:8080/#/c/4527/1/fe/src/main/java/com/cloudera/impala/analysis/ToSqlUtils.java File fe/src/main/java/com/cloudera/impala/analysis/ToSqlUtils.java: PS1, Line 72: reassamble > reassemble Done -- To view, visit http://gerrit.cloudera.org:8080/4527 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I8e66439daa35d2e3725e7277c16f3cc5fdb6f128 Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Zoltan Ivanfi Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Ivanfi Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-784: Use `-s in SHOW CREATE TABLE output
Zoltan Ivanfi has posted comments on this change. Change subject: IMPALA-784: Use `-s in SHOW CREATE TABLE output .. Patch Set 1: The functional tests still fail with false positives like - CREATE TABLE functional_kudu.dimtbl (id BIGINT, name STRING, zip INT) + CREATE TABLE `functional_kudu`.`dimtbl` (`id` BIGINT, `name` STRING, `zip` INT) I haven't fixed these yet, because I still have to figure out where the expected strings come from, but first I wanted to make sure that we agree on the approach to be taken. The same applies to the overly long lines in the unit test. That test is fixed functionally but not stylistically, as I didn't want to spend too much time on it until I get feedback about whether this is the right approach. Thanks, Zoltan -- To view, visit http://gerrit.cloudera.org:8080/4527 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I8e66439daa35d2e3725e7277c16f3cc5fdb6f128 Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Zoltan Ivanfi Gerrit-Reviewer: Zoltan Ivanfi Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-784: Use `-s in SHOW CREATE TABLE output
Zoltan Ivanfi has uploaded a new change for review. http://gerrit.cloudera.org:8080/4527 Change subject: IMPALA-784: Use `-s in SHOW CREATE TABLE output .. IMPALA-784: Use `-s in SHOW CREATE TABLE output Change-Id: I8e66439daa35d2e3725e7277c16f3cc5fdb6f128 --- M fe/src/main/java/com/cloudera/impala/analysis/ColumnDef.java M fe/src/main/java/com/cloudera/impala/analysis/DeleteStmt.java M fe/src/main/java/com/cloudera/impala/analysis/InsertStmt.java M fe/src/main/java/com/cloudera/impala/analysis/TableName.java M fe/src/main/java/com/cloudera/impala/analysis/TableRef.java M fe/src/main/java/com/cloudera/impala/analysis/ToSqlUtils.java M fe/src/main/java/com/cloudera/impala/analysis/UpdateStmt.java M fe/src/test/java/com/cloudera/impala/analysis/ToSqlTest.java 8 files changed, 339 insertions(+), 338 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/27/4527/1 -- To view, visit http://gerrit.cloudera.org:8080/4527 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newchange Gerrit-Change-Id: I8e66439daa35d2e3725e7277c16f3cc5fdb6f128 Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Zoltan Ivanfi
[Impala-ASF-CR] IMPALA-3973: add position and occurrence to instr()
Zoltan Ivanfi has posted comments on this change. Change subject: IMPALA-3973: add position and occurrence to instr() .. Patch Set 17: Dan, could you please +2 again? I added the missing test case and I also had to rebase to allow jenkins to verify the change. Thanks. > Please see comment about possible missing test case, but if nothing > else changes, this looks good. -- To view, visit http://gerrit.cloudera.org:8080/4094 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ie9648de458d243306fa14adc5e7f7002bf6f67fd Gerrit-PatchSet: 17 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Zoltan Ivanfi Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Internal Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Zoltan Ivanfi Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-3973: add position and occurrence to instr()
Hello Lars Volker, Matthew Jacobs, Internal Jenkins, Dan Hecht, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/4094 to look at the new patch set (#16). Change subject: IMPALA-3973: add position and occurrence to instr() .. IMPALA-3973: add position and occurrence to instr() Change-Id: Ie9648de458d243306fa14adc5e7f7002bf6f67fd --- M be/src/experiments/CMakeLists.txt R be/src/experiments/string-search-sse-test.cc M be/src/exprs/expr-test.cc M be/src/exprs/string-functions-ir.cc M be/src/exprs/string-functions.h M be/src/runtime/CMakeLists.txt A be/src/runtime/string-search-test.cc M be/src/runtime/string-search.h M common/function-registry/impala_functions.py 9 files changed, 343 insertions(+), 20 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/94/4094/16 -- To view, visit http://gerrit.cloudera.org:8080/4094 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ie9648de458d243306fa14adc5e7f7002bf6f67fd Gerrit-PatchSet: 16 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Zoltan Ivanfi Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Internal Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Zoltan Ivanfi
[Impala-ASF-CR] IMPALA-3973: add position and occurrence to instr()
Hello Lars Volker, Matthew Jacobs, Internal Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/4094 to look at the new patch set (#15). Change subject: IMPALA-3973: add position and occurrence to instr() .. IMPALA-3973: add position and occurrence to instr() Change-Id: Ie9648de458d243306fa14adc5e7f7002bf6f67fd --- M be/src/experiments/CMakeLists.txt R be/src/experiments/string-search-sse-test.cc M be/src/exprs/expr-test.cc M be/src/exprs/string-functions-ir.cc M be/src/exprs/string-functions.h M be/src/runtime/CMakeLists.txt A be/src/runtime/string-search-test.cc M be/src/runtime/string-search.h M common/function-registry/impala_functions.py 9 files changed, 341 insertions(+), 20 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/94/4094/15 -- To view, visit http://gerrit.cloudera.org:8080/4094 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ie9648de458d243306fa14adc5e7f7002bf6f67fd Gerrit-PatchSet: 15 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Zoltan Ivanfi Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Internal Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Zoltan Ivanfi
[Impala-ASF-CR] IMPALA-3973: add position and occurrence to instr()
Hello Lars Volker, Matthew Jacobs, Internal Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/4094 to look at the new patch set (#14). Change subject: IMPALA-3973: add position and occurrence to instr() .. IMPALA-3973: add position and occurrence to instr() Change-Id: Ie9648de458d243306fa14adc5e7f7002bf6f67fd --- M be/src/experiments/CMakeLists.txt R be/src/experiments/string-search-sse-test.cc M be/src/exprs/expr-test.cc M be/src/exprs/string-functions-ir.cc M be/src/exprs/string-functions.h M be/src/runtime/CMakeLists.txt A be/src/runtime/string-search-test.cc M be/src/runtime/string-search.h M common/function-registry/impala_functions.py 9 files changed, 339 insertions(+), 20 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/94/4094/14 -- To view, visit http://gerrit.cloudera.org:8080/4094 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ie9648de458d243306fa14adc5e7f7002bf6f67fd Gerrit-PatchSet: 14 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Zoltan Ivanfi Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Internal Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Zoltan Ivanfi
[Impala-ASF-CR] IMPALA-3973: add position and occurrence to instr()
Zoltan Ivanfi has posted comments on this change. Change subject: IMPALA-3973: add position and occurrence to instr() .. Patch Set 13: (5 comments) We should also update the documentation for instr(). Where can I do that? > FYI we should try to get query generator support for this, I filed: > https://issues.cloudera.org/browse/IMPALA-4115 http://gerrit.cloudera.org:8080/#/c/4094/13/be/src/exprs/expr-test.cc File be/src/exprs/expr-test.cc: Line 2064: TestValue("instr('', '')", TYPE_INT, 0); > would probably be good to add this test case for backwards search. Done http://gerrit.cloudera.org:8080/#/c/4094/13/be/src/exprs/string-functions-ir.cc File be/src/exprs/string-functions-ir.cc: Line 292: if (occurrence.val <= 0 || start_position.val == 0) return IntVal(0); > is this what other systems do? i.e. do they silently return 0 or do they ra Thanks for spotting this. An invalid value for start position is handled by silently returning a 0, therefore I assumed that an invalid occurrence value is handled in the same way. But in reality, the latter gives an error: [Oracle][ODBC][Ora]ORA-01428: argument '0' is out of range (SQL-HY000) PS13, Line 302: DCHECK > DCHECK_LT. Done PS13, Line 317: DCHECK > DCHECK_GT. but what is the meaning of this DCHECK? search_start_pos is the Done PS13, Line 319: std::min > This is pretty confusing. I think this would be clearer if we did the min() Done -- To view, visit http://gerrit.cloudera.org:8080/4094 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ie9648de458d243306fa14adc5e7f7002bf6f67fd Gerrit-PatchSet: 13 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Zoltan Ivanfi Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Internal Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Zoltan Ivanfi Gerrit-HasComments: Yes