Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/20548 )
Change subject: IMPALA-12308: DIRECTED distribution mode for V2 Iceberg tables ...................................................................... Patch Set 6: (43 comments) Thanks for all the feedback, People! I think I managed to address all of them except the following 2: - Make another perf measurement where I scale from 10 executor nodes to 20 or even more. Will update the commit message once done. - Ignore the fact when a delete file record is referring to a data file that is not part of the snapshot. Will make this change in a follow-up PS. http://gerrit.cloudera.org:8080/#/c/20548/4//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/20548/4//COMMIT_MSG@26 PS4, Line 26: Hence, : V2 read optimization is turned off for table sampling. > Can we just drop such delete records instead? Compaction might also remove I also considered just silently dropping the delete rows when I don't find the associated data file. I decided be on the safe side and to fail the query instead because if we have a bug somewhere in the 'filename to hosts' mapping creation then we'd end up in a data correctness issue if we just ignored that a delete row that doesn't have corresponding entry in the map. I don't really have insight into how compaction works TBH. Is it a valid scenario that let's assume we have a delete file containing rows for 2 different data files, and when we do compaction where we get rid of one of the data files but leaving the delete file intact referring to both data files? If this scenario is in fact valid, then indeed we have to ignore the 'non-matching' delete rows. I don't get the other example where we mix MoR and CoW. In that case there won't be any delete files referring non-existing data files. Update: I believe compaction can cause a situation where delete file rows referring to non-existing data files. So I guess the only option I have here is to ignore such delete rows. Fingers crossed then not to have any bugs in the filename to hosts mapping mechanism now that remain under the radar now :) Will adjust this logic in a follow-up patch set. http://gerrit.cloudera.org:8080/#/c/20548/4//COMMIT_MSG@45 PS4, Line 45: 10-node cluster wi > These are not known upstream. Maybe you could just specify the characterist Ooop, I usually don't confuse the open source and the downstream world. Thanks for pointing that out! I'll do the measurement for more executors later on and will update the commit msg with the results. http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/runtime/coordinator-backend-state.cc File be/src/runtime/coordinator-backend-state.cc: http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/runtime/coordinator-backend-state.cc@330 PS4, Line 330: CopyFilenameToHostsMappingToRequest(&request); > Could extract this to a function. Done http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/runtime/coordinator-backend-state.cc@331 PS4, Line 331: > Could we use the concrete types instead of auto here and on L337? It would Done http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/runtime/coordinator-backend-state.cc@332 PS4, Line 332: VLOG_FILE << "making rpc: ExecQueryFInstances" > nit: could be range-based for loops for less verbosity Done http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/runtime/coordinator-backend-state.cc@350 PS4, Line 350: google::protobuf::Map<int32, FileNameToHostsMapPB>* by_node_filename_to_hosts = > For better readability, can we put the above code to a separate function? Done http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/runtime/fragment-state.cc File be/src/runtime/fragment-state.cc: http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/runtime/fragment-state.cc@108 PS4, Line 108: DCHECK(tuple_desc != nullptr); > nit: could be nullptr Done http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/runtime/fragment-state.cc@112 PS4, Line 112: > Could write out the type to make it easier to understand. well, it's this: std::unordered_map<std::string, std::pair<std::vector<NetworkAddressPB>, bool>> I didn't want to write it out as I found it too long. If you insists it'd increase readability, I can do so. http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/runtime/fragment-state.cc@113 PS4, Line 113: for (auto file_to_hosts : node_to_files_it->second) { > nit: can we use range-based for loop? Done http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/runtime/fragment-state.cc@116 PS4, Line 116: is_relative_path) { > Could extract this to a variable named "path_is_relative" or something simi Done http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/runtime/krpc-data-stream-sender.h File be/src/runtime/krpc-data-stream-sender.h: http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/runtime/krpc-data-stream-sender.h@113 PS4, Line 113: M > nit: please add comma before 'and' it's not required to use a comma before an 'and' when you list things. http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/runtime/krpc-data-stream-sender.h@297 PS4, Line 297: d > Could it be a reference? Done http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/runtime/krpc-data-stream-sender.cc File be/src/runtime/krpc-data-stream-sender.cc: http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/runtime/krpc-data-stream-sender.cc@1085 PS4, Line 1085: row > Shouldn't we call it 'tuple' instead of 'row'? Here one row has one tuple so kind of identical. I prefer row, for me it is easier to understand what it contains. http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/runtime/krpc-data-stream-sender.cc@1090 PS4, Line 1090: then > nit: then Done http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/runtime/krpc-data-stream-sender.cc@1091 PS4, Line 1091: channel > 'channels'? It seems there may be multiple channels we send to. Done http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/runtime/krpc-data-stream-sender.cc@1097 PS4, Line 1097: r, filename > I'm a bit unsure if using DebugString() is good, the name suggests it shoul Under the hood it's just "return string(ptr, len);". I'll replace this http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/runtime/krpc-data-stream-sender.cc@1142 PS4, Line 1142: content > Nit: contents. Done http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/runtime/krpc-data-stream-sender.cc@1143 PS4, Line 1143: empty()) { > It should never be nullptr right? It could be a DCHECK. I changed it from ptr to const ref so this is not relevant anymore. http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/runtime/query-state.h File be/src/runtime/query-state.h: http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/runtime/query-state.h@123 PS4, Line 123: exact > nit: exact Done http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/runtime/query-state.h@126 PS4, Line 126: std::pair<std::vector<NetworkAddressPB>, bool> > For readability, it would probably make sense to use a struct with named fi Good point, but one thing that prevented me from introducing a struct is that the bool here says if we have relative or absolute path as the key of the map one level above. So a struct here would contain a vector of addresses and a bool that is not relevant for that vector. Couldn't really figure out a more sophisticated way of doing this. http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/runtime/query-state.cc File be/src/runtime/query-state.cc: http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/runtime/query-state.cc@271 PS4, Line 271: // Making a copy of the "filename to fragment instance" mapping into std library types. > Could extract this into a function. Usually I'd do that but here the whole Init() function is about traversing data structures and assigning the values, so I figured moving this to a function doesn't really help much. http://gerrit.cloudera.org:8080/#/c/20548/5/be/src/scheduling/scheduler.h File be/src/scheduling/scheduler.h: http://gerrit.cloudera.org:8080/#/c/20548/5/be/src/scheduling/scheduler.h@86 PS5, Line 86: /// Map from filenames to hosts where those files are scheduled, grouped by scan node > line too long (91 > 90) Done http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/scheduling/scheduler.h File be/src/scheduling/scheduler.h: http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/scheduling/scheduler.h@86 PS4, Line 86: thos > Nit: those. Done http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/scheduling/scheduler.h@425 PS4, Line 425: y added items a > Mention in the comment what 'duplicate_check' is used for. Done http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/scheduling/scheduler.cc File be/src/scheduling/scheduler.cc: http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/scheduling/scheduler.cc@944 PS4, Line 944: const TPlanNode& node = state->GetNode(per_node_ranges.first); > nit: indent Done http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/scheduling/scheduler.cc@952 PS4, Line 952: const HdfsFileSplitPB& hdfs_file_split = scan_ranges.scan_range().hdfs_file_split(); > scan_ranges.scan_range().hdfs_file_split() could be extracted Done http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/scheduling/scheduler.cc@962 PS4, Line 962: std::unordered_set<NetworkAddressPB>& current_hosts = > Or it meant to be 'unordered_set<NetworkAddressPB>&'? yes, thx http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/scheduling/scheduler.cc@962 PS4, Line 962: std::unordered_set<NetworkAddressPB>& current_hosts = > This copy assignment is unneeded I think, it ruins the duplicate check. This was meant to be a reference. Nice catch! http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/scheduling/scheduler.cc@964 PS4, Line 964: if (current_hosts.find(finst.host) != current_hosts.end()) continue; > unordered_set handles the already-existing-element insertion and returns it In fact it returns std::pair<iterator,bool>. So I could do the following: if (!current_hosts.insert(finst.host).second)) continue; I think here getting the second part of the pair isn't that straightforward, so I'd prefer as it is now. http://gerrit.cloudera.org:8080/#/c/20548/4/common/thrift/Partitions.thrift File common/thrift/Partitions.thrift: http://gerrit.cloudera.org:8080/#/c/20548/4/common/thrift/Partitions.thrift@43 PS4, Line 43: content > Nit: contents? Done http://gerrit.cloudera.org:8080/#/c/20548/4/common/thrift/Partitions.thrift@44 PS4, Line 44: s that a > Can it be multiple instances if the file is big? If so, we should mention i Done http://gerrit.cloudera.org:8080/#/c/20548/4/fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java File fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java: http://gerrit.cloudera.org:8080/#/c/20548/4/fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java@684 PS4, Line 684: on > Nit: on. Done http://gerrit.cloudera.org:8080/#/c/20548/4/fe/src/main/java/org/apache/impala/planner/ExchangeNode.java File fe/src/main/java/org/apache/impala/planner/ExchangeNode.java: http://gerrit.cloudera.org:8080/#/c/20548/4/fe/src/main/java/org/apache/impala/planner/ExchangeNode.java@24 PS4, Line 24: import org.apache.impala.common.ImpalaException; > Do we use this import? Done http://gerrit.cloudera.org:8080/#/c/20548/4/fe/src/main/java/org/apache/impala/planner/ExchangeNode.java@77 PS4, Line 77: nge( > nit: return Done http://gerrit.cloudera.org:8080/#/c/20548/4/fe/src/main/java/org/apache/impala/planner/ExchangeNode.java@93 PS4, Line 93: > Is there a reason why you use 'fragment_' here and 'getFragment()' on L87? Done http://gerrit.cloudera.org:8080/#/c/20548/4/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java File fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java: http://gerrit.cloudera.org:8080/#/c/20548/4/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java@68 PS4, Line 68: hav > Nit: have. Done http://gerrit.cloudera.org:8080/#/c/20548/4/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java@74 PS4, Line 74: MultiAggregateInfo aggInfo, List<FileDescriptor> fileDescs, : List<Expr> nonIdent > nit: indentation Done http://gerrit.cloudera.org:8080/#/c/20548/4/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java File fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java: http://gerrit.cloudera.org:8080/#/c/20548/4/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@272 PS4, Line 272: tblRef_.getSampleParams() != null > A comment could be useful explaining why we have this condition here. I kno Done http://gerrit.cloudera.org:8080/#/c/20548/4/fe/src/main/java/org/apache/impala/planner/JoinNode.java File fe/src/main/java/org/apache/impala/planner/JoinNode.java: http://gerrit.cloudera.org:8080/#/c/20548/4/fe/src/main/java/org/apache/impala/planner/JoinNode.java@219 PS4, Line 219: dis > nit: indentation Done http://gerrit.cloudera.org:8080/#/c/20548/4/testdata/workloads/functional-planner/queries/PlannerTest/tablesample.test File testdata/workloads/functional-planner/queries/PlannerTest/tablesample.test: http://gerrit.cloudera.org:8080/#/c/20548/4/testdata/workloads/functional-planner/queries/PlannerTest/tablesample.test@339 PS4, Line 339: # filename to hosts mapping wouldn't contain all the data files, hence the DIRECTED > Optional: could mention that we can't use ICEBERG DELETE JOIN for sampling Done http://gerrit.cloudera.org:8080/#/c/20548/4/testdata/workloads/functional-query/queries/QueryTest/iceberg-v2-read-position-deletes.test File testdata/workloads/functional-query/queries/QueryTest/iceberg-v2-read-position-deletes.test: http://gerrit.cloudera.org:8080/#/c/20548/4/testdata/workloads/functional-query/queries/QueryTest/iceberg-v2-read-position-deletes.test@753 PS4, Line 753: ---- TYPES > Can we assert that there is indeed a separate fragment here (and that there Done http://gerrit.cloudera.org:8080/#/c/20548/4/testdata/workloads/functional-query/queries/QueryTest/iceberg-v2-read-position-deletes.test@753 PS4, Line 753: ---- TYPES > Maybe it's possible with an extra RUNTIME_PROFILE section and looking at so Done http://gerrit.cloudera.org:8080/#/c/20548/4/testdata/workloads/functional-query/queries/QueryTest/set.test File testdata/workloads/functional-query/queries/QueryTest/set.test: http://gerrit.cloudera.org:8080/#/c/20548/4/testdata/workloads/functional-query/queries/QueryTest/set.test@135 PS4, Line 135: > This can be confusing, because setting this has no effect on the distributi You're right, this is minimum confusing/misleading. I made some adjustments so that users won't be able to provide DIRECTED here and the hint won't give it as a valid option. -- To view, visit http://gerrit.cloudera.org:8080/20548 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I212afd7c9e94551a1c50a40ccb0e3c1f7ecdf3d2 Gerrit-Change-Number: 20548 Gerrit-PatchSet: 6 Gerrit-Owner: Gabor Kaszab <gaborkas...@cloudera.com> Gerrit-Reviewer: Daniel Becker <daniel.bec...@cloudera.com> Gerrit-Reviewer: Gabor Kaszab <gaborkas...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Peter Rozsa <pro...@cloudera.com> Gerrit-Reviewer: Tamas Mate <tma...@apache.org> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Comment-Date: Wed, 25 Oct 2023 15:07:20 +0000 Gerrit-HasComments: Yes