[Impala-ASF-CR] IMPALA-12308: DIRECTED distribution mode for V2 Iceberg tables

Gabor Kaszab (Code Review) Wed, 25 Oct 2023 08:08:24 -0700

Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20548 )


Change subject: IMPALA-12308: DIRECTED distribution mode for V2 Iceberg tables
......................................................................


Patch Set 6:

(43 comments)

Thanks for all the feedback, People!
I think I managed to address all of them except the following 2:
  - Make another perf measurement where I scale from 10 executor nodes to 20 or 
even more. Will update the commit message once done.
  - Ignore the fact when a delete file record is referring to a data file that 
is not part of the snapshot. Will make this change in a follow-up PS.

http://gerrit.cloudera.org:8080/#/c/20548/4//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/20548/4//COMMIT_MSG@26
PS4, Line 26: Hence,
            : V2 read optimization is turned off for table sampling.
> Can we just drop such delete records instead? Compaction might also remove
I also considered just silently dropping the delete rows when I don't find the 
associated data file. I decided be on the safe side and to fail the query 
instead because if we have a bug somewhere in the 'filename to hosts' mapping 
creation then we'd end up in a data correctness issue if we just ignored that a 
delete row that doesn't have corresponding entry in the map.
I don't really have insight into how compaction works TBH. Is it a valid 
scenario that let's assume we have a delete file containing rows for 2 
different data files, and when we do compaction where we get rid of one of the 
data files but leaving the delete file intact referring to both data files? If 
this scenario is in fact valid, then indeed we have to ignore the 
'non-matching' delete rows.
I don't get the other example where we mix MoR and CoW. In that case there 
won't be any delete files referring non-existing data files.

Update: I believe compaction can cause a situation where delete file rows 
referring to non-existing data files. So I guess the only option I have here is 
to ignore such delete rows. Fingers crossed then not to have any bugs in the 
filename to hosts mapping mechanism now that remain under the radar now :)
Will adjust this logic in a follow-up patch set.


http://gerrit.cloudera.org:8080/#/c/20548/4//COMMIT_MSG@45
PS4, Line 45: 10-node cluster wi
> These are not known upstream. Maybe you could just specify the characterist
Ooop, I usually don't confuse the open source and the downstream world. Thanks 
for pointing that out!

I'll do the measurement for more executors later on and will update the commit 
msg with the results.


http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/runtime/coordinator-backend-state.cc
File be/src/runtime/coordinator-backend-state.cc:

http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/runtime/coordinator-backend-state.cc@330
PS4, Line 330:     CopyFilenameToHostsMappingToRequest(&request);
> Could extract this to a function.
Done


http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/runtime/coordinator-backend-state.cc@331
PS4, Line 331:
> Could we use the concrete types instead of auto here and on L337? It would
Done


http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/runtime/coordinator-backend-state.cc@332
PS4, Line 332:     VLOG_FILE << "making rpc: ExecQueryFInstances"
> nit: could be range-based for loops for less verbosity
Done


http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/runtime/coordinator-backend-state.cc@350
PS4, Line 350:   google::protobuf::Map<int32, FileNameToHostsMapPB>* 
by_node_filename_to_hosts =
> For better readability, can we put the above code to a separate function?
Done


http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/runtime/fragment-state.cc
File be/src/runtime/fragment-state.cc:

http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/runtime/fragment-state.cc@108
PS4, Line 108:       DCHECK(tuple_desc != nullptr);
> nit: could be nullptr
Done


http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/runtime/fragment-state.cc@112
PS4, Line 112:
> Could write out the type to make it easier to understand.
well, it's this:
std::unordered_map<std::string, std::pair<std::vector<NetworkAddressPB>, bool>>
I didn't want to write it out as I found it too long. If you insists it'd 
increase readability, I can do so.


http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/runtime/fragment-state.cc@113
PS4, Line 113:       for (auto file_to_hosts : node_to_files_it->second) {
> nit: can we use range-based for loop?
Done


http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/runtime/fragment-state.cc@116
PS4, Line 116: is_relative_path) {
> Could extract this to a variable named "path_is_relative" or something simi
Done


http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/runtime/krpc-data-stream-sender.h
File be/src/runtime/krpc-data-stream-sender.h:

http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/runtime/krpc-data-stream-sender.h@113
PS4, Line 113: M
> nit: please add comma before 'and'
it's not required to use a comma before an 'and' when you list things.


http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/runtime/krpc-data-stream-sender.h@297
PS4, Line 297: d
> Could it be a reference?
Done


http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/runtime/krpc-data-stream-sender.cc
File be/src/runtime/krpc-data-stream-sender.cc:

http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/runtime/krpc-data-stream-sender.cc@1085
PS4, Line 1085: row
> Shouldn't we call it 'tuple' instead of 'row'?
Here one row has one tuple so kind of identical. I prefer row, for me it is 
easier to understand what it contains.


http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/runtime/krpc-data-stream-sender.cc@1090
PS4, Line 1090: then
> nit: then
Done


http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/runtime/krpc-data-stream-sender.cc@1091
PS4, Line 1091: channel
> 'channels'? It seems there may be multiple channels we send to.
Done


http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/runtime/krpc-data-stream-sender.cc@1097
PS4, Line 1097: r, filename
> I'm a bit unsure if using DebugString() is good, the name suggests it shoul
Under the hood it's just "return string(ptr, len);". I'll replace this


http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/runtime/krpc-data-stream-sender.cc@1142
PS4, Line 1142: content
> Nit: contents.
Done


http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/runtime/krpc-data-stream-sender.cc@1143
PS4, Line 1143: empty()) {
> It should never be nullptr right? It could be a DCHECK.
I changed it from ptr to const ref so this is not relevant anymore.


http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/runtime/query-state.h
File be/src/runtime/query-state.h:

http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/runtime/query-state.h@123
PS4, Line 123: exact
> nit: exact
Done


http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/runtime/query-state.h@126
PS4, Line 126: std::pair<std::vector<NetworkAddressPB>, bool>
> For readability, it would probably make sense to use a struct with named fi
Good point, but one thing that prevented me from introducing a struct is that 
the bool here says if we have relative or absolute path as the key of the map 
one level above. So a struct here would contain a vector of addresses and a 
bool that is not relevant for that vector.
Couldn't really figure out a more sophisticated way of doing this.


http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/runtime/query-state.cc
File be/src/runtime/query-state.cc:

http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/runtime/query-state.cc@271
PS4, Line 271:   // Making a copy of the "filename to fragment instance" 
mapping into std library types.
> Could extract this into a function.
Usually I'd do that but here the whole Init() function is about traversing data 
structures and assigning the values, so I figured moving this to a function 
doesn't really help much.


http://gerrit.cloudera.org:8080/#/c/20548/5/be/src/scheduling/scheduler.h
File be/src/scheduling/scheduler.h:

http://gerrit.cloudera.org:8080/#/c/20548/5/be/src/scheduling/scheduler.h@86
PS5, Line 86:   /// Map from filenames to hosts where those files are 
scheduled, grouped by scan node
> line too long (91 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/scheduling/scheduler.h
File be/src/scheduling/scheduler.h:

http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/scheduling/scheduler.h@86
PS4, Line 86: thos
> Nit: those.
Done


http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/scheduling/scheduler.h@425
PS4, Line 425: y added items a
> Mention in the comment what 'duplicate_check' is used for.
Done


http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/scheduling/scheduler.cc
File be/src/scheduling/scheduler.cc:

http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/scheduling/scheduler.cc@944
PS4, Line 944:     const TPlanNode& node = 
state->GetNode(per_node_ranges.first);
> nit: indent
Done


http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/scheduling/scheduler.cc@952
PS4, Line 952:       const HdfsFileSplitPB& hdfs_file_split = 
scan_ranges.scan_range().hdfs_file_split();
> scan_ranges.scan_range().hdfs_file_split() could be extracted
Done


http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/scheduling/scheduler.cc@962
PS4, Line 962:       std::unordered_set<NetworkAddressPB>& current_hosts =
> Or it meant to be 'unordered_set<NetworkAddressPB>&'?
yes, thx


http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/scheduling/scheduler.cc@962
PS4, Line 962:       std::unordered_set<NetworkAddressPB>& current_hosts =
> This copy assignment is unneeded I think, it ruins the duplicate check.
This was meant to be a reference. Nice catch!


http://gerrit.cloudera.org:8080/#/c/20548/4/be/src/scheduling/scheduler.cc@964
PS4, Line 964:       if (current_hosts.find(finst.host) != current_hosts.end()) 
continue;
> unordered_set handles the already-existing-element insertion and returns it
In fact it returns std::pair<iterator,bool>.
So I could do the following:
if (!current_hosts.insert(finst.host).second)) continue;

I think here getting the second part of the pair isn't that straightforward, so 
I'd prefer as it is now.


http://gerrit.cloudera.org:8080/#/c/20548/4/common/thrift/Partitions.thrift
File common/thrift/Partitions.thrift:

http://gerrit.cloudera.org:8080/#/c/20548/4/common/thrift/Partitions.thrift@43
PS4, Line 43: content
> Nit: contents?
Done


http://gerrit.cloudera.org:8080/#/c/20548/4/common/thrift/Partitions.thrift@44
PS4, Line 44: s that a
> Can it be multiple instances if the file is big? If so, we should mention i
Done


http://gerrit.cloudera.org:8080/#/c/20548/4/fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java
File fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java:

http://gerrit.cloudera.org:8080/#/c/20548/4/fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java@684
PS4, Line 684: on
> Nit: on.
Done


http://gerrit.cloudera.org:8080/#/c/20548/4/fe/src/main/java/org/apache/impala/planner/ExchangeNode.java
File fe/src/main/java/org/apache/impala/planner/ExchangeNode.java:

http://gerrit.cloudera.org:8080/#/c/20548/4/fe/src/main/java/org/apache/impala/planner/ExchangeNode.java@24
PS4, Line 24: import org.apache.impala.common.ImpalaException;
> Do we use this import?
Done


http://gerrit.cloudera.org:8080/#/c/20548/4/fe/src/main/java/org/apache/impala/planner/ExchangeNode.java@77
PS4, Line 77: nge(
> nit: return
Done


http://gerrit.cloudera.org:8080/#/c/20548/4/fe/src/main/java/org/apache/impala/planner/ExchangeNode.java@93
PS4, Line 93:
> Is there a reason why you use 'fragment_' here and 'getFragment()' on L87?
Done


http://gerrit.cloudera.org:8080/#/c/20548/4/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java
File fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java:

http://gerrit.cloudera.org:8080/#/c/20548/4/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java@68
PS4, Line 68: hav
> Nit: have.
Done


http://gerrit.cloudera.org:8080/#/c/20548/4/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java@74
PS4, Line 74:       MultiAggregateInfo aggInfo, List<FileDescriptor> fileDescs,
            :       List<Expr> nonIdent
> nit: indentation
Done


http://gerrit.cloudera.org:8080/#/c/20548/4/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java
File fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java:

http://gerrit.cloudera.org:8080/#/c/20548/4/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@272
PS4, Line 272: tblRef_.getSampleParams() != null
> A comment could be useful explaining why we have this condition here. I kno
Done


http://gerrit.cloudera.org:8080/#/c/20548/4/fe/src/main/java/org/apache/impala/planner/JoinNode.java
File fe/src/main/java/org/apache/impala/planner/JoinNode.java:

http://gerrit.cloudera.org:8080/#/c/20548/4/fe/src/main/java/org/apache/impala/planner/JoinNode.java@219
PS4, Line 219:         dis
> nit: indentation
Done


http://gerrit.cloudera.org:8080/#/c/20548/4/testdata/workloads/functional-planner/queries/PlannerTest/tablesample.test
File testdata/workloads/functional-planner/queries/PlannerTest/tablesample.test:

http://gerrit.cloudera.org:8080/#/c/20548/4/testdata/workloads/functional-planner/queries/PlannerTest/tablesample.test@339
PS4, Line 339: # filename to hosts mapping wouldn't contain all the data files, 
hence the DIRECTED
> Optional: could mention that we can't use ICEBERG DELETE JOIN for sampling
Done


http://gerrit.cloudera.org:8080/#/c/20548/4/testdata/workloads/functional-query/queries/QueryTest/iceberg-v2-read-position-deletes.test
File 
testdata/workloads/functional-query/queries/QueryTest/iceberg-v2-read-position-deletes.test:

http://gerrit.cloudera.org:8080/#/c/20548/4/testdata/workloads/functional-query/queries/QueryTest/iceberg-v2-read-position-deletes.test@753
PS4, Line 753: ---- TYPES
> Can we assert that there is indeed a separate fragment here (and that there
Done


http://gerrit.cloudera.org:8080/#/c/20548/4/testdata/workloads/functional-query/queries/QueryTest/iceberg-v2-read-position-deletes.test@753
PS4, Line 753: ---- TYPES
> Maybe it's possible with an extra RUNTIME_PROFILE section and looking at so
Done


http://gerrit.cloudera.org:8080/#/c/20548/4/testdata/workloads/functional-query/queries/QueryTest/set.test
File testdata/workloads/functional-query/queries/QueryTest/set.test:

http://gerrit.cloudera.org:8080/#/c/20548/4/testdata/workloads/functional-query/queries/QueryTest/set.test@135
PS4, Line 135:
> This can be confusing, because setting this has no effect on the distributi
You're right, this is minimum confusing/misleading. I made some adjustments so 
that users won't be able to provide DIRECTED here and the hint won't give it as 
a valid option.



--
To view, visit http://gerrit.cloudera.org:8080/20548
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I212afd7c9e94551a1c50a40ccb0e3c1f7ecdf3d2
Gerrit-Change-Number: 20548
Gerrit-PatchSet: 6
Gerrit-Owner: Gabor Kaszab <gaborkas...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <daniel.bec...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <gaborkas...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Peter Rozsa <pro...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tma...@apache.org>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Comment-Date: Wed, 25 Oct 2023 15:07:20 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-12308: DIRECTED distribution mode for V2 Iceberg tables

Reply via email to