Lars Volker has posted comments on this change. Change subject: IMPALA-2979: Fix scheduling on remote hosts ......................................................................
Patch Set 24: (55 comments) Thanks for the review. Please see PS25. http://gerrit.cloudera.org:8080/#/c/2200/24/be/src/scheduling/simple-scheduler.cc File be/src/scheduling/simple-scheduler.cc: Line 360: return backend_maps; > does this create another copy? For debug builds I'm not sure if the compiler will be able to elide it. In release builds, it will elide it. Line 439: AssignmentCtx assignment_ctx; > this contains the assignment heap. so this isn't load-balanced across the q Yes, this follows the original approach of the scheduler to only track assigned bytes per node, which is also what we test. Switching to query-wide bookkeeping looks like an isolated change to me, so should we do that in a new change (including new tests)? Line 440: BackendCtx backend_ctx(AtomicReadBackendMapsPtr()); > can we get the backend config once per query? See previous comment. Line 451: TNetworkAddress backend_hostport; > this looks like a write-only variable (except you also read it in l518, but Done Line 453: backend_hostport = MakeNetworkAddress(FLAGS_hostname, FLAGS_be_port); > where is this recorded? It was not, I added a test for that. Line 508: // optimal OS buffer cache utilization. > i'd say "there's no OS buffer cache to worry about" Done Line 519: scan_range_locations, backend_ctx, &assignment_ctx, &byte_counters, assignment); > since assignment_ctx records the per-backend assignment state, do the count Moved them into assignment_ctx Line 525: } // End of backend host selection. > you're not using backend_hostport here Done Line 534: RecordScanRangeAssignment( backend_hostport, node_id, host_list, > formatting Done Line 542: if (VLOG_FILE_IS_ON) { > move this into a member function of the relevant struct (this function is a Done Line 565: void SimpleScheduler::RecordScanRangeAssignment( const TNetworkAddress& backend_hostport, > formatting Done Line 568: AssignmentCtx* assignment_ctx, ByteCounters* byte_counters, > backend_ctx, assignment_ctx, and byte_counters always show up together. is Done Line 596: for (const TScanRangeLocation& location: scan_range_locations.locations) { > why are we iterating over these again? it feels like we already selected a We need to figure out if the read is remote. When aiming to pick a remote replica (e.g. replica_preference = REMOTE) we might still pick a local one by chance, so we need to re-check to get the statistics right. The comment in L589 is meant to explain this, should I elaborate more on it? Line 1058: // TODO: Do we have rules how to format and indent lambdas? > maybe like a function? Done Line 1060: return backend_ctx.GetBackendRank(data_locations[a]) > which would require extra indentation here and the closing brace on a separ Done Line 1070: { > formatting Done Line 1075: DCHECK(backend_ctx->NumBackends() == assignment_ctx.backend_heap().size()); > this is an invariant between two data structures which you didn't point out This is actually only the case if HasUnusedBackends() is false. I added a comment. PS24, Line 1089: std::mt19937 g(rand()) > do we want some degree of determinism here, in order to make tests determin Yes, I used rand() here to get deterministic randomness. It will return random values, but will work fine for tests. Getting the seed from the scheduler will move the problem there. To solve this, we should inject the random number generator as a dependency into the scheduler. There is a TODO for this at the class level comment. Can we leave it as is and address this in a subsequent change? PS24, Line 1089: std::mt19937 g(rand()) > You'd probably want to use a std::random_device() to seed the generator her See comment below Line 1090: std::shuffle(random_backend_order_.begin(), random_backend_order_.end(), g); > hm, you're doing this once per plan node. i'd be interesting to see the tim Do you think this will be prohibitively slow and we should perform a benchmark now? Shuffle has linear complexity, so I don't expect this to be an issue and if you're ok with it, I'd include this in the benchmarks to be done in the future. Line 1110: bool SimpleScheduler::BackendCtx::LookUpBackendIp(const Hostname& hostname, > member fn of BackendMaps? Done http://gerrit.cloudera.org:8080/#/c/2200/24/be/src/scheduling/simple-scheduler.h File be/src/scheduling/simple-scheduler.h: Line 93: /// Map from a backend's IP address to a list of backends running on that node. > a host's ip address? Done Line 96: /// Map from a backend's IP address to the next position in its corresponding list of > a host's ip address? Done Line 100: /// Map from a backend's hostname to its IP address to support hostname based backend > why "backend's hostname" instead of just "hostname"? Done Line 107: struct BackendMaps { > this is the semi-static configuration that is external to the scheduling pr Done Line 114: /// TODO: Better name for BackendAssignmentInfo? AssignmentInfoElement? > what's wrong with BackendAssignmentInfo? looks like that's exactly what it Done Line 117: struct BackendAssignmentInfo { > add comments for fields (who sets it, invariants) Done Line 123: return std::tie(assigned_bytes, random_rank) > > i had to look this up. i think it's easier to read when converted to compar Done Line 128: /// Heap to compute candidates for scan range assignments. By default boost implements > you never spell out how exactly that heap is organized (i know what it is, Done Line 138: /// Struct to store backend information in an addressable heap. > add information important to the understanding of this data structure: wher Done Line 148: AssignmentHeap backend_heap_; > confusing to name the type and the variable differently Done, I renamed both the type and the variable, they're still different but in a less confusing way (I hope). Line 153: /// TODO: Better name for BackendCtx? > what does this provide context for? Done Line 157: class BackendCtx { > what is the logical division between this struct and AssignmentCtx? is ther Done Line 161: bool HasUnusedBackends() const; > comment Done Line 181: const BackendMap& backend_map() const { return backend_maps_->backend_map; } > too many similar sounding names also make the code harder to follow Renamed BackendMaps to BackendConfig. PS24, Line 192: int next_unused_backend_idx_; > Shouldn't this be an atomicInt? These objects are created per invocation of ComputeScanRangeAssignment and thus don't need to be thread-safe. I added this to the class comment. PS24, Line 192: int next_unused_backend_idx_; > describe somewhere how you're keeping track of unused backends Done Line 194: /// Store a random permutation of backend hosts to select backends from. > provide class comment that gives context: why random permutations? etc. Done Line 201: /// A struct to track various counts of assigned bytes during scheduling. > AssignedBytesCounters then? provides more information than ByteCounters Done Line 202: /// TODO: Move this into AssignmentCtx? > good question, but definitely worth answering Done Line 211: /// and atomically swap the contents of this pointer. > provide an *outline* of the main data structures and the scheduling process This is covered in the comment of ComputeScanRangeAssignment. The whole SimpleScheduler class does much more than that, so I'm reluctant to put it there. Let me know if you think that's the better place though. We should also consider breaking SimpleScheduler into several classes, but IMO that's better done in a subsequent refactoring change. Line 216: mutable boost::mutex backend_maps_lock_; > leave todo, maybe in class comment, to switch to kudu's rw locks, this is a There is already a TODO to replace backend_maps_ (now backend_config_) with atomic_shared_ptr once compilers support them. I added a note to that comment to try Kudu's locks in comparison. Line 288: /// protecting the access by backend_maps_lock_. > "access with" Done Line 292: BackendMapsPtr AtomicReadBackendMapsPtr() const; > or simply Get-/SetBackendMapsPtr()? (or Get-/SetBackendConfig?) Done Line 333: /// of scan ranges. > i don't understand the meaning of the last sentence. Done, see if it is more clear now. assignment is a twofold indirection host -> plan_node_id -> list of scan ranges. Line 339: /// processed first. If multiple replicas also run a backend, then the 'memory distance' > what does "if multiple replicas also run a backend" mean? Done Line 340: /// of each backend is considered. The concept of memory distance reflects the cost of > good to point out: memory distance should roughly reflect the effort involv Done Line 349: /// the amount of work each backend has been assigned so far. > you could also paraphrase the last sentence as "they will be load-balanced Done Line 353: /// replica_preference: This value is used as a minimum memory distance for all > feel free to insert break after ':' and then use the full line width, to re Done Line 364: /// schedule_random_replica: When equivalent backends are found for a scan range (same > is this called node_random_replica below? No, this describes the query option, which is called "schedule_random_replica". The method's parameter "node_random_replica" is the optional query hint passed in the SQL statement. I added comments for all method parameters. PS24, Line 353: replica_preference: This value is used as a minimum memory distance for all : /// replicas. For example, by setting this to DISK_LOCAL, all : /// cached replicas will be treated as if they were not cached, : /// but local disk replicas. This can help prevent hot-spots by : /// spreading the assignments over more replicas. Allowed values : /// are CACHE_LOCAL, DISK_LOCAL and REMOTE. : /// : /// disable_cached_reads: Setting this value to true is equivalent to setting : /// replica_preference to DISK_LOCAL and takes precedence over : /// replica_preference. : /// : /// schedule_random_replica: When equivalent backends are found for a scan range (same : /// memory distance, same amount of assigned work), then : /// the first one will be picked deterministically. This aims : /// to make better use of OS buffer caches, but can lead to : /// performance bottlenecks on single hosts. Setting this : /// option to true will randomly change the order in which : /// equivalent replicas are picked for different plan nodes. : /// This helps to compute a more even assignment, on the : /// downside of increased memory usage for OS buffer caches. > "with the downside being" Done PS24, Line 353: replica_preference: This value is used as a minimum memory distance for all : /// replicas. For example, by setting this to DISK_LOCAL, all : /// cached replicas will be treated as if they were not cached, : /// but local disk replicas. This can help prevent hot-spots by : /// spreading the assignments over more replicas. Allowed values : /// are CACHE_LOCAL, DISK_LOCAL and REMOTE. : /// : /// disable_cached_reads: Setting this value to true is equivalent to setting : /// replica_preference to DISK_LOCAL and takes precedence over : /// replica_preference. : /// : /// schedule_random_replica: When equivalent backends are found for a scan range (same : /// memory distance, same amount of assigned work), then : /// the first one will be picked deterministically. This aims : /// to make better use of OS buffer caches, but can lead to : /// performance bottlenecks on single hosts. Setting this : /// option to true will randomly change the order in which : /// equivalent replicas are picked for different plan nodes. : /// This helps to compute a more even assignment, on the : /// downside of increased memory usage for OS buffer caches. > It be useful to mention the defaults of these query options here. Done Line 380: /// Add a single assignment to 'assignment'. > provide complete comment (at least point out how output vars are updated an Done. Being a class member method it will also affect member variables of the class (such as assignment_heap_ and assignment_byte_counters_), but I'll assume that those are side effects we generally don't document. Line 434: /// backend_ctx: Provides information on backends and the hosts they run on. > at this point i already forgot whether this was config information about ba Done Line 445: const IpAddr* SelectRemoteBackendHost(const AssignmentCtx& assignment_ctx, BackendCtx* backend_ctx); > long line Done -- To view, visit http://gerrit.cloudera.org:8080/2200 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I044f83806fcde820fcb38047cf6b8e780d803858 Gerrit-PatchSet: 24 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Lars Volker <l...@cloudera.com> Gerrit-Reviewer: Lars Volker <l...@cloudera.com> Gerrit-Reviewer: Marcel Kornacker <mar...@cloudera.com> Gerrit-Reviewer: Matthew Jacobs <m...@cloudera.com> Gerrit-Reviewer: Mostafa Mokhtar <mmokh...@cloudera.com> Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com> Gerrit-Reviewer: anujphadke <apha...@cloudera.com> Gerrit-HasComments: Yes