Lars Volker has posted comments on this change.

Change subject: IMPALA-2979: Fix scheduling on remote hosts
......................................................................


Patch Set 24:

(55 comments)

Thanks for the review. Please see PS25.

http://gerrit.cloudera.org:8080/#/c/2200/24/be/src/scheduling/simple-scheduler.cc
File be/src/scheduling/simple-scheduler.cc:

Line 360:   return backend_maps;
> does this create another copy?
For debug builds I'm not sure if the compiler will be able to elide it. In 
release builds, it will elide it.


Line 439:   AssignmentCtx assignment_ctx;
> this contains the assignment heap. so this isn't load-balanced across the q
Yes, this follows the original approach of the scheduler to only track assigned 
bytes per node, which is also what we test. Switching to query-wide bookkeeping 
looks like an isolated change to me, so should we do that in a new change 
(including new tests)?


Line 440:   BackendCtx backend_ctx(AtomicReadBackendMapsPtr());
> can we get the backend config once per query?
See previous comment.


Line 451:     TNetworkAddress backend_hostport;
> this looks like a write-only variable (except you also read it in l518, but
Done


Line 453:       backend_hostport = MakeNetworkAddress(FLAGS_hostname, 
FLAGS_be_port);
> where is this recorded?
It was not, I added a test for that.


Line 508:       //   optimal OS buffer cache utilization.
> i'd say "there's no OS buffer cache to worry about"
Done


Line 519:           scan_range_locations, backend_ctx, &assignment_ctx, 
&byte_counters, assignment);
> since assignment_ctx records the per-backend assignment state, do the count
Moved them into assignment_ctx


Line 525:     }  // End of backend host selection.
> you're not using backend_hostport here
Done


Line 534:     RecordScanRangeAssignment( backend_hostport, node_id, host_list,
> formatting
Done


Line 542:   if (VLOG_FILE_IS_ON) {
> move this into a member function of the relevant struct (this function is a
Done


Line 565: void SimpleScheduler::RecordScanRangeAssignment( const 
TNetworkAddress& backend_hostport,
> formatting
Done


Line 568:     AssignmentCtx* assignment_ctx, ByteCounters* byte_counters,
> backend_ctx, assignment_ctx, and byte_counters always show up together. is 
Done


Line 596:   for (const TScanRangeLocation& location: 
scan_range_locations.locations) {
> why are we iterating over these again? it feels like we already selected a 
We need to figure out if the read is remote. When aiming to pick a remote 
replica (e.g. replica_preference = REMOTE) we might still pick a local one by 
chance, so we need to re-check to get the statistics right. The comment in L589 
is meant to explain this, should I elaborate more on it?


Line 1058:         // TODO: Do we have rules how to format and indent lambdas?
> maybe like a function?
Done


Line 1060:         return backend_ctx.GetBackendRank(data_locations[a])
> which would require extra indentation here and the closing brace on a separ
Done


Line 1070:   {
> formatting
Done


Line 1075:     DCHECK(backend_ctx->NumBackends() == 
assignment_ctx.backend_heap().size());
> this is an invariant between two data structures which you didn't point out
This is actually only the case if HasUnusedBackends() is false. I added a 
comment.


PS24, Line 1089: std::mt19937 g(rand())
> do we want some degree of determinism here, in order to make tests determin
Yes, I used rand() here to get deterministic randomness. It will return random 
values, but will work fine for tests. Getting the seed from the scheduler will 
move the problem there. To solve this, we should inject the random number 
generator as a dependency into the scheduler. There is a TODO for this at the 
class level comment. Can we leave it as is and address this in a subsequent 
change?


PS24, Line 1089: std::mt19937 g(rand())
> You'd probably want to use a std::random_device() to seed the generator her
See comment below


Line 1090:   std::shuffle(random_backend_order_.begin(), 
random_backend_order_.end(), g);
> hm, you're doing this once per plan node. i'd be interesting to see the tim
Do you think this will be prohibitively slow and we should perform a benchmark 
now? Shuffle has linear complexity, so I don't expect this to be an issue and 
if you're ok with it, I'd include this in the benchmarks to be done in the 
future.


Line 1110: bool SimpleScheduler::BackendCtx::LookUpBackendIp(const Hostname& 
hostname,
> member fn of BackendMaps?
Done


http://gerrit.cloudera.org:8080/#/c/2200/24/be/src/scheduling/simple-scheduler.h
File be/src/scheduling/simple-scheduler.h:

Line 93:   /// Map from a backend's IP address to a list of backends running on 
that node.
> a host's ip address?
Done


Line 96:   /// Map from a backend's IP address to the next position in its 
corresponding list of
> a host's ip address?
Done


Line 100:   /// Map from a backend's hostname to its IP address to support 
hostname based backend
> why "backend's hostname" instead of just "hostname"?
Done


Line 107:   struct BackendMaps {
> this is the semi-static configuration that is external to the scheduling pr
Done


Line 114:   /// TODO: Better name for BackendAssignmentInfo? 
AssignmentInfoElement?
> what's wrong with BackendAssignmentInfo? looks like that's exactly what it 
Done


Line 117:   struct BackendAssignmentInfo {
> add comments for fields (who sets it, invariants)
Done


Line 123:       return std::tie(assigned_bytes, random_rank) >
> i had to look this up. i think it's easier to read when converted to compar
Done


Line 128:   /// Heap to compute candidates for scan range assignments. By 
default boost implements
> you never spell out how exactly that heap is organized (i know what it is, 
Done


Line 138:   /// Struct to store backend information in an addressable heap.
> add information important to the understanding of this data structure: wher
Done


Line 148:     AssignmentHeap backend_heap_;
> confusing to name the type and the variable differently
Done, I renamed both the type and the variable, they're still different but in 
a less confusing way (I hope).


Line 153:   /// TODO: Better name for BackendCtx?
> what does this provide context for?
Done


Line 157:   class BackendCtx {
> what is the logical division between this struct and AssignmentCtx? is ther
Done


Line 161:     bool HasUnusedBackends() const;
> comment
Done


Line 181:     const BackendMap& backend_map() const { return 
backend_maps_->backend_map; }
> too many similar sounding names also make the code harder to follow
Renamed BackendMaps to BackendConfig.


PS24, Line 192: int next_unused_backend_idx_;
> Shouldn't this be an atomicInt?
These objects are created per invocation of ComputeScanRangeAssignment and thus 
don't need to be thread-safe. I added this to the class comment.


PS24, Line 192: int next_unused_backend_idx_;
> describe somewhere how you're keeping track of unused backends
Done


Line 194:     /// Store a random permutation of backend hosts to select 
backends from.
> provide class comment that gives context: why random permutations? etc.
Done


Line 201:   /// A struct to track various counts of assigned bytes during 
scheduling.
> AssignedBytesCounters then? provides more information than ByteCounters
Done


Line 202:   /// TODO: Move this into AssignmentCtx?
> good question, but definitely worth answering
Done


Line 211:   /// and atomically swap the contents of this pointer.
> provide an *outline* of the main data structures and the scheduling process
This is covered in the comment of ComputeScanRangeAssignment. The whole 
SimpleScheduler class does much more than that, so I'm reluctant to put it 
there. Let me know if you think that's the better place though. We should also 
consider breaking SimpleScheduler into several classes, but IMO that's better 
done in a subsequent refactoring change.


Line 216:   mutable boost::mutex backend_maps_lock_;
> leave todo, maybe in class comment, to switch to kudu's rw locks, this is a
There is already a TODO to replace backend_maps_ (now backend_config_) with 
atomic_shared_ptr once compilers support them. I added a note to that comment 
to try Kudu's locks in comparison.


Line 288:   /// protecting the access by backend_maps_lock_.
> "access with"
Done


Line 292:   BackendMapsPtr AtomicReadBackendMapsPtr() const;
> or simply Get-/SetBackendMapsPtr()? (or Get-/SetBackendConfig?)
Done


Line 333:   /// of scan ranges.
> i don't understand the meaning of the last sentence.
Done, see if it is more clear now. assignment is a twofold indirection host -> 
plan_node_id -> list of scan ranges.


Line 339:   /// processed first. If multiple replicas also run a backend, then 
the 'memory distance'
> what does "if multiple replicas also run a backend" mean?
Done


Line 340:   /// of each backend is considered. The concept of memory distance 
reflects the cost of
> good to point out: memory distance should roughly reflect the effort involv
Done


Line 349:   /// the amount of work each backend has been assigned so far.
> you could also paraphrase the last sentence as "they will be load-balanced 
Done


Line 353:   /// replica_preference: This value is used as a minimum memory 
distance for all
> feel free to insert break after ':' and then use the full line width, to re
Done


Line 364:   /// schedule_random_replica: When equivalent backends are found for 
a scan range (same
> is this called node_random_replica below?
No, this describes the query option, which is called "schedule_random_replica". 
The method's parameter "node_random_replica" is the optional query hint passed 
in the SQL statement. I added comments for all method parameters.


PS24, Line 353: replica_preference: This value is used as a minimum memory 
distance for all
              :   ///                     replicas. For example, by setting 
this to DISK_LOCAL, all
              :   ///                     cached replicas will be treated as if 
they were not cached,
              :   ///                     but local disk replicas. This can 
help prevent hot-spots by
              :   ///                     spreading the assignments over more 
replicas. Allowed values
              :   ///                     are CACHE_LOCAL, DISK_LOCAL and 
REMOTE.
              :   ///
              :   /// disable_cached_reads: Setting this value to true is 
equivalent to setting
              :   ///                       replica_preference to DISK_LOCAL 
and takes precedence over
              :   ///                       replica_preference.
              :   ///
              :   /// schedule_random_replica: When equivalent backends are 
found for a scan range (same
              :   ///                          memory distance, same amount of 
assigned work), then
              :   ///                          the first one will be picked 
deterministically. This aims
              :   ///                          to make better use of OS buffer 
caches, but can lead to
              :   ///                          performance bottlenecks on 
single hosts. Setting this
              :   ///                          option to true will randomly 
change the order in which
              :   ///                          equivalent replicas are picked 
for different plan nodes.
              :   ///                          This helps to compute a more 
even assignment, on the
              :   ///                          downside of increased memory 
usage for OS buffer caches.
> "with the downside being"
Done


PS24, Line 353: replica_preference: This value is used as a minimum memory 
distance for all
              :   ///                     replicas. For example, by setting 
this to DISK_LOCAL, all
              :   ///                     cached replicas will be treated as if 
they were not cached,
              :   ///                     but local disk replicas. This can 
help prevent hot-spots by
              :   ///                     spreading the assignments over more 
replicas. Allowed values
              :   ///                     are CACHE_LOCAL, DISK_LOCAL and 
REMOTE.
              :   ///
              :   /// disable_cached_reads: Setting this value to true is 
equivalent to setting
              :   ///                       replica_preference to DISK_LOCAL 
and takes precedence over
              :   ///                       replica_preference.
              :   ///
              :   /// schedule_random_replica: When equivalent backends are 
found for a scan range (same
              :   ///                          memory distance, same amount of 
assigned work), then
              :   ///                          the first one will be picked 
deterministically. This aims
              :   ///                          to make better use of OS buffer 
caches, but can lead to
              :   ///                          performance bottlenecks on 
single hosts. Setting this
              :   ///                          option to true will randomly 
change the order in which
              :   ///                          equivalent replicas are picked 
for different plan nodes.
              :   ///                          This helps to compute a more 
even assignment, on the
              :   ///                          downside of increased memory 
usage for OS buffer caches.
> It be useful to mention the defaults of these query options here.
Done


Line 380:   /// Add a single assignment to 'assignment'.
> provide complete comment (at least point out how output vars are updated an
Done. Being a class member method it will also affect member variables of the 
class (such as assignment_heap_ and assignment_byte_counters_), but I'll assume 
that those are side effects we generally don't document.


Line 434:   /// backend_ctx:        Provides information on backends and the 
hosts they run on.
> at this point i already forgot whether this was config information about ba
Done


Line 445:   const IpAddr* SelectRemoteBackendHost(const AssignmentCtx& 
assignment_ctx, BackendCtx* backend_ctx);
> long line
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/2200
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I044f83806fcde820fcb38047cf6b8e780d803858
Gerrit-PatchSet: 24
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Lars Volker <l...@cloudera.com>
Gerrit-Reviewer: Lars Volker <l...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <mar...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <m...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mmokh...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>
Gerrit-Reviewer: anujphadke <apha...@cloudera.com>
Gerrit-HasComments: Yes

Reply via email to