[Impala-ASF-CR] IMPALA-3905: Add single-threaded scan node.

Alex Behm (Code Review) Fri, 26 Aug 2016 14:24:24 -0700

Alex Behm has posted comments on this change.

Change subject: IMPALA-3905: Add single-threaded scan node.
......................................................................



Patch Set 1:

(28 comments)

http://gerrit.cloudera.org:8080/#/c/4113/1/be/src/exec/hdfs-avro-scanner.h
File be/src/exec/hdfs-avro-scanner.h:

PS1, Line 93: bool add_batches_to_queue
> This seems somewhat redundant with the 'scan_node' parameter. Can we add a 
Good idea. That helped me also get rid of the add_batches_to_queue flag in the 
scanners.


http://gerrit.cloudera.org:8080/#/c/4113/1/be/src/exec/hdfs-scan-node-base.cc
File be/src/exec/hdfs-scan-node-base.cc:

Line 678:     DCHECK(value_ctx->is_clone());
> This DCHECK needs a little explanation.
Done


http://gerrit.cloudera.org:8080/#/c/4113/1/be/src/exec/hdfs-scan-node-base.h
File be/src/exec/hdfs-scan-node-base.h:

Line 69:   HdfsFileDesc(const std::string& filename)
> Space before function?
Not following the first part (space where?).

Moved the c'tor up.


http://gerrit.cloudera.org:8080/#/c/4113/1/be/src/exec/hdfs-scan-node-mt.cc
File be/src/exec/hdfs-scan-node-mt.cc:

Line 57: 
> We aren't accounting time in Open() in the same way we do in the non-mt sca
Done


Line 61:   if (!initial_ranges_issued_) IssueInitialScanRanges(state);
> Add a brief comment about why we don't do this in Open() - I see there's a 
Done


Line 63:   if (ReachedLimit()) {
> Is it a problem that we don't call StopAndFinalizeCounters() on this code p
It will eventually be called in Close() in any case, so I don't think it's 
necessary to make sure it gets called everywhere before returning.

For consistency I added the call for all *eos==true paths.


Line 72:       scanner_->Close(row_batch);
> We can reset 'scanner_' at this point, right?
Yes. Done.


Line 75:     
RETURN_IF_ERROR(runtime_state_->io_mgr()->GetNextRange(reader_context_, 
&scan_range_));
> Long line.
Done


PS1, Line 81: reinterpret_cast
> Not your change but I think these casts of meta_data() here and should prob
Fixed everywhere.


Line 93:   num_rows_returned_ += row_batch->num_rows();
> I think we should rework the control flow to avoid executing this big block
Done


http://gerrit.cloudera.org:8080/#/c/4113/1/be/src/exec/hdfs-scan-node-mt.h
File be/src/exec/hdfs-scan-node-mt.h:

Line 52:   bool scanner_eos_;
> This is the pattern we use elsewhere in the code, but IMO this would be sim
Done


http://gerrit.cloudera.org:8080/#/c/4113/1/be/src/exec/hdfs-scan-node.cc
File be/src/exec/hdfs-scan-node.cc:

Line 194:   SCOPED_TIMER(runtime_profile_->total_time_counter());
> I guess we didn't track this before.
Looks like it. Should be fixed now, but please double-check.


http://gerrit.cloudera.org:8080/#/c/4113/1/be/src/exec/hdfs-scan-node.h
File be/src/exec/hdfs-scan-node.h:

Line 60: /// An HdfsScanNode may expect to receive runtime filters produced 
elsewhere in the plan
> It seems like some of this comment (at least about the filters) should go i
Good point. Done.


Line 108:   friend class ScannerContext; // for accessing done_
> Any reason we can't just add an accessor?
Done


http://gerrit.cloudera.org:8080/#/c/4113/1/be/src/exec/hdfs-scanner.cc
File be/src/exec/hdfs-scanner.cc:

Line 80:   // Check valid combinations of add_batches_to_queue_ and scan node 
type.
> I made a related comment earlier, but would it work to determine 'add_batch
Done. Much better, thanks.


PS1, Line 119: iter
> It's a map entry not an iterator
Done


PS1, Line 143: iter
> This is an map entry not an iterator. 'entry'?
Done


http://gerrit.cloudera.org:8080/#/c/4113/1/be/src/exec/hdfs-scanner.h
File be/src/exec/hdfs-scanner.h:

Line 208:   /// Holds memory for template tuples.
> It would be helpful to document the lifetime of memory in the pool.
Commented on lifetime.

I'm open to renaming the pool. Seems like we'd have to call it tuple_data_pool_ 
or something like that, so it may be confusing with the pool in the row batch. 
Other ideas?


Line 209:   boost::scoped_ptr<MemPool> template_tuple_pool_;
> It's also a little unfortunate we have to use a whole MemPool for a single 
It's not just a single tuple, It can contain several template tuples (nested 
types), and it also contains default values, e.g., for Avro where every file 
may have different default values.

Still, I agree it's unfortunate we need to use a new pool, but I could not find 
an existing pool in this class whose memory has the same lifetime as what we 
need.

We can still go back to having a single pool in the scan node and protect it 
with a lock, if you prefer that.


Line 219:   /// file.  They are owned by the HDFS scan node, although each 
scanner has its own
> This ownership has changed, right?
Done


Line 228:   int tuple_byte_size_;
> const?
Done


Line 231:   Tuple* tuple_;
> This is only used in a couple of subclasses. Maybe move duplicate the membe
A scanner can be used either by a single-threaded or multi-threaded scan node. 
Depending on which mode it is used in some members are used or not. The tuple_, 
batch_, tuple_mem_, etc. members fall into that category.

In other words, if you call ProcessSplit() on the scanner, then those members 
are used. If you drive the scanner via GetNext() then those members are not 
interesting. But the same scanner may support both interfaces, so I don't see 
why we'd put in the effort to separate the single-threaded stuff from the 
multi-threaded stuff (since the same scanner needs to support both interfaces).


Line 238:   RowBatch* batch_;
> This is only used for non-MT scanners right?
Yes.


Line 247:   int32_t num_null_bytes_;
> const? And move next to tuple_byte_size_?
Done


Line 256:   boost::scoped_ptr<Codec> decompressor_;
> It looks like this decompressor plus the data_buffer_pool_ is only used by 
As far as I can tell this is used by pretty much all scanners, except Parquet.


Line 270:   typedef int (*WriteTuplesFn)(HdfsScanner*, MemPool*, TupleRow*, 
int, FieldLocation*,
> Similarly this WriteTuple stuff is only used by some subclasses, would be n
Added a TODO, hope that's ok for now.

I agree that it might be worth anther pass over all members, but that seems 
kind of unrelated to this change (I did not change the scanner class hierarchy).


http://gerrit.cloudera.org:8080/#/c/4113/1/be/src/exec/scanner-context.cc
File be/src/exec/scanner-context.cc:

Line 316:   if (hdfs_scan_node == NULL) return false;
> To check my understanding - this is no longer relevant because in the singl
That's correct. In the multi-threaded world this was used to synchronize the 
cancellation across the threads, i.e., to stop scanning if the scanner context 
has been cancelled.

I did forget an explicit check for cancellation in the scan node, though. Added 
it together with QueryMaintenance() etc.


http://gerrit.cloudera.org:8080/#/c/4113/1/be/src/exec/scanner-context.h
File be/src/exec/scanner-context.h:

Line 298:   /// If true, the ScanNode has been cancelled and the scanner thread 
should finish up
> Update this comment to explain single-threaded behaviour?
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/4113
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I98cc7f970e1575dd83875609985e1877ada3d5e0
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <alex.b...@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-3905: Add single-threaded scan node.

Reply via email to