[kudu-CR](gh-pages) Blogpost describing index skip scan optimization.

Andrew Wong (Code Review) Wed, 29 Aug 2018 12:54:51 -0700

Andrew Wong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/11263 )


Change subject: Blogpost describing index skip scan optimization.
......................................................................


Patch Set 4:

(16 comments)

http://gerrit.cloudera.org:8080/#/c/11263/4//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/11263/4//COMMIT_MSG@6
PS4, Line 6:
           : Blogpost describing index skip scan optimization.
           :
In reviewing blogposts, it's generally helpful to post a link to a rendered 
version, e.g. posting to your own github, which will automatically render the 
*.md


http://gerrit.cloudera.org:8080/#/c/11263/4/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md
File _posts/2018-08-17-index-skip-scan-optimization-in-kudu.md:

http://gerrit.cloudera.org:8080/#/c/11263/4/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@9
PS4, Line 9: does not contain the first column of the composite (multi-column) 
primary key.
This already seems like it's going a bit too far into implementation details. 
Maybe instead note something like: 'I optimized the Kudu scan-path by 
implementing a technique called an "index-skip scan."'


http://gerrit.cloudera.org:8080/#/c/11263/4/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@11
PS4, Line 11: Example
            : ==========
Probably don't need this.


http://gerrit.cloudera.org:8080/#/c/11263/4/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@14
PS4, Line 14: <!--more-->
What do these do?


http://gerrit.cloudera.org:8080/#/c/11263/4/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@31
PS4, Line 31: as B-Tree
nit: "a B-tree", and no need to capitalize "Tree" below


http://gerrit.cloudera.org:8080/#/c/11263/4/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@33
PS4, Line 33: ("host")
nit: perhaps using ``s would be more reasonable here (ie. `host`). Then it'd be 
formatted as monospace font. Here and elsewhere


http://gerrit.cloudera.org:8080/#/c/11263/4/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@32
PS4, Line 32: The data is sorted lexicographically starting from the leftmost 
primary key column and stored in the B-Tree leaf nodes.
            : Therefore, when the user query contains the first key column 
("host"), Kudu uses the primary key range push down
            : operation to optimize the scan time.
IMO this doesn't convey the idea that the data is sorted by the composite of 
all primary key columns. Also not sure what you mean by "primary key range push 
down operation".

Also, overall for this project, I think it's always been helpful to 
think/reason about it with some example data. I think having an dummy dataset 
of a handful of rows with a decent number of prefix keys would make this 
blogpost more understandable to the layperson. It'd also serve as a concrete 
example of why we can't use the PK index if the predicate doesn't contain the 
first key.


http://gerrit.cloudera.org:8080/#/c/11263/4/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@39
PS4, Line 39:
nit: here and elsewhere, no need for spaces before punctuation marks


http://gerrit.cloudera.org:8080/#/c/11263/4/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@46
PS4, Line 46: form a prefix.
The crux of this is the prefixes are also sorted, and all rows of a given 
prefix are also sorted by the remaining PK columns. A prefix with no other 
properties isn't necessarily useful, so without calling that out, it might be 
hard to see why having these prefixes are helpful.


http://gerrit.cloudera.org:8080/#/c/11263/4/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@51
PS4, Line 51: For example, consider the query :
            : {% highlight SQL %}
            : select clusterid from metrics where tstamp = 100;
            : {% endhighlight %}
            :
            : ![png]({{ site.github.url 
}}/img/index-skip-scan/skip-scan-example-table.png){:height="500px" 
width="500px" .img-responsive}
            : *Sample rows of Table "metrics" (sorted by key columns for 
simplicity).*
Ah, so you _do_ have an example! I think it'd be helpful setting this up up 
front, saying, here is how data is organized in Kudu today, and based on that, 
why it's not straightforward to use the index when there aren't predicates on 
the first primary key, etc.

Also isn't the data _actually_ stored like this? I.e. not for simplicity, but 
this actually represents how Kudu would see the data, doesn't it?


http://gerrit.cloudera.org:8080/#/c/11263/4/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@61
PS4, Line 61: host = "helium"
nit: backticks here too and everywhere else that has code snippets


http://gerrit.cloudera.org:8080/#/c/11263/4/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@62
PS4, Line 62: popularly known as index skip scan optimization can skip all the 
rows for which host = "helium" and tstamp != 100
nit: it's great to get to the point that we call this a "skip scan". To drive 
the point home, it might help to say what the "skip" and "scan" parts of this 
are.


http://gerrit.cloudera.org:8080/#/c/11263/4/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@68
PS4, Line 68: (upto 13x)
It's not up to 13x, but it should be arbitrarily optimizable based on the 
cardinality, no? Also "up to", here and elsewhere


http://gerrit.cloudera.org:8080/#/c/11263/4/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@69
PS4, Line 69: prefix column(s)
nit: similar to my comments in the patch, it might help defining these up front 
where you're describing the approach; that way it's clear what this is 
referring to.


http://gerrit.cloudera.org:8080/#/c/11263/4/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@73
PS4, Line 73: Based on experiments on upto 10 million rows per tablet, we 
decided to disable skip scan when the number of seeks
            : for distinct prefix column values exceeds 
![](https://latex.codecogs.com/gif.latex?%5Csqrt%7B%5C%23total%20rows%7D).
This could use some explanation as to why sqrt(total_num_rows) was chosen. Also 
it's not the total number of rows in the table, right?


http://gerrit.cloudera.org:8080/#/c/11263/4/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@82
PS4, Line 82: Conclusion
            : ==========
I generally see the conclusion as a place to summarize, in addition to talking 
about next steps, closing thoughts, etx.



--
To view, visit http://gerrit.cloudera.org:8080/11263
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: gh-pages
Gerrit-MessageType: comment
Gerrit-Change-Id: I2250652dcba3d1b0a06f1ffb7f23c11bf533d35e
Gerrit-Change-Number: 11263
Gerrit-PatchSet: 4
Gerrit-Owner: Anupama Gupta <ag3...@columbia.edu>
Gerrit-Reviewer: Alexey Serbin <aser...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Comment-Date: Wed, 29 Aug 2018 19:54:32 +0000
Gerrit-HasComments: Yes

[kudu-CR](gh-pages) Blogpost describing index skip scan optimization.

Reply via email to