Todd Lipcon has posted comments on this change. Change subject: Initial scan tokens design doc ......................................................................
Patch Set 3: (7 comments) http://gerrit.cloudera.org:8080/#/c/2443/3//COMMIT_MSG Commit Message: Line 7: Initial scan tokens design doc mind throwing the JIRA number in here? http://gerrit.cloudera.org:8080/#/c/2443/3/docs/design-docs/scan-tokens.md File docs/design-docs/scan-tokens.md: Line 20: split Kudu tables into logical sections, so that computation can be distributed physical sections, not logical, right? Line 36: defined serialization format so that tokens may be serialized and deserialized well defined (but opaque to the caller) Line 43: location hint, or a hint for every replica? I think multiple hints, but with a preference for the current leader? probably depends on the consistency mode Line 47: 2) How should scan tokens handle going stale WRT tablet location changes and perhaps we can provide an API on a scanner like 'IsLocal()'? that's also useful for metrics (eg Impala and MR like to expose counters of how many bytes were read locally vs remote, etc). The API might be slightly subtle since it coudl change as the scanner moves cross-tablet, but I think that's the best we can do. Another thought: we could offer a 'refresh' API or a 'check current' type API which would re-contact the master and verify that things haven't changed? though I still think some indication of "isLocal" is useful Line 54: point, but it will be an important consideration once that feature lands. I vote for the partition key range, since that will support splits at some point in the future without any changes Line 60: client could. yes I think this is a very useful API -- right now we ask people to split into many tablets per TS to get scan parallelism, but if we could subdivide our scan ranges in Impala/Spark/etc, then this wouldn't be nearly as important. I dont think we should implement it right off the bat, but working it into our thinking is a good idea. As for whether Kudu can do better than the client -- yes, I think the tablet server has enough data to suggest subdivisions - it can look at the current rowset min/max boundaries and sizes to get a reasonable estimate of PK distribution for example. -- To view, visit http://gerrit.cloudera.org:8080/2443 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Id208cecababf15e1671a01a219d4599adfcd4163 Gerrit-PatchSet: 3 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dan Burkert <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Dan Burkert <[email protected]> Gerrit-Reviewer: David Ribeiro Alves <[email protected]> Gerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy <[email protected]> Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-HasComments: Yes
