Alex Behm has posted comments on this change. Change subject: IMPALA-5309: Adds TABLESAMPLE clause for HDFS table refs. ......................................................................
Patch Set 3: (1 comment) http://gerrit.cloudera.org:8080/#/c/6868/3/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java: Line 1961: List<Pair<Long, FileDescriptor>> allFiles = > this is an expensive list, and i don't think you need it. you can achieve t 1. I'm happy to try this proposal, but a key element is missing. How do you propose to avoid selecting the same file twice? A retry loop? The purpose of this list is to efficiently avoid selecting the same file twice regardless of the sample percent. I agree the object generation is probably bad. We can avoid that by using two arrays. 2. As for returning a a map instead. I'm happy to do that, but do't really see why the indirection over ids/indexes makes sense. What do we gain from this indirection? We will generate more bigger objects (map + sets versus a list), and we need to modify computeScanRanges() to probe that map (or change computeScanRanges() entirely). -- To view, visit http://gerrit.cloudera.org:8080/6868 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ief112cfb1e4983c5d94c08696dc83da9ccf43f70 Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Alex Behm <alex.b...@cloudera.com> Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com> Gerrit-Reviewer: Marcel Kornacker <mar...@cloudera.com> Gerrit-HasComments: Yes