Thanks for all the useful information Dan. 




You are correct; this is just a poor man auto-increment primary key; 
fortunately in my case my code is the only one writing to this table and hence 
I can for instance synchronize the primary key value among the many threads 
writing to that table (but for the same reason I had to rule out using Sqoop or 
any MapReduce based approach because of the performance impact of synchronizing 
a value among many hosts). 
The main issue I am trying to solve here is to quickly get the next sequence 
number when starting a new batch (i.e. not for each row), so while having an 
efficient algorithm was not really a 'must have' in my case, I couldn't really 
tolerate having something that is O(N) (see below) as opposed to a O(logN) 
solution. 




My post yesterday didn't make it clear but I did implement yesterday morning 
the whole bisection algorithm that I described in my post; it's about 60 lines 
of Java code (that's why I said it seemed more complicated than it should be) 
and it does seem to behave as one would expect. 
I still don't have the whole table loaded (due to an unrelated problem with the 
way Impala stores timestamps in Kudu, but that's a story for another time...), 
but I ran a couple of tests with the very same table and computing min(PK) and 
max(PK) when the table had about 100M rows (first set of results) and later 
when the table had twice as many rows (about 200M) - these numbers are for a 
Kudu installation with 12 tablet servers. 
In the results below PKMax1 uses the standard O(N) full table scan algorithm, 
while PKMax3 uses the bisection+lowerBound() algorithm (PKMin is there for a 
reference of the basic 'setup' time): 




- with 107M rows: 




./run PKMin -t testtable 
PK min: 1 
Elapsed time (ms): 284 




./run PKMax1 -t testtable 
PK max: 107038068 
Elapsed time (ms): 3381 




./run PKMax3 -t testtable 
PK max: 107038068 
Elapsed time (ms): 462 





- with 214M rows: 




./run PKMin -t testtable 
PK min: 1 
Elapsed time (ms): 292 




./run PKMax1 -t testtable 
PK max: 214076136 
Elapsed time (ms): 6757 




./run PKMax3 -t testtable 
PK max: 214076136 
Elapsed time (ms): 631 





As you can see the PKMax1 (standard full scan) results show the linear behavior 
of the algorithm, while the PKMax3 (bisection algorithm) is much faster - it 
does show a 37% increase in time, but with elapsed times under a second, it is 
hard to tell if it due to other factors; I would have to run another couple of 
tests with say 400M and 800M rows to get a more accurate answer. 




Last but not least, I'll look into kudu-ts system - it definitely seems 
interesting. 




Franco 


----- Original Message -----

From: "Dan Burkert" <danburk...@apache.org> 
To: user@kudu.apache.org 
Sent: Thursday, December 14, 2017 5:02:01 PM 
Subject: Re: Efficient way of computing max(PK) in Kudu 

Hi Franco, 

Great question, and I think this gets towards a deeper use-case that Kudu could 
really excel at, but currently doesn't have the full set of required features 
to support. To your original question: you've pretty much covered all of the 
bases. Kudu doesn't have an efficient way to search backwards, currently. I 
agree with all of the conclusions you drew regarding manual binary search; it's 
probably feasible, but obviously not scalable if you have to do it for every 
write, or maybe even every batch. If you only have to do it every time you 
restart your single writer process, then maybe it's not so bad. 

As far as the bigger picture, I think this is a symptom of the fact that Kudu 
doesn't currently make it easy to work with data that doesn't have a natural 
primary key. It sounds like for your usecase having an auto-increment column 
would be ideal. Auto-increment is tricky when combined with sharded/partitioned 
tables, but I think it's solvable. One thing we've come up against as well are 
data sets where many columns need to be shoved into the PK in order to make it 
unique. There's definitely overhead in having big string multi-column PKs in 
Kudu, so having an auto-increment feature would be a big win for these datasets 
as well. KUDU-1945 is more or less our tracking ticket for this idea. 

One thing I'll mention, mostly as a 'food for thought' type idea is how the 
experimental kudu-ts system works. kudu-ts supports multi-attribute indexing, 
and in order to do so it builds an index table which is effectively a 
linear-probing open-addressed hash map inside a Kudu table. The indexed 
attributes are hashed, and then inserted into the 'tagsets' table with the hash 
value as the PK. If there's a collision on insert it does linear probing. Even 
though this is a slow on writes, it works out really well for kudu's read 
patterns; you just do a scan with a limit of 10 or 20, and it's virtually 
guaranteed that the row you're looking for will be in that result set. Now, I 
only bring this up because it's an interesting thought experiment, we haven't 
done any scale tests, ymmv, etc etc. 

Sorry I don't have a great solution for your usecase, but hopefully with some 
additional features Kudu will eventually get there. I'd be very interested to 
hear your results if your pursue the binary search solution. 

- Dan 

On Wed, Dec 13, 2017 at 6:54 PM, Franco Venturi < fvent...@comcast.net > wrote: 





Scenario: Kudu table with a primary key (PK) made of just one column of type 
INT64 (i.e. Java long) 




Problem: Compute max(primary key) in an efficient way 





Discussion: 
At work we have a Kudu table with billions of rows; its primary key is made of 
just one column of type INT64, this column contains a sequence number 
increasing from 1 to N (number of rows). 
For this discussion I am going to assume that the design of this table is fixed 
and can't be changed. 




I'd like to find an efficient way to compute the maximum of the primary key, so 
I know which sequence number I can start from when inserting another batch of 
rows in this table. 




Notice that the dual problem, i.e. computing min(PK), can be solved trivially 
in O(1) time, due to the fact that in Kudu all the rows within a tablet are 
kept in primary key sorted order (see here: 
https://www.cloudera.com/documentation/enterprise/latest/topics/kudu_schema_design.html#concept_f4d_jsy_1z
 ) - a simple way to do that is to get a list of KuduScanTokens from a 
KuduScanTokenBuilder (with setProjectedColumnIndexes set to return just the 0th 
column, and limit set to 1), read the first row returned on each tablet in the 
list, and compute the minimum of these values. 




In the case of finding the maximum value instead, the simplest approach would 
be to run a full table scan (using a KuduScanner or a set of KuduScanTokens in 
parallel), and find the maximum among all the values (or the maximum among the 
last values returned by each tablet). This approach hoewever scales as O(N) and 
therefore takes a while to run when the table has several billion rows (of 
course with setProjectedColumnIndexes set to return just the primary key 
column). 




I also read the API documentation and the code to see if Kudu offered a way to 
scan the rows of a table backwards (i.e. in decreasing order), but I couldn't 
find it (but I would be glad to be proven wrong on this one). 




After some thinking I came up with this algorithm that uses the lowerBound() 
method in KuduScannerBuilder and bisection: given an interval of possible 
values where the primary key maximum could be, I create a KuduScannerBuilder 
with a lowerBound set to a value that is half way between the two ends of the 
interval; if that KuduScanner returns at least 1 row (which can be checked in 
O(1) time), then the maximum value must be somewhere between the half way point 
and the upper end; on the other hand if that KuduScanner returns no rows, then 
the maximum value must be in the lower half. 
I then repeat the same process of bisection to the half interval selected above 
and so on by using the standard bisection algorithm. As you can easily see, 
this algorithm is about O(log2(N)). 
I also added a couple of additional tricks to my code: 
- since it is very unlikely that my maximum is in the range of the trillions of 
quadrillions (since it is just a sequential number, i.e. it is the same as the 
number of rows I have), I run a first lowerBound()+bisection loop to determine 
the highest '1' bit in the maximum (i.e. I start with 1<<32, see if there's any 
row above that lower bound, if there's none, trying again with 1<<16, and so 
on) 
- since I imagine that creating KuduScanners (and closing them afterwards) is 
an "expensive" operation compared to just scanning a few rows, when the 
bisection interval reaches some predefined value (in my tests 128 rows), I 
switch to just a regular scan of this final interval and find the maximum among 
all the values found in this small interval in the same way as the standard 
approach described above. 




In conclusion the lowerBound()+bisection algorithm is definitely efficient (and 
a few tests I ran showed that), but it seems very complicated (more than it 
should perhaps), so I was wondering if I am missing something obvious, and if 
any of you had to solve a similar problem in the past, how did you do it? 




Also I haven't looked at the source code for Impala, but I would like to know 
if Impala uses any trick (or undocumented rpc call/parameter) when it comes to 
computing something like this or scanning the rows of a tablet backwards. 




Franco 





Reply via email to