[ https://issues.apache.org/jira/browse/LUCENE-7862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16604279#comment-16604279 ]
Ignacio Vera commented on LUCENE-7862: -------------------------------------- I have been playing with this approach to see if we can make the BKD tree more efficient. I have done some benchmarks that confirm a big boost in performance whenever there is correlation between the dimensions so it is particularly good for ranges. In addition this boost quite a bit the newly introduced approach for indexing shapes in LUCENE-8396. Maybe we should only ad this extra information to the index when number of dimensions > 1. Here are a few benchmarks: 1) Ranges: Using the data points from GeoBench, the ranges have been created as following: {code:java} double lat = Double.parseDouble(parts[1]); double lon = Double.parseDouble(parts[2]); double latEnd = lat + 0.1 * Math.abs(lon); double lonEnd = lon + 0.1 * Math.abs(lat);{code} Where latitude is used in even dimension and longitude in uneven ones. IT = Index time (sec) FM = Force merge time (sec) IS = Index size (GB) RH = Reader heap (MB || Approach|| Dimensions||IT Dev||IT Base||IT Diff||IFM Dev||FM Base||FM Diff||IS Dev||IS Base||IS Diff||RH Dev||RH Base||IRH Diff|| |ranges|1|102.7s|101.5s|1%|0.0s|0.0s|0%|0.81|0.81|0%|0.80|0.80|-0%| |ranges|2|96.0s|94.4s|2%|0.0s|0.0s|0%|1.59|1.58|0%|1.02|1.03|-1%| |ranges|3|93.1s|84.6s|10%|0.0s|0.0s|0%|2.29|2.29|0%|1.18|1.01|17%| |ranges|4|101.0s|91.0s|11%|0.0s|0.0s|0%|3.07|3.05|0%|1.04|1.12|-7%| HS = Hits per second QPS= Queries per second Hits = Total hits ||Approach|| Dimensions||HS Dev||HS Base||HS Diff||QPS Dev||QPS Base||QPS Diff||Hits Dev||Hits Base||Hits Diff|| |ranges|1|131.17|116.20|13%|10.81|9.58|13%|2730282630|2730282630|0%| |ranges|2|34.07|10.83|215%|16.93|5.38|215%|452860046|452860046|0%| |ranges|3|34.41|3.83|799%|17.10|1.90|799%|452860046|452860046|0%| |ranges|4|26.01|3.27|696%|12.92|1.62|696%|452860046|452860046|0%|. | 3) GeoBench: comparison of Geo benchmark using points (LatLonPoint), geo3d (Geo3DPoint) and shapes (LaLonShape) IT = Index time (sec) FM = Force merge time (sec) IS = Index size (GB) RH = Reader heap (MB || Approach||IT Dev||IT Base||IT Diff||IFM Dev||FM Base||FM Diff||IS Dev||IS Base||IS Diff||RH Dev||RH Base||IRH Diff|| |points|103.3s|100.0s|3%|0.0s|0.0s|0%|0.51|0.51|0%|0.61|0.61|0%| |geo3d|105.6s|102.9s|3%|0.0s|0.0s|0%|0.72|0.72|0%|0.62|0.62|-0%| |shapes|108.3s|100.6s|8%|0.0s|0.0s|0%|1.31|1.30|0%|0.87|0.87|-0%| HS = Hits per second QPS= Queries per second Hits = Total hits ||Approach|| Shape||HS Dev||HS Base||HS Diff||QPS Dev||QPS Base||QPS Diff||Hits Dev||Hits Base||Hits Diff|| |points|polyRussia|17.19|17.19|0%|4.90|4.90|0%|3508846|3508846|0%| |points|polyMedium|9.51|9.26|3%|116.51|113.40|3%|2693559|2693559|0%| |points|distance|77.56|77.33|0%|45.57|45.43|0%|382961957|382961957|0%| |points|nearest 10|0.05|0.05|-0%|4651.95|4664.57|-0%|60844404|60844404|0%| |points|box|81.95|83.07|-1%|83.39|84.53|-1%|221118844|221118844|0%| |points|poly 10|80.37|79.78|1%|50.83|50.45|1%|355809475|355809475|0%| |points|sort|33.25|31.26|6%|33.83|31.81|6%|221118844|221118844|0%| |geo3d|polyRussia|0.51|0.50|2%|0.15|0.14|2%|3508671|3508671|0%| |geo3d|polyMedium|0.42|0.43|-0%|5.20|5.23|-0%|2693545|2693545|0%| |geo3d|distance|64.36|63.47|1%|37.77|37.25|1%|383371884|383371884|0%| |geo3d|box|49.29|51.60|-4%|50.15|52.50|-4%|221118844|221118844|0%| |geo3d|poly 10|39.40|39.43|-0%|24.91|24.93|-0%|355855227|355855227|0%| |shapes|polyRussia|2.81|0.31|819%|0.80|0.09|819%|3508846|3508846|0%| |shapes|polyMedium|0.52|0.08|539%|6.38|1.00|539%|2693559|2693559|0%| |shapes|box|11.07|1.46|661%|11.27|1.48|661%|221118844|221118844|0%| |shapes|poly 10|16.83|1.46|1052%|10.64|0.92|1052%|355809475|355809475|0%| > Should BKD cells store their min/max packed values? > --------------------------------------------------- > > Key: LUCENE-7862 > URL: https://issues.apache.org/jira/browse/LUCENE-7862 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Adrien Grand > Priority: Minor > Attachments: LUCENE-7862.patch, LUCENE-7862.patch > > > The index of the BKD tree already allows to know lower and upper bounds of > values in a given dimension. However the actual range of values might be more > narrow than what the index tells us, especially if splitting on one dimension > reduces the range of values in at least one other dimension. For instance > this tends to be the case with range fields: since we enforce that lower > bounds are less than upper bounds, splitting on one dimension will also > affect the range of values in the other dimension. > So I'm wondering whether we should store the actual range of values for each > dimension in leaf blocks, this will hopefully allow to figure out that either > none or all values match in a block without having to check them all. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org