[
https://issues.apache.org/jira/browse/LUCENE-8496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16612826#comment-16612826
]
Nicholas Knize commented on LUCENE-8496:
----------------------------------------
Initial patch provided:
The lionshare of the changes are made to {{FieldType}}, {{BKDWriter}}, and
{{BKDReader}}.
* {{FieldType}} - split {{pointDimensionCount}} into two new integers that
define {{pointDataDimensionCount}} and {{pointIndexDimensionCount}}.
{{pointIndexDimensionCount}} must be <= {{pointDataDimensionCount}} and defines
the first {{n}} dimensions that will be used to build the index. The remaining
{{pointDataDimensionCount}} - {{pointIndexDimensionCount}} dimensions are
ignored while building (e.g., split/merge) the index. Getter and Setter utility
methods are added.
* {{BKDWriter}} - change {{writeIndex}} to encode and write {{numIndexDims}} in
the 2 most significant bytes of the integer that formerly stored {{numDims}}
this provides simple backwards compatability without requiring a change to
{{FieldInfoFormat}}. Indexing methods are updated to only use the first
{{numIndexDims}} while building the tree. Leaf nodes still use {{numDataDims}}
for efficiently packing and compressing the leaf level data (data nodes).
* {{BKDReader}} - add version checking in the constructor to decode
{{numIndexDims}} and {{numDataDims}} from the packed dimension integer. Update
index reading methods to only look at the first {{numIndexDims}} while
traversing the tree. {{numDataDims}} are still used for decoding leaf level
data.
* API Changes - all instances of {{pointDimensionCount}} have been updated to
{{pointDataDimensionCount}} and {{pointIndexDimensionCount}} to reflect total
number of dimensions, and number of dimensions used for creating the index,
respectively.
* All POINT Tests and POINT based Fields have been updated to use the API
changes.
Benchmarking
---
To benchmark the changes I update {{LatLonShape}} (not included in this patch)
and ran benchmark tests both with and without selective indexing. The results
are below:
6 dimension encoded {{LatLonShape}} w/o selective indexing
------
INDEX SIZE: 1.2795778876170516 GB
READER MB: 1.7928361892700195
BEST M hits/sec: 11.67378231920028
BEST QPS: 6.8635445274291715 for 225 queries, totHits=382688713
7 dimension LatLonShape encoding w/ 4 dimension selective indexing
-------
INDEX SIZE: 2.1509012933820486 GB
READER MB: 1.8154268264770508
BEST M hits/sec: 17.018094815004627
BEST QPS: 10.005707519719927 for 225 queries, totHits=382688713
The gains are a little better than the differences between searching a 4d range
vs a 6d range. The index size increased due to using 7 dimensions instead of 6,
but I also switched over to a bit bigger encoding size.
> Explore selective dimension indexing in BKDReader/Writer
> --------------------------------------------------------
>
> Key: LUCENE-8496
> URL: https://issues.apache.org/jira/browse/LUCENE-8496
> Project: Lucene - Core
> Issue Type: New Feature
> Reporter: Nicholas Knize
> Priority: Major
> Attachments: LUCENE-8496.patch
>
>
> This issue explores adding a new feature to BKDReader/Writer that enables
> users to select a fewer number of dimensions to be used for creating the BKD
> index than the total number of dimensions specified for field encoding. This
> is useful for encoding dimensional data that is used for interpreting the
> encoded field data but unnecessary (or not efficient) for creating the index
> structure. One such example is {{LatLonShape}} encoding. The first 4
> dimensions may be used to to efficiently search/index the triangle using its
> precomputed bounding box as a 4D point, and the remaining dimensions can be
> used to encode the vertices of the tessellated triangle. This causes BKD to
> act much like an R-Tree for shape data where search is distilled into a 4D
> point (instead of a more expensive 6D point) and the triangle is encoded
> using a portion of the remaining (non-indexed) dimensions. Fields that use
> the full data range for indexing are not impacted and behave as they normally
> would.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]