[ 
https://issues.apache.org/jira/browse/LUCENE-8496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16612826#comment-16612826
 ] 

Nicholas Knize commented on LUCENE-8496:
----------------------------------------

Initial patch provided:

The lionshare of the changes are made to {{FieldType}}, {{BKDWriter}}, and 
{{BKDReader}}.

* {{FieldType}} - split {{pointDimensionCount}} into two new integers that 
define {{pointDataDimensionCount}} and {{pointIndexDimensionCount}}. 
{{pointIndexDimensionCount}} must be <= {{pointDataDimensionCount}} and defines 
the first {{n}} dimensions that will be used to build the index. The remaining 
{{pointDataDimensionCount}} - {{pointIndexDimensionCount}} dimensions are 
ignored while building (e.g., split/merge) the index. Getter and Setter utility 
methods are added.

* {{BKDWriter}} - change {{writeIndex}} to encode and write {{numIndexDims}} in 
the 2 most significant bytes of the integer that formerly stored {{numDims}} 
this provides simple backwards compatability without requiring a change to 
{{FieldInfoFormat}}. Indexing methods are updated to only use the first 
{{numIndexDims}} while building the tree. Leaf nodes still use {{numDataDims}} 
for efficiently packing and compressing the leaf level data (data nodes).

* {{BKDReader}} - add version checking in the constructor to decode 
{{numIndexDims}} and {{numDataDims}} from the packed dimension integer. Update 
index reading methods to only look at the first {{numIndexDims}} while 
traversing the tree. {{numDataDims}} are still used for decoding leaf level 
data.

* API Changes - all instances of {{pointDimensionCount}} have been updated to 
{{pointDataDimensionCount}} and {{pointIndexDimensionCount}} to reflect total 
number of dimensions, and number of dimensions used for creating the index, 
respectively.

* All POINT Tests and POINT based Fields have been updated to use the API 
changes.

Benchmarking
---

To benchmark the changes I update {{LatLonShape}} (not included in this patch) 
and ran benchmark tests both with and without selective indexing. The results 
are below: 

6 dimension encoded {{LatLonShape}} w/o selective indexing
------
INDEX SIZE: 1.2795778876170516 GB
READER MB: 1.7928361892700195
BEST M hits/sec: 11.67378231920028
BEST QPS: 6.8635445274291715 for 225 queries, totHits=382688713

7 dimension LatLonShape encoding w/ 4 dimension selective indexing
-------
INDEX SIZE: 2.1509012933820486 GB
READER MB: 1.8154268264770508
BEST M hits/sec: 17.018094815004627
BEST QPS: 10.005707519719927 for 225 queries, totHits=382688713

The gains are a little better than the differences between searching a 4d range 
vs a 6d range. The index size increased due to using 7 dimensions instead of 6, 
but I also switched over to a bit bigger encoding size.

> Explore selective dimension indexing in BKDReader/Writer
> --------------------------------------------------------
>
>                 Key: LUCENE-8496
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8496
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Nicholas Knize
>            Priority: Major
>         Attachments: LUCENE-8496.patch
>
>
> This issue explores adding a new feature to BKDReader/Writer that enables 
> users to select a fewer number of dimensions to be used for creating the BKD 
> index than the total number of dimensions specified for field encoding. This 
> is useful for encoding dimensional data that is used for interpreting the 
> encoded field data but unnecessary (or not efficient) for creating the index 
> structure. One such example is {{LatLonShape}} encoding. The first 4 
> dimensions may be used to to efficiently search/index the triangle using its 
> precomputed bounding box as a 4D point, and the remaining dimensions can be 
> used to encode the vertices of the tessellated triangle. This causes BKD to 
> act much like an R-Tree for shape data where search is distilled into a 4D 
> point (instead of a more expensive 6D point) and the triangle is encoded 
> using a portion of the remaining (non-indexed) dimensions. Fields that use 
> the full data range for indexing are not impacted and behave as they normally 
> would.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to