[jira] [Commented] (LUCENE-5609) Should we revisit the default numeric precision step?

Uwe Schindler (JIRA) Sat, 19 Apr 2014 07:01:46 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13974866#comment-13974866
 ]


Uwe Schindler commented on LUCENE-5609:
---------------------------------------

To just explain, why you might have mutiple numeric fields and multiple 
queries: I have a customer with date ranges (they use precStep 8, 16 hurted 
much with ElasticSearch for a 100 GB index) and also for geo search here 
in-house (PANGAEA). If you have something like overlapping ranges, you need 2 
queries with half open ranges. For example you have a date range on each 
document (start/end date of validity). The query on the index is also a date 
range and you want to find all documents that have overlapping ranges (validity 
range of document overlaps date range of query). In that case you need 2 half 
open queries (which are expensive with large precision steps). For stuff like 
bounding boxes in geo you might need if the bounding box of the document 
overlaps the bounding box of the query (Google Maps like query). Here you have 
4 half open ranges, which almost always hit half of all your documents). With 
large precsteps this takes looooooooooong. So 8 is a good default, for my 
customer 16 took like 4 times as long as 8 (becausde of the half open ranges). 
With smaller precSteps half open ranges are very simple.

With geonames you can check this: geonames have in most cases bounding boxes 
assigned and you want to search with bounding boxes, too. This is my example 
above. And those ranges (unless you want to find all documents completely 
inside the query range) are always 4 half open ones each hitting half of all 
documents. By anding them together, you later get the real results 
(conjunctionscorer).

> Should we revisit the default numeric precision step?
> -----------------------------------------------------
>
>                 Key: LUCENE-5609
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5609
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/search
>            Reporter: Michael McCandless
>             Fix For: 4.9, 5.0
>
>
> Right now it's 4, for both 8 (long/double) and 4 byte (int/float)
> numeric fields, but this is a pretty big hit on indexing speed and
> disk usage, especially for tiny documents, because it creates many (8
> or 16) terms for each value.
> Since we originally set these defaults, a lot has changed... e.g. we
> now rewrite MTQs per-segment, we have a faster (BlockTree) terms dict,
> a faster postings format, etc.
> Index size is important because it limits how much of the index will
> be hot (fit in the OS's IO cache).  And more apps are using Lucene for
> tiny docs where the overhead of individual fields is sizable.
> I used the Geonames corpus to run a simple benchmark (all sources are
> committed to luceneutil). It has 8.6 M tiny docs, each with 23 fields,
> with these numeric fields:
>   * lat/lng (double)
>   * modified time, elevation, population (long)
>   * dem (int)
> I tested 4, 8 and 16 precision steps:
> {noformat}
> indexing:
> PrecStep        Size        IndexTime
>        4   1812.7 MB        651.4 sec
>        8   1203.0 MB        443.2 sec
>       16    894.3 MB        361.6 sec
> searching:
>      Field  PrecStep   QueryTime   TermCount
>  geoNameID         4   2872.5 ms       20306
>  geoNameID         8   2903.3 ms      104856
>  geoNameID        16   3371.9 ms     5871427
>   latitude         4   2160.1 ms       36805
>   latitude         8   2249.0 ms      240655
>   latitude        16   2725.9 ms     4649273
>   modified         4   2038.3 ms       13311
>   modified         8   2029.6 ms       58344
>   modified        16   2060.5 ms       77763
>  longitude         4   3468.5 ms       33818
>  longitude         8   3629.9 ms      214863
>  longitude        16   4060.9 ms     4532032
> {noformat}
> Index time is with 1 thread (for identical index structure).
> The query time is time to run 100 random ranges for that field,
> averaged over 20 iterations.  TermCount is the total number of terms
> the MTQ rewrote to across all 100 queries / segments, and it gets
> higher as expected as precStep gets higher, but the search time is not
> that heavily impacted ... negligible going from 4 to 8, and then some
> impact from 8 to 16.
> Maybe we should increase the int/float default precision step to 8 and
> long/double to 16?  Or both to 16?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5609) Should we revisit the default numeric precision step?

Reply via email to