[jira] [Commented] (LUCENE-5609) Should we revisit the default numeric precision step?

Robert Muir (JIRA) Sun, 20 Apr 2014 10:38:26 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13975204#comment-13975204
 ]


Robert Muir commented on LUCENE-5609:
-------------------------------------

{quote}
I think the main problem of this issue is, that we only have one default. 
Sombeody never doing any ranges does not need the additional terms at all. 
That's the main problem. Solr is better here, as it provided 2 predefined field 
types, but Lucene only has one - and that is the bug.
{quote}

Well, I kind of agree, but in a different way. 

In my opinion the default numeric types (intfield, longfield, floatfield, 
doublefield) should have good defaults for general-purpose use. This includes 
range queries: they should work "reasonably" well out of box. Users that dont 
need range queries can optimize by changing to Infinity. Along the same lines, 
they also dont need to be super-optimized for "hardcore" esoteric uses of range 
queries. Thats what defaults are, just making the right tradeoffs for 
out-of-box use. 

I would not be happy if these fields default to precisionStep=Infinity either, 
because thats also a bad default for general purpose use, just in the opposite 
direction of precisionStep=4.

I am fine with precisionStep=8 as the new default for both, but I don't think 
its the best idea. I think 16 for the 64-bit types are nice because its easy to 
understand "4 terms for each value". Today its 8 terms for each value (32-bit 
field), and 16 terms for each value (64-bit field). 

I also think we should be able to add new types in the future (e.g. 16-bit 
short and half-float) and give them different defaults too. So, I don't 
understand the need for a "one-size-fits-all" default.


> Should we revisit the default numeric precision step?
> -----------------------------------------------------
>
>                 Key: LUCENE-5609
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5609
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/search
>            Reporter: Michael McCandless
>             Fix For: 4.9, 5.0
>
>         Attachments: LUCENE-5609.patch
>
>
> Right now it's 4, for both 8 (long/double) and 4 byte (int/float)
> numeric fields, but this is a pretty big hit on indexing speed and
> disk usage, especially for tiny documents, because it creates many (8
> or 16) terms for each value.
> Since we originally set these defaults, a lot has changed... e.g. we
> now rewrite MTQs per-segment, we have a faster (BlockTree) terms dict,
> a faster postings format, etc.
> Index size is important because it limits how much of the index will
> be hot (fit in the OS's IO cache).  And more apps are using Lucene for
> tiny docs where the overhead of individual fields is sizable.
> I used the Geonames corpus to run a simple benchmark (all sources are
> committed to luceneutil). It has 8.6 M tiny docs, each with 23 fields,
> with these numeric fields:
>   * lat/lng (double)
>   * modified time, elevation, population (long)
>   * dem (int)
> I tested 4, 8 and 16 precision steps:
> {noformat}
> indexing:
> PrecStep        Size        IndexTime
>        4   1812.7 MB        651.4 sec
>        8   1203.0 MB        443.2 sec
>       16    894.3 MB        361.6 sec
> searching:
>      Field  PrecStep   QueryTime   TermCount
>  geoNameID         4   2872.5 ms       20306
>  geoNameID         8   2903.3 ms      104856
>  geoNameID        16   3371.9 ms     5871427
>   latitude         4   2160.1 ms       36805
>   latitude         8   2249.0 ms      240655
>   latitude        16   2725.9 ms     4649273
>   modified         4   2038.3 ms       13311
>   modified         8   2029.6 ms       58344
>   modified        16   2060.5 ms       77763
>  longitude         4   3468.5 ms       33818
>  longitude         8   3629.9 ms      214863
>  longitude        16   4060.9 ms     4532032
> {noformat}
> Index time is with 1 thread (for identical index structure).
> The query time is time to run 100 random ranges for that field,
> averaged over 20 iterations.  TermCount is the total number of terms
> the MTQ rewrote to across all 100 queries / segments, and it gets
> higher as expected as precStep gets higher, but the search time is not
> that heavily impacted ... negligible going from 4 to 8, and then some
> impact from 8 to 16.
> Maybe we should increase the int/float default precision step to 8 and
> long/double to 16?  Or both to 16?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5609) Should we revisit the default numeric precision step?

Reply via email to