[ https://issues.apache.org/jira/browse/LUCENE-5609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13974899#comment-13974899 ]
Michael McCandless commented on LUCENE-5609: -------------------------------------------- +1 for 8/16. > Should we revisit the default numeric precision step? > ----------------------------------------------------- > > Key: LUCENE-5609 > URL: https://issues.apache.org/jira/browse/LUCENE-5609 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search > Reporter: Michael McCandless > Fix For: 4.9, 5.0 > > Attachments: LUCENE-5609.patch > > > Right now it's 4, for both 8 (long/double) and 4 byte (int/float) > numeric fields, but this is a pretty big hit on indexing speed and > disk usage, especially for tiny documents, because it creates many (8 > or 16) terms for each value. > Since we originally set these defaults, a lot has changed... e.g. we > now rewrite MTQs per-segment, we have a faster (BlockTree) terms dict, > a faster postings format, etc. > Index size is important because it limits how much of the index will > be hot (fit in the OS's IO cache). And more apps are using Lucene for > tiny docs where the overhead of individual fields is sizable. > I used the Geonames corpus to run a simple benchmark (all sources are > committed to luceneutil). It has 8.6 M tiny docs, each with 23 fields, > with these numeric fields: > * lat/lng (double) > * modified time, elevation, population (long) > * dem (int) > I tested 4, 8 and 16 precision steps: > {noformat} > indexing: > PrecStep Size IndexTime > 4 1812.7 MB 651.4 sec > 8 1203.0 MB 443.2 sec > 16 894.3 MB 361.6 sec > searching: > Field PrecStep QueryTime TermCount > geoNameID 4 2872.5 ms 20306 > geoNameID 8 2903.3 ms 104856 > geoNameID 16 3371.9 ms 5871427 > latitude 4 2160.1 ms 36805 > latitude 8 2249.0 ms 240655 > latitude 16 2725.9 ms 4649273 > modified 4 2038.3 ms 13311 > modified 8 2029.6 ms 58344 > modified 16 2060.5 ms 77763 > longitude 4 3468.5 ms 33818 > longitude 8 3629.9 ms 214863 > longitude 16 4060.9 ms 4532032 > {noformat} > Index time is with 1 thread (for identical index structure). > The query time is time to run 100 random ranges for that field, > averaged over 20 iterations. TermCount is the total number of terms > the MTQ rewrote to across all 100 queries / segments, and it gets > higher as expected as precStep gets higher, but the search time is not > that heavily impacted ... negligible going from 4 to 8, and then some > impact from 8 to 16. > Maybe we should increase the int/float default precision step to 8 and > long/double to 16? Or both to 16? -- This message was sent by Atlassian JIRA (v6.2#6252) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org