[ https://issues.apache.org/jira/browse/LUCENE-5609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13975204#comment-13975204 ]
Robert Muir commented on LUCENE-5609: ------------------------------------- {quote} I think the main problem of this issue is, that we only have one default. Sombeody never doing any ranges does not need the additional terms at all. That's the main problem. Solr is better here, as it provided 2 predefined field types, but Lucene only has one - and that is the bug. {quote} Well, I kind of agree, but in a different way. In my opinion the default numeric types (intfield, longfield, floatfield, doublefield) should have good defaults for general-purpose use. This includes range queries: they should work "reasonably" well out of box. Users that dont need range queries can optimize by changing to Infinity. Along the same lines, they also dont need to be super-optimized for "hardcore" esoteric uses of range queries. Thats what defaults are, just making the right tradeoffs for out-of-box use. I would not be happy if these fields default to precisionStep=Infinity either, because thats also a bad default for general purpose use, just in the opposite direction of precisionStep=4. I am fine with precisionStep=8 as the new default for both, but I don't think its the best idea. I think 16 for the 64-bit types are nice because its easy to understand "4 terms for each value". Today its 8 terms for each value (32-bit field), and 16 terms for each value (64-bit field). I also think we should be able to add new types in the future (e.g. 16-bit short and half-float) and give them different defaults too. So, I don't understand the need for a "one-size-fits-all" default. > Should we revisit the default numeric precision step? > ----------------------------------------------------- > > Key: LUCENE-5609 > URL: https://issues.apache.org/jira/browse/LUCENE-5609 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search > Reporter: Michael McCandless > Fix For: 4.9, 5.0 > > Attachments: LUCENE-5609.patch > > > Right now it's 4, for both 8 (long/double) and 4 byte (int/float) > numeric fields, but this is a pretty big hit on indexing speed and > disk usage, especially for tiny documents, because it creates many (8 > or 16) terms for each value. > Since we originally set these defaults, a lot has changed... e.g. we > now rewrite MTQs per-segment, we have a faster (BlockTree) terms dict, > a faster postings format, etc. > Index size is important because it limits how much of the index will > be hot (fit in the OS's IO cache). And more apps are using Lucene for > tiny docs where the overhead of individual fields is sizable. > I used the Geonames corpus to run a simple benchmark (all sources are > committed to luceneutil). It has 8.6 M tiny docs, each with 23 fields, > with these numeric fields: > * lat/lng (double) > * modified time, elevation, population (long) > * dem (int) > I tested 4, 8 and 16 precision steps: > {noformat} > indexing: > PrecStep Size IndexTime > 4 1812.7 MB 651.4 sec > 8 1203.0 MB 443.2 sec > 16 894.3 MB 361.6 sec > searching: > Field PrecStep QueryTime TermCount > geoNameID 4 2872.5 ms 20306 > geoNameID 8 2903.3 ms 104856 > geoNameID 16 3371.9 ms 5871427 > latitude 4 2160.1 ms 36805 > latitude 8 2249.0 ms 240655 > latitude 16 2725.9 ms 4649273 > modified 4 2038.3 ms 13311 > modified 8 2029.6 ms 58344 > modified 16 2060.5 ms 77763 > longitude 4 3468.5 ms 33818 > longitude 8 3629.9 ms 214863 > longitude 16 4060.9 ms 4532032 > {noformat} > Index time is with 1 thread (for identical index structure). > The query time is time to run 100 random ranges for that field, > averaged over 20 iterations. TermCount is the total number of terms > the MTQ rewrote to across all 100 queries / segments, and it gets > higher as expected as precStep gets higher, but the search time is not > that heavily impacted ... negligible going from 4 to 8, and then some > impact from 8 to 16. > Maybe we should increase the int/float default precision step to 8 and > long/double to 16? Or both to 16? -- This message was sent by Atlassian JIRA (v6.2#6252) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org