[ https://issues.apache.org/jira/browse/LUCENE-3108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13036284#comment-13036284 ]
Michael McCandless commented on LUCENE-3108: -------------------------------------------- {quote} bq. How come codecID changed from String to int on the branch? due to DocValues I need to compare the ID to certain fields to see for what field I stored and need to open docValues. I always had to parse the given string which is kind of odd. I think its more natural to have the same datatype on FieldInfo, SegmentCodecs and eventually in the Codec#files() method. Making a string out of it is way simpler / less risky than parsing IMO. {quote} OK that sounds great. {quote} bq. Can SortField somehow detect whether the needed field was stored in FC vs DV This is tricky though. You can have a DV field that is indexed too so its hard to tell if we can reliably do it. If we can't make it reliable I think we should not do it at all. {quote} It is tricky... but, eg, when someone does SortField("title", SortField.STRING), which cache (DV or FC) should we populate? {quote} bq. Should we rename oal.index.values.Type -> .ValueType? agreed. I also think we should rename Source but I don't have a good name yet. Any idea? {quote} ValueSource? (conflicts w/ FQs though) Though, maybe we can just refer to it as DocValues.Source, then it's clear? {quote} bq. Since we dynamically reserve a value to mean "unset", does that mean there are some datasets we cannot index? Again, tricky! The quick answer is yes, but we can't do that anyway since I have not normalize the range to be 0 based since PackedInts doesn't allow negative values. so the range we can store is (2^63) -1. So essentially with the current impl we can store (2^63)-2 and the max value is Long#MAX_VALUE-1. Currently there is no assert for this which is needed I think but to get around this we need to have a different impl I think or do I miss something? {quote} OK, but I think if we make a "straight longs" impl (ie no packed ints at all) then we can handle all long values? But in that case we'd require the app to pick a sentinel to mean "unset"? > Land DocValues on trunk > ----------------------- > > Key: LUCENE-3108 > URL: https://issues.apache.org/jira/browse/LUCENE-3108 > Project: Lucene - Java > Issue Type: Task > Components: core/index, core/search, core/store > Affects Versions: CSF branch, 4.0 > Reporter: Simon Willnauer > Assignee: Simon Willnauer > Fix For: 4.0 > > Attachments: LUCENE-3108.patch > > > Its time to move another feature from branch to trunk. I want to start this > process now while still a couple of issues remain on the branch. Currently I > am down to a single nocommit (javadocs on DocValues.java) and a couple of > testing TODOs (explicit multithreaded tests and unoptimized with deletions) > but I think those are not worth separate issues so we can resolve them as we > go. > The already created issues (LUCENE-3075 and LUCENE-3074) should not block > this process here IMO, we can fix them once we are on trunk. > Here is a quick feature overview of what has been implemented: > * DocValues implementations for Ints (based on PackedInts), Float 32 / 64, > Bytes (fixed / variable size each in sorted, straight and deref variations) > * Integration into Flex-API, Codec provides a > PerDocConsumer->DocValuesConsumer (write) / PerDocValues->DocValues (read) > * By-Default enabled in all codecs except of PreFlex > * Follows other flex-API patterns like non-segment reader throw UOE forcing > MultiPerDocValues if on DirReader etc. > * Integration into IndexWriter, FieldInfos etc. > * Random-testing enabled via RandomIW - injecting random DocValues into > documents > * Basic checks in CheckIndex (which runs after each test) > * FieldComparator for int and float variants (Sorting, currently directly > integrated into SortField, this might go into a separate DocValuesSortField > eventually) > * Extended TestSort for DocValues > * RAM-Resident random access API plus on-disk DocValuesEnum (currently only > sequential access) -> Source.java / DocValuesEnum.java > * Extensible Cache implementation for RAM-Resident DocValues (by-default > loaded into RAM only once and freed once IR is closed) -> SourceCache.java > > PS: Currently the RAM resident API is named Source (Source.java) which seems > too generic. I think we should rename it into RamDocValues or something like > that, suggestion welcome! > Any comments, questions (rants :)) are very much appreciated. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org