Re: Iterating TermsEnum for Long field produces zero values at the end

Michael McCandless Mon, 17 Nov 2014 12:40:07 -0800

It's better to use doc values than field cache, if you can.

Mike McCandless


http://blog.mikemccandless.com


On Mon, Nov 17, 2014 at 2:55 PM, Barry Coughlan <b.coughl...@gmail.com> wrote:
> Makes sense, thanks. I switched the implementation to a FieldCache with no
> noticeable performance difference:
>
> private Longs cacheDocIds() throws IOException {
>     AtomicReader wrapped = SlowCompositeReaderWrapper.wrap(reader);
>     Longs vals = FieldCache.DEFAULT.getLongs(wrapped, "id", false);
>     return vals;
> }
>
> Regards,
> Barry
>
> On Mon, Nov 17, 2014 at 6:50 PM, Uwe Schindler <u...@thetaphi.de> wrote:
>
>> Hi,
>>
>> > It is expected: those are the "prefix" terms, which come after all the
>> full-
>> > precision numeric terms.
>> >
>> > But I'm not sure why you see 0s ... the bytes should be unique for every
>> term
>> > you get back from the TermsEnum.
>>
>> That's easy to explain:
>>
>> The lower precision terms at the end have more than one doc in the
>> DocsEnum, you always return only the first (Lucene docid 0, you never list
>> all other entries in DocsEnum). The prefixcoded term has a shift value> 0
>> and because bits are stripped from the right, the small long values will
>> therefore return 0L after decoding.
>>
>> In general to have such a type of cache, I would not use terms and instead
>> use numeric docvalues. An alternative is to use FieldCache, which does the
>> right thing automatically. Relying on the internal implementation of
>> numeric terms is not a good idea.
>>
>> Uwe
>>
>> > On Mon, Nov 17, 2014 at 10:39 AM, Barry Coughlan
>> > <b.coughl...@gmail.com> wrote:
>> > > Hi all,
>> > >
>> > > I'm using 4.10.2. I have a Long "id" field. Each document has one "id"
>> > > value. I am creating a look-up between Lucene's internal document id
>> > > and my "id" values by enumerating the inverted index:
>> > >
>> > >     private long[] cacheDocIds() throws IOException {
>> > >         long[] ourIds = new long[reader.maxDoc()];
>> > >
>> > >         Bits liveDocs = MultiFields.getLiveDocs(reader);
>> > >         Fields fields = MultiFields.getFields(reader);
>> > >         Terms terms = fields.terms("id");
>> > >
>> > >         TermsEnum iterator = terms.iterator(null);
>> > >         BytesRef bytesRef = null;
>> > >         while ((bytesRef = iterator.next()) != null) {
>> > >             DocsEnum docsEnum = iterator.docs(liveDocs, null,
>> > > DocsEnum.FLAG_NONE);
>> > >
>> > >             int luceneId = docsEnum.nextDoc();
>> > >             long ourId = NumericUtils.prefixCodedToLong(bytesRef);
>> > >             System.out.println(luceneId + " " + ourId);
>> > >             ourIds[luceneId] = ourId;
>> > >         }
>> > >
>> > >         return ourIds;
>> > >     }
>> > >
>> > > With 5 documents (1, 2, 3, 4, 5) I get this output from the above code:
>> > >
>> > > 0 1
>> > > 1 2
>> > > 2 3
>> > > 3 4
>> > > 4 5
>> > > 0 0
>> > > 0 0
>> > > 0 0
>> > >
>> > > I don't understand why there are three zeroes at the end.
>> > >
>> > > - reader.maxDoc is 5 and no documents have been deleted.
>> > > - I have tried this with a varying number of documents and there are
>> > > always three zeroes at the end.
>> > > - I tried changing version to Lucene 4.10.0 and Lucene 4.9 and the
>> > > same behavior occurs.
>> > >
>> > > I can work around this with but I'm just curious if this behavior is
>> > > expected?
>> > >
>> > > Regards,
>> > > Barry
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> > For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Iterating TermsEnum for Long field produces zero values at the end

Reply via email to