Re: Iterating TermsEnum for Long field produces zero values at the end

Barry Coughlan Tue, 18 Nov 2014 04:07:32 -0800

Hi Michael,

Indexing:


    private NumericDocValuesField idField = new NumericDocValuesField("id",
0);

Reading:

    private NumericDocValues cacheDocIds() throws IOException {
        AtomicReader wrapped = SlowCompositeReaderWrapper.wrap(reader);
        return DocValues.getNumeric(wrapped, "id");
    }


I'm just putting this here for others because it's hard to find up-to-date
examples of using DocValues.

Two quick questions:

1. Do you suggest I use DocValues because intended to eventually replace
FieldCache?
2. Is it preferable  to use reader.leaves() instead of
SlowCompositeReaderWrapper here and somehow merge the segments?

Thanks for all your help.

Barry




On Mon, Nov 17, 2014 at 8:37 PM, Michael McCandless <
[email protected]> wrote:

> It's better to use doc values than field cache, if you can.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Mon, Nov 17, 2014 at 2:55 PM, Barry Coughlan <[email protected]>
> wrote:
> > Makes sense, thanks. I switched the implementation to a FieldCache with
> no
> > noticeable performance difference:
> >
> > private Longs cacheDocIds() throws IOException {
> >     AtomicReader wrapped = SlowCompositeReaderWrapper.wrap(reader);
> >     Longs vals = FieldCache.DEFAULT.getLongs(wrapped, "id", false);
> >     return vals;
> > }
> >
> > Regards,
> > Barry
> >
> > On Mon, Nov 17, 2014 at 6:50 PM, Uwe Schindler <[email protected]> wrote:
> >
> >> Hi,
> >>
> >> > It is expected: those are the "prefix" terms, which come after all the
> >> full-
> >> > precision numeric terms.
> >> >
> >> > But I'm not sure why you see 0s ... the bytes should be unique for
> every
> >> term
> >> > you get back from the TermsEnum.
> >>
> >> That's easy to explain:
> >>
> >> The lower precision terms at the end have more than one doc in the
> >> DocsEnum, you always return only the first (Lucene docid 0, you never
> list
> >> all other entries in DocsEnum). The prefixcoded term has a shift value>
> 0
> >> and because bits are stripped from the right, the small long values will
> >> therefore return 0L after decoding.
> >>
> >> In general to have such a type of cache, I would not use terms and
> instead
> >> use numeric docvalues. An alternative is to use FieldCache, which does
> the
> >> right thing automatically. Relying on the internal implementation of
> >> numeric terms is not a good idea.
> >>
> >> Uwe
> >>
> >> > On Mon, Nov 17, 2014 at 10:39 AM, Barry Coughlan
> >> > <[email protected]> wrote:
> >> > > Hi all,
> >> > >
> >> > > I'm using 4.10.2. I have a Long "id" field. Each document has one
> "id"
> >> > > value. I am creating a look-up between Lucene's internal document id
> >> > > and my "id" values by enumerating the inverted index:
> >> > >
> >> > >     private long[] cacheDocIds() throws IOException {
> >> > >         long[] ourIds = new long[reader.maxDoc()];
> >> > >
> >> > >         Bits liveDocs = MultiFields.getLiveDocs(reader);
> >> > >         Fields fields = MultiFields.getFields(reader);
> >> > >         Terms terms = fields.terms("id");
> >> > >
> >> > >         TermsEnum iterator = terms.iterator(null);
> >> > >         BytesRef bytesRef = null;
> >> > >         while ((bytesRef = iterator.next()) != null) {
> >> > >             DocsEnum docsEnum = iterator.docs(liveDocs, null,
> >> > > DocsEnum.FLAG_NONE);
> >> > >
> >> > >             int luceneId = docsEnum.nextDoc();
> >> > >             long ourId = NumericUtils.prefixCodedToLong(bytesRef);
> >> > >             System.out.println(luceneId + " " + ourId);
> >> > >             ourIds[luceneId] = ourId;
> >> > >         }
> >> > >
> >> > >         return ourIds;
> >> > >     }
> >> > >
> >> > > With 5 documents (1, 2, 3, 4, 5) I get this output from the above
> code:
> >> > >
> >> > > 0 1
> >> > > 1 2
> >> > > 2 3
> >> > > 3 4
> >> > > 4 5
> >> > > 0 0
> >> > > 0 0
> >> > > 0 0
> >> > >
> >> > > I don't understand why there are three zeroes at the end.
> >> > >
> >> > > - reader.maxDoc is 5 and no documents have been deleted.
> >> > > - I have tried this with a varying number of documents and there are
> >> > > always three zeroes at the end.
> >> > > - I tried changing version to Lucene 4.10.0 and Lucene 4.9 and the
> >> > > same behavior occurs.
> >> > >
> >> > > I can work around this with but I'm just curious if this behavior is
> >> > > expected?
> >> > >
> >> > > Regards,
> >> > > Barry
> >> >
> >> > ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: [email protected]
> >> > For additional commands, e-mail: [email protected]
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [email protected]
> >> For additional commands, e-mail: [email protected]
> >>
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: Iterating TermsEnum for Long field produces zero values at the end

Reply via email to