Re: Iterating TermsEnum for Long field produces zero values at the end

Barry Coughlan Tue, 18 Nov 2014 04:57:52 -0800

Never mind, I got it: MultiDocValues.getNumericValues(final IndexReader r,
final String field)


Barry

On Tue, Nov 18, 2014 at 12:05 PM, Barry Coughlan <b.coughl...@gmail.com>
wrote:

> Hi Michael,
>
> Indexing:
>
>     private NumericDocValuesField idField = new
> NumericDocValuesField("id", 0);
>
> Reading:
>
>     private NumericDocValues cacheDocIds() throws IOException {
>         AtomicReader wrapped = SlowCompositeReaderWrapper.wrap(reader);
>         return DocValues.getNumeric(wrapped, "id");
>     }
>
>
> I'm just putting this here for others because it's hard to find up-to-date
> examples of using DocValues.
>
> Two quick questions:
>
> 1. Do you suggest I use DocValues because intended to eventually replace
> FieldCache?
> 2. Is it preferable  to use reader.leaves() instead of
> SlowCompositeReaderWrapper here and somehow merge the segments?
>
> Thanks for all your help.
>
> Barry
>
>
>
>
> On Mon, Nov 17, 2014 at 8:37 PM, Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
>> It's better to use doc values than field cache, if you can.
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Mon, Nov 17, 2014 at 2:55 PM, Barry Coughlan <b.coughl...@gmail.com>
>> wrote:
>> > Makes sense, thanks. I switched the implementation to a FieldCache with
>> no
>> > noticeable performance difference:
>> >
>> > private Longs cacheDocIds() throws IOException {
>> >     AtomicReader wrapped = SlowCompositeReaderWrapper.wrap(reader);
>> >     Longs vals = FieldCache.DEFAULT.getLongs(wrapped, "id", false);
>> >     return vals;
>> > }
>> >
>> > Regards,
>> > Barry
>> >
>> > On Mon, Nov 17, 2014 at 6:50 PM, Uwe Schindler <u...@thetaphi.de> wrote:
>> >
>> >> Hi,
>> >>
>> >> > It is expected: those are the "prefix" terms, which come after all
>> the
>> >> full-
>> >> > precision numeric terms.
>> >> >
>> >> > But I'm not sure why you see 0s ... the bytes should be unique for
>> every
>> >> term
>> >> > you get back from the TermsEnum.
>> >>
>> >> That's easy to explain:
>> >>
>> >> The lower precision terms at the end have more than one doc in the
>> >> DocsEnum, you always return only the first (Lucene docid 0, you never
>> list
>> >> all other entries in DocsEnum). The prefixcoded term has a shift
>> value> 0
>> >> and because bits are stripped from the right, the small long values
>> will
>> >> therefore return 0L after decoding.
>> >>
>> >> In general to have such a type of cache, I would not use terms and
>> instead
>> >> use numeric docvalues. An alternative is to use FieldCache, which does
>> the
>> >> right thing automatically. Relying on the internal implementation of
>> >> numeric terms is not a good idea.
>> >>
>> >> Uwe
>> >>
>> >> > On Mon, Nov 17, 2014 at 10:39 AM, Barry Coughlan
>> >> > <b.coughl...@gmail.com> wrote:
>> >> > > Hi all,
>> >> > >
>> >> > > I'm using 4.10.2. I have a Long "id" field. Each document has one
>> "id"
>> >> > > value. I am creating a look-up between Lucene's internal document
>> id
>> >> > > and my "id" values by enumerating the inverted index:
>> >> > >
>> >> > >     private long[] cacheDocIds() throws IOException {
>> >> > >         long[] ourIds = new long[reader.maxDoc()];
>> >> > >
>> >> > >         Bits liveDocs = MultiFields.getLiveDocs(reader);
>> >> > >         Fields fields = MultiFields.getFields(reader);
>> >> > >         Terms terms = fields.terms("id");
>> >> > >
>> >> > >         TermsEnum iterator = terms.iterator(null);
>> >> > >         BytesRef bytesRef = null;
>> >> > >         while ((bytesRef = iterator.next()) != null) {
>> >> > >             DocsEnum docsEnum = iterator.docs(liveDocs, null,
>> >> > > DocsEnum.FLAG_NONE);
>> >> > >
>> >> > >             int luceneId = docsEnum.nextDoc();
>> >> > >             long ourId = NumericUtils.prefixCodedToLong(bytesRef);
>> >> > >             System.out.println(luceneId + " " + ourId);
>> >> > >             ourIds[luceneId] = ourId;
>> >> > >         }
>> >> > >
>> >> > >         return ourIds;
>> >> > >     }
>> >> > >
>> >> > > With 5 documents (1, 2, 3, 4, 5) I get this output from the above
>> code:
>> >> > >
>> >> > > 0 1
>> >> > > 1 2
>> >> > > 2 3
>> >> > > 3 4
>> >> > > 4 5
>> >> > > 0 0
>> >> > > 0 0
>> >> > > 0 0
>> >> > >
>> >> > > I don't understand why there are three zeroes at the end.
>> >> > >
>> >> > > - reader.maxDoc is 5 and no documents have been deleted.
>> >> > > - I have tried this with a varying number of documents and there
>> are
>> >> > > always three zeroes at the end.
>> >> > > - I tried changing version to Lucene 4.10.0 and Lucene 4.9 and the
>> >> > > same behavior occurs.
>> >> > >
>> >> > > I can work around this with but I'm just curious if this behavior
>> is
>> >> > > expected?
>> >> > >
>> >> > > Regards,
>> >> > > Barry
>> >> >
>> >> > ---------------------------------------------------------------------
>> >> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> >> > For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >>
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>

Re: Iterating TermsEnum for Long field produces zero values at the end

Reply via email to