Re: Iterating BinaryDocValues

Mikhail Khludnev Fri, 10 Jan 2014 01:27:11 -0800

Adrian,

Thanks for your answer! You highlighted the point which I completely
missed! Again, I see that noob questions are quite useful.


On Fri, Jan 10, 2014 at 10:09 AM, Adrien Grand <jpou...@gmail.com> wrote:

> Hi Mikhail,
>
> These new byte[] are done on purpose because the contract on the
> BytesRef class is that it is only a reference to bytes stored
> somewhere, not a buffer. So it is always legal to change offset,
> length and the bytes reference, but the content of the byte[] array
> may only be changed if you own the BytesRef object.

This aspect of usage isn't obvious to me at all. At least it's worth to
clarify this rule in javadoc!


> So for example, it
> is legal for a BytesRefIterator to use a private BytesRef as a mutable
> buffer and to expose it in calls to next() but it is illegal to modify
> the content of the byte[] in BinaryDocValues.get since the bytes are
> provided by consumers of the API.
>
> If we would like to be able to remove these byte[] creations, I think
> we would either need:
>  1. to use a byte buffer instead of a BytesRef object,
>  2. or to make the BytesRef objects private to the BinaryDocValues
> instances and have a "BytesRef get(int docId)" method instead of "void
> get(int docId, BytesRef ref)"
>
> Although option 1 would avoid these byte[] creations, it would make
> the paged-bytes-based implementations a bit slower (eg.
> MemoryDocValuesFormat exposes binary doc values through paged bytes)
> because they would need to copy bytes instead of just changing
> pointers. With option 2, the BinaryDocValues object wouldn't be
> thread-safe by nature anymore, but I don't think this is an issue as
> we already store them in thread-locals.
>

Agree 1 is not fancy, but I feel uncomfortable relying on ThreadLocals. I
started from what Joel suggested - from column reading API, but lately I
rejected it as unnecessary. It might make sense to look on it again.

public abstract class BinaryDocValues {
  .....
  /**
   * the purpose is an efficient sequential memory access
   * @throws IOException if anything sad happens
   * */
  public BytesRefIterator getColumn(final DocIdSetIterator docIds) throws
IOException{
    return new BytesRefIterator (){

      private BytesRef ref = new BytesRef();

      @Override
      public BytesRef next() throws IOException {
        int doc = docIds.nextDoc();
        if(doc==DocIdSetIterator.NO_MORE_DOCS){
          return null;
        } else {
          get(doc, ref);
        }
        return ref;
      }
    };
  }
 }

It's a concrete impl which backed on current method. We don't need to amend
all currently existing implementations. Certain impls can provide an
efficient Iterator.

Anyway, whether you like getColumn() or not, if you raise an issue I can
work on patch.


>
> On Thu, Jan 9, 2014 at 5:27 PM, Mikhail Khludnev
> <mkhlud...@griddynamics.com> wrote:
> > Don't you think it's worth to raise a jira regarding those 'new bytes[]'
> ?
> > I'm able to provide a patch if you wish.
> >
> >
> > On Wed, Jan 8, 2014 at 2:02 PM, Mikhail Khludnev
> > <mkhlud...@griddynamics.com> wrote:
> >>
> >> FWIW,
> >>
> >> Micro benchmark shows 4% gain on reusing incoming ByteRef.bytes in short
> >> binary docvalues Test2BBinaryDocValues.testVariableBinary() with mmap
> >> directory.
> >> I wonder why it doesn't reads into incoming bytes
> >>
> https://github.com/apache/lucene-solr/blame/trunk/lucene/core/src/java/org/apache/lucene/codecs/lucene45/Lucene45DocValuesProducer.java#L401
> >>
> >>
> >>
> >> On Wed, Jan 8, 2014 at 12:53 AM, Michael McCandless
> >> <luc...@mikemccandless.com> wrote:
> >>>
> >>> Going sequentially should help, if the pages are not hot (in the OS's
> IO
> >>> cache).
> >>>
> >>> You can also use a different DVFormat, e.g. Direct, but this holds all
> >>> bytes in RAM.
> >>>
> >>> Mike McCandless
> >>>
> >>> http://blog.mikemccandless.com
> >>>
> >>>
> >>> On Tue, Jan 7, 2014 at 1:09 PM, Mikhail Khludnev
> >>> <mkhlud...@griddynamics.com> wrote:
> >>> > Joel,
> >>> >
> >>> > I tried to hack it straightforwardly, but found no free gain there.
> The
> >>> > only
> >>> > attempt I can suggest is to try to reuse bytes in
> >>> >
> >>> >
> https://github.com/apache/lucene-solr/blame/trunk/lucene/core/src/java/org/apache/lucene/codecs/lucene45/Lucene45DocValuesProducer.java#L401
> >>> > right now it allocates bytes every time, which beside of GC can also
> >>> > impact
> >>> > memory access locality. Could you try fix memory waste and repeat
> >>> > performance test?
> >>> >
> >>> > Have a good hack!
> >>> >
> >>> >
> >>> > On Mon, Dec 23, 2013 at 9:51 PM, Joel Bernstein <joels...@gmail.com>
> >>> > wrote:
> >>> >>
> >>> >>
> >>> >> Hi,
> >>> >>
> >>> >> I'm looking for a faster way to perform large scale docId ->
> bytesRef
> >>> >> lookups for BinaryDocValues.
> >>> >>
> >>> >> I'm finding that I can't get the performance that I need from the
> >>> >> random
> >>> >> access seek in the BinaryDocValues interface.
> >>> >>
> >>> >> I'm wondering if sequentially scanning the docValues would be a
> faster
> >>> >> approach. I have a BitSet of matching docs, so if I sequentially
> moved
> >>> >> through the docValues I could test each one against that bitset.
> >>> >>
> >>> >> Wondering if that approach would be faster for bulk extracts and how
> >>> >> tricky it would be to add an iterator to the BinaryDocValues
> >>> >> interface?
> >>> >>
> >>> >> Thanks,
> >>> >> Joel
> >>> >
> >>> >
> >>> >
> >>> >
> >>> > --
> >>> > Sincerely yours
> >>> > Mikhail Khludnev
> >>> > Principal Engineer,
> >>> > Grid Dynamics
> >>> >
> >>> >
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >>> For additional commands, e-mail: dev-h...@lucene.apache.org
> >>>
> >>
> >>
> >>
> >> --
> >> Sincerely yours
> >> Mikhail Khludnev
> >> Principal Engineer,
> >> Grid Dynamics
> >>
> >>
> >
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> > Principal Engineer,
> > Grid Dynamics
> >
> >
>
>
>
> --
> Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
 <mkhlud...@griddynamics.com>

Re: Iterating BinaryDocValues

Reply via email to