Re: prefix compression implementation

Matt Corgan Fri, 16 Sep 2011 18:47:45 -0700

I'm a little confused over the direction of the DBBs in general, hence the
lack of clarity in my code.


I see value in doing fine-grained parsing of the DBB if you're going to have
a large block of data and only want to retrieve a small KV from the middle
of it.  With this trie design, you can navigate your way through the DBB
without copying hardly anything to the heap.  It would be a shame blow away
your entire L1 cache by loading a whole 256KB block onto heap if you only
want to read 200 bytes out of the middle... it can be done
ultra-efficiently.

The problem is if you're going to iterate through an entire block made of
5000 small KV's doing thousands of DBB.get(index) calls.  Those are like 10x
slower than byte[index] calls.  In that case, if it's a DBB, you want to
copy the full block on-heap and access it through the byte[] interface.  If
it's a HeapBB, then you already have access to the underlying byte[].

So there's possibly value in implementing both methods.  The main problem i
see is a lack of interfaces in the current code base.  I'll throw one
suggestion out there as food for thought.  Create a new interface:

interface HCell{
  byte[] getRow();
  byte[] getFamily();
  byte[] getQualifier();
  long getTimestamp();
  byte getType();
  byte[] getValue();

  //plus an endless list of convenience methods:
  int getKeyLength();
  KeyValue getKeyValue();
  boolean isDelete();
  //etc, etc (or put these in sub-interface)
}

We could start by making KeyValue implement that interface and then slowly
change pieces of the code base to use HCell.  That will allow us to start
elegantly working in different implementations.
PtKeyValue<https://github.com/hotpads/hbase-prefix-trie/blob/master/src/org/apache/hadoop/hbase/keyvalue/trie/compact/read/PtKeyValue.java>would
be one of them.  During the transition, you can always call
PtKeyValue.getCopiedKeyValue() which will instantiate a new byte[] in the
traditional KeyValue format.

We'd also want an interface for HFileBlock, and a few others...

Some of this stuff is overwhelming to think about in parallel with the
existing hbase code, but it's actually not very complicated from a
standalone perspective.  If you can isolate it into a module behind an
interface, then it's just a bunch of converting things to bytes and back.
 There are (hopefully) no exceptions, gc pauses, cascading failures, etc...
all the things that are hard to handle to begin with and especially time
consuming to debug, emulate, and write tests for.  There's not even
multi-threading!  It's pretty easy to write tests for it and then never look
at it again.

Matt

On Fri, Sep 16, 2011 at 6:08 PM, Ryan Rawson <[email protected]> wrote:

> Hey this stuff looks really interesting!
>
> On the ByteBuffer, the 'array' byte[] access to the underlying data is
> totally incompatible with the 'off heap' features that are implemented
> by DirectByteBuffer.  While people talk about DBB in terms of nio
> performance, if you have to roundtrip the data thru java code, I'm not
> sure it buys you much - you still need to move data in and out of the
> main Java heap.  Typically this is geared more towards apps which read
> and write from/to socket/files with minimal processing.
>
> While in the past I have been pretty bullish on off-heap caching for
> HBase, I have since changed my mind due to the poor API (ByteBuffer is
> a sucky way to access data structures in ram), and other reasons (ping
> me off list if you want).  The KeyValue code pretty much presumes that
> data is in byte[] anyways, and I had thought that even with off-heap
> caching, we'd still have to copy KeyValues into main-heap during
> scanning anyways.
>
> Given the minimal size of the hfile blocks, I really dont see an issue
> with buffering a block output - especially if the savings is fairly
> substantial.
>
> Thanks,
> -ryan
>
> On Fri, Sep 16, 2011 at 5:48 PM, Matt Corgan <[email protected]> wrote:
> > Jacek,
> >
> > Thanks for helping out with this.  I implemented most of the DeltaEncoder
> > and DeltaEncoderSeeker.  I haven't taken the time to generate a good set
> of
> > test data for any of this, but it does pass on some very small input data
> > that aims to cover the edge cases i can think of.  Perhaps you have full
> > HFiles you can run through it.
> >
> >
> https://github.com/hotpads/hbase-prefix-trie/tree/master/src/org/apache/hadoop/hbase/keyvalue/trie/deltaencoder
> >
> > I also put some notes on the PtDeltaEncoder regarding how the prefix trie
> > should be optimally used.  I can't think of a situation where you'd want
> to
> > blow it up into the full uncompressed KeyValue ByteBuffer, so
> implementing
> > the DeltaEncoder interface is a mismatch, but I realize it's only a
> starting
> > point.
> >
> > You also would never really have a full ByteBuffer of KeyValues to pass
> to
> > it for compression.  Typically, you'd be passing individual KeyValues
> from
> > the memstore flush or from a collection of HFiles being merged through a
> > PriorityQueue.
> >
> > The end goal is to operate on the encoded trie without decompressing it.
> >  Long term, and in certain circumstances, it may even be possible to pass
> > the compressed trie over the wire to the client who can then decode it.
> >
> > Let me know if I implemented that the way you had in mind.  I haven't
> done
> > the seekTo method yet, but will try to do that next week.
> >
> > Matt
> >
> > On Wed, Sep 14, 2011 at 3:43 PM, Jacek Migdal <[email protected]> wrote:
> >
> >> Matt,
> >>
> >> Thanks a lot for the code. Great job!
> >>
> >> As I mentioned in JIRA I work full time on the delta encoding [1]. Right
> >> now the code and integration is almost done. Most of the parts are under
> >> review. Since it is a big change will plan to test it very carefully.
> >> After that, It will be ported to trunk and open sourced.
> >>
> >> I have a quick glimpse I have taken the different approach. I
> implemented
> >> a few different algorithms which are simpler. They also aims mostly to
> >> save space while having fast decompress/compress code. However the
> access
> >> is still sequential. The goal of my project is to save some RAM by
> having
> >> compressed BlockCache in memory.
> >>
> >> On the other hand, it seems that you are most concerned about seeking
> >> performance.
> >>
> >> I will read your code more carefully. A quick glimpse: we both
> implemented
> >> some routines (like vint), but expect that there is no overlap.
> >>
> >> I also seen that you spend some time investigating ByteBuffer vs.
> Byte[].
> >> I experienced significant negative performance impact when I switched to
> >> ByteBuffer. However I postpone this optimization.
> >>
> >> Right now I think the easiest way to go would be that you will implement
> >> DeltaEncoder interface after my change:
> >> http://pastebin.com/Y8UxUByG
> >> (note, there might be some minor changes)
> >>
> >> That way, you will reuse my integration with existing code for "free".
> >>
> >> Jacek
> >>
> >> [1] - I prefer to call it that way. Prefix is one of the algorithm, but
> >> there are also different approach.
> >>
> >> On 9/13/11 1:36 AM, "Ted Yu" <[email protected]> wrote:
> >>
> >> >Matt:
> >> >Thanks for the update.
> >> >Cacheable interface is defined in:
> >> >src/main/java/org/apache/hadoop/hbase/io/hfile/Cacheable.java
> >> >
> >> >You can find the implementation at:
> >> >src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java
> >> >
> >> >I will browse your code later.
> >> >
> >> >On Tue, Sep 13, 2011 at 12:44 AM, Matt Corgan <[email protected]>
> >> wrote:
> >> >
> >> >> Hi devs,
> >> >>
> >> >> I put a "developer preview" of a prefix compression algorithm on
> github.
> >> >>  It
> >> >> still needs some details worked out, a full set of iterators, about
> 200
> >> >> optimizations, and a bunch of other stuff...  but, it successfully
> >> >>passes
> >> >> some preliminary tests so I thought I'd get it in front of more
> eyeballs
> >> >> sooner than later.
> >> >>
> >> >> https://github.com/hotpads/hbase-prefix-trie
> >> >>
> >> >> It depends on HBase's Bytes.java and KeyValue.java, which depends on
> >> >> hadoop.
> >> >>  Those jars are in there, along with some others for full HFile
> testing
> >> >>in
> >> >> the near future.
> >> >>
> >> >> A good place to start looking at the code
> >> >> is org.apache.hadoop.hbase.keyvalue.trie.builder.KeyValuePtBuilder.
> >> >>It's
> >> >> the main driver of the compaction side of things, taking KeyValues
> (in
> >> >> sorted order), and generates a byte[] to be saved as a disk block.
>  Then
> >> >> for
> >> >> reading it back in, there is trie.compact.read.PtIterator which takes
> >> >>the
> >> >> byte[] and spits out KeyValues.  The main test class is
> >> >> trie.test.row.RowBuilderTests which round-trips a bunch of KeyValues
> to
> >> >> make
> >> >> sure the outputs are the same as the inputs.
> >> >>  trie.compact.row.node.PtRowNode is the format of a single trie node
> in
> >> >>the
> >> >> underlying byte[].
> >> >>
> >> >> The current version generates a lot of little objects (garbage) while
> >> >> writing and reading.  I plan to optimize it to the point where most
> >> >> variables are primitives on the stack (at least when reading), but I
> >> >>think
> >> >> these intermediate classes are important for debugging.  I'll
> probably
> >> >>try
> >> >> to keep them around going forward and develop a more compact version
> in
> >> >> parallel.
> >> >>
> >> >> It uses trie style compression for the row keys and column
> qualifiers,
> >> >> where
> >> >> pointers between nodes are compacted ints.  It keeps a list of
> >> >>compacted,
> >> >> de-duped deltas for timestamps, and if they're all the same, it
> stores
> >> >>only
> >> >> one (long) per block.  If all KeyValue.TYPE operations are the same,
> >> >>then
> >> >> it
> >> >> only stores one (byte) per block.
> >> >>
> >> >> It's designed around efficient cpu cache usage and elimination of 8
> byte
> >> >> pointers, so should be fast.  Get calls can traverse the trie nodes
> to
> >> >>dig
> >> >> up a row key while barely loading anything from memory to cache, as
> >> >>opposed
> >> >> to current hbase which may load the better part of a block into
> cache.
> >> >>  Scanners/Filters/Comparators can all be designed to be trie-aware so
> >> >>they
> >> >> can iterate through 20 columnQualifiers in the same row without
> >> >>constantly
> >> >> re-scanning the rowKey bytes... etc...
> >> >>
> >> >>
> >> >> Here are a few questions we'll need to answer at some point:
> >> >>
> >> >> * where should this code go?
> >> >>    - i'd vote for keeping it isolated and limiting references back to
> >> >>the
> >> >> main hbase project.  sort of like the gzip/lzo algorithms.
> >> >>    - if it's strictly isolated, it'll be easier to keep it well
> tested
> >> >>for
> >> >> correctness/performance and let more people tinker with it to make it
> >> >> faster.  it'll also force us to come up with the right interfaces to
> >> >>allow
> >> >> other compression implementations.
> >> >>
> >> >> * current HFileWriter sends KeyValues to the OutputStream as soon as
> >> >> they're
> >> >> processed, but would it be ok to queue up a whole block in memory and
> >> >>write
> >> >> it all at once?
> >> >>    - i'd vote for yes.  it makes it easier to arrange the data to be
> >> >>more
> >> >> read-friendly.  also, we're only talking about one block of data
> which
> >> >>is
> >> >> presumably a fraction of the block cache's size
> >> >>
> >> >> * should the block bytes be accessed as a byte[] of ByteBuffer.  i
> know
> >> >> there's been work on off-heap cache, but i've read the blocks are
> pulled
> >> >> on-heap before they're parsed (??)
> >> >>    - see
> >> >>org.apache.hadoop.hbase.keyvalue.trie.profile.MemoryAccessProfiler
> >> >> for a comparison of byte[] vs ByteBuffer speed tests.  ByteBuffer
> looks
> >> >> ~2-10x slower, but some people need the off-heap cache
> >> >>    - i'll propose maintaining separate reading algorithms for each,
> >> >>given
> >> >> that the underlying bytes are in the exact same format.  should be
> >> >>possible
> >> >> to copy/paste the code and replace bytes[i] with ByteBuffer.get(i),
> and
> >> >> create parallel versions of static util methods
> >> >>
> >> >> * each block has some metadata wrapped in a class called PtBlockMeta.
> >> >>does
> >> >> HBase currently have a way to store its values as java primitives on
> the
> >> >> heap rather than parsing them out of the byte[]/ByteBuffer on every
> >> >>access?
> >> >>  if they have to be parsed out on every block access, that could take
> >> >>more
> >> >> time than the Get query it's trying to perform.
> >> >>    - I know there's a new Cachable interface or something like that.
> >> >>maybe
> >> >> that already supports it or could be enhanced
> >> >>
> >> >>
> >> >> What jiras do you think i should make?
> >> >>
> >> >> Look forward to hearing people's feedback,
> >> >> Matt
> >> >>
> >>
> >>
> >
>

Re: prefix compression implementation

Reply via email to