> i agree about starting with the block index since it's a relatively
> isolated piece of code.  Could possibly be done without KeyValue
> modifications

Right.  I don't think any KeyValue changes will be required.  I'm
trying to get more info about if we can MMap the FST, then store all
the keys into it on disk/HDFS.  This could completely and insanely
reduce the cost of rowids both in terms of loading them into heap,
heap usage, and the overall size of HFiles in general.  Also it would
enable the user to choose keys of a much larger size and not pay a
penalty.

On Thu, Jun 2, 2011 at 8:17 PM, Matt Corgan <mcor...@hotpads.com> wrote:
> What about turning KeyValue into an interface with only the essential
> getXX() methods?  Whatever is using the KeyValue shouldn't *have* to know
> what sort of structure it's backed by, but could figure out the
> implementation in cases where it wants to optimize performance.  Seems like
> this is warranted right now just for the sake of breaking up such a huge
> class... Possible implementations include:
>
> new SimpleKeyValue(final byte[] row, final byte[] family,
>      final byte[] qualifier, final long timestamp, Type type,
>      final byte[] value) (separate fields, low performance, useful for
> debugging?)
>
> new ArrayKeyValue(byte[] bytes)
>
> new SubArrayKeyValue(byte[] bytes, int offset, int length)
>
> new BlockIndexKeyValue(BlockIndex blockIndex, int kvPosition)
>
> new HFileKeyValue(HFileScanner scanner)
>
> etc... where each implementation knows how to fish out the underlying data.
>  Later we add more implementations:
>
> new CssTreeBlockIndexKeyValue(CssTreeBlockIndex blockIndex, int kvPosition)
>
>
> @Jason - i agree about starting with the block index since it's a relatively
> isolated piece of code.  Could possibly be done without KeyValue
> modifications.  Also, i think it could work without changing the HFile if
> people are willing to let it compact the index when the HFile is read rather
> than when the when it's written.  Would make for slightly slower startup
> times but would save a lot of memory.  Could be configurable.
>
>
> On Thu, Jun 2, 2011 at 4:44 PM, Jonathan Gray <jg...@fb.com> wrote:
>
>> I'm here!
>>
>> Still parsing through the stuff in this e-mail but I agree that many
>> approaches will require rejiggering of KeyValue in a significant way.  I
>> have made some attempts at this but nothing is very far.  It's definitely
>> something that we are hoping to put some resources into over the summer.
>>
>> JG
>>
>> > -----Original Message-----
>> > From: Todd Lipcon [mailto:t...@cloudera.com]
>> > Sent: Thursday, June 02, 2011 3:37 PM
>> > To: dev@hbase.apache.org
>> > Subject: Re: prefix compression
>> >
>> > Hey Matt,
>> >
>> > Interesting email, and also something I've been thinking about recently.
>> >
>> > Unfortunately, I think one of the big prerequisites before we can start
>> > thinking about actual compression algorithms is some refactoring around
>> > how KeyValue is used. Currently, KeyValue exposes byte arrays in a lot of
>> > places
>> > - these byte arrays are expected to have discrete sub-ranges that
>> represent
>> > row key, column qualifier, etc. Various parts of the code directly grab
>> these
>> > byte arrays and offsets and do stuff with them (eg comparison,
>> serialization,
>> > etc).
>> >
>> > In order to split the different components up, or do true prefix
>> compression,
>> > we'd need to be able to have KeyValue do some smarter things for
>> > comparison, and never call functions like "getBuffer), getRowKeyOffset(),
>> > getRowKeyLength()" expecting that to show us a byte array range with a
>> row
>> > key in it.
>> >
>> > I think Jonathan Gray has some code hacked together for this, but last I
>> > heard it wasn't in shippable state... Jonathan, you out there?
>> >
>> > As for the particular compression algorithms, a nice first step would be
>> a very
>> > simplistic one: the first row key of any HFile block would be given in
>> full.
>> > Following that, each key uses a single byte token which identifies the
>> > number of key components shared with the previous row (ie same row,
>> > same col, different ts). This isn't quite as good as true prefix
>> compression,
>> > but for common cases where each cell has three versions, and each row has
>> > many columns, it's probably a huge savings.
>> >
>> > -Todd
>> >
>> > On Thu, Jun 2, 2011 at 3:28 PM, Matt Corgan <mcor...@hotpads.com>
>> > wrote:
>> >
>> > > I've been reading up on prefix compression for a few weeks trying to
>> > > figure out the best approach.  Maybe some of you have noticed that 90%
>> > > of my emails to the user list have something to do with it...  Now
>> > > that people are working on HFile v2 and the topic of Lucene's FST has
>> > > come up, I thought it would be a good time to share some thoughts.  My
>> > > goal here is definitely not asking anyone to implement this, but just
>> > > to start a discussion that gets the proper jiras in place.  I have
>> > > spent a little time coding an initial rowKey trie builder that could
>> > > be the start of a pluggable implementation.
>> > >
>> > > This is a big topic to address in one email, but I don't think the
>> > > pieces make sense on their own.  Here's an outline with some thoughts:
>> > >
>> > > * refer to prefix compression as "compaction" to avoid interfering
>> > > with traditional "compression"
>> > >
>> > > * main goal is to reduce size of data in memory to increase effective
>> > > capacity of block index, block cache, and memstores
>> > >    * disk access takes thousands of times longer than memory access,
>> > > especially over hdfs/network
>> > >
>> > > * secondary goal is to speed up the processing of in memory data
>> > >    * memory access takes 200+ cpu cycles, so must use cache friendly
>> > > algorithms
>> > >    * structure trie node sizes around typical 64B cache line size
>> > >    * investigate java's memory prefetching capabilities, possibly
>> > > increasing to ~256B nodes
>> > >        * research
>> > > paper<
>> > >
>> > http://www.google.com/url?sa=t&source=web&cd=5&sqi=2&ved=0CDcQFj
>> > AE&url
>> > >
>> > =http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3
>> > D10.1.
>> > >
>> > 1.68.4872%26rep%3Drep1%26type%3Dpdf&rct=j&q=databases%20prefetchi
>> > ng&ei
>> > > =7tnnTcSWLuXTiALSpNCUDA&usg=AFQjCNFpVC9oDsG1-
>> > UG0x6ZsqsIHLqvGDA&sig2=o0
>> > > 96l8cNrnX69oeBGChUKg
>> > > >about
>> > > modern memory databases and prefetching
>> > >    * almost always use variable length integer/long encoding
>> > >
>> > > * need separate compacted structures/algorithms for rowKey,
>> > > colQualifier, and timestamp
>> > >    * structures will need to interact with varInt pointers to link
>> > > rowKey<->colQualifier<->timestamp
>> > >
>> > > * i think in the medium to long term (and maybe even short term) it
>> > > would be best to code this from scratch
>> > >
>> > >
>> > > Static row key compaction for block cache and data blocks:
>> > > * avoid schemes where each row is a diff of the previous (expensive
>> > > cpu)
>> > > * utilize the fact that row keys are always sorted (unlike column
>> > > qualifiers)
>> > >    * this makes for fast tree construction during flushing and
>> > > compaction
>> > > * low hanging fruit: (rowKeyCompaction=simple)
>> > >    * pull off the common prefix of all rows
>> > >    * have a flag indicating if row key is repeat of the previous
>> > > * more aggressive: use modified Cache Sensitive Search Trees<
>> > > http://www.cs.columbia.edu/~library/TR-repository/reports/reports-1998
>> > > /cucs-019-98.pdf
>> > > >
>> > > (rowKeyCompaction=cssTree64,
>> > > cssTree256, etc)
>> > >    * modified to accept full byte range 0-255, store full byte strings
>> > > in the tree, remove all pointers, optimize for cache line size, etc,
>> > > but similar idea
>> > >    * CSS trees are considered sub-par because they are not easily
>> > > modified, but that is ok in hbase's block cache and data blocks
>> > > * input is a sorted list of byte[], output is a byte[] containing the
>> > > entire prefix trie
>> > > * possibly put all keys in the front of the block, and all values at
>> > > the end, with value offsets/lengths in the keys
>> > >
>> > > Backwards compatibility
>> > > * block cache could quickly compute the trie when loaded, therefore
>> > > working with HFile v1
>> > >    * eventually want to store the computed trie on disk
>> > > * HFile v2 will need to store information about the pluggable data
>> > > block compaction schemes
>> > >
>> > >
>> > > Memstore dynamic row key compaction:
>> > > * this will be much more difficult than the static compaction because
>> > > it needs to accept unsorted byte[]'s and support row locking
>> > > * use modified C-Burstsort<
>> > > http://ww2.cs.mu.oz.au/~rsinha/papers/SinhaRingZobel-2006.pdf>
>> > > for
>> > > dynamic unsorted data
>> > > * fast and compact (best dynamic string sorting algorithm i could
>> > > find)
>> > > * fragmentation friendly (not original goal, but ends up working like
>> > > MSLAB)
>> > >    * allocates large (customizable) byte[] for holding multiple leaf
>> values
>> > >    * "bursts" the byte[] when they fill up, creating multiple child
>> > > arrays
>> > > * will need to add some sort of lockable object to the leaf of the
>> > > rowKey trie
>> > >
>> > >
>> > > Column qualifier compaction:
>> > > * config param colQualifierCompaction=(none, lookupTable, cssTree64,
>> > > etc)
>> > > * use lookupTable when column qualifiers are a small bounded set
>> > >    * store the lookup table in memory (similar to blockIndex or
>> > > fixedFileTrailer)
>> > >    * on second thought, this could be difficult without 2-pass
>> > > compaction, but maybe still possible
>> > > * use cssTree when a large or unbounded set, like when appending a
>> > > numerical suffix in wide rows
>> > >
>> > >
>> > > Timestamp compaction:
>> > > * timestampCompaction=(none, hfileDiff, cssTree, flatten)
>> > > * each hfile needs a lowestTimestamp stored in the trailer
>> > > * hflieDiff would store a varLong relative to the lowestTimestamp
>> > > * cssTree would do the same as it does for rowKeys, but the trie
>> > > overhead might wipe out the space savings with such small values
>> > > * flatten would not store a timestamp in each record, rather use the
>> > > hfile's lowestTimestamp for every record
>> > >    * many use cases don't care about the timestamp
>> > >    * i think this works for maxVersions=1.  may need more elaborate
>> > > scheme for multiple versions
>> > >
>> > >
>> > > I'll stop there...  sorry for the huge email!  I haven't seen much
>> > > discussion on email or jira about how prefix compression should be
>> > > done, so I hope this provides somewhat of a starting point.
>> > >
>> > > Matt
>> > >
>> >
>> >
>> >
>> > --
>> > Todd Lipcon
>> > Software Engineer, Cloudera
>>
>

Reply via email to