-- sorry, resending it as I don't know what happens to the layout of the
previous one
Hi Paul,
This is a good question. The two methods, i.e., VSE and AFOR, are very
similar. The two methods can be considered as an extension of FOR to
make it less sensitive to outliers by adapting the encoding to the value
distribution. To achieve this, the two methods are encoding a list of
values by
- partitioning it into "frames" (or sequence of consecutive integers) of
variable lengths,
- encoding each frame using a different "bit frame" (the minimum number
of bits required to encode any integer in the frame, and still be able
to distinguish them)
- relying on algorithms to automatically find a good list partitioning.
Apart from the minor differences in the implementation design (that I
will discuss later), the main difference is that VSE is optimised for
achieving a high compression rate and a fast decompression but
disregards the efficiency of compression, while AFOR is optimised for
achieving a high compression rate, a fast decompression but also a fast
compression speed. VSE is using a Dynamic Programming method to find the
*optimal partitioning* of a list (optimal in term of compression rate).
While this approach provides a higher compression rate than the one
proposed in AFOR, the complexity of such a partitioning algorithm is O(n
* k), with the term n being the number of values and the term k the size
of the larger frame, which might greatly impact the compression
performance. In AFOR, we use instead a local optimisation algorithm that
is less effective in term of compression rate but faster to compute.
In term of implementation details, there is a few differences.
1) VSE allows frames of length 1, 2, 4, 6, 8, 12, 16 and 32. The current
implementation of AFOR restrict the length of a frame to be a multiple
of 8 to to be aligned with the start and end of a byte boundary (and
also to minimise the number of loop-unrolled highly-optimised routines).
More precisely, AFOR-2 use three frame lengths: 8, 16 and 32.
2) To allow the *optimal partitioning* of a list, the original
implementation of VSE needs to operate on the full list. On the
contrary, AFOR has been developed to operate on small subsets of the
list, so that AFOR can be applied during incremental construction of the
compressed list (it does not require the full list, but works on small
block of 32 or more integers). However, we can think of applying VSE on
small subset, as in AFOR. In this case, VSE does not compute the optimal
partition of a list, but only the optimal partition of the subset of the
list.
VSE and AFOR encodes a frame in a similar way: first, a header (1 byte)
which provides the bit frame and the frames length, then the encoded frame.
So, as you can see, in essence, the two models are very similar. For the
background, I know well Fabrizio Silvestri (co-author of VSE), and he
was my PhD thesis examiner (the AFOR compression scheme is a chapter of
my thesis). The funny thing is that we come up with these two models at
the same time, this summer, without knowing we were working on something
similar ;o). However, he was more lucky than I am to publish his
findings before me.
I hope this answers to your question.
Feel free to ask if you have any other questions,
Regards,
--
Renaud Delbru
On 25/01/11 12:24, Renaud Delbru wrote:
Hi Paul,
This is a good question. The two methods, i.e., VSE and AFOR, are very
similar. The two methods can be considered as an extension of FOR to
make it less sensitive to outliers by adapting the encoding to the
value distribution. To achieve this, the two methods are encoding a
list of values by
- partitioning it into "frames" (or sequence of consecutive integers)
of variable lengths,
- encoding each frame using a different "bit frame" (the minimum
number of bits required to encode any integer in the frame, and still
be able to distinguish them)
- relying on algorithms to automatically find a good list partitioning.
Apart from the minor differences in the implementation design (that I
will discuss later), the main difference is that VSE is optimised for
achieving a high compression rate and a fast decompression but
disregards the efficiency of compression, while AFOR is optimised for
achieving a high compression rate, a fast decompression but also a
fast compression speed. VSE is using a Dynamic Programming method to
find the *optimal partitioning* of a list (optimal in term of
compression rate). While this approach provides a higher compression
rate than the one proposed in AFOR, the complexity of such a
partitioning algorithm is O(n * k), with the term n being the number
of values and the term k the size of the larger frame, which might
greatly impact the compression performance. In AFOR, we use instead a
local optimisation algorithm that is less effective in term of
compression rate but faster to compute.
In term of implementation details, there is a few differences.
1) VSE allows frames of length 1, 2, 4, 6, 8, 12, 16 and 32. The
current implementation of AFOR restrict the length of a frame to be a
multiple of 8 to to be aligned with the start and end of a byte
boundary (and also to minimise the number of loop-unrolled
highly-optimised routines). More precisely, AFOR-2 use three frame
lengths: 8, 16 and 32.
2) To allow the *optimal partitioning* of a list, the original
implementation of VSE needs to operate on the full list. On the
contrary, AFOR has been developed to operate on small subsets of the
list, so that AFOR can be applied during incremental construction of
the compressed list (it does not require the full list, but works on
small block of 32 or more integers). However, we can think of applying
VSE on small subset, as in AFOR. In this case, VSE does not compute
the optimal partition of a list, but only the optimal partition of the
subset of the list.
VSE and AFOR encodes a frame in a similar way: first, a header (1
byte) which provides the bit frame and the frames length, then the
encoded frame.
So, as you can see, in essence, the two models are very similar. For
the background, I know well Fabrizio Silvestri (co-author of VSE), and
he was my PhD thesis examiner (the AFOR compression scheme is a
chapter of my thesis). The funny thing is that we come up with these
two models at the same time, this summer, without knowing we were
working on something similar ;o). However, he was more lucky than I am
to publish his findings before me.
I hope this answers to your question.
Feel free to ask if you have any other questions,
Regards,
--
Renaud Delbru
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org