[
https://issues.apache.org/jira/browse/KYLIN-1034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14900832#comment-14900832
]
Daniel Lemire commented on KYLIN-1034:
--------------------------------------
Apache Spark, Druid and many other projects have come to rely on off-heap
memory (e.g., memory-file mapping). So it is definitively reliable and not too
bad performance-wise.
The speed penalty due to the fact that you go through the ByteBuffer interface
can be anywhere from none to, say, two times slower. This comes down to the
fact that Java is very good at optimizing array accesses, but not so good at
optimizing accesses through a ByteBuffer. When the computation is fairly
intensive, the difference between the memory-mapped version and the regular one
can be insignificant... but if the performance bottleneck is data access, then
going through the ByteBuffer can be more of a penalty. I don't think I have
ever seen a memory-mapped version being, say, 10x slower however... simply
because there is a limit at how inefficient ByteBuffers can be. On the other
side of the equation, going through a ByteBuffer can possibly reduce the stress
on the garbage collector, simply because you allocate less memory through Java.
So there is a gain there that compensates the penalty.
This being said, if you need to load the bitmaps, that is, deserialize them,
each time you do a computation, then you are definitively much better off with
memory-file mapping. That's because "mapping" a bitmap from an existing
location in a file is almost free (compared to the burden of deserialization).
So, I think that if you can afford to leave the bitmaps in (Java's) RAM and the
deserialization is a rare occurence (that does not happen with most queries),
then memory-file mapping is going to slow things down. If you deserialize
possibly with each query, then memory-file mapping is probably the way to go.
Let me take a concrete example. We can consider one first data set
(census1881). It is made of 200 bitmaps randomly picked from the index of a
census database (from 1881). In one test, we take the successive ORs between
bitmaps (so bitmap 1 is ORed with bitmap 2, then bitmap 2 is ORed with bitmap 3
and so forth). We report the average time elapsed:
With memory-file mapping (where the bulk of the data does not reside in Java's
RAM), we get the following times...
Conciset 228496.931
Roaring 957.803
And with the conventional in-Java-RAM, you get...
ConciseSet 25200.886
Roaring 690.160
In this case, Roaring is slightly penalized by the memory-file mapping.
(You should not take from these numbers that Roaring can be expected to be
systematically 20x faster than Concise. This is just one case, for
illustration.)
I should point out what we use for memory-file mapping in the case of
Concise... we use the extendedset library produced by the metamx team for the
purposes of Druid, e.g., see
https://github.com/metamx/extendedset/blob/master/src/main/java/it/uniroma3/mat/extendedset/intset/ImmutableConciseSet.java
Note that other bitmap libraries, like JavaEWAH also support memory-file
mapping. (JavaEWAH is used as part of Apache Hive.) This has become a widely
available feature.
> Faster bitmap indexes with Roaring bitmaps
> ------------------------------------------
>
> Key: KYLIN-1034
> URL: https://issues.apache.org/jira/browse/KYLIN-1034
> Project: Kylin
> Issue Type: Improvement
> Components: General
> Affects Versions: Future
> Reporter: Daniel Lemire
> Labels: performance
>
> Kylin is using ConciseSet for bitmap indexing. It was found that Roaring
> bitmaps are often much better than ConciseSet (e.g., see experimental section
> in http://arxiv.org/pdf/1402.6407.pdf ). The compression is often better and
> the speed difference can be substantial. This is even more so with version
> 0.5 of Roaring.
> There is a high quality Java implementation that is used by Apache Spark and
> Druid.io. The Druid people found that switching to Roaring bitmaps could
> improve real-world performance by 30% or more.
> Source code:
> https://github.com/lemire/RoaringBitmap/
> Importing from Maven:
> http://central.maven.org/maven2/org/roaringbitmap/RoaringBitmap/0.5.3/
> <dependencies>
> <dependency>
> <groupId>org.roaringbitmap</groupId>
> <artifactId>RoaringBitmap</artifactId>
> <version>[0.5,)</version>
> </dependency>
> </dependencies>
> Online API :
> http://lemire.me/docs/RoaringBitmap/
> JavaDoc Jar:
> http://central.maven.org/maven2/org/roaringbitmap/RoaringBitmap/0.5.3/RoaringBitmap-0.5.3-javadoc.jar
> When desired, the library supports memory file mapping, so that out-of-JVM
> heap memory is used instead. This can greatly improve IO issues. The library
> is available under the Apache license and is patent-free.
> There is an extensive real-data benchmark framework which you can run for
> yourself to compare Roaring with competitive alternatives such as ConciseSet :
> https://github.com/lemire/RoaringBitmap/tree/master/jmh
> Running such a benchmark can be as simple as launching a script.
> For Druid, the bitmap format was made "pluggable" so you can switch from one
> format to the other using a configuration flag. This is implemented through
> simple wrappers, e.g., see
> https://github.com/metamx/bytebuffer-collections/tree/master/src/main/java/com/metamx/collections/bitmap
> So it can be really easy to make it possible to switch the format while
> preserving backward compatibility if needed...
> It is probably not difficult work to integrate Roaring in Kylin (maybe a day
> or so of programming) and it could potentially improve performance while
> reducing memory storage.
> Note: Roaring bitmaps were also adopted in Apache Lucene, though they have
> their own implementation, see
> https://www.elastic.co/blog/frame-of-reference-and-roaring-bitmaps
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)