[ 
https://issues.apache.org/jira/browse/KYLIN-1034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14933428#comment-14933428
 ] 

Daniel Lemire commented on KYLIN-1034:
--------------------------------------

 It does look to me like the bitmaps are serialized to streams of bytes. From 
there, *immutable* bitmaps are reloaded on demand, then possibly copied and 
modified. 

The Roaring library has a class ideally suited for this purpose, called 
ImmutableRoaringBitmap...  From any ByteBuffer, you can map directly a bitmap : 

https://github.com/lemire/RoaringBitmap/blob/master/examples/ImmutableRoaringBitmapExample.java

Compared to deserializing a bitmap from a stream of bytes, this approach avoids 
copying and parsing the data: constructing an ImmutableRoaringBitmap is very 
fast and uses very  little memory. Because they are formally immutable, you 
only need one instance in your entire application, irrespective of the number 
of cores. The data is accessed only when the ImmutableRoaringBitmap is actually 
queried, and what is accessed is the original stream of bytes (no unnecessary 
copy is made). So it uses less memory. 

Making use of ImmutableRoaringBitmap would not be difficult, programming-wise, 
in kylin.

> Faster bitmap indexes with Roaring bitmaps
> ------------------------------------------
>
>                 Key: KYLIN-1034
>                 URL: https://issues.apache.org/jira/browse/KYLIN-1034
>             Project: Kylin
>          Issue Type: Improvement
>          Components: General
>    Affects Versions: Future
>            Reporter: Daniel Lemire
>              Labels: performance
>         Attachments: roaring-0001.patch
>
>
> Kylin is using ConciseSet for bitmap indexing. It was found that Roaring 
> bitmaps are often much better than ConciseSet (e.g., see experimental section 
> in http://arxiv.org/pdf/1402.6407.pdf ). The compression is often better and 
> the speed difference can be substantial. This is even more so with version 
> 0.5 of Roaring.
> There is a high quality Java implementation that is used by Apache Spark and 
> Druid.io. The Druid people found that switching to Roaring bitmaps could 
> improve real-world performance by 30% or more. 
> Source code: 
> https://github.com/lemire/RoaringBitmap/
> Importing from Maven:
> http://central.maven.org/maven2/org/roaringbitmap/RoaringBitmap/0.5.3/
> <dependencies>
>     <dependency>
>       <groupId>org.roaringbitmap</groupId>
>       <artifactId>RoaringBitmap</artifactId>
>       <version>[0.5,)</version>
>     </dependency>
>  </dependencies>
> Online API :
> http://lemire.me/docs/RoaringBitmap/
> JavaDoc Jar: 
> http://central.maven.org/maven2/org/roaringbitmap/RoaringBitmap/0.5.3/RoaringBitmap-0.5.3-javadoc.jar
> When desired, the library supports memory file mapping, so that  out-of-JVM 
> heap memory is used instead. This can greatly improve IO issues. The library 
> is available under the Apache license and is patent-free.
> There is an extensive real-data benchmark framework which you can run for 
> yourself to compare Roaring with competitive alternatives such as ConciseSet :
> https://github.com/lemire/RoaringBitmap/tree/master/jmh
> Running such a benchmark can be as simple as launching a script.
> For Druid, the bitmap format was made "pluggable" so you can switch from one 
> format to the other using a configuration flag. This is implemented through 
> simple wrappers, e.g., see
> https://github.com/metamx/bytebuffer-collections/tree/master/src/main/java/com/metamx/collections/bitmap
> So it can be really easy to make it possible to switch the format while 
> preserving backward compatibility if needed... 
> It is probably not difficult work to integrate Roaring in Kylin (maybe a day 
> or so of programming) and it could potentially improve performance while 
> reducing memory storage.
> Note: Roaring bitmaps were also adopted in Apache Lucene, though they have 
> their own implementation, see 
> https://www.elastic.co/blog/frame-of-reference-and-roaring-bitmaps



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to