[ 
https://issues.apache.org/jira/browse/CASSANDRA-4175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13698701#comment-13698701
 ] 

Terje Marthinussen commented on CASSANDRA-4175:
-----------------------------------------------

Hi, 

Sorry for the late update.

Yes, we have a cluster with some 20-30 billion columns (maybe even closer to 40 
billion by now) which implements a column name map and has been in production 
for about 2 years.

I was actually looking at committing this 2 years ago together with fairly 
large number of other changes which was implemented in the column/supercolumn 
serializer code but I never got  around to implement a good way to push the 
sstable version numbers into the serializer to make things backwards compatible 
before focus moved resources elsewhere.

As mentioned above by others, while not benchmarked and proven, I had a very 
good feeling the total change helped quite a bit on GC issues, memtables and a 
bit on performance in general, but in terms of disk space, the benefit was 
somewhat limited after sstable compression was implemented as the repeating 
column names are compressed pretty well.

This is already 2 years ago (the cluster still runs by the way), but if memory 
serves me right:
30-40% reduction in disk space without compression
10% reduction on top of compression (I did a test after it was implemented).

In my case, the implementation is actually hardcoded due to time constraints. A 
static map which is global for the entire cassandra installation.

If committing this into cassandra, I believe my plan was split in 3.
Possible as 3 different implementation stages:

1. A simple config option (as a config file or as a columnfamily) where users 
themselves can assign repeating column names. Sure, it is not as fancy as many 
other options, but maybe we could open up to cover some strange corner case 
usages here with things like substrings as well.

Think options to cover complex versions of patterns like date/times such as 
20130701202020 where a large chunk of the column name repeats, but not all of 
it.

In the current implementation, if there is a mapping entry, it converts the 
string to a variable length integer which becomes the new column name. If there 
is no mapping entry, it stores the raw data.

In our case, we have <40 repeating column names so I never need more than 1 
byte, but the implementation would handle more if I had.

I modified the sstable to add a bitmap at the start of each column to be able 
to turn on/off mapping entries, timestamps not used, TTL's and other things. 
There is a bunch of 64 bit numbers in the column format which only have default 
value in 99.999% of all cases and very often your column value is just an 8 
byte int, a boolean or a short text entry. 

I think in 99% of the columns in this cassandra store, the column timestamp 
takes up more space than the column value.

This would have been my first implementation. Mostly because I have a working 
implementation of it already and the mapping table would be very easy to move 
to a config file read at start of a column family similar to what we have for 
CF config but also here, it is a bit work to push such config data down to the 
serializer as the code was organized 2 years ago.

Notice again, you do not need atomic handling of the updates to the map in any 
way in this implementation. You can add map entries at any time. The result 
after deserializing is always the same as column names can have a mix of raw 
and map id values thanks to the "column feature bitmap" that was introduced.

2. Auto learning feature with mapping table per sstable. 
This would be stage 2 of the implementation.

When starting to create a new SSTable, build a sampling of the most frequently 
occuring column names and gradually start mapping them to ID's.

Add the mapping table to the end of the SSTable or in a separate .map file 
(similar to index files) at the completion of sstable generation.

The initial id mapping could be further improved by maintaining a global map of 
column names. This "global map" would not be used for 
serialization/deserialization. It would be used to pre-populate the value for a 
sstable and would only be statistics to optimize things further by reducing the 
number of mapping variances between sstables and reducing the number of raw 
values getting stored a bit more.

The id map would still be local to each sstable in terms of storage, but having 
such statistics would allow you to dramatically reduce the size of a 
potentially shared id cache across sstables where a lot of mapping entries 
would be identical.

Some may feel that we would run out of memory quickly or use a lot of extra 
disk with maps per sstable, but I guess that we only really need to deal with 
the top few thousand entries in each sstable and this would not be a problem to 
keep in a idmap cache in terms of size.

This is really just the top X re-occuring column names or column name sub 
pattern

If you have more unique column entries that this in a sstable, this will 
probably not be the feature that will save the day anyway as the benefit per 
column entry will be quite small vs. the overhead and the entire feature should 
potentially disable itself automagically if there is no frequently repeating 
patterns.

3. I had some ideas for moving the mapping up from the serializer to allow 
things like streaming entries including id maps between nodes, but things do 
indeed quickly get ugly and I do not remember clearly how I had planned to do 
this.

---
The reason I isolated the mapping function to the serializer is that it looked 
incredibly messy to move this further "up" in the stack. Column sorts, range 
scans, lookukups... 

Not fun at all and if the memtable is serialized anyway the memory consumption 
there and in disk cache is dramatically reduced.

Also... with a global static map here at startup time, I actually share the 
mapped strings across most columns in memory anyway as I believe they all 
become pointers to my static complied in map (again, this gets a lot more 
trivial to make work very well if this is a startup config, but yes a bit less 
user unfriendly)

I haven't looked at the cassandra code for way to long now.

Has it become easier to get to know sstable version numbers in the serializer 
class now?

I could maybe check if someone in the team here would like to take a stab at 
moving this to latest cassandra and commit it if the above implementation seems 
interesting. 

Part of it should be really easy to port as long as we can get a bit more info 
into the serializer/deserializer.

                
> Reduce memory, disk space, and cpu usage with a column name/id map
> ------------------------------------------------------------------
>
>                 Key: CASSANDRA-4175
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4175
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Jonathan Ellis
>             Fix For: 2.1
>
>
> We spend a lot of memory on column names, both transiently (during reads) and 
> more permanently (in the row cache).  Compression mitigates this on disk but 
> not on the heap.
> The overhead is significant for typical small column values, e.g., ints.
> Even though we intern once we get to the memtable, this affects writes too 
> via very high allocation rates in the young generation, hence more GC 
> activity.
> Now that CQL3 provides us some guarantees that column names must be defined 
> before they are inserted, we could create a map of (say) 32-bit int column 
> id, to names, and use that internally right up until we return a resultset to 
> the client.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to