> Hi all,
> As a way of gaining familiarity with Cassandra I am migrating a table
> that is currently stored in a relational database and mapping it into
> a Cassandra column family. We add about 700,000 new rows a day to this
> table, and the average disk space used per row is ~ 300 bytes
> including indexes.
> The mapping from table to column family is straight forward - there is
> a one-one relationship between table columns and column family column
> names. The relational table has 19 columns. The length of the names of
> the columns is nearly 200 bytes whereas the average amount of data per
> row is only 130 bytes.
> Initially I used the identify map for this translation - i.e. my
> Cassandra column names were the same as the relational column names. I
> then found out I could save a lot of disk space by using single letter
> column names instead of the original relational names. I.e. use 'L'
> instead of 'LINK_IDENTIFIER' for a column name.
> The procedure I use to determine space used is:
> 1. rm -rf the cassandra var-lib directory
> 2. start cassandra, create keyspace, column families, etc.
> 3. insert records
> 4. stop cassandra
> 5. re-start cassandra
> 6. measure disk space with du -s the cassandra var-lib directory
> This seems to replace the commit logs with .db files.
> My questions are:
> 1. Is this a common practice (i.e. making the client responsible for
> shortening the column names) when dealing with a large number of fixed
> column names and a high volume of inserts? Is there any way that
> Cassandra can help out here?

Yes, we're working on a new, compressed format CASSANDRA-674.

> 2. Is there another way to transform the commit logs into .db files
> without stopping and starting the server?

nodetool flush.


