On Thu, Jun 16, 2011 at 3:41 PM, E R <pc88m...@gmail.com> wrote: > Hi all, > > As a way of gaining familiarity with Cassandra I am migrating a table > that is currently stored in a relational database and mapping it into > a Cassandra column family. We add about 700,000 new rows a day to this > table, and the average disk space used per row is ~ 300 bytes > including indexes. > > The mapping from table to column family is straight forward - there is > a one-one relationship between table columns and column family column > names. The relational table has 19 columns. The length of the names of > the columns is nearly 200 bytes whereas the average amount of data per > row is only 130 bytes. > > Initially I used the identify map for this translation - i.e. my > Cassandra column names were the same as the relational column names. I > then found out I could save a lot of disk space by using single letter > column names instead of the original relational names. I.e. use 'L' > instead of 'LINK_IDENTIFIER' for a column name. > > The procedure I use to determine space used is: > > 1. rm -rf the cassandra var-lib directory > 2. start cassandra, create keyspace, column families, etc. > 3. insert records > 4. stop cassandra > 5. re-start cassandra > 6. measure disk space with du -s the cassandra var-lib directory > > This seems to replace the commit logs with .db files. > > My questions are: > > 1. Is this a common practice (i.e. making the client responsible for > shortening the column names) when dealing with a large number of fixed > column names and a high volume of inserts? Is there any way that > Cassandra can help out here?
Yes, we're working on a new, compressed format CASSANDRA-674. > 2. Is there another way to transform the commit logs into .db files > without stopping and starting the server? nodetool flush. -ryan