Re: 8 million Cassandra data files on disk

Jonathan Ellis Tue, 02 Aug 2011 14:37:39 -0700

I don't remember a removing-compacted-files bug in 0.7.0, but you
should absolutely upgrade to 0.7.8 for several dozen other fixes,
including some severe ones -- see NEWS.txt.


On Tue, Aug 2, 2011 at 4:29 PM, Yiming Sun <yiming....@gmail.com> wrote:
> Hi Jeremiah,
>
> Thank you for the information - it certainly is a relief.  Two questions
> though:
>
> 1. I came across an old thread which seemed to be saying 0.7.0 cassandra has
> a bug and doesn't remove these compact files properly.  Should we upgrade to
> a newer version that has this bug fixed?
>
> 2. Do we must do the garbage collection via Jconsole manually?  Is there
> anyway I can force the GC in our code? (we are using Hector as our java
> client).
>
> Thanks!
>
>
>
> On Tue, Aug 2, 2011 at 5:19 PM, Jeremiah Jordan
> <jeremiah.jor...@morningstar.com> wrote:
>>
>> Connect with jconsole and run garbage collection.
>> All of the files that have a -Compacted with the same name will get
>> deleted the next time a full garbage collection runs, or when the node
>> is restarted.  They have already been combined into new files, the old
>> ones just haven't been deleted yet.
>>
>> On Tue, 2011-08-02 at 16:09 -0400, Yiming Sun wrote:
>> > Hi,
>> >
>> > I am new to Cassandra, and am hoping someone could help me understand
>> > the (large amount of small) data files on disk that Cassandra
>> > generates.
>> >
>> > The reason we are using Cassandra is because we are dealing with
>> > thousands to millions of small text files on disk, so we are
>> > experimenting with Cassandra hoping that by dropping the files
>> > contents into Cassandra, it will achieve more efficient disk usage
>> > because Cassandra is going to aggregate them into bigger files (one
>> > file per column family, according to the wiki).
>> >
>> > But after we pushed a subset of the files into a single node Cassandra
>> > v0.7.0 instance, we noted that in the Cassandra data directory for the
>> > keyspace, there are 8.5 million very small files, most are named
>> >
>> >     <SuperColumnFamilyName>-e-<nnnnn>.Filter.db
>> >     <SuperColumnFamilyName>-e-<nnnnn>.Compacted.db
>> >     <SuperColumnFamilyName>-e-<nnnnn>.Index.db
>> >     <SuperColumnFamilyName>-e-<nnnnn>.Statistics.db
>> >
>> > and among these files, the Compacted.db are always empty,  Filter and
>> > Index are under 100 bytes, and Statistics are around 4k.
>> >
>> > What are these files? Why are there so many of them?  We originally
>> > hope that Cassandra was going to solve our issue with the small files
>> > we have, but now it doesn't seem to help -- we still end up with tons
>> > of small files.   Is there any way to reduce/combine these small
>> > files?
>> >
>> > Thanks.
>> >
>> > -- Y.
>>
>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com

Re: 8 million Cassandra data files on disk

Reply via email to