On Wed, Feb 23, 2011 at 4:51 PM, buddhasystem <potek...@bnl.gov> wrote: > > I know that theoretically it should not (apart from compaction issues), but > maybe somebody has experience showing otherwise: > > My test cluster now has 250GB of data and will have 1.5TB in its > reincarnation. If all these data is in a single CF -- will it cause read or > write performance problems? Should I "shard" it? One advantage of splitting > the data would be reducing the impact of compaction and repairs (or so I > naively assume). > > TIA > > Maxim > > -- > View this message in context: > http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Will-the-large-datafile-size-affect-the-performance-tp6057991p6057991.html > Sent from the cassandra-u...@incubator.apache.org mailing list archive at > Nabble.com. >
http://wiki.apache.org/cassandra/LargeDataSetConsiderations By dividing your data you get the benefits of being able to apply two different settings at the Column Family or keyspace level. For example you might have some batch data that you only want to replicate twice, or some small subset of data that needs to be read frequently that is highly cached. Also as you said having three smaller CF's helps you avoid a single very long running and intensive operations like repair or major compact. If you always need to read both CF's to satisfy you application it is not a good idea.