Will the large datafile size affect the performance?

2011-02-23 Thread buddhasystem

I know that theoretically it should not (apart from compaction issues), but
maybe somebody has experience showing otherwise:

My test cluster now has 250GB of data and will have 1.5TB in its
reincarnation. If all these data is in a single CF -- will it cause read or
write performance problems? Should I shard it? One advantage of splitting
the data would be reducing the impact of compaction and repairs (or so I
naively assume).

TIA

Maxim

-- 
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Will-the-large-datafile-size-affect-the-performance-tp6057991p6057991.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Re: Will the large datafile size affect the performance?

2011-02-23 Thread Edward Capriolo
On Wed, Feb 23, 2011 at 4:51 PM, buddhasystem potek...@bnl.gov wrote:

 I know that theoretically it should not (apart from compaction issues), but
 maybe somebody has experience showing otherwise:

 My test cluster now has 250GB of data and will have 1.5TB in its
 reincarnation. If all these data is in a single CF -- will it cause read or
 write performance problems? Should I shard it? One advantage of splitting
 the data would be reducing the impact of compaction and repairs (or so I
 naively assume).

 TIA

 Maxim

 --
 View this message in context: 
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Will-the-large-datafile-size-affect-the-performance-tp6057991p6057991.html
 Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
 Nabble.com.


http://wiki.apache.org/cassandra/LargeDataSetConsiderations

By dividing your data you get the benefits of being able to apply two
different settings at the Column Family or keyspace level. For example
you might have some batch data that you only want to replicate twice,
or some small subset of data that needs to be read frequently that is
highly cached. Also as you said having three smaller CF's helps you
avoid a single very long running and intensive operations like repair
or major compact.

If you always need to read both CF's to satisfy you application it is
not a good idea.