Re: Nodes frozen in GC

2011-03-06 Thread ruslan usifov
2011/3/6 aaron morton aa...@thelastpickle.com

 Your node is under memory pressure, after the GC there is still 5.7GB in
 use. In fact it looks like memory usage went up during the GC process.

 Can you reduce the memtable size, caches or the number of CF's or increase
 the JVM size? Also is this happening under heavy load ?

 Yes I do bulk load to cluster


I reduce memtable to 64MB sutuation became better, but it is all the same
very rare GC more than 15 Sec happens. Can reduce
flush_largest_memtables_at helps? O may be some GC options?


Re: cant seem to figure out secondary index definition

2011-03-06 Thread Jürgen Link

Hi Roshan,
could you please post a small sample from your yaml file?
As documentation of indexes is quite sparse, we're grateful for any 
working example.


Cheers
Jürgen

Am 04.03.2011 19:27, schrieb Roshan Dawrani:
On Fri, Mar 4, 2011 at 11:52 PM, Jürgen Link 
juergen.l...@googlemail.com mailto:juergen.l...@googlemail.com wrote:


Hi Jonathan,
as Roland is already out of office, I'd like to jump in.
Maybe this somehow got lost in the moddle of this thread, indexing
works fine in our real cassandra cluster.
For our test cases, we use an embedded cassandra instance, which
is configured via yaml.
In case indexes cannot be defined via yaml (for embedded
instances), is there a more preferred way to do so?


Sorry, I haven't followed the whole of this thread yet, but I just 
noticed this mail and would like to add that we also use embedded 
Cassandra instance for our dev / tests and out yaml has a number of 
indexes and they work just fine. We are on Cassandar 0.7.0.





Re: cant seem to figure out secondary index definition

2011-03-06 Thread Roshan Dawrani
Hi,

Sure. Here is a sample of how we define it at the end of cassandra.yaml.

In the keyspace MyApp, it defines a column family MyUser, that has secondary
indexes on 2 String columns - firstname, and lastname.

Does it help?

--
keyspaces:
- name: MyApp
  replica_placement_strategy:
org.apache.cassandra.locator.SimpleStrategy
  replication_factor: 1
  column_families:
- name: MyUser
  compare_with: org.apache.cassandra.db.marshal.BytesType
  column_metadata:
- name: firstname
  validator_class: UTF8Type
  index_type: KEYS
  index_name: FirstNameIdx
- name: firstname
  validator_class: UTF8Type
  index_type: KEYS
  index_name: LastNameIdx
--

Regards,
Roshan

On Sun, Mar 6, 2011 at 4:34 PM, Jürgen Link juergen.l...@googlemail.comwrote:

  Hi Roshan,
 could you please post a small sample from your yaml file?
 As documentation of indexes is quite sparse, we're grateful for any working
 example.

 Cheers
 Jürgen

 Am 04.03.2011 19:27, schrieb Roshan Dawrani:

 On Fri, Mar 4, 2011 at 11:52 PM, Jürgen Link 
 juergen.l...@googlemail.comwrote:

 Hi Jonathan,
 as Roland is already out of office, I'd like to jump in.
 Maybe this somehow got lost in the moddle of this thread, indexing works
 fine in our real cassandra cluster.
 For our test cases, we use an embedded cassandra instance, which is
 configured via yaml.
 In case indexes cannot be defined via yaml (for embedded instances), is
 there a more preferred way to do so?


  Sorry, I haven't followed the whole of this thread yet, but I just
 noticed this mail and would like to add that we also use embedded Cassandra
 instance for our dev / tests and out yaml has a number of indexes and they
 work just fine. We are on Cassandar 0.7.0.






Re: cant seem to figure out secondary index definition

2011-03-06 Thread Roshan Dawrani
On Sun, Mar 6, 2011 at 4:54 PM, Roshan Dawrani roshandawr...@gmail.comwrote:



 --
 keyspaces:
 - name: firstname
   validator_class: UTF8Type
   index_type: KEYS
   index_name: LastNameIdx


Please correct a typo and change the name for the 2nd column to lastname in
the sample. :-)


Re: confirm unsubscribe from user@cassandra.apache.org

2011-03-06 Thread Thomas cmdln Gideon
On 03/06/2011 08:08 AM, user-h...@cassandra.apache.org wrote:
 Hi! This is the ezmlm program. I'm managing the
 user@cassandra.apache.org mailing list.
 
 To confirm that you would like
 
cm...@thecommandline.net
 
 removed from the user mailing list, please send a short reply 
 to this address:
 

 user-uc.1299416928.cgclfmokcblhbpnocdoo-cmdln=thecommandline@cassandra.apache.org
 
 Usually, this happens when you just hit the reply button.
 If this does not work, simply copy the address and paste it into
 the To: field of a new message.
 
 I haven't checked whether your address is currently on the mailing list.
 To see what address you used to subscribe, look at the messages you are
 receiving from the mailing list. Each message has your address hidden
 inside its return path; for example, m...@xdd.ff.com receives messages
 with return path: user-return-number-mary=xdd.ff@cassandra.apache.org.
 
 Some mail programs are broken and cannot handle long addresses. If you
 cannot reply to this request, instead send a message to
 user-requ...@cassandra.apache.org and put the entire address listed above
 into the Subject: line.
 
 
 --- Administrative commands for the user list ---
 
 I can handle administrative requests automatically. Please
 do not send them to the list address! Instead, send
 your message to the correct command address:
 
 To subscribe to the list, send a message to:
user-subscr...@cassandra.apache.org
 
 To remove your address from the list, send a message to:
user-unsubscr...@cassandra.apache.org
 
 Send mail to the following for info and FAQ for this list:
user-i...@cassandra.apache.org
user-...@cassandra.apache.org
 
 Similar addresses exist for the digest list:
user-digest-subscr...@cassandra.apache.org
user-digest-unsubscr...@cassandra.apache.org
 
 To get messages 123 through 145 (a maximum of 100 per request), mail:
user-get.123_...@cassandra.apache.org
 
 To get an index with subject and author for messages 123-456 , mail:
user-index.123_...@cassandra.apache.org
 
 They are always returned as sets of 100, max 2000 per request,
 so you'll actually get 100-499.
 
 To receive all messages with the same subject as message 12345,
 send a short message to:
user-thread.12...@cassandra.apache.org
 
 The messages should contain one line or word of text to avoid being
 treated as sp@m, but I will ignore their content.
 Only the ADDRESS you send to is important.
 
 You can start a subscription for an alternate address,
 for example john@host.domain, just add a hyphen and your
 address (with '=' instead of '@') after the command word:
 user-subscribe-john=host.dom...@cassandra.apache.org
 
 To stop subscription for this address, mail:
 user-unsubscribe-john=host.dom...@cassandra.apache.org
 
 In both cases, I'll send a confirmation message to that address. When
 you receive it, simply reply to it to complete your subscription.
 
 If despite following these instructions, you do not get the
 desired results, please contact my owner at
 user-ow...@cassandra.apache.org. Please be patient, my owner is a
 lot slower than I am ;-)
 
 --- Enclosed is a copy of the request I received.
 
 Return-Path: cm...@thecommandline.net
 Received: (qmail 57902 invoked by uid 99); 6 Mar 2011 13:08:48 -
 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136)
 by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 06 Mar 2011 13:08:48 +
 X-ASF-Spam-Status: No, hits=-1.4 required=10.0
   tests=ASF_LIST_OPS,HK_RANDOM_ENVFROM,RCVD_IN_DNSWL_LOW,SPF_NEUTRAL
 X-Spam-Check-By: apache.org
 Received-SPF: neutral (athena.apache.org: local policy)
 Received: from [209.85.216.172] (HELO mail-qy0-f172.google.com) 
 (209.85.216.172)
 by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 06 Mar 2011 13:08:42 +
 Received: by qyk29 with SMTP id 29so1221884qyk.10
 for user-unsubscr...@cassandra.apache.org; Sun, 06 Mar 2011 
 05:08:21 -0800 (PST)
 Received: by 10.224.27.5 with SMTP id g5mr2299477qac.97.1299416901045;
 Sun, 06 Mar 2011 05:08:21 -0800 (PST)
 Received: from [192.168.1.4] ([96.231.195.66])
 by mx.google.com with ESMTPS id y17sm1072286qci.45.2011.03.06.05.08.18
 (version=SSLv3 cipher=OTHER);
 Sun, 06 Mar 2011 05:08:19 -0800 (PST)
 Message-ID: 4d738742.3070...@thecommandline.net
 Date: Sun, 06 Mar 2011 08:08:18 -0500
 From: Thomas \cmdln\ Gideon cm...@thecommandline.net
 User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.14) 
 Gecko/20110223 Lightning/1.0b2 Thunderbird/3.1.8
 MIME-Version: 1.0
 To: user-unsubscr...@cassandra.apache.org
 Subject: unsubscribe
 X-Enigmail-Version: 1.1.2
 Content-Type: text/plain; charset=ISO-8859-1
 Content-Transfer-Encoding: 7bit
 
 unsubscribe


-- 
The Command Line Podcast
http://thecommandline.net/


Re: Nodes frozen in GC

2011-03-06 Thread Peter Schuller
Do you have row cache enabled? Disable it. If it fixes it and you want
it, re-enable but consider row sizes and the cap on the cache size..
-- 
/ Peter Schuller


Re: OOM exceptions

2011-03-06 Thread Mark

If its determined that this is due to a very large row, what are my options?

Thanks

On 3/5/11 7:11 PM, aaron morton wrote:
First question is which version are you running ? Am guessing 0.6 
something


If you have OOM in the compaction thread it may be because of a very 
large row. The CF information available through JConsole will give you 
the max row size for the CF.


Your setting for RowWarningThresholdInMB is 512, however the default 
setting in 0.6.12 is 64mb. Here's the inline help from 
storage-conf.xml for 0.6.12


!--
   ~ Size of compacted row above which to log a warning.  If compacted
   ~ rows do not fit in memory, Cassandra will crash.  (This is explained
   ~ in http://wiki.apache.org/cassandra/CassandraLimitations and is
   ~ scheduled to be fixed in 0.7.)  Large rows can also be a problem
   ~ when row caching is enabled.
  --
RowWarningThresholdInMB64/RowWarningThresholdInMB

In 0.7 the equivalent setting is in_memory_compaction_limit_in_mb , 
here's the file help from the 0.7.3 conf file


# Size limit for rows being compacted in memory.  Larger rows will spill
# over to disk and use a slower two-pass compaction process.  A message
# will be logged specifying the row key.
in_memory_compaction_limit_in_mb: 64
Hope that helps.
Aaron


On 5/03/2011, at 10:54 AM, Mark wrote:


Thats very nice of you. Thanks

Storage
ClusterNameMyCluster/ClusterName
AutoBootstraptrue/AutoBootstrap
HintedHandoffEnabledtrue/HintedHandoffEnabled
IndexInterval128/IndexInterval

Keyspaces
Keyspace Name=MyCompany
ColumnFamily Name=SearchLog
ColumnType=Super
CompareWith=TimeUUIDType
CompareSubcolumnsWith=BytesType/
ColumnFamily Name=ProductLog
ColumnType=Super
CompareWith=TimeUUIDType
CompareSubcolumnsWith=BytesType/
ColumnFamily Name=RequestLog
ColumnType=Super
CompareWith=TimeUUIDType
CompareSubcolumnsWith=BytesType/
ColumnFamily Name=ClickLog
ColumnType=Super
CompareWith=TimeUUIDType
CompareSubcolumnsWith=BytesType/
ColumnFamily Name=ItemTranslation
ColumnType=Super
CompareWith=BytesType
CompareSubcolumnsWith=BytesType/
ColumnFamily Name=ItemTranslationIndex
CompareWith=BytesType/
ColumnFamily Name=RelatedObject
CompareWith=LongType/

ReplicaPlacementStrategyorg.apache.cassandra.locator.RackUnawareStrategy/ReplicaPlacementStrategy
ReplicationFactor2/ReplicationFactor
EndPointSnitchorg.apache.cassandra.locator.EndPointSnitch/EndPointSnitch
/Keyspace
/Keyspaces

Authenticatororg.apache.cassandra.auth.AllowAllAuthenticator/Authenticator
Partitionerorg.apache.cassandra.dht.RandomPartitioner/Partitioner
InitialToken/InitialToken

SavedCachesDirectory/var/lib/cassandra/saved_caches/SavedCachesDirectory
CommitLogDirectory/var/lib/cassandra/commitlog/CommitLogDirectory
DataFileDirectories
DataFileDirectory/var/lib/cassandra/data/DataFileDirectory
/DataFileDirectories

Seeds
Seedcassandra1/Seed
Seedcassandra2/Seed
/Seeds

RpcTimeoutInMillis1/RpcTimeoutInMillis
CommitLogRotationThresholdInMB128/CommitLogRotationThresholdInMB

ListenAddress/ListenAddress
StoragePort7000/StoragePort

ThriftAddress/ThriftAddress
ThriftPort9160/ThriftPort
ThriftFramedTransportfalse/ThriftFramedTransport

DiskAccessModemmap_index_only/DiskAccessMode

RowWarningThresholdInMB512/RowWarningThresholdInMB
SlicedBufferSizeInKB64/SlicedBufferSizeInKB
FlushDataBufferSizeInMB32/FlushDataBufferSizeInMB
FlushIndexBufferSizeInMB8/FlushIndexBufferSizeInMB
ColumnIndexSizeInKB64/ColumnIndexSizeInKB
MemtableThroughputInMB64/MemtableThroughputInMB
BinaryMemtableThroughputInMB256/BinaryMemtableThroughputInMB
MemtableFlushAfterMinutes60/MemtableFlushAfterMinutes

ConcurrentReads8/ConcurrentReads
ConcurrentWrites32/ConcurrentWrites

CommitLogSyncperiodic/CommitLogSync
CommitLogSyncPeriodInMS1/CommitLogSyncPeriodInMS
GCGraceSeconds864000/GCGraceSeconds

DoConsistencyChecksBooleantrue/DoConsistencyChecksBoolean
/Storage


On 3/4/11 1:05 PM, Narendra Sharma wrote:
I have been through tuning for GC and OOM recently. If you can 
provide the cassandra.yaml, I can help. Mostly I had to play with 
memtable thresholds.


Thanks,
Naren

On Fri, Mar 4, 2011 at 12:43 PM, Mark static.void@gmail.com 
mailto:static.void@gmail.com wrote:


We have 7 column families and we are not using the default key
cache (20).

These were our initial settings so it was not in response to
anything. Would you recommend anything else? Thanks



On 3/4/11 12:34 PM, Chris Burroughs wrote:

- Are you using a key cache?  How many keys do you have?
 Across how
many column families

You configuration is unusual both in terms of not setting
min heap ==
max heap and the 

What would be a good strategy for Storing the large text contents like blog posts in Cassandra.

2011-03-06 Thread Aditya Narayan
What would be a good strategy to store large text content/(blog posts
of around 1500-3000 characters)  in cassandra? I need to store these
blog posts along with their metadata like bloggerId, blogTags. I am
looking forward to store this data in a single row giving each
attribute a single column. So one blog per row. Is using a single
column for a large blog post like this a good strategy?

Next, I also need to store the blogComments which I am planning to
store all, in another single row. 1 comment per column. Thus the
entire information about the a single comment like  commentBody,
commentor would be serialized(using google Protocol buffers) and
stored in a single column,
For storing the no. of likes of each comment itself,  I am planning to
keep a counter_column, in the same row, for each comment that will
hold an no. specifiying no. of 'likes' of that comment.

Any suggestions on the above design highly appreciated.. Thanks.


Re: how large can a cluster over the WAN be?

2011-03-06 Thread Mimi Aluminium
Hi,

Please, can you help with the following? it will lead us in some design
decisions.
Are you familiar with Cassandra cluster that is installed in datacenters
that are spread across the WAN? can you comment on the perfromance of such
installation?
What is the largest size of of such a cluster you are aware of?

Thanks a lot,
Miriam


On Tue, Mar 1, 2011 at 9:47 PM, Mimi Aluminium mimi.alumin...@gmail.comwrote:

  Hi,
 Are there clusters of 100 nodes? more? Please can you refer me to such
 installations/ systems?
 Can you comment on over-the-WAN clusters in this size or less? and can you
 point on system with nodes in different DCs connected by WAN ( could be
 dedicated or internet) ?
 Thanks a lot,
 Miriam







Re: cant seem to figure out secondary index definition

2011-03-06 Thread Jürgen Link

Hi Roshan,
thanks for your post. I quickly ran over it, and the only difference I 
can actually see is the compare_with type (we use TimeUUIDType).


Any other suggestions, anyone?



Am 06.03.2011 12:24, schrieb Roshan Dawrani:

Hi,

Sure. Here is a sample of how we define it at the end of cassandra.yaml.

In the keyspace MyApp, it defines a column family MyUser, that has 
secondary indexes on 2 String columns - firstname, and lastname.


Does it help?

--
keyspaces:
- name: MyApp
  replica_placement_strategy: 
org.apache.cassandra.locator.SimpleStrategy

  replication_factor: 1
  column_families:
- name: MyUser
  compare_with: org.apache.cassandra.db.marshal.BytesType
  column_metadata:
- name: firstname
  validator_class: UTF8Type
  index_type: KEYS
  index_name: FirstNameIdx
- name: firstname
  validator_class: UTF8Type
  index_type: KEYS
  index_name: LastNameIdx
--

Regards,
Roshan

On Sun, Mar 6, 2011 at 4:34 PM, Jürgen Link 
juergen.l...@googlemail.com mailto:juergen.l...@googlemail.com wrote:


Hi Roshan,
could you please post a small sample from your yaml file?
As documentation of indexes is quite sparse, we're grateful for
any working example.

Cheers
Jürgen

Am 04.03.2011 19:27, schrieb Roshan Dawrani:

On Fri, Mar 4, 2011 at 11:52 PM, Jürgen Link
juergen.l...@googlemail.com
mailto:juergen.l...@googlemail.com wrote:

Hi Jonathan,
as Roland is already out of office, I'd like to jump in.
Maybe this somehow got lost in the moddle of this thread,
indexing works fine in our real cassandra cluster.
For our test cases, we use an embedded cassandra instance,
which is configured via yaml.
In case indexes cannot be defined via yaml (for embedded
instances), is there a more preferred way to do so?


Sorry, I haven't followed the whole of this thread yet, but I
just noticed this mail and would like to add that we also use
embedded Cassandra instance for our dev / tests and out yaml has
a number of indexes and they work just fine. We are on Cassandar
0.7.0.








Re: OOM exceptions

2011-03-06 Thread Aaron Morton
Under 0.6 am not sure off the top of my head. Would need to dig into it, its 
probably been discussed here though.

Check the row size and let us know what version you are using first.

Aaron

On 7/03/2011, at 5:50 AM, Mark static.void@gmail.com wrote:

 If its determined that this is due to a very large row, what are my options?
 
 Thanks
 
 On 3/5/11 7:11 PM, aaron morton wrote:
 
 First question is which version are you running ? Am guessing 0.6 something
 
 If you have OOM in the compaction thread it may be because of a very large 
 row. The CF information available through JConsole will give you the max row 
 size for the CF.
 
 Your setting for RowWarningThresholdInMB is 512, however the default setting 
 in 0.6.12 is 64mb. Here's the inline help from storage-conf.xml for 0.6.12
 
   !--
~ Size of compacted row above which to log a warning.  If compacted
~ rows do not fit in memory, Cassandra will crash.  (This is explained
~ in http://wiki.apache.org/cassandra/CassandraLimitations and is
~ scheduled to be fixed in 0.7.)  Large rows can also be a problem
~ when row caching is enabled.
   --
   RowWarningThresholdInMB64/RowWarningThresholdInMB
 
 In 0.7 the equivalent setting is in_memory_compaction_limit_in_mb , here's 
 the file help from the 0.7.3 conf file 
 
 # Size limit for rows being compacted in memory.  Larger rows will spill
 # over to disk and use a slower two-pass compaction process.  A message
 # will be logged specifying the row key.
 in_memory_compaction_limit_in_mb: 64
  
 Hope that helps. 
 Aaron
 
 
 On 5/03/2011, at 10:54 AM, Mark wrote:
 
 Thats very nice of you. Thanks
 
 Storage
   ClusterNameMyCluster/ClusterName
   AutoBootstraptrue/AutoBootstrap
   HintedHandoffEnabledtrue/HintedHandoffEnabled
   IndexInterval128/IndexInterval
 
   Keyspaces
 Keyspace Name=MyCompany
   ColumnFamily Name=SearchLog
 ColumnType=Super
 CompareWith=TimeUUIDType
 CompareSubcolumnsWith=BytesType/
   ColumnFamily Name=ProductLog
 ColumnType=Super
 CompareWith=TimeUUIDType
 CompareSubcolumnsWith=BytesType/
   ColumnFamily Name=RequestLog
 ColumnType=Super
 CompareWith=TimeUUIDType
 CompareSubcolumnsWith=BytesType/
   ColumnFamily Name=ClickLog
 ColumnType=Super
 CompareWith=TimeUUIDType
 CompareSubcolumnsWith=BytesType/
   ColumnFamily Name=ItemTranslation
 ColumnType=Super
 CompareWith=BytesType
 CompareSubcolumnsWith=BytesType/
   ColumnFamily Name=ItemTranslationIndex
 CompareWith=BytesType/
   ColumnFamily Name=RelatedObject
 CompareWith=LongType/
 
   
 ReplicaPlacementStrategyorg.apache.cassandra.locator.RackUnawareStrategy/ReplicaPlacementStrategy
   ReplicationFactor2/ReplicationFactor
   
 EndPointSnitchorg.apache.cassandra.locator.EndPointSnitch/EndPointSnitch
 /Keyspace
   /Keyspaces
 
   
 Authenticatororg.apache.cassandra.auth.AllowAllAuthenticator/Authenticator
   Partitionerorg.apache.cassandra.dht.RandomPartitioner/Partitioner
   InitialToken/InitialToken
 
   
 SavedCachesDirectory/var/lib/cassandra/saved_caches/SavedCachesDirectory
   CommitLogDirectory/var/lib/cassandra/commitlog/CommitLogDirectory
   DataFileDirectories
 DataFileDirectory/var/lib/cassandra/data/DataFileDirectory
   /DataFileDirectories
 
   Seeds
 Seedcassandra1/Seed
 Seedcassandra2/Seed
   /Seeds
 
   RpcTimeoutInMillis1/RpcTimeoutInMillis
   CommitLogRotationThresholdInMB128/CommitLogRotationThresholdInMB
 
   ListenAddress/ListenAddress
   StoragePort7000/StoragePort
  
   ThriftAddress/ThriftAddress
   ThriftPort9160/ThriftPort
   ThriftFramedTransportfalse/ThriftFramedTransport
 
   DiskAccessModemmap_index_only/DiskAccessMode
  
   RowWarningThresholdInMB512/RowWarningThresholdInMB
   SlicedBufferSizeInKB64/SlicedBufferSizeInKB
   FlushDataBufferSizeInMB32/FlushDataBufferSizeInMB
   FlushIndexBufferSizeInMB8/FlushIndexBufferSizeInMB
   ColumnIndexSizeInKB64/ColumnIndexSizeInKB
   MemtableThroughputInMB64/MemtableThroughputInMB
   BinaryMemtableThroughputInMB256/BinaryMemtableThroughputInMB
   MemtableFlushAfterMinutes60/MemtableFlushAfterMinutes
   
   ConcurrentReads8/ConcurrentReads
   ConcurrentWrites32/ConcurrentWrites
 
   CommitLogSyncperiodic/CommitLogSync
   CommitLogSyncPeriodInMS1/CommitLogSyncPeriodInMS
   GCGraceSeconds864000/GCGraceSeconds
  
   DoConsistencyChecksBooleantrue/DoConsistencyChecksBoolean
 /Storage
 
 
 On 3/4/11 1:05 PM, Narendra Sharma wrote:
 
 I have been through tuning for GC and OOM recently. If you can provide the 
 cassandra.yaml, I can help. Mostly I had to play with memtable thresholds.
 
 Thanks,
 Naren
 
 On Fri, Mar 4, 2011 at 12:43 PM, 

Re: What would be a good strategy for Storing the large text contents like blog posts in Cassandra.

2011-03-06 Thread Aaron Morton
Sounds reasonable, one CF for the blog post one CF for the comments. You could 
also use a single CF if you will often read the blog and the comments at the 
same time. The best design is the one that suits how your app works, try one 
and be prepared to change.

Note that counters are only in the 0.8 trunk and are still under development, 
they are not going to be released for a couple of months.

Your per column data size is nothing to be concerned abut.

Hope that helps.
Aaron 

On 7/03/2011, at 6:35 AM, Aditya Narayan ady...@gmail.com wrote:

 What would be a good strategy to store large text content/(blog posts
 of around 1500-3000 characters)  in cassandra? I need to store these
 blog posts along with their metadata like bloggerId, blogTags. I am
 looking forward to store this data in a single row giving each
 attribute a single column. So one blog per row. Is using a single
 column for a large blog post like this a good strategy?
 
 Next, I also need to store the blogComments which I am planning to
 store all, in another single row. 1 comment per column. Thus the
 entire information about the a single comment like  commentBody,
 commentor would be serialized(using google Protocol buffers) and
 stored in a single column,
 For storing the no. of likes of each comment itself,  I am planning to
 keep a counter_column, in the same row, for each comment that will
 hold an no. specifiying no. of 'likes' of that comment.
 
 Any suggestions on the above design highly appreciated.. Thanks.


Re: OOM exceptions

2011-03-06 Thread Mark

Sorry, I forgot to mention. I am running 0.6.6

On 3/6/11 3:27 PM, Aaron Morton wrote:
Under 0.6 am not sure off the top of my head. Would need to dig into 
it, its probably been discussed here though.


Check the row size and let us know what version you are using first.

Aaron

On 7/03/2011, at 5:50 AM, Mark static.void@gmail.com 
mailto:static.void@gmail.com wrote:


If its determined that this is due to a very large row, what are my 
options?


Thanks

On 3/5/11 7:11 PM, aaron morton wrote:
First question is which version are you running ? Am guessing 0.6 
something


If you have OOM in the compaction thread it may be because of a very 
large row. The CF information available through JConsole will give 
you the max row size for the CF.


Your setting for RowWarningThresholdInMB is 512, however the default 
setting in 0.6.12 is 64mb. Here's the inline help from 
storage-conf.xml for 0.6.12


!--
   ~ Size of compacted row above which to log a warning.  If compacted
   ~ rows do not fit in memory, Cassandra will crash.  (This is 
explained

   ~ in http://wiki.apache.org/cassandra/CassandraLimitations and is
   ~ scheduled to be fixed in 0.7.)  Large rows can also be a problem
   ~ when row caching is enabled.
  --
RowWarningThresholdInMB64/RowWarningThresholdInMB

In 0.7 the equivalent setting is in_memory_compaction_limit_in_mb , 
here's the file help from the 0.7.3 conf file


# Size limit for rows being compacted in memory.  Larger rows will spill
# over to disk and use a slower two-pass compaction process.  A message
# will be logged specifying the row key.
in_memory_compaction_limit_in_mb: 64
Hope that helps.
Aaron


On 5/03/2011, at 10:54 AM, Mark wrote:


Thats very nice of you. Thanks

Storage
ClusterNameMyCluster/ClusterName
AutoBootstraptrue/AutoBootstrap
HintedHandoffEnabledtrue/HintedHandoffEnabled
IndexInterval128/IndexInterval

Keyspaces
Keyspace Name=MyCompany
ColumnFamily Name=SearchLog
ColumnType=Super
CompareWith=TimeUUIDType
CompareSubcolumnsWith=BytesType/
ColumnFamily Name=ProductLog
ColumnType=Super
CompareWith=TimeUUIDType
CompareSubcolumnsWith=BytesType/
ColumnFamily Name=RequestLog
ColumnType=Super
CompareWith=TimeUUIDType
CompareSubcolumnsWith=BytesType/
ColumnFamily Name=ClickLog
ColumnType=Super
CompareWith=TimeUUIDType
CompareSubcolumnsWith=BytesType/
ColumnFamily Name=ItemTranslation
ColumnType=Super
CompareWith=BytesType
CompareSubcolumnsWith=BytesType/
ColumnFamily Name=ItemTranslationIndex
CompareWith=BytesType/
ColumnFamily Name=RelatedObject
CompareWith=LongType/

ReplicaPlacementStrategyorg.apache.cassandra.locator.RackUnawareStrategy/ReplicaPlacementStrategy
ReplicationFactor2/ReplicationFactor
EndPointSnitchorg.apache.cassandra.locator.EndPointSnitch/EndPointSnitch
/Keyspace
/Keyspaces

Authenticatororg.apache.cassandra.auth.AllowAllAuthenticator/Authenticator
Partitionerorg.apache.cassandra.dht.RandomPartitioner/Partitioner
InitialToken/InitialToken

SavedCachesDirectory/var/lib/cassandra/saved_caches/SavedCachesDirectory
CommitLogDirectory/var/lib/cassandra/commitlog/CommitLogDirectory
DataFileDirectories
DataFileDirectory/var/lib/cassandra/data/DataFileDirectory
/DataFileDirectories

Seeds
Seedcassandra1/Seed
Seedcassandra2/Seed
/Seeds

RpcTimeoutInMillis1/RpcTimeoutInMillis
CommitLogRotationThresholdInMB128/CommitLogRotationThresholdInMB

ListenAddress/ListenAddress
StoragePort7000/StoragePort

ThriftAddress/ThriftAddress
ThriftPort9160/ThriftPort
ThriftFramedTransportfalse/ThriftFramedTransport

DiskAccessModemmap_index_only/DiskAccessMode

RowWarningThresholdInMB512/RowWarningThresholdInMB
SlicedBufferSizeInKB64/SlicedBufferSizeInKB
FlushDataBufferSizeInMB32/FlushDataBufferSizeInMB
FlushIndexBufferSizeInMB8/FlushIndexBufferSizeInMB
ColumnIndexSizeInKB64/ColumnIndexSizeInKB
MemtableThroughputInMB64/MemtableThroughputInMB
BinaryMemtableThroughputInMB256/BinaryMemtableThroughputInMB
MemtableFlushAfterMinutes60/MemtableFlushAfterMinutes

ConcurrentReads8/ConcurrentReads
ConcurrentWrites32/ConcurrentWrites

CommitLogSyncperiodic/CommitLogSync
CommitLogSyncPeriodInMS1/CommitLogSyncPeriodInMS
GCGraceSeconds864000/GCGraceSeconds

DoConsistencyChecksBooleantrue/DoConsistencyChecksBoolean
/Storage


On 3/4/11 1:05 PM, Narendra Sharma wrote:
I have been through tuning for GC and OOM recently. If you can 
provide the cassandra.yaml, I can help. Mostly I had to play with 
memtable thresholds.


Thanks,
Naren

On Fri, Mar 4, 2011 at 12:43 PM, Mark static.void@gmail.com 
mailto:static.void@gmail.com wrote:


We have 7 column families and we are not using the default key
cache (20).

These were our 

Re: cant seem to figure out secondary index definition

2011-03-06 Thread Tyler Hobbs
On Sun, Mar 6, 2011 at 4:49 PM, Jürgen Link juergen.l...@googlemail.comwrote:

  Hi Roshan,
 thanks for your post. I quickly ran over it, and the only difference I can
 actually see is the compare_with type (we use TimeUUIDType).

 Any other suggestions, anyone?


You want to add an index on a CF with TimeUUIDType column names?  I think
you've probably mistaken the purpose of compare_with.

If you haven't, I think you'll need to add the index programmatically in
order to specify a non ascii/unicode column name for the index.

-- 
Tyler Hobbs
Software Engineer, DataStax http://datastax.com/
Maintainer of the pycassa http://github.com/pycassa/pycassa Cassandra
Python client library


Designing a decent data model for an online music shop...confused/stuck on decisions

2011-03-06 Thread Courtney
We're in a bit of a predicament, we have an e-music store currently built in 
PHP using codeigniter/mysql...
The current system has 100+K users and a decent song collection. Over the last 
few months I've been playing with
Cassandra... needless to say I'm impressed but I have a few questions.
Firstly, I want to avoid re-writing the entire site if possible so my instincts 
have made me inclined to replace the database layer
in code igniter... is this something anyone would recommend and are there any 
gotchas in doing that?

I can't say I've been terribly happy with PHP accessing cassandra, when sample 
data of the same size was put into mysql and in cassandra (of the same 
size/type)
The pages with php connecting to Cassandra took longer to load, (30K records in 
table). 
I've thought maybe it was my setup that needed tweaking and I've played with as 
many a options as I could but the best I've gotten is matching query time.
Query speed test was simply getting time stamps right before and after query 
call returned...

Is this something anyone else has seen, any comments suggestions? I've tried 
using thrift, phpcassa and pandra with pretty similar numbers.

My other thought turned to maybe it was the way I designed my CFs, at first I 
used super columns to model user account CF based on a post I read
by Arin (WTF is a super column) but I later changed to using normal CFs.

I'm trying to make this work but I get the feeling my approach is somewhat...I 
don't mis-guided.

Here's a break down of the current model.
CF:Users{
uid
fname
lname
username
password
street

}
Some additional columns in place for a user but keeping it simple...
CF:Library{
uid
songid
...
other info about user library
}

CF:Songs{
songid
title
artistid
}

This all is still very relational like (considering I go on to have a CF for 
playlist and artists) and I'm not sure if this is a good design for the data 
but... when I looked into
combining some of the info and removing some CFs I run into the issue of 
replicating data all over the place. If for example I stored the artist name in 
the library for each record
then each then the artist would be replicated for every song they have for 
every user who has that song in their library

Where do you sort of draw the line on deciding how much is okay to be 
replicated?

As much as I am not liking the idea of building the application from scratch, 
I'm considering the possibility of building from scratch in Java/JSP just to 
get the benefit of using
the hector client. (Efforts from the guys doing the PHP libs is much 
appreciated but PHP doesn't seem to go too well with Cas.)

In the process of making decisions because the upgrade/rebuild needs to have a 
fairly steady working version for October and I don't want to go wrong before 
even starting.

Recommendations. Suggestions, advice are all welcomed (Any experience with PHP 
and Cas. is also welcomed since all my fav. libs. are in PHP I'm reluctant to 
turn away)

Re: Designing a decent data model for an online music shop...confused/stuck on decisions

2011-03-06 Thread Tyler Hobbs
Regarding PHP performance with Cassandra,
THRIFT-638https://issues.apache.org/jira/browse/THRIFT-638was
recently resolved and it shows some big performance improvements.
I'll
be upgrading the Thrift package that ships with phpcassa soon to include
this fix, so you may want to compare performance numbers before and after.

On Sun, Mar 6, 2011 at 8:03 PM, Courtney e-mailadr...@hotmail.com wrote:

  We're in a bit of a predicament, we have an e-music store currently built
 in PHP using codeigniter/mysql...
 The current system has 100+K users and a decent song collection. Over the
 last few months I've been playing with
 Cassandra... needless to say I'm impressed but I have a few questions.
 Firstly, I want to avoid re-writing the entire site if possible so my
 instincts have made me inclined to replace the database layer
 in code igniter... is this something anyone would recommend and are there
 any gotchas in doing that?

 I can't say I've been terribly happy with PHP accessing cassandra, when
 sample data of the same size was put into mysql and in cassandra (of the
 same size/type)
 The pages with php connecting to Cassandra took longer to load, (30K
 records in table).
 I've thought maybe it was my setup that needed tweaking and I've played
 with as many a options as I could but the best I've gotten is matching query
 time.
 Query speed test was simply getting time stamps right before and after
 query call returned...

 Is this something anyone else has seen, any comments suggestions? I've
 tried using thrift, phpcassa and pandra with pretty similar numbers.

 My other thought turned to maybe it was the way I designed my CFs, at first
 I used super columns to model user account CF based on a post I read
 by Arin (WTF is a super column) but I later changed to using normal CFs.

 I'm trying to make this work but I get the feeling my approach is
 somewhat...I don't mis-guided.

 Here's a break down of the current model.
 CF:Users{
 uid
 fname
 lname
 username
 password
 street
 
 }
 Some additional columns in place for a user but keeping it simple...
 CF:Library{
 uid
 songid
 ...
 other info about user library
 }

 CF:Songs{
 songid
 title
 artistid
 }

 This all is still very relational like (considering I go on to have a CF
 for playlist and artists) and I'm not sure if this is a good design for the
 data but... when I looked into
 combining some of the info and removing some CFs I run into the issue of
 replicating data all over the place. If for example I stored the artist name
 in the library for each record
 then each then the artist would be replicated for every song they have for
 every user who has that song in their library

 Where do you sort of draw the line on deciding how much is okay to be
 replicated?

 As much as I am not liking the idea of building the application from
 scratch, I'm considering the possibility of building from scratch in
 Java/JSP just to get the benefit of using
 the hector client. (Efforts from the guys doing the PHP libs is much
 appreciated but PHP doesn't seem to go too well with Cas.)

 In the process of making decisions because the upgrade/rebuild needs to
 have a fairly steady working version for October and I don't want to go
 wrong before even starting.

 Recommendations. Suggestions, advice are all welcomed (Any experience with
 PHP and Cas. is also welcomed since all my fav. libs. are in PHP I'm
 reluctant to turn away)




-- 
Tyler Hobbs
Software Engineer, DataStax http://datastax.com/
Maintainer of the pycassa http://github.com/pycassa/pycassa Cassandra
Python client library


Re: question about replicas dynamic response to load

2011-03-06 Thread Shaun Cutts
Thanks for the answers, Dan, Aaron. 

...
Ok, so one question is, if I haven't made any writes at all, can I decommission 
without delay? (Is there a force drop option or something, or will the 
cluster recognize the lack of writes)?

I may be able to segregate writes to the reference collection so that they 
occur late at night and/or on weekends when I don't have much load, otherwise. 
(NB it would be nice to be able to control replication strategy by keyspace; as 
it is I can probably put the reference data in its own cluster.)

But thanks for the suggestions about a caching layer -- I had already thought 
of memcache (as noted problematic due to amount of data), but hadn't considered 
the some of the other options you've mentioned. I didn't know, for instance, 
that you could use the queueing services this way.

As for S3, etc... I guess its possible, but the costs seem to mount quickly as 
well. Typically I have one sporadic writer and many readers, but I do write 
sometimes.

Another use case is to have expanded capacity for writes  reads of 
intermediate results while running hadoop. Should I perhaps just start a whole 
other cluster for these?

Gratefully,

-- Shaun



On Mar 5, 2011, at 10:52 PM, aaron morton wrote:

 Agree. Cassandra generally assumes a reasonable static cluster membership. 
 There are some tricks that can be done with copying SSTables but they will 
 only reduce the need to stream data around, not eliminate it.
 
 This may not suit your problem domain but, speaking of the AWS infrastructure 
 how about using the SQS messaging service (or similar e.g. RabbitMQ) to 
 smooth out your throughput ? You could then throttle the inserts into the 
 cassandra cluster to a maximum level and spec your HW against that. During 
 peak the message queue can soak up the overflow. 
 
 Hope that helps. 
 Aaron
 
 On 4/03/2011, at 2:07 PM, Dan Hendry wrote:
 
 To some extent, the boot-strapping problem will be an issue with most
 solutions: the data has to be duplicated from somewhere. Bootstrapping
 should not cause much performance degradation unless you are already pushing
 capacity limits. It's the decommissioning problem which makes Cassandra
 somewhat problematic in your case. You grow your cluster x5 then write to
 it. You have to perform a proper decommission when shrinking the cluster
 again which involves validating and streaming data to the remaining
 replicas: a fairly serious operation with TBs of data. For most realistic
 situations, unless the cluster is completely read-only, you cant just kill
 most of the nodes in the cluster.
 
 I cant really think of a good, general, way to do this with just Cassandra
 although there may be some hacktastical possibilities. I think a more
 statically sized Cassandra cluster then a variable cache layer (memcached or
 similar) is probably a better solution. This option kind of falls apart at
 the terabytes of data range. 
 
 Have you considered using S3, Amazon cloud front or some other CDN instead
 of rolling your own solution? For immutable data, its what they excel at.
 Cassandra has amazing write capacity and its design focus is on scaling
 writes. I would not really consider it a good tool for the job of serving
 massive amounts of static content.
 
 Dan
 
 -Original Message-
 From: Shaun Cutts [mailto:sh...@cuttshome.net] 
 Sent: March-03-11 13:00
 To: user@cassandra.apache.org
 Subject: question about replicas  dynamic response to load
 
 Hello,
 
 In our project our usage pattern is likely to be quite variable -- high for
 a a few days, then lower, etc could vary as much (or more) as 10x from peak
 to non-peak. Also, much of our data is immutable -- but there is a
 considerable amount of it -- perhaps in the single digit TBs. Finally, we
 are hosting with amazon.
 
 I'm looking for advice on how to vary the number of nodes dynamically, in
 order to reduce our hosting costs at non-peak times. I worry that just
 adding new nodes in response to demand will make things worse -- at least
 temporarily -- as the new node copies data to itself; then bringing it down
 will also cause a degradation.
 
 I'm wondering if it is possible to bring up exact copies of other nodes? Or
 alternately to take down a populated node containing (only?) immutable data,
 then bring it up again when the need arises?
 
 Are there reference/reading materials(/blogs) concerning dynamically varying
 number of nodes in response to demand?
 
 Thanks!
 
 -- Shaun
 
 No virus found in this incoming message.
 Checked by AVG - www.avg.com 
 Version: 9.0.872 / Virus Database: 271.1.1/3479 - Release Date: 03/03/11
 02:34:00
 
 



Re: Secondary indexes

2011-03-06 Thread aaron morton
Info on secondary indexes 
http://www.datastax.com/dev/blog/whats-new-cassandra-07-secondary-indexes

Some answers to your other questions are also in there as well as a discussion 
about the limitations. 

Hope that helps. 
Aaron



On 7/03/2011, at 3:54 PM, Mark wrote:

 I haven't looked at Cassandra since 0.6.6 and now I notice in 0.7+ there is 
 support for secondary indexes. I haven't found much material on how these are 
 used and when one should use one. Can someone point me in the right direction.
 
 Also can these be created (and deleted) as needed without affecting the 
 underlying CF? If so, I'm guessing there is no need to create another CF just 
 for indexing. What are the disadvantages of this method?
 
 Thanks all.



Re: What would be a good strategy for Storing the large text contents like blog posts in Cassandra.

2011-03-06 Thread Aditya Narayan
Thanks Aaron!!

I didnt knew about the upcoming facility for inbuilt counters. This
sounds really great for my use-case!! Could you let me know where can
I read more about this, if this had been blogged about, somewhere ?

I'll go forward with the one (entire)blog per column design.

Thanks



On Mon, Mar 7, 2011 at 5:10 AM, Aaron Morton aa...@thelastpickle.com wrote:
 Sounds reasonable, one CF for the blog post one CF for the comments. You 
 could also use a single CF if you will often read the blog and the comments 
 at the same time. The best design is the one that suits how your app works, 
 try one and be prepared to change.

 Note that counters are only in the 0.8 trunk and are still under development, 
 they are not going to be released for a couple of months.

 Your per column data size is nothing to be concerned abut.

 Hope that helps.
 Aaron

 On 7/03/2011, at 6:35 AM, Aditya Narayan ady...@gmail.com wrote:

 What would be a good strategy to store large text content/(blog posts
 of around 1500-3000 characters)  in cassandra? I need to store these
 blog posts along with their metadata like bloggerId, blogTags. I am
 looking forward to store this data in a single row giving each
 attribute a single column. So one blog per row. Is using a single
 column for a large blog post like this a good strategy?

 Next, I also need to store the blogComments which I am planning to
 store all, in another single row. 1 comment per column. Thus the
 entire information about the a single comment like  commentBody,
 commentor would be serialized(using google Protocol buffers) and
 stored in a single column,
 For storing the no. of likes of each comment itself,  I am planning to
 keep a counter_column, in the same row, for each comment that will
 hold an no. specifiying no. of 'likes' of that comment.

 Any suggestions on the above design highly appreciated.. Thanks.