Re: Nodes frozen in GC
2011/3/6 aaron morton aa...@thelastpickle.com Your node is under memory pressure, after the GC there is still 5.7GB in use. In fact it looks like memory usage went up during the GC process. Can you reduce the memtable size, caches or the number of CF's or increase the JVM size? Also is this happening under heavy load ? Yes I do bulk load to cluster I reduce memtable to 64MB sutuation became better, but it is all the same very rare GC more than 15 Sec happens. Can reduce flush_largest_memtables_at helps? O may be some GC options?
Re: cant seem to figure out secondary index definition
Hi Roshan, could you please post a small sample from your yaml file? As documentation of indexes is quite sparse, we're grateful for any working example. Cheers Jürgen Am 04.03.2011 19:27, schrieb Roshan Dawrani: On Fri, Mar 4, 2011 at 11:52 PM, Jürgen Link juergen.l...@googlemail.com mailto:juergen.l...@googlemail.com wrote: Hi Jonathan, as Roland is already out of office, I'd like to jump in. Maybe this somehow got lost in the moddle of this thread, indexing works fine in our real cassandra cluster. For our test cases, we use an embedded cassandra instance, which is configured via yaml. In case indexes cannot be defined via yaml (for embedded instances), is there a more preferred way to do so? Sorry, I haven't followed the whole of this thread yet, but I just noticed this mail and would like to add that we also use embedded Cassandra instance for our dev / tests and out yaml has a number of indexes and they work just fine. We are on Cassandar 0.7.0.
Re: cant seem to figure out secondary index definition
Hi, Sure. Here is a sample of how we define it at the end of cassandra.yaml. In the keyspace MyApp, it defines a column family MyUser, that has secondary indexes on 2 String columns - firstname, and lastname. Does it help? -- keyspaces: - name: MyApp replica_placement_strategy: org.apache.cassandra.locator.SimpleStrategy replication_factor: 1 column_families: - name: MyUser compare_with: org.apache.cassandra.db.marshal.BytesType column_metadata: - name: firstname validator_class: UTF8Type index_type: KEYS index_name: FirstNameIdx - name: firstname validator_class: UTF8Type index_type: KEYS index_name: LastNameIdx -- Regards, Roshan On Sun, Mar 6, 2011 at 4:34 PM, Jürgen Link juergen.l...@googlemail.comwrote: Hi Roshan, could you please post a small sample from your yaml file? As documentation of indexes is quite sparse, we're grateful for any working example. Cheers Jürgen Am 04.03.2011 19:27, schrieb Roshan Dawrani: On Fri, Mar 4, 2011 at 11:52 PM, Jürgen Link juergen.l...@googlemail.comwrote: Hi Jonathan, as Roland is already out of office, I'd like to jump in. Maybe this somehow got lost in the moddle of this thread, indexing works fine in our real cassandra cluster. For our test cases, we use an embedded cassandra instance, which is configured via yaml. In case indexes cannot be defined via yaml (for embedded instances), is there a more preferred way to do so? Sorry, I haven't followed the whole of this thread yet, but I just noticed this mail and would like to add that we also use embedded Cassandra instance for our dev / tests and out yaml has a number of indexes and they work just fine. We are on Cassandar 0.7.0.
Re: cant seem to figure out secondary index definition
On Sun, Mar 6, 2011 at 4:54 PM, Roshan Dawrani roshandawr...@gmail.comwrote: -- keyspaces: - name: firstname validator_class: UTF8Type index_type: KEYS index_name: LastNameIdx Please correct a typo and change the name for the 2nd column to lastname in the sample. :-)
Re: confirm unsubscribe from user@cassandra.apache.org
On 03/06/2011 08:08 AM, user-h...@cassandra.apache.org wrote: Hi! This is the ezmlm program. I'm managing the user@cassandra.apache.org mailing list. To confirm that you would like cm...@thecommandline.net removed from the user mailing list, please send a short reply to this address: user-uc.1299416928.cgclfmokcblhbpnocdoo-cmdln=thecommandline@cassandra.apache.org Usually, this happens when you just hit the reply button. If this does not work, simply copy the address and paste it into the To: field of a new message. I haven't checked whether your address is currently on the mailing list. To see what address you used to subscribe, look at the messages you are receiving from the mailing list. Each message has your address hidden inside its return path; for example, m...@xdd.ff.com receives messages with return path: user-return-number-mary=xdd.ff@cassandra.apache.org. Some mail programs are broken and cannot handle long addresses. If you cannot reply to this request, instead send a message to user-requ...@cassandra.apache.org and put the entire address listed above into the Subject: line. --- Administrative commands for the user list --- I can handle administrative requests automatically. Please do not send them to the list address! Instead, send your message to the correct command address: To subscribe to the list, send a message to: user-subscr...@cassandra.apache.org To remove your address from the list, send a message to: user-unsubscr...@cassandra.apache.org Send mail to the following for info and FAQ for this list: user-i...@cassandra.apache.org user-...@cassandra.apache.org Similar addresses exist for the digest list: user-digest-subscr...@cassandra.apache.org user-digest-unsubscr...@cassandra.apache.org To get messages 123 through 145 (a maximum of 100 per request), mail: user-get.123_...@cassandra.apache.org To get an index with subject and author for messages 123-456 , mail: user-index.123_...@cassandra.apache.org They are always returned as sets of 100, max 2000 per request, so you'll actually get 100-499. To receive all messages with the same subject as message 12345, send a short message to: user-thread.12...@cassandra.apache.org The messages should contain one line or word of text to avoid being treated as sp@m, but I will ignore their content. Only the ADDRESS you send to is important. You can start a subscription for an alternate address, for example john@host.domain, just add a hyphen and your address (with '=' instead of '@') after the command word: user-subscribe-john=host.dom...@cassandra.apache.org To stop subscription for this address, mail: user-unsubscribe-john=host.dom...@cassandra.apache.org In both cases, I'll send a confirmation message to that address. When you receive it, simply reply to it to complete your subscription. If despite following these instructions, you do not get the desired results, please contact my owner at user-ow...@cassandra.apache.org. Please be patient, my owner is a lot slower than I am ;-) --- Enclosed is a copy of the request I received. Return-Path: cm...@thecommandline.net Received: (qmail 57902 invoked by uid 99); 6 Mar 2011 13:08:48 - Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 06 Mar 2011 13:08:48 + X-ASF-Spam-Status: No, hits=-1.4 required=10.0 tests=ASF_LIST_OPS,HK_RANDOM_ENVFROM,RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [209.85.216.172] (HELO mail-qy0-f172.google.com) (209.85.216.172) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 06 Mar 2011 13:08:42 + Received: by qyk29 with SMTP id 29so1221884qyk.10 for user-unsubscr...@cassandra.apache.org; Sun, 06 Mar 2011 05:08:21 -0800 (PST) Received: by 10.224.27.5 with SMTP id g5mr2299477qac.97.1299416901045; Sun, 06 Mar 2011 05:08:21 -0800 (PST) Received: from [192.168.1.4] ([96.231.195.66]) by mx.google.com with ESMTPS id y17sm1072286qci.45.2011.03.06.05.08.18 (version=SSLv3 cipher=OTHER); Sun, 06 Mar 2011 05:08:19 -0800 (PST) Message-ID: 4d738742.3070...@thecommandline.net Date: Sun, 06 Mar 2011 08:08:18 -0500 From: Thomas \cmdln\ Gideon cm...@thecommandline.net User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.14) Gecko/20110223 Lightning/1.0b2 Thunderbird/3.1.8 MIME-Version: 1.0 To: user-unsubscr...@cassandra.apache.org Subject: unsubscribe X-Enigmail-Version: 1.1.2 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit unsubscribe -- The Command Line Podcast http://thecommandline.net/
Re: Nodes frozen in GC
Do you have row cache enabled? Disable it. If it fixes it and you want it, re-enable but consider row sizes and the cap on the cache size.. -- / Peter Schuller
Re: OOM exceptions
If its determined that this is due to a very large row, what are my options? Thanks On 3/5/11 7:11 PM, aaron morton wrote: First question is which version are you running ? Am guessing 0.6 something If you have OOM in the compaction thread it may be because of a very large row. The CF information available through JConsole will give you the max row size for the CF. Your setting for RowWarningThresholdInMB is 512, however the default setting in 0.6.12 is 64mb. Here's the inline help from storage-conf.xml for 0.6.12 !-- ~ Size of compacted row above which to log a warning. If compacted ~ rows do not fit in memory, Cassandra will crash. (This is explained ~ in http://wiki.apache.org/cassandra/CassandraLimitations and is ~ scheduled to be fixed in 0.7.) Large rows can also be a problem ~ when row caching is enabled. -- RowWarningThresholdInMB64/RowWarningThresholdInMB In 0.7 the equivalent setting is in_memory_compaction_limit_in_mb , here's the file help from the 0.7.3 conf file # Size limit for rows being compacted in memory. Larger rows will spill # over to disk and use a slower two-pass compaction process. A message # will be logged specifying the row key. in_memory_compaction_limit_in_mb: 64 Hope that helps. Aaron On 5/03/2011, at 10:54 AM, Mark wrote: Thats very nice of you. Thanks Storage ClusterNameMyCluster/ClusterName AutoBootstraptrue/AutoBootstrap HintedHandoffEnabledtrue/HintedHandoffEnabled IndexInterval128/IndexInterval Keyspaces Keyspace Name=MyCompany ColumnFamily Name=SearchLog ColumnType=Super CompareWith=TimeUUIDType CompareSubcolumnsWith=BytesType/ ColumnFamily Name=ProductLog ColumnType=Super CompareWith=TimeUUIDType CompareSubcolumnsWith=BytesType/ ColumnFamily Name=RequestLog ColumnType=Super CompareWith=TimeUUIDType CompareSubcolumnsWith=BytesType/ ColumnFamily Name=ClickLog ColumnType=Super CompareWith=TimeUUIDType CompareSubcolumnsWith=BytesType/ ColumnFamily Name=ItemTranslation ColumnType=Super CompareWith=BytesType CompareSubcolumnsWith=BytesType/ ColumnFamily Name=ItemTranslationIndex CompareWith=BytesType/ ColumnFamily Name=RelatedObject CompareWith=LongType/ ReplicaPlacementStrategyorg.apache.cassandra.locator.RackUnawareStrategy/ReplicaPlacementStrategy ReplicationFactor2/ReplicationFactor EndPointSnitchorg.apache.cassandra.locator.EndPointSnitch/EndPointSnitch /Keyspace /Keyspaces Authenticatororg.apache.cassandra.auth.AllowAllAuthenticator/Authenticator Partitionerorg.apache.cassandra.dht.RandomPartitioner/Partitioner InitialToken/InitialToken SavedCachesDirectory/var/lib/cassandra/saved_caches/SavedCachesDirectory CommitLogDirectory/var/lib/cassandra/commitlog/CommitLogDirectory DataFileDirectories DataFileDirectory/var/lib/cassandra/data/DataFileDirectory /DataFileDirectories Seeds Seedcassandra1/Seed Seedcassandra2/Seed /Seeds RpcTimeoutInMillis1/RpcTimeoutInMillis CommitLogRotationThresholdInMB128/CommitLogRotationThresholdInMB ListenAddress/ListenAddress StoragePort7000/StoragePort ThriftAddress/ThriftAddress ThriftPort9160/ThriftPort ThriftFramedTransportfalse/ThriftFramedTransport DiskAccessModemmap_index_only/DiskAccessMode RowWarningThresholdInMB512/RowWarningThresholdInMB SlicedBufferSizeInKB64/SlicedBufferSizeInKB FlushDataBufferSizeInMB32/FlushDataBufferSizeInMB FlushIndexBufferSizeInMB8/FlushIndexBufferSizeInMB ColumnIndexSizeInKB64/ColumnIndexSizeInKB MemtableThroughputInMB64/MemtableThroughputInMB BinaryMemtableThroughputInMB256/BinaryMemtableThroughputInMB MemtableFlushAfterMinutes60/MemtableFlushAfterMinutes ConcurrentReads8/ConcurrentReads ConcurrentWrites32/ConcurrentWrites CommitLogSyncperiodic/CommitLogSync CommitLogSyncPeriodInMS1/CommitLogSyncPeriodInMS GCGraceSeconds864000/GCGraceSeconds DoConsistencyChecksBooleantrue/DoConsistencyChecksBoolean /Storage On 3/4/11 1:05 PM, Narendra Sharma wrote: I have been through tuning for GC and OOM recently. If you can provide the cassandra.yaml, I can help. Mostly I had to play with memtable thresholds. Thanks, Naren On Fri, Mar 4, 2011 at 12:43 PM, Mark static.void@gmail.com mailto:static.void@gmail.com wrote: We have 7 column families and we are not using the default key cache (20). These were our initial settings so it was not in response to anything. Would you recommend anything else? Thanks On 3/4/11 12:34 PM, Chris Burroughs wrote: - Are you using a key cache? How many keys do you have? Across how many column families You configuration is unusual both in terms of not setting min heap == max heap and the
What would be a good strategy for Storing the large text contents like blog posts in Cassandra.
What would be a good strategy to store large text content/(blog posts of around 1500-3000 characters) in cassandra? I need to store these blog posts along with their metadata like bloggerId, blogTags. I am looking forward to store this data in a single row giving each attribute a single column. So one blog per row. Is using a single column for a large blog post like this a good strategy? Next, I also need to store the blogComments which I am planning to store all, in another single row. 1 comment per column. Thus the entire information about the a single comment like commentBody, commentor would be serialized(using google Protocol buffers) and stored in a single column, For storing the no. of likes of each comment itself, I am planning to keep a counter_column, in the same row, for each comment that will hold an no. specifiying no. of 'likes' of that comment. Any suggestions on the above design highly appreciated.. Thanks.
Re: how large can a cluster over the WAN be?
Hi, Please, can you help with the following? it will lead us in some design decisions. Are you familiar with Cassandra cluster that is installed in datacenters that are spread across the WAN? can you comment on the perfromance of such installation? What is the largest size of of such a cluster you are aware of? Thanks a lot, Miriam On Tue, Mar 1, 2011 at 9:47 PM, Mimi Aluminium mimi.alumin...@gmail.comwrote: Hi, Are there clusters of 100 nodes? more? Please can you refer me to such installations/ systems? Can you comment on over-the-WAN clusters in this size or less? and can you point on system with nodes in different DCs connected by WAN ( could be dedicated or internet) ? Thanks a lot, Miriam
Re: cant seem to figure out secondary index definition
Hi Roshan, thanks for your post. I quickly ran over it, and the only difference I can actually see is the compare_with type (we use TimeUUIDType). Any other suggestions, anyone? Am 06.03.2011 12:24, schrieb Roshan Dawrani: Hi, Sure. Here is a sample of how we define it at the end of cassandra.yaml. In the keyspace MyApp, it defines a column family MyUser, that has secondary indexes on 2 String columns - firstname, and lastname. Does it help? -- keyspaces: - name: MyApp replica_placement_strategy: org.apache.cassandra.locator.SimpleStrategy replication_factor: 1 column_families: - name: MyUser compare_with: org.apache.cassandra.db.marshal.BytesType column_metadata: - name: firstname validator_class: UTF8Type index_type: KEYS index_name: FirstNameIdx - name: firstname validator_class: UTF8Type index_type: KEYS index_name: LastNameIdx -- Regards, Roshan On Sun, Mar 6, 2011 at 4:34 PM, Jürgen Link juergen.l...@googlemail.com mailto:juergen.l...@googlemail.com wrote: Hi Roshan, could you please post a small sample from your yaml file? As documentation of indexes is quite sparse, we're grateful for any working example. Cheers Jürgen Am 04.03.2011 19:27, schrieb Roshan Dawrani: On Fri, Mar 4, 2011 at 11:52 PM, Jürgen Link juergen.l...@googlemail.com mailto:juergen.l...@googlemail.com wrote: Hi Jonathan, as Roland is already out of office, I'd like to jump in. Maybe this somehow got lost in the moddle of this thread, indexing works fine in our real cassandra cluster. For our test cases, we use an embedded cassandra instance, which is configured via yaml. In case indexes cannot be defined via yaml (for embedded instances), is there a more preferred way to do so? Sorry, I haven't followed the whole of this thread yet, but I just noticed this mail and would like to add that we also use embedded Cassandra instance for our dev / tests and out yaml has a number of indexes and they work just fine. We are on Cassandar 0.7.0.
Re: OOM exceptions
Under 0.6 am not sure off the top of my head. Would need to dig into it, its probably been discussed here though. Check the row size and let us know what version you are using first. Aaron On 7/03/2011, at 5:50 AM, Mark static.void@gmail.com wrote: If its determined that this is due to a very large row, what are my options? Thanks On 3/5/11 7:11 PM, aaron morton wrote: First question is which version are you running ? Am guessing 0.6 something If you have OOM in the compaction thread it may be because of a very large row. The CF information available through JConsole will give you the max row size for the CF. Your setting for RowWarningThresholdInMB is 512, however the default setting in 0.6.12 is 64mb. Here's the inline help from storage-conf.xml for 0.6.12 !-- ~ Size of compacted row above which to log a warning. If compacted ~ rows do not fit in memory, Cassandra will crash. (This is explained ~ in http://wiki.apache.org/cassandra/CassandraLimitations and is ~ scheduled to be fixed in 0.7.) Large rows can also be a problem ~ when row caching is enabled. -- RowWarningThresholdInMB64/RowWarningThresholdInMB In 0.7 the equivalent setting is in_memory_compaction_limit_in_mb , here's the file help from the 0.7.3 conf file # Size limit for rows being compacted in memory. Larger rows will spill # over to disk and use a slower two-pass compaction process. A message # will be logged specifying the row key. in_memory_compaction_limit_in_mb: 64 Hope that helps. Aaron On 5/03/2011, at 10:54 AM, Mark wrote: Thats very nice of you. Thanks Storage ClusterNameMyCluster/ClusterName AutoBootstraptrue/AutoBootstrap HintedHandoffEnabledtrue/HintedHandoffEnabled IndexInterval128/IndexInterval Keyspaces Keyspace Name=MyCompany ColumnFamily Name=SearchLog ColumnType=Super CompareWith=TimeUUIDType CompareSubcolumnsWith=BytesType/ ColumnFamily Name=ProductLog ColumnType=Super CompareWith=TimeUUIDType CompareSubcolumnsWith=BytesType/ ColumnFamily Name=RequestLog ColumnType=Super CompareWith=TimeUUIDType CompareSubcolumnsWith=BytesType/ ColumnFamily Name=ClickLog ColumnType=Super CompareWith=TimeUUIDType CompareSubcolumnsWith=BytesType/ ColumnFamily Name=ItemTranslation ColumnType=Super CompareWith=BytesType CompareSubcolumnsWith=BytesType/ ColumnFamily Name=ItemTranslationIndex CompareWith=BytesType/ ColumnFamily Name=RelatedObject CompareWith=LongType/ ReplicaPlacementStrategyorg.apache.cassandra.locator.RackUnawareStrategy/ReplicaPlacementStrategy ReplicationFactor2/ReplicationFactor EndPointSnitchorg.apache.cassandra.locator.EndPointSnitch/EndPointSnitch /Keyspace /Keyspaces Authenticatororg.apache.cassandra.auth.AllowAllAuthenticator/Authenticator Partitionerorg.apache.cassandra.dht.RandomPartitioner/Partitioner InitialToken/InitialToken SavedCachesDirectory/var/lib/cassandra/saved_caches/SavedCachesDirectory CommitLogDirectory/var/lib/cassandra/commitlog/CommitLogDirectory DataFileDirectories DataFileDirectory/var/lib/cassandra/data/DataFileDirectory /DataFileDirectories Seeds Seedcassandra1/Seed Seedcassandra2/Seed /Seeds RpcTimeoutInMillis1/RpcTimeoutInMillis CommitLogRotationThresholdInMB128/CommitLogRotationThresholdInMB ListenAddress/ListenAddress StoragePort7000/StoragePort ThriftAddress/ThriftAddress ThriftPort9160/ThriftPort ThriftFramedTransportfalse/ThriftFramedTransport DiskAccessModemmap_index_only/DiskAccessMode RowWarningThresholdInMB512/RowWarningThresholdInMB SlicedBufferSizeInKB64/SlicedBufferSizeInKB FlushDataBufferSizeInMB32/FlushDataBufferSizeInMB FlushIndexBufferSizeInMB8/FlushIndexBufferSizeInMB ColumnIndexSizeInKB64/ColumnIndexSizeInKB MemtableThroughputInMB64/MemtableThroughputInMB BinaryMemtableThroughputInMB256/BinaryMemtableThroughputInMB MemtableFlushAfterMinutes60/MemtableFlushAfterMinutes ConcurrentReads8/ConcurrentReads ConcurrentWrites32/ConcurrentWrites CommitLogSyncperiodic/CommitLogSync CommitLogSyncPeriodInMS1/CommitLogSyncPeriodInMS GCGraceSeconds864000/GCGraceSeconds DoConsistencyChecksBooleantrue/DoConsistencyChecksBoolean /Storage On 3/4/11 1:05 PM, Narendra Sharma wrote: I have been through tuning for GC and OOM recently. If you can provide the cassandra.yaml, I can help. Mostly I had to play with memtable thresholds. Thanks, Naren On Fri, Mar 4, 2011 at 12:43 PM,
Re: What would be a good strategy for Storing the large text contents like blog posts in Cassandra.
Sounds reasonable, one CF for the blog post one CF for the comments. You could also use a single CF if you will often read the blog and the comments at the same time. The best design is the one that suits how your app works, try one and be prepared to change. Note that counters are only in the 0.8 trunk and are still under development, they are not going to be released for a couple of months. Your per column data size is nothing to be concerned abut. Hope that helps. Aaron On 7/03/2011, at 6:35 AM, Aditya Narayan ady...@gmail.com wrote: What would be a good strategy to store large text content/(blog posts of around 1500-3000 characters) in cassandra? I need to store these blog posts along with their metadata like bloggerId, blogTags. I am looking forward to store this data in a single row giving each attribute a single column. So one blog per row. Is using a single column for a large blog post like this a good strategy? Next, I also need to store the blogComments which I am planning to store all, in another single row. 1 comment per column. Thus the entire information about the a single comment like commentBody, commentor would be serialized(using google Protocol buffers) and stored in a single column, For storing the no. of likes of each comment itself, I am planning to keep a counter_column, in the same row, for each comment that will hold an no. specifiying no. of 'likes' of that comment. Any suggestions on the above design highly appreciated.. Thanks.
Re: OOM exceptions
Sorry, I forgot to mention. I am running 0.6.6 On 3/6/11 3:27 PM, Aaron Morton wrote: Under 0.6 am not sure off the top of my head. Would need to dig into it, its probably been discussed here though. Check the row size and let us know what version you are using first. Aaron On 7/03/2011, at 5:50 AM, Mark static.void@gmail.com mailto:static.void@gmail.com wrote: If its determined that this is due to a very large row, what are my options? Thanks On 3/5/11 7:11 PM, aaron morton wrote: First question is which version are you running ? Am guessing 0.6 something If you have OOM in the compaction thread it may be because of a very large row. The CF information available through JConsole will give you the max row size for the CF. Your setting for RowWarningThresholdInMB is 512, however the default setting in 0.6.12 is 64mb. Here's the inline help from storage-conf.xml for 0.6.12 !-- ~ Size of compacted row above which to log a warning. If compacted ~ rows do not fit in memory, Cassandra will crash. (This is explained ~ in http://wiki.apache.org/cassandra/CassandraLimitations and is ~ scheduled to be fixed in 0.7.) Large rows can also be a problem ~ when row caching is enabled. -- RowWarningThresholdInMB64/RowWarningThresholdInMB In 0.7 the equivalent setting is in_memory_compaction_limit_in_mb , here's the file help from the 0.7.3 conf file # Size limit for rows being compacted in memory. Larger rows will spill # over to disk and use a slower two-pass compaction process. A message # will be logged specifying the row key. in_memory_compaction_limit_in_mb: 64 Hope that helps. Aaron On 5/03/2011, at 10:54 AM, Mark wrote: Thats very nice of you. Thanks Storage ClusterNameMyCluster/ClusterName AutoBootstraptrue/AutoBootstrap HintedHandoffEnabledtrue/HintedHandoffEnabled IndexInterval128/IndexInterval Keyspaces Keyspace Name=MyCompany ColumnFamily Name=SearchLog ColumnType=Super CompareWith=TimeUUIDType CompareSubcolumnsWith=BytesType/ ColumnFamily Name=ProductLog ColumnType=Super CompareWith=TimeUUIDType CompareSubcolumnsWith=BytesType/ ColumnFamily Name=RequestLog ColumnType=Super CompareWith=TimeUUIDType CompareSubcolumnsWith=BytesType/ ColumnFamily Name=ClickLog ColumnType=Super CompareWith=TimeUUIDType CompareSubcolumnsWith=BytesType/ ColumnFamily Name=ItemTranslation ColumnType=Super CompareWith=BytesType CompareSubcolumnsWith=BytesType/ ColumnFamily Name=ItemTranslationIndex CompareWith=BytesType/ ColumnFamily Name=RelatedObject CompareWith=LongType/ ReplicaPlacementStrategyorg.apache.cassandra.locator.RackUnawareStrategy/ReplicaPlacementStrategy ReplicationFactor2/ReplicationFactor EndPointSnitchorg.apache.cassandra.locator.EndPointSnitch/EndPointSnitch /Keyspace /Keyspaces Authenticatororg.apache.cassandra.auth.AllowAllAuthenticator/Authenticator Partitionerorg.apache.cassandra.dht.RandomPartitioner/Partitioner InitialToken/InitialToken SavedCachesDirectory/var/lib/cassandra/saved_caches/SavedCachesDirectory CommitLogDirectory/var/lib/cassandra/commitlog/CommitLogDirectory DataFileDirectories DataFileDirectory/var/lib/cassandra/data/DataFileDirectory /DataFileDirectories Seeds Seedcassandra1/Seed Seedcassandra2/Seed /Seeds RpcTimeoutInMillis1/RpcTimeoutInMillis CommitLogRotationThresholdInMB128/CommitLogRotationThresholdInMB ListenAddress/ListenAddress StoragePort7000/StoragePort ThriftAddress/ThriftAddress ThriftPort9160/ThriftPort ThriftFramedTransportfalse/ThriftFramedTransport DiskAccessModemmap_index_only/DiskAccessMode RowWarningThresholdInMB512/RowWarningThresholdInMB SlicedBufferSizeInKB64/SlicedBufferSizeInKB FlushDataBufferSizeInMB32/FlushDataBufferSizeInMB FlushIndexBufferSizeInMB8/FlushIndexBufferSizeInMB ColumnIndexSizeInKB64/ColumnIndexSizeInKB MemtableThroughputInMB64/MemtableThroughputInMB BinaryMemtableThroughputInMB256/BinaryMemtableThroughputInMB MemtableFlushAfterMinutes60/MemtableFlushAfterMinutes ConcurrentReads8/ConcurrentReads ConcurrentWrites32/ConcurrentWrites CommitLogSyncperiodic/CommitLogSync CommitLogSyncPeriodInMS1/CommitLogSyncPeriodInMS GCGraceSeconds864000/GCGraceSeconds DoConsistencyChecksBooleantrue/DoConsistencyChecksBoolean /Storage On 3/4/11 1:05 PM, Narendra Sharma wrote: I have been through tuning for GC and OOM recently. If you can provide the cassandra.yaml, I can help. Mostly I had to play with memtable thresholds. Thanks, Naren On Fri, Mar 4, 2011 at 12:43 PM, Mark static.void@gmail.com mailto:static.void@gmail.com wrote: We have 7 column families and we are not using the default key cache (20). These were our
Re: cant seem to figure out secondary index definition
On Sun, Mar 6, 2011 at 4:49 PM, Jürgen Link juergen.l...@googlemail.comwrote: Hi Roshan, thanks for your post. I quickly ran over it, and the only difference I can actually see is the compare_with type (we use TimeUUIDType). Any other suggestions, anyone? You want to add an index on a CF with TimeUUIDType column names? I think you've probably mistaken the purpose of compare_with. If you haven't, I think you'll need to add the index programmatically in order to specify a non ascii/unicode column name for the index. -- Tyler Hobbs Software Engineer, DataStax http://datastax.com/ Maintainer of the pycassa http://github.com/pycassa/pycassa Cassandra Python client library
Designing a decent data model for an online music shop...confused/stuck on decisions
We're in a bit of a predicament, we have an e-music store currently built in PHP using codeigniter/mysql... The current system has 100+K users and a decent song collection. Over the last few months I've been playing with Cassandra... needless to say I'm impressed but I have a few questions. Firstly, I want to avoid re-writing the entire site if possible so my instincts have made me inclined to replace the database layer in code igniter... is this something anyone would recommend and are there any gotchas in doing that? I can't say I've been terribly happy with PHP accessing cassandra, when sample data of the same size was put into mysql and in cassandra (of the same size/type) The pages with php connecting to Cassandra took longer to load, (30K records in table). I've thought maybe it was my setup that needed tweaking and I've played with as many a options as I could but the best I've gotten is matching query time. Query speed test was simply getting time stamps right before and after query call returned... Is this something anyone else has seen, any comments suggestions? I've tried using thrift, phpcassa and pandra with pretty similar numbers. My other thought turned to maybe it was the way I designed my CFs, at first I used super columns to model user account CF based on a post I read by Arin (WTF is a super column) but I later changed to using normal CFs. I'm trying to make this work but I get the feeling my approach is somewhat...I don't mis-guided. Here's a break down of the current model. CF:Users{ uid fname lname username password street } Some additional columns in place for a user but keeping it simple... CF:Library{ uid songid ... other info about user library } CF:Songs{ songid title artistid } This all is still very relational like (considering I go on to have a CF for playlist and artists) and I'm not sure if this is a good design for the data but... when I looked into combining some of the info and removing some CFs I run into the issue of replicating data all over the place. If for example I stored the artist name in the library for each record then each then the artist would be replicated for every song they have for every user who has that song in their library Where do you sort of draw the line on deciding how much is okay to be replicated? As much as I am not liking the idea of building the application from scratch, I'm considering the possibility of building from scratch in Java/JSP just to get the benefit of using the hector client. (Efforts from the guys doing the PHP libs is much appreciated but PHP doesn't seem to go too well with Cas.) In the process of making decisions because the upgrade/rebuild needs to have a fairly steady working version for October and I don't want to go wrong before even starting. Recommendations. Suggestions, advice are all welcomed (Any experience with PHP and Cas. is also welcomed since all my fav. libs. are in PHP I'm reluctant to turn away)
Re: Designing a decent data model for an online music shop...confused/stuck on decisions
Regarding PHP performance with Cassandra, THRIFT-638https://issues.apache.org/jira/browse/THRIFT-638was recently resolved and it shows some big performance improvements. I'll be upgrading the Thrift package that ships with phpcassa soon to include this fix, so you may want to compare performance numbers before and after. On Sun, Mar 6, 2011 at 8:03 PM, Courtney e-mailadr...@hotmail.com wrote: We're in a bit of a predicament, we have an e-music store currently built in PHP using codeigniter/mysql... The current system has 100+K users and a decent song collection. Over the last few months I've been playing with Cassandra... needless to say I'm impressed but I have a few questions. Firstly, I want to avoid re-writing the entire site if possible so my instincts have made me inclined to replace the database layer in code igniter... is this something anyone would recommend and are there any gotchas in doing that? I can't say I've been terribly happy with PHP accessing cassandra, when sample data of the same size was put into mysql and in cassandra (of the same size/type) The pages with php connecting to Cassandra took longer to load, (30K records in table). I've thought maybe it was my setup that needed tweaking and I've played with as many a options as I could but the best I've gotten is matching query time. Query speed test was simply getting time stamps right before and after query call returned... Is this something anyone else has seen, any comments suggestions? I've tried using thrift, phpcassa and pandra with pretty similar numbers. My other thought turned to maybe it was the way I designed my CFs, at first I used super columns to model user account CF based on a post I read by Arin (WTF is a super column) but I later changed to using normal CFs. I'm trying to make this work but I get the feeling my approach is somewhat...I don't mis-guided. Here's a break down of the current model. CF:Users{ uid fname lname username password street } Some additional columns in place for a user but keeping it simple... CF:Library{ uid songid ... other info about user library } CF:Songs{ songid title artistid } This all is still very relational like (considering I go on to have a CF for playlist and artists) and I'm not sure if this is a good design for the data but... when I looked into combining some of the info and removing some CFs I run into the issue of replicating data all over the place. If for example I stored the artist name in the library for each record then each then the artist would be replicated for every song they have for every user who has that song in their library Where do you sort of draw the line on deciding how much is okay to be replicated? As much as I am not liking the idea of building the application from scratch, I'm considering the possibility of building from scratch in Java/JSP just to get the benefit of using the hector client. (Efforts from the guys doing the PHP libs is much appreciated but PHP doesn't seem to go too well with Cas.) In the process of making decisions because the upgrade/rebuild needs to have a fairly steady working version for October and I don't want to go wrong before even starting. Recommendations. Suggestions, advice are all welcomed (Any experience with PHP and Cas. is also welcomed since all my fav. libs. are in PHP I'm reluctant to turn away) -- Tyler Hobbs Software Engineer, DataStax http://datastax.com/ Maintainer of the pycassa http://github.com/pycassa/pycassa Cassandra Python client library
Re: question about replicas dynamic response to load
Thanks for the answers, Dan, Aaron. ... Ok, so one question is, if I haven't made any writes at all, can I decommission without delay? (Is there a force drop option or something, or will the cluster recognize the lack of writes)? I may be able to segregate writes to the reference collection so that they occur late at night and/or on weekends when I don't have much load, otherwise. (NB it would be nice to be able to control replication strategy by keyspace; as it is I can probably put the reference data in its own cluster.) But thanks for the suggestions about a caching layer -- I had already thought of memcache (as noted problematic due to amount of data), but hadn't considered the some of the other options you've mentioned. I didn't know, for instance, that you could use the queueing services this way. As for S3, etc... I guess its possible, but the costs seem to mount quickly as well. Typically I have one sporadic writer and many readers, but I do write sometimes. Another use case is to have expanded capacity for writes reads of intermediate results while running hadoop. Should I perhaps just start a whole other cluster for these? Gratefully, -- Shaun On Mar 5, 2011, at 10:52 PM, aaron morton wrote: Agree. Cassandra generally assumes a reasonable static cluster membership. There are some tricks that can be done with copying SSTables but they will only reduce the need to stream data around, not eliminate it. This may not suit your problem domain but, speaking of the AWS infrastructure how about using the SQS messaging service (or similar e.g. RabbitMQ) to smooth out your throughput ? You could then throttle the inserts into the cassandra cluster to a maximum level and spec your HW against that. During peak the message queue can soak up the overflow. Hope that helps. Aaron On 4/03/2011, at 2:07 PM, Dan Hendry wrote: To some extent, the boot-strapping problem will be an issue with most solutions: the data has to be duplicated from somewhere. Bootstrapping should not cause much performance degradation unless you are already pushing capacity limits. It's the decommissioning problem which makes Cassandra somewhat problematic in your case. You grow your cluster x5 then write to it. You have to perform a proper decommission when shrinking the cluster again which involves validating and streaming data to the remaining replicas: a fairly serious operation with TBs of data. For most realistic situations, unless the cluster is completely read-only, you cant just kill most of the nodes in the cluster. I cant really think of a good, general, way to do this with just Cassandra although there may be some hacktastical possibilities. I think a more statically sized Cassandra cluster then a variable cache layer (memcached or similar) is probably a better solution. This option kind of falls apart at the terabytes of data range. Have you considered using S3, Amazon cloud front or some other CDN instead of rolling your own solution? For immutable data, its what they excel at. Cassandra has amazing write capacity and its design focus is on scaling writes. I would not really consider it a good tool for the job of serving massive amounts of static content. Dan -Original Message- From: Shaun Cutts [mailto:sh...@cuttshome.net] Sent: March-03-11 13:00 To: user@cassandra.apache.org Subject: question about replicas dynamic response to load Hello, In our project our usage pattern is likely to be quite variable -- high for a a few days, then lower, etc could vary as much (or more) as 10x from peak to non-peak. Also, much of our data is immutable -- but there is a considerable amount of it -- perhaps in the single digit TBs. Finally, we are hosting with amazon. I'm looking for advice on how to vary the number of nodes dynamically, in order to reduce our hosting costs at non-peak times. I worry that just adding new nodes in response to demand will make things worse -- at least temporarily -- as the new node copies data to itself; then bringing it down will also cause a degradation. I'm wondering if it is possible to bring up exact copies of other nodes? Or alternately to take down a populated node containing (only?) immutable data, then bring it up again when the need arises? Are there reference/reading materials(/blogs) concerning dynamically varying number of nodes in response to demand? Thanks! -- Shaun No virus found in this incoming message. Checked by AVG - www.avg.com Version: 9.0.872 / Virus Database: 271.1.1/3479 - Release Date: 03/03/11 02:34:00
Re: Secondary indexes
Info on secondary indexes http://www.datastax.com/dev/blog/whats-new-cassandra-07-secondary-indexes Some answers to your other questions are also in there as well as a discussion about the limitations. Hope that helps. Aaron On 7/03/2011, at 3:54 PM, Mark wrote: I haven't looked at Cassandra since 0.6.6 and now I notice in 0.7+ there is support for secondary indexes. I haven't found much material on how these are used and when one should use one. Can someone point me in the right direction. Also can these be created (and deleted) as needed without affecting the underlying CF? If so, I'm guessing there is no need to create another CF just for indexing. What are the disadvantages of this method? Thanks all.
Re: What would be a good strategy for Storing the large text contents like blog posts in Cassandra.
Thanks Aaron!! I didnt knew about the upcoming facility for inbuilt counters. This sounds really great for my use-case!! Could you let me know where can I read more about this, if this had been blogged about, somewhere ? I'll go forward with the one (entire)blog per column design. Thanks On Mon, Mar 7, 2011 at 5:10 AM, Aaron Morton aa...@thelastpickle.com wrote: Sounds reasonable, one CF for the blog post one CF for the comments. You could also use a single CF if you will often read the blog and the comments at the same time. The best design is the one that suits how your app works, try one and be prepared to change. Note that counters are only in the 0.8 trunk and are still under development, they are not going to be released for a couple of months. Your per column data size is nothing to be concerned abut. Hope that helps. Aaron On 7/03/2011, at 6:35 AM, Aditya Narayan ady...@gmail.com wrote: What would be a good strategy to store large text content/(blog posts of around 1500-3000 characters) in cassandra? I need to store these blog posts along with their metadata like bloggerId, blogTags. I am looking forward to store this data in a single row giving each attribute a single column. So one blog per row. Is using a single column for a large blog post like this a good strategy? Next, I also need to store the blogComments which I am planning to store all, in another single row. 1 comment per column. Thus the entire information about the a single comment like commentBody, commentor would be serialized(using google Protocol buffers) and stored in a single column, For storing the no. of likes of each comment itself, I am planning to keep a counter_column, in the same row, for each comment that will hold an no. specifiying no. of 'likes' of that comment. Any suggestions on the above design highly appreciated.. Thanks.