After deleting some data from the cluster under Solandra, we keep seeing this assertion.

2011-11-11 Thread Jacob, Arun
After removing some data from Solandra via a Solr query, we are getting 
DecoratedKey assertions.

Our setup:
latest version of Solandra (I think it supports 0.8.6, please correct if wrong)

3 solandra nodes, with replication set to 2 and sharding set to 3.

No systems are currently running (ingest or read) other than a simple


http://tn7cldsolandra01/solandra/schema/http://tn7cldsolandra01/solandra/schema/WDPRO-NGELOG-DEVmyschemaquery.

The delete command I ran was:

http://tn7cldsolandra01/solandra/myschema/updatehttp://tn7cldsolandra01/solandra/WDPRO-NGELOG-DEV/update
  -H”Content-Type: text/xml” –data-binary “deletequerytime:[0 TO 
131814360] /query/delete

This is what’s getting dumped into one node (solandra01).   I’ve ran nodetool 
scrub and nodetool –repair on this node and nodetool –repair on our solandra02 
box (solandra03 is still doing the nodetool –repair).

ERROR [ReadStage:3709] 2011-11-11 16:54:06,185 AbstractCassandraDaemon.java 
(line 139) Fatal exception in thread Thread[ReadStage:3709,5,main]
java.lang.AssertionError: DecoratedKey(144225997531208877267913204104447190682, 
44434c4f55442d52414e44592d444556) != 
DecoratedKey(144225997531208877267913204104447190682, 
313434323235393937353331323038383737323637393133323034313034343437313930363832efbfbf736861726473)
 in /data/dcloud-querysvc/data/L/SI-g-94-Data.db
at 
org.apache.cassandra.db.columniterator.SSTableSliceIterator.init(SSTableSliceIterator.java:59)
at 
org.apache.cassandra.db.filter.SliceQueryFilter.getSSTableColumnIterator(SliceQueryFilter.java:66)
at 
org.apache.cassandra.db.filter.QueryFilter.getSSTableColumnIterator(QueryFilter.java:80)
at 
org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1407)
at 
org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1304)
at 
org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1261)
at org.apache.cassandra.db.Table.getRow(Table.java:385)
at 
org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:61)
at 
org.apache.cassandra.service.StorageProxy$LocalReadRunnable.runMayThrow(StorageProxy.java:668)
at 
org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1133)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)


Solandra: connection refused errors

2011-10-06 Thread Jacob, Arun
I'm seeing this error when trying to insert data into a core I've defined in 
Solandra

INFO [pool-7-thread-319] 2011-10-06 16:21:34,328 HttpMethodDirector.java (line 
445) Retrying request
INFO [pool-7-thread-1070] 2011-10-06 16:21:34,328 HttpMethodDirector.java (line 
445) Retrying request
INFO [pool-7-thread-335] 2011-10-06 16:21:34,327 HttpMethodDirector.java (line 
439) I/O exception (java.net.ConnectException) caught when processing request: 
Connection
refused
INFO [pool-7-thread-335] 2011-10-06 16:21:34,329 HttpMethodDirector.java (line 
445) Retrying request
ERROR [1926426205@qtp-673795938-11] 2011-10-06 16:21:34,327 SolrException.java 
(line 151) org.apache.solr.common.SolrException: 
org.apache.solr.client.solrj.SolrServerException: java.net.ConnectException: 
Connection refused


Has anyone seen this behavior before, the problem is that it seems to be 
intermittent (if it were failing all of the time, I would suspect a port or IP 
misconfiguration).




invalid shard name encountered

2011-10-06 Thread Jacob, Arun
I'm seeing this in my logs:

WARN [1832199239@qtp-673795938-0] 2011-10-06 16:15:46,424 
CassandraIndexManager.java (line 364) invalid shard name encountered: 
WDPRO-NGELOG-DEV 1

WDPRO-NGELOG-DEV  is the name of the index I'm creating. Is there a restriction 
on characters in the name?


Cassandra Node Requirements

2011-08-24 Thread Jacob, Arun
I'm trying to determine a node configuration for Cassandra. From what I've been 
able to determine from reading around:


 1.  we need to cap data size at 50% of total node storage capacity for 
compaction
 2.  with RF=3, that means that I need to effectively assume that I have 1/6th 
of total storage capacity.
 3.  SSDs are preferred, but of course  reduce storage capacity
 4.  using standard storage means you bump up your RAM to keep as much in 
memory as possible.

Right now we are looking at storage requirements of 42 – 60TB, assuming a 
baseline of 3TB/day and expiring data after 14-20 days (depending on use case), 
 I would assume based on above that we need 252- 360TB total storage max.

My questions:

 1.  is 8TB (meaning 1.33 actual TB storage/node) a reasonable per node storage 
size for Cassandra? I don’t want to use SSDs due to reduced storage capacity -- 
I don't want to buy 100s of nodes to support that reduced storage capacity of 
SSDs.  Given that I will be using standard drives, what is a 
reasonable/effective per node storage capacity?
 2.  other than splitting the commit log onto a separate drive, is there any 
other drive allocation I should be doing?
 3.  Assuming I'm not using SSDs, what would be a good memory size for a node? 
I've heard anything from 32-48 GB, but need more guidance.

Anything else that anyone has run into? What are common configurations being 
used by others?

Thanks in advance,

-- Arun




Re: Cassandra Node Requirements

2011-08-24 Thread Jacob, Arun
Thanks for the links and the answers. The vagueness of my initial questions 
reflects the fact that I'm trying to configure for a general case — I will 
clarify below:

I need to account for a variety of use cases.
(1) they will be both read and write heavy.  I was assuming that SSDs would be 
really good to handle the heavy read load, but with the amount of data I need 
to store, SSDs arent economical.
(2) I should have clarified, the main use case has  95% of writes going to a 
single column family. The other CFs are going to be much smaller in relation to 
the primary CF, which will be sized to sizes below. Given that this is the 
case, are my assumptions about storage correct for that use case?
(3) In  that use case, the majority of reads will actually come from the most 
recently inserted 7% of the data. In other use cases, reads will be random. 
Another use case uses Solandra, and I am assuming that use case results in 
random reads.

Assuming 250-360TB storage, need for the primary use case, I'm still trying to 
determine  how many nodes  I need to stand up to service that much data. What 
is a reasonable amount of storage per node? You mentioned memory to storage 
ratio: I'm assuming that ratio trends higher with the more random reads you do. 
Could you provide an example ratio for a heavy read use case?

Thanks,

-- Arun

From: Edward Capriolo edlinuxg...@gmail.commailto:edlinuxg...@gmail.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Wed, 24 Aug 2011 14:54:56 -0700
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: Cassandra Node Requirements



On Wed, Aug 24, 2011 at 2:54 PM, Jacob, Arun 
arun.ja...@disney.commailto:arun.ja...@disney.com wrote:
I'm trying to determine a node configuration for Cassandra. From what I've been 
able to determine from reading around:


 1.  we need to cap data size at 50% of total node storage capacity for 
compaction
 2.  with RF=3, that means that I need to effectively assume that I have 1/6th 
of total storage capacity.
 3.  SSDs are preferred, but of course  reduce storage capacity
 4.  using standard storage means you bump up your RAM to keep as much in 
memory as possible.

Right now we are looking at storage requirements of 42 – 60TB, assuming a 
baseline of 3TB/day and expiring data after 14-20 days (depending on use case), 
 I would assume based on above that we need 252- 360TB total storage max.

My questions:

 1.  is 8TB (meaning 1.33 actual TB storage/node) a reasonable per node storage 
size for Cassandra? I don’t want to use SSDs due to reduced storage capacity -- 
I don't want to buy 100s of nodes to support that reduced storage capacity of 
SSDs.  Given that I will be using standard drives, what is a 
reasonable/effective per node storage capacity?
 2.  other than splitting the commit log onto a separate drive, is there any 
other drive allocation I should be doing?
 3.  Assuming I'm not using SSDs, what would be a good memory size for a node? 
I've heard anything from 32-48 GB, but need more guidance.

Anything else that anyone has run into? What are common configurations being 
used by others?

Thanks in advance,

-- Arun




I would suggest checking out:
http://wiki.apache.org/cassandra/CassandraHardware
http://wiki.apache.org/cassandra/LargeDataSetConsiderations
http://www.slideshare.net/edwardcapriolo/real-world-capacity

1. we need to cap data size at 50% of total node storage capacity for compaction

False. You need 50% the capacity of your largest column family free with some 
other room for overhead. This changes all your numbers.

3. SSDs are preferred, but of course  reduce storage capacity

Avoid generalizations. Many use cases may get little benefit from SSD disks.

4. using standard storage means you bump up your RAM to keep as much in memory 
as possible.

In most cases you want to maintain some RAM / Hard disk ratio. SSD setups still 
likely need sizable RAM.

Your 3 questions are hard to answer because what hardware you need workload 
dependent. If really depends on active set, what percent of the data is active 
at any time. It also depends on your latency requirements, if you are modeling 
something like the way-back machine, that has different usage profile then a 
stock ticker application, that is again different from the usage patterns of an 
email system.

Generally people come to Cassandra because they are looking for low latency 
access to read and write data. This is hard to achieve on 8TB of disk. The size 
of the bloom filters and index files are themselves substantial with 8TB of 
data. You will also require a large amount of RAM on this disk to minimize disk 
seeks (or a super like SSD raid-0 (does this sound like a bad idea to you? It 
does to me :))

The only way to answer the question of how much hardware your need is with load 
testing

question on capacity planning

2011-06-29 Thread Jacob, Arun
if I'm planning to store 20TB of new data per week, and expire all data every 2 
weeks, with a replication factor of 3, do I only need approximately 120 TB of 
disk? I'm going to use ttl in my column values to automatically expire data. Or 
would I need more capacity to handle sstable merges? Given this amount of data, 
would you recommend node storage at 2TB per node or more? This application will 
have a heavy write /moderate read use profile.

-- Arun