Re: avro + cassandra + ruby

2010-11-17 Thread Benjamin Black
Cassandra.new(keyspace, server, {:protocol =
Thrift::BinaryProtocolAccelerated})

On Tue, Nov 16, 2010 at 5:13 PM, Ryan King r...@twitter.com wrote:
 On Tue, Nov 16, 2010 at 10:25 AM, Jonathan Ellis jbel...@gmail.com wrote:
 On Tue, Sep 28, 2010 at 6:35 PM, Ryan King r...@twitter.com wrote:
 One thing you should try is to make thrift use
 BinaryProtocolAccelerated, rather than the pure-ruby implementation
 (we should change the default).

 Dumb question time: how do you do this?

 $ find . -name *.rb |xargs grep -i binaryprotocol

 in the fauna cassandra gem repo turns up no hits.

 I believe we're relying on the default from thrift_client (which
 defaults to BinaryProtocol): https://github.com/fauna/thrift_client/

 -ryan



Re: How to Retrieve all the rows from a ColumnFamily

2010-09-27 Thread Benjamin Black
http://wiki.apache.org/cassandra/FAQ#iter_world

On Sun, Sep 26, 2010 at 11:51 PM, sekhar kosuru kosurusek...@gmail.com wrote:
 Hi
 I am new to Cassandra Database.
 I want to know how to Retrieve all the records from a column family, is this
 is different in the clustered servers vs single servers.
 Please suggest me with piece of code if possible.

 /Regards
 Sekhar.



Re: 0.7 memory usage problem

2010-09-27 Thread Benjamin Black
On Mon, Sep 27, 2010 at 12:59 PM, Alaa Zubaidi alaa.zuba...@pdf.com wrote:
 Thanks for the help.
 we have 2 drives using basic configurations, commitlog on one drive and data
 on another.
 and Yes the CL for writes is 3, however, the CL for reads is 1.


It is simply not possible that you are inserting at CL.ALL (which is
what I assume you mean by CL for writes is 3) given how frequently you
are flushing memtables.  Flushing every 1.7 seconds with 300,000 ops
and your 60 columns per row indicates you are inserting 3000 rows/sec,
not 600.  The behavior shown in that logs is almost certainly from
inserting with CL.ZERO.  The code you provided does not include the
definition of _writeConsistencyLevel.  Where is that set and what is
it set to?


b


Re: 0.7 memory usage problem

2010-09-27 Thread Benjamin Black
On Mon, Sep 27, 2010 at 2:51 PM, Benjamin Black b...@b3k.us wrote:
 On Mon, Sep 27, 2010 at 12:59 PM, Alaa Zubaidi alaa.zuba...@pdf.com wrote:
 Thanks for the help.
 we have 2 drives using basic configurations, commitlog on one drive and data
 on another.
 and Yes the CL for writes is 3, however, the CL for reads is 1.


 It is simply not possible that you are inserting at CL.ALL (which is
 what I assume you mean by CL for writes is 3) given how frequently you
 are flushing memtables.  Flushing every 1.7 seconds with 300,000 ops
 and your 60 columns per row indicates you are inserting 3000 rows/sec,

Sorry, that should be _5000_ rows/sec, not 3000.


b


Re: UnavailableException when data grows

2010-09-27 Thread Benjamin Black
Your ring is wildly unbalanced and you are almost certainly out of I/O
on one or more nodes.  You should be monitoring via JMX and common
systems tools to know when you are starting to have issues.  It is
going to take you some effort to get out of this situation now.


b

On Mon, Sep 27, 2010 at 2:55 PM, Rana Aich aichr...@gmail.com wrote:
 Hi Peter,
 Thanks for your detailed query...
 I have 8 m/c cluster. KVSHIGH1,2,3,4 and KVSLOW1,2,3,4. As the name suggests
 KVSLOWs have low diskspace ~ 350GB
  Whereas KVSHIGHs have 1.5 terabytes.
 Yet my nodetool shows the following:
 192.168.202.202Down       319.94 GB
 7200044730783885730400843868815072654      |--|
 192.168.202.4 Up         382.39 GB
 23719654286404067863958492664769598669     |   ^
 192.168.202.2 Up         106.81 GB
 36701505058375526444137310055285336988     v   |
 192.168.202.3 Up         149.81 GB
 65098486053779167479528707238121707074     |   ^
 192.168.202.201Up         154.72 GB
 79420606800360567885560534277526521273     v   |
 192.168.202.204Up         72.91 GB
  85219217446418416293334453572116009608     |   ^
 192.168.202.1 Up         29.78 GB
  87632302962564279114105239858760976120     v   |
 192.168.202.203Up         9.35 GB
 87790520647700936489181912967436646309     |--|
 As you can see one of our KVSLOW box is already down. Its 100% full. Whereas
 boxes having 1.5 terabytes have only 29.78 GB (192.168.202.1 )! I'm using
 RandomPartitioner. When I run the client program the Cassandra Daemon takes
 around 85-130% CPU.
 Regards,
 Rana


 On Mon, Sep 27, 2010 at 2:31 PM, Peter Schuller
 peter.schul...@infidyne.com wrote:

  How can I handle this kind of situation?

 In terms of surviving the problem, a re-try on the client side might
 help assuming the problem is temporary.

 However,  certainly the fact that you're seeing an issue to begin with
 is interesting, and the way to avoid it would depend on what the
 problem is. My understanding is that the UnavailableException
 indicates that the node you are talking to was unable to read
 form/write to a sufficient number of nodes to satisfy your consistency
 level. Presumably either because individual requests failed to return
 in time, or because the node considers other nodes to be flat out
 down.

 Can you correlate these issues with server-side activity on the nodes,
 such as background compaction, commitlog rotation or memtable
 flushing? Do you see your nodes saying that other nodes in the cluster
 are DOWN and UP (flapping)?

 How large is the data set in total (in terms of sstable size on disk),
 and how much memory do you have in your machines (going to page
 cache)?

 Have you observed the behavior of your nodes during compaction; in
 particular whether compaction is CPU bound or I/O bound? (That would
 tend to depend on data; generally the larger the individual values the
 more disk bound you'd tend to be.)

 Just trying to zero in on what the likely root cause is in this case.

 --
 / Peter Schuller




Re: 0.7 memory usage problem

2010-09-27 Thread Benjamin Black
What is your RF?

On Mon, Sep 27, 2010 at 3:13 PM, Alaa Zubaidi alaa.zuba...@pdf.com wrote:
  Sorry 3 means QUORUM.


 On 9/27/2010 2:55 PM, Benjamin Black wrote:

 On Mon, Sep 27, 2010 at 2:51 PM, Benjamin Blackb...@b3k.us  wrote:

 On Mon, Sep 27, 2010 at 12:59 PM, Alaa Zubaidialaa.zuba...@pdf.com
  wrote:

 Thanks for the help.
 we have 2 drives using basic configurations, commitlog on one drive and
 data
 on another.
 and Yes the CL for writes is 3, however, the CL for reads is 1.

 It is simply not possible that you are inserting at CL.ALL (which is
 what I assume you mean by CL for writes is 3) given how frequently you
 are flushing memtables.  Flushing every 1.7 seconds with 300,000 ops
 and your 60 columns per row indicates you are inserting 3000 rows/sec,

 Sorry, that should be _5000_ rows/sec, not 3000.


 b



 --
 Alaa Zubaidi
 PDF Solutions, Inc.
 333 West San Carlos Street, Suite 700
 San Jose, CA 95110  USA
 Tel: 408-283-5639 (or 408-280-7900 x5639)
 fax: 408-938-6479
 email: alaa.zuba...@pdf.com





Re: 0.7 memory usage problem

2010-09-27 Thread Benjamin Black
Does that mean you are doing 600 rows/sec per process or 600/sec total
across all processes?

On Mon, Sep 27, 2010 at 3:14 PM, Alaa Zubaidi alaa.zuba...@pdf.com wrote:
  Its actually split to 8 different processes that are doing the insertion.

 Thanks

 On 9/27/2010 2:03 PM, Peter Schuller wrote:

 [note: i put user@ back on CC but I'm not quoting the source code]

 Here is the code I am using (this is only for testing Cassandra it is not
 going the be used in production) I am new to Java, but I tested this and
 it
 seems to work fine when running for short amount of time:

 If you mean to ask about how to distributed writes - the general
 recommendation is to use a high-level Cassandra client (such as Hector
 at http://github.com/rantav/hector or Pelops at
 http://github.com/s7/scale7-pelops) rather than using the Thrift API
 directly. This is probably especially a good idea if you're new to
 Java as you say.

 But in any case, if you're having performance issues w.r.t. the write
 speed - are you in fact doing writes concurrently or is it a single
 sequential client doing the insertions? If you are maxing out without
 being disk bound, make sure that in addition to spreading writes
 across all nodes in the cluster, you are submitting writes with
 sufficient concurrency to allow Cassandra to scale to use available
 CPU across all cores.


 --
 Alaa Zubaidi
 PDF Solutions, Inc.
 333 West San Carlos Street, Suite 700
 San Jose, CA 95110  USA
 Tel: 408-283-5639 (or 408-280-7900 x5639)
 fax: 408-938-6479
 email: alaa.zuba...@pdf.com





Re: 0.7 memory usage problem

2010-09-27 Thread Benjamin Black
On Mon, Sep 27, 2010 at 3:48 PM, Alaa Zubaidi alaa.zuba...@pdf.com wrote:
  RF=2

With RF=2, QUORUM and ALL are the same.  Again, your logs show you are
attempting to insert about 180,000 columns/sec.  The only way that is
possible with your hardware is if you are using CL.ZERO.  The
available information does not add up.


b


Re: Curious as to how Cassandra handles the following

2010-09-26 Thread Benjamin Black
On Sun, Sep 26, 2010 at 11:04 AM, Lucas Nodine lucasnod...@gmail.com wrote:
 I'm looking at a design where multiple clients will connect to Cassandra and
 get/mutate resources, possibly concurrently.  After planning a bit, I ran
 into the following scenero for which I have not been able to research to
 find an answer sufficient for my needs.  I have found where others have
 recommended Zookeeper for such tasks, but I want to determine if there is a
 simple solution before including another product in my design.

 Make the following assumption for all following situations:
 Assuming multiple clients where a client is someone accessing Cassandra
 using thrift.  All reads and writes are performed using the QUORUM
 consistency level.

 Situation 1:
 Client A (A) connects to Cassandra and requests a QUORUM consistency level
 get of an entire row.  At or very shortly thereafter (before A's request
 completes), Client B (B) connects to Cassandra and inserts (or mutates) a
 column (or multiple columns) within the row.

 Does A receive the new data saved by B or does A receive the data prior to
 B's save?


Depends on the exact order of operations across several nodes.  Since
you can't know what that ordering will be (or what it was), you can't
predict whether you see the pre- or post-update version.

 Situaton 2:
 B connects and mutates multiple columns within a row.  A requests some data
 therein while B is processing.

 Result?


Which call was used to make the changes?

 Situation 3:
 B mutates multiple columns within multiple rows.  A requests some data
 therein while B is processing.

 Result?


Undefined, as in situation 1.

 Justification: At certain points I want to essentially lock a resource (row)
 in cassandra for exclusive write access (think checkout a resource) by
 setting a flag value of a column within that row.  I'm just considering race
 conditions.


If you really can't fix your design to avoid locks, then you need a
system to permit locking.  That usually means Zookeeper.


b


Re: Curious as to how Cassandra handles the following

2010-09-26 Thread Benjamin Black
On Sun, Sep 26, 2010 at 4:01 PM, Lucas Nodine lucasnod...@gmail.com wrote:
 Ok, so based on everyone's input it seems that I need to put some sort of
 server in front of Cassandra to handle locking and exclusive access.

 I am planning on building a system (DMS) that will store resources
 (document, images, media, etc) using Cassandra for data.  As my target user
 is going to be someone without any understanding of a 'diff' I have elected
 for locking instead of conflict resolution in versions.

Good thing, as versioned conflict resolution is not available in Cassandra.


b


Re: 0.7 memory usage problem

2010-09-25 Thread Benjamin Black
Looking further, I would expect your 36000 writes/sec to trigger a
memtable flush every 8-9 seconds (which is already crazy), but you are
actually flushing them every ~1.7 seconds, leading me to believe you
are writing a _lot_ faster than you think you are.

 INFO [ROW-MUTATION-STAGE:21] 2010-09-24 13:13:23,203
ColumnFamilyStore.java (line 422) switching in a fresh Memtable for
HiFreq at 
CommitLogContext(file='C:\Cassandra\Cass07\commitlog\CommitLog-1285358848765.log',
position=13796967)
 INFO [ROW-MUTATION-STAGE:4] 2010-09-24 13:13:25,171
ColumnFamilyStore.java (line 422) switching in a fresh Memtable for
HiFreq at 
CommitLogContext(file='C:\Cassandra\Cass07\commitlog\CommitLog-1285358848765.log',
position=29372124)
 INFO [ROW-MUTATION-STAGE:8] 2010-09-24 13:13:26,937
ColumnFamilyStore.java (line 422) switching in a fresh Memtable for
HiFreq at 
CommitLogContext(file='C:\Cassandra\Cass07\commitlog\CommitLog-1285358848765.log',
position=44950820)


b

On Sat, Sep 25, 2010 at 7:53 PM, Benjamin Black b...@b3k.us wrote:
 The log posted shows _10_ pending in MPF stage, and the errors show
 repeated failures trying to flush memtables at all:

  INFO [GC inspection] 2010-09-24 13:16:11,281 GCInspector.java (line
 156) MEMTABLE-POST-FLUSHER             1        10

 You are also flushing _really_ small memtables to disk (looks to be
 triggered by the default ops threshold):

  INFO [FLUSH-WRITER-POOL:1] 2010-09-24 12:55:27,296 Memtable.java
 (line 150) Writing memtable-hif...@741540175(15105576 bytes, 314640
 operations)

 Based on what you said initially:

 600 row (60 columns per row) per second, ~3K size rows

 If that is so, you are writing 36000 columns per second to a single
 machine (why are you not distributing the client load across the
 cluster, as is best practice?).  If your RF is 3 on your 3 node
 cluster, every node is taking every write, so you are trying to
 maintain 36000 writes per second per node.  Even with a dedicated
 (spinning media) commitlog drive, you can't possibly keep up with
 that.

 What is your disk setup?

 What CL are you using for these writes?

 Can you post your client code for doing the writes?

 It is odd that you are able to do 36000/sec _at all_ unless you are
 using CL.ZERO, which would quickly lead to OOM.


 b



Re: Backporting Data Center Shard Strategy

2010-09-22 Thread Benjamin Black
You might be confusing the RackAware strategy (which puts 1 replica in
a remote DC) and the DatacenterShard strategy (which puts M of N
replicas in remote DCs).  Both are in 0.6.5.

https://svn.apache.org/repos/asf/cassandra/tags/cassandra-0.6.5/src/java/org/apache/cassandra/locator/DatacenterShardStategy.java

On Tue, Sep 21, 2010 at 10:23 PM, rbukshin rbukshin rbuks...@gmail.com wrote:
 The one in 0.6 doesn't allow controlling number of replicas to place in
 other DC. Atmost 1 copy of data can be placed in other DC.

 What are other differences between the implementation in 0.6 vs 0.7?



 On Tue, Sep 21, 2010 at 10:03 PM, Benjamin Black b...@b3k.us wrote:

 DCShard is in 0.6.  It has been rewritten in 0.7.

 On Tue, Sep 21, 2010 at 10:02 PM, rbukshin rbukshin rbuks...@gmail.com
 wrote:
  Is there any plan to backport DataCenterShardStrategy to 0.6.x from 0.7?
  It
  will be very useful for those who don't want to make drastic changes in
  their code and get the benefits of this replica placement strategy.
 
  --
  Thanks,
  -rbukshin
 
 



 --
 Thanks,
 -rbukshin




Re: Backporting Data Center Shard Strategy

2010-09-21 Thread Benjamin Black
DCShard is in 0.6.  It has been rewritten in 0.7.

On Tue, Sep 21, 2010 at 10:02 PM, rbukshin rbukshin rbuks...@gmail.com wrote:
 Is there any plan to backport DataCenterShardStrategy to 0.6.x from 0.7? It
 will be very useful for those who don't want to make drastic changes in
 their code and get the benefits of this replica placement strategy.

 --
 Thanks,
 -rbukshin




Re: timestamp parameter for Thrift insert API ??

2010-09-21 Thread Benjamin Black
On Mon, Sep 20, 2010 at 7:25 PM, Kuan(謝冠生) lakersg...@mail2000.com.tw wrote:
 By using cassandra-cli tool, we don't have to input timestamp while 
 insertion. Does it mean that Cassandra have time synchronization build-in 
 already?

No, it means the cassandra-cli program is inserting a timestamp, which
it then provides to the cluster via thrift, just like any other
client.

 Since cassandra depending on time-stamp parameter very much (both 
 read/write). The most ideal way to deal with timestamp is by cassandra 
 itself, considering data safty and consistensy


This doesn't fix anything, unfortunately.  Time synchronization/event
ordering in distributed systems is a notoriously hard problem.  Having
Cassandra nodes (remember, there are many in a cluster) assign
timestamps just means their clocks need to be tightly synchronized,
exactly as is the case for having clients insert timestamps.  They
will never be in sync enough to deal with badly designed apps
attempting to simultaneously write to the same cell.

Further, as jbellis mentioned, there are other reasons to not want the
_current_ time used as the timestamp.  The end result is that it is
neither advantageous nor desirable to have cluster nodes assign
timestamps.  If that is the requirement, you need to a) fix your
application, b) use a locking service like Zookeeper, or c) use an
ACID database.


b


Re: Cassandra performance

2010-09-17 Thread Benjamin Black
It appears you are doing several things that assure terrible
performance, so I am not surprised you are getting it.

On Tue, Sep 14, 2010 at 3:40 PM, Kamil Gorlo kgs4...@gmail.com wrote:
 My main tool was stress.py for benchmarks (or equivalent written in
 C++ to deal with python2.5 lack of multiprocessing). I will focus only
 on reads (random with normal distribution, what is default in
 stress.py) because writes were /quite/ good.

 I have 8 machines (xen quests with dedicated pair of 2TB SATA disks
 combined in RAID-O for every guest). Every machine has 4 individual
 cores of 2.4 Ghz and 4GB RAM.


First problem: I/O in Xen is very poor and Cassandra is generally very
sensitive to I/O performance.

 Cassandra commitlog and data dirs were on the same disk,

This is not recommended if you want best performance.  You should have
a dedicated commitlog drive.

 I gave 2.5GB
 for Heap for Cassandra, key and row cached were disabled (standard
 Keyspace1 schema, all tests use Standard1 CF).
 All other options were
 defaults. I've disabled cache because I was testing random (or semi
 random - normal distribution) reads so it wouldnt help so much (and
 also because 4GB of RAM is not a lot).


Disabling row cache in this case makes sense, but disabling key cache
is probably hurting your performance quite a bit.  If you wrote 20GB
of data per node, with narrow rows as you describe, and had default
memtable settings, you now have a huge number of sstables on disk.
You did not indicate you use nodetool compact to trigger a major
compaction, so I'm assuming you did not.

 For first test I installed Cassandra on only one machine to test it
 and remember results for further comparisons with large cluster and
 other DBs.

 1) RF was set to 1. I've inserted ~20GB of data (this is number
 reported in load column form nodetool ring output) using stress.py
 (100 colums per row). Then I've tested reads and got 200 rows/second
 (reading 100 columns per row, CL=ONE, disks were bottleneck, util was
 100%). There was no other operation pending during reads (compaction,
 insertion, etc..).


This is normal behavior under random reads for _any_ data base.  If
the dataset can't fit in RAM, you are I/O bound.  I don't know why you
would expect anything else.  You did not indicate your disk access
mode, but if it is mmap and you are not using code that calls
mlockall, then with that size dataset you are almost certainly
swapping, as well.  You can check that with vmstat.

Given the combination of very little RAM in comparison to the data
set, very little disk I/O, key caching disabled, a large number of
sstables, and likely mmap I/O without mlockall, you have created about
the worst possible setup.  If you are _actually_ dealing with that
much data AND random reads, then you either need enough RAM to hold it
all, or you need SSDs.  And that is not specific to Cassandra.

If you are saying you have similarly misconfigured MySQL and still
gotten better performance, then kudos.  You are very lucky.


b


Re: questions on cassandra (repair and multi-datacenter)

2010-09-16 Thread Benjamin Black
On Thu, Sep 16, 2010 at 3:19 PM, Gurpreet Singh
gurpreet.si...@gmail.com wrote:
 1.  I was looking to increase the RF to 3. This process entails changing the
 config and calling repair on the keyspace one at a time, right?
 So, I started with one node at a time, changed the config file on the first
 node for the keyspace, restarted the node. And then called a nodetool repair
 on the node.

You need to change the RF on _all_ nodes in the cluster _before_
running repair on _any_ of them.  If nodes disagree on which nodes
should have replicas for keys, repair will not work correctly.
Different RF for the same keyspace creates that disagreement.


b


Re: Connect to localhost is ok,but the ip fails.

2010-09-09 Thread Benjamin Black
Do you mean you are changing the yaml file?  Does 'netstat -an | grep
9160' indicate cassandra is bound to ipv4 or ipv6 (tcp vs tcp6 in the
netstat output)?


b

On Thu, Sep 9, 2010 at 1:06 AM, Ying Tang ivytang0...@gmail.com wrote:
 I'm using cassandra 0.7 .
 And in storage-conf .

 # The address to bind the Thrift RPC service to
 rpc_address: localhost
 # port for Thrift to listen on
 rpc_port: 9160

 In my client , the code below works successfully.

         TSocket socket = new TSocket(localhost, 9160);
     TTransport trans =
 Boolean.valueOf(System.getProperty(cassandra.framed, true)) ? new
 TFramedTransport(
     socket) : socket;
     trans.open();

 But if i changed localhost to the localhost's ip , throws out the
 java.net.ConnectException: Connection refused.

 And the connecting to other ip also fails.

 --
 Best regards,
 Ivy Tang





Re: Connect to localhost is ok,but the ip fails.

2010-09-09 Thread Benjamin Black
when you say localhost's ip do you mean 127.0.0.1 or do you mean an
ip on its local interface?

On Thu, Sep 9, 2010 at 1:29 AM, Ying Tang ivytang0...@gmail.com wrote:
 oh.solve it.

 Change the rpc_address to my localhost's ip ,then in the client code ,the
 TSocket can connect to the ip.

 On Thu, Sep 9, 2010 at 4:14 AM, Ying Tang ivytang0...@gmail.com wrote:

 no , i didn't change the yaml file.

 On Thu, Sep 9, 2010 at 4:10 AM, Benjamin Black b...@b3k.us wrote:

 Do you mean you are changing the yaml file?  Does 'netstat -an | grep
 9160' indicate cassandra is bound to ipv4 or ipv6 (tcp vs tcp6 in the
 netstat output)?


 b

 On Thu, Sep 9, 2010 at 1:06 AM, Ying Tang ivytang0...@gmail.com wrote:
  I'm using cassandra 0.7 .
  And in storage-conf .
 
  # The address to bind the Thrift RPC service to
  rpc_address: localhost
  # port for Thrift to listen on
  rpc_port: 9160
 
  In my client , the code below works successfully.
 
          TSocket socket = new TSocket(localhost, 9160);
      TTransport trans =
  Boolean.valueOf(System.getProperty(cassandra.framed, true)) ? new
  TFramedTransport(
      socket) : socket;
      trans.open();
 
  But if i changed localhost to the localhost's ip , throws out the
  java.net.ConnectException: Connection refused.
 
  And the connecting to other ip also fails.
 
  --
  Best regards,
  Ivy Tang
 
 
 



 --
 Best regards,
 Ivy Tang





 --
 Best regards,
 Ivy Tang





Re: ganglia plugin

2010-09-09 Thread Benjamin Black
Nice!

On Wed, Sep 8, 2010 at 6:45 PM, Scott Dworkis s...@mylife.com wrote:
 in case the community is interested, my gmetric collector:

 http://github.com/scottnotrobot/gmetric/tree/master/database/cassandra/

 note i have only tested with a special csv mode of gmetric... you can bypass
 this mode and use vanilla gmetric with --nocsv, but beware it will generate
 over 100 forks on a trivial cassandra schema.  the patch for the csv enabled
 gmetric is here:

 http://bugzilla.ganglia.info/cgi-bin/bugzilla/show_bug.cgi?id=273

 -scott



Re: Connect to localhost is ok,but the ip fails.

2010-09-09 Thread Benjamin Black
correct, 0.0.0.0 is a wildcard.

On Thu, Sep 9, 2010 at 1:19 PM, Aaron Morton aa...@thelastpickle.com wrote:
 I  set this to 0.0.0.0 I think the original storage_config.xml had a comment
 that it would make thrift respond on all interfaces.
 Aaron
 On 09 Sep, 2010,at 08:37 PM, Benjamin Black b...@b3k.us wrote:

 when you say localhost's ip do you mean 127.0.0.1 or do you mean an
 ip on its local interface?

 On Thu, Sep 9, 2010 at 1:29 AM, Ying Tang ivytang0...@gmail.com wrote:
 oh.solve it.

 Change the rpc_address to my localhost's ip ,then in the client code ,the
 TSocket can connect to the ip.

 On Thu, Sep 9, 2010 at 4:14 AM, Ying Tang ivytang0...@gmail.com wrote:

 no , i didn't change the yaml file.

 On Thu, Sep 9, 2010 at 4:10 AM, Benjamin Black b...@b3k.us wrote:

 Do you mean you are changing the yaml file?  Does 'netstat -an | grep
 9160' indicate cassandra is bound to ipv4 or ipv6 (tcp vs tcp6 in the
 netstat output)?


 b

 On Thu, Sep 9, 2010 at 1:06 AM, Ying Tang ivytang0...@gmail.com wrote:
  I'm using cassandra 0.7 .
  And in storage-conf .
 
  # The address to bind the Thrift RPC service to
  rpc_address: localhost
  # port for Thrift to listen on
  rpc_port: 9160
 
  In my client , the code below works successfully.
 
          TSocket socket = new TSocket(localhost, 9160);
      TTransport trans =
  Boolean.valueOf(System.getProperty(cassandra.framed, true)) ? new
  TFramedTransport(
      socket) : socket;
      trans.open();
 
  But if i changed localhost to the localhost's ip , throws out the
  java.net.ConnectException: Connection refused.
 
  And the connecting to other ip also fails.
 
  --
  Best regards,
  Ivy Tang
 
 
 



 --
 Best regards,
 Ivy Tang





 --
 Best regards,
 Ivy Tang






Re: Azure Cloud Storage - Tables

2010-09-08 Thread Benjamin Black
They are not copying Cassandra with that, as it was in development for
some time before Cassandra was released (possibly even before
Cassandra development started).  The BigTable-esque aspects, if they
are 'copied' from anywhere, are copied from BigTable, just as they are
in Cassandra.  The underlying storage originated with their Cosmos
project, and the layers with various semantics came about at least 3
years ago (possibly more) when various things were consolidated under
the Azure umbrella.  Definitely a lot of interesting technology in
there.


b

public reference for some of that -
http://www.zdnet.com/blog/microsoft/a-microsoft-code-name-a-day-cosmos/632?tag=mantle_skin;content

On Wed, Sep 8, 2010 at 2:20 PM, Peter Harrison cheetah...@gmail.com wrote:
 Microsoft has essentially copied the Cassandra approach for it's Table
 Storage. See here:

 http://www.codeproject.com/KB/azure/AzureStorage.aspx

 It is I believe a compliment of sorts, in the sense that it is a
 validation of the Cassandra approach. The reason I know about this is
 that I attended a presentation about Azure last week, and one of the
 Azure team told us all about it. The Table Storage is essentially
 Cassandra; although I guess reimplemented.

 That said it would have been better for them to actually use Cassandra
 and commit development effort to help the project than simply
 reimplement. Given their desire to promote Azure as a platform for
 open source applications I would have thought this would be a no
 brainer.



Re: Azure Cloud Storage - Tables

2010-09-08 Thread Benjamin Black
And having said all that: Azure Table storage model doesn't look like
Cassandra.  There is a schema, there are partition keys.  It more
resembles something like VoltDB than the map of maps (of maps) of
Cassandra (and BigTable, and HBase).


b

On Wed, Sep 8, 2010 at 2:20 PM, Peter Harrison cheetah...@gmail.com wrote:
 Microsoft has essentially copied the Cassandra approach for it's Table
 Storage. See here:

 http://www.codeproject.com/KB/azure/AzureStorage.aspx

 It is I believe a compliment of sorts, in the sense that it is a
 validation of the Cassandra approach. The reason I know about this is
 that I attended a presentation about Azure last week, and one of the
 Azure team told us all about it. The Table Storage is essentially
 Cassandra; although I guess reimplemented.

 That said it would have been better for them to actually use Cassandra
 and commit development effort to help the project than simply
 reimplement. Given their desire to promote Azure as a platform for
 open source applications I would have thought this would be a no
 brainer.



Re: Few questions regarding cassandra deployment on windows

2010-09-07 Thread Benjamin Black
This does not sound like a good application for Cassandra at all.  Why
are you using it?

On Tue, Sep 7, 2010 at 3:42 PM, kannan chandrasekaran
ckanna...@yahoo.com wrote:
 Hi All,

 We are currently considering Cassandra for our application.

 Platform:
 * a single-node cluster.
 * windows '08
 * 64-bit jvm

 For the sake of brevity let,
 Cassandra service =  a single node cassandra server running as an embedded
 service inside a JVM


 My use cases:
 1) Start with a schema ( keyspace and set of column families under it) in a
 cassandra service
 2) Need to be able to replicate the same schema structure (add new
 keyspace/columnfamilies with different names ofcourse).
 3) Because of some existing limitations in my application, I need to be able
 to write to the keyspace/column-families from a cassandra service and read
 the written changes from a different cassandra service. Both the write and
 the read cassandra-services are sharing the same Data directory. I
 understand that the application has to take care of any naming collisions.


 Couple Questions related to the above mentioned usecases:
 1) I want to spawn a new JVM and launch Cassandra as an embedded service
 programatically instead of using the startup.bat. I would like to know if
 that is possible and any pointers in that direction would be really helpful.
 ( use-case1)
 2) I understand that there are provisions for live schema changes in 0.7 (
 thank you guys !!!), but since I cant use a beta version in production, I am
 restricted to 0.6 for now. Is it possible to to support use-case 2 in 0.6.5
 ? More specifically, I am planning to make runtime changes to the
 storage.conf xml file followed by a cassandra service restart
 3) Can I switch the data directory at run-time ?  (use-case 3). In order to
 not disrupt read while the writes are in progress, I am thinking something
 like, copy the existing data-dir into a new location; write to a new data
 directory; once the write is complete; switch pointers and restart the
 cassandra service to read from the new directory to pick up the updated
 changes

 Any help is greatly appreciated.

 Thanks
 Kannan





Re: 4k keyspaces... Maybe we're doing it wrong?

2010-09-06 Thread Benjamin Black
On Mon, Sep 6, 2010 at 12:41 AM, Janne Jalkanen
janne.jalka...@ecyrd.com wrote:

 So if I read this right, using lots of CF's is also a Bad Idea(tm)?


Yes, lots of CFs is bad means lots of CFs is also bad.


Re: question about Cassandra error

2010-09-03 Thread Benjamin Black
You seem to be typing 0.7 commands on a 0.6 cli.  Please follow the
README in the version you are using, e.g.:

set Keyspace1.Standard2['jsmith']['first'] = 'John'

On Thu, Sep 2, 2010 at 5:35 PM, Simon Chu simonchu@gmail.com wrote:
 I downloaded cassendra 0.6.5 and ran it, got this error:

 bin/cassandra -f
  INFO 16:46:06,198 JNA not found. Native methods will be disabled.
  INFO 16:46:06,875 DiskAccessMode 'auto' determined to be mmap,
 indexAccessMode is mmap

 is this an issue?

 When I tried to run cassandra cli from the example, I got the following
 errors:

 cassandra use Keyspace1 sc 'blah$'
 line 1:0 no viable alternative at input 'use'
 Invalid Statement (Type: 0)
 cassandra set Standard2['jsmith']['first'] = 'John';
 line 1:13 mismatched input '[' expecting DOT

 is this a setup issue?

 Simon


Re: the process of reading and writing

2010-09-03 Thread Benjamin Black
On Thu, Sep 2, 2010 at 8:19 PM, Ying Tang ivytang0...@gmail.com wrote:
 Recently , i read the paper about Cassandra again .
 And now i have some concepts about  the reading and writing .
 We all know Cassandra uses NWR ,
 When read :
 the request --- a random node in Cassandra .This node acts as a proxy ,and
 it routes the request.
 Here ,
 1. the proxy node route this request to this key's coordinator , the
 coordinator then routes request to other N-1 nodes   OR   the proxy routes
 the read request to N nodes ?

The coordinator node is the proxy node.

 2. If it is the former situation , the read repair occurs on the  key's
 coordinator ?
    If  it is the latter , the  read repair occurs on the proxy node ?

Depends on the CL requested.  QUORUM and ALL cause the RR to be
performed by the coordinator.  ANY and ONE cause RR to be delegated to
one of the replicas for the key.

 When write :
 the request --- a random node in Cassandra .This node acts as a proxy ,and
 it routes the request.
 Here ,
 3. the proxy node route this request to this key's coordinator , the
 coordinator then routes request to other N-1 nodes   OR   the proxy routes
 the request to N nodes ?


For writes, the coordinator sends the writes directly to the replicas
regardless of CL (rather than delegating for weakly consistent CLs).

 4. The N isn't the data's copy numbers , it's just a  range . In this  N
 range , there must be W copies .So W is the copy numbers.
 So in this N range , R+WN can guarantee the data's validity. Right?


Sorry, I can't even parse this.


b


Re: Cassandra on AWS across Regions

2010-09-02 Thread Benjamin Black
On Thu, Sep 2, 2010 at 5:52 AM, Phil Stanhope stanh...@gmail.com wrote:
 Ben, can you elaborate on some infrastructure topology issues that would
 break this approach?


As noted, the naive approach results in nodes behind the same NAT
having to communicate with each other through that NAT rather than
directly.  You can different property files for property snitch on
different nodes, as that is directly encoding topology.  You could do
the same with /etc/hosts.  You could do the same with DNS.  The
problem is that in all these cases you have a different view of the
world depending on where you are.  Does this node have the right
information for connecting to local nodes and remote nodes?  Is it
failing to connect to some other node because of a hostname resolution
failure, or because it has the wrong topology information, or ...?

And this only assumes 1:1 NAT.  What is the solution for PAT (which is
quite common)?  It's a deep dark hole of edge cases.  I would rather
have a dead simple 80% solution than a 100% solution with dynamics I
can't understand.


b


Re: Data Center Move

2010-09-02 Thread Benjamin Black
You will likely need to rename some of the files to avoid collisions
(they are only unique per node).  Otherwise, yes, this can work.

On Thu, Sep 2, 2010 at 11:09 AM, Anthony Molinaro
antho...@alumni.caltech.edu wrote:
 Hi,

  We're running cassandra 0.6.4, and need to do a data center move of
 a cluster (from EC2 to our own data center).   Because of the way the
 networks are set up we can't actually connect these boxes directly, so
 the original plan of add some nodes in the new colo, let them bootstrap
 then decommission nodes in the old colo until the data is all transfered
 will not work.

 So I'm wondering if the following will work

 1. take a snapshot on the source cluster
 2. rsync all the files from the old machines to the new machines (we'd most
   likely be reducing the total number of machines, so would do things like
   take 4-5 machines worth of data and put it onto 1 machine)
 3. bring up the new machines in the new colo
 4. run cleanup on all new nodes?
 5. run repair on all new nodes?

 So will this work?  If so, are steps 4 and 5 correct?

 I realize we will miss any new data that happens between the snapshot
 and turning on writes on the new cluster, but I think we might be able
 to just tune compaction such that it doesn't happen, then just sync
 the files that change while the data transfers happen?

 Thanks,

 -Anthony

 --
 
 Anthony Molinaro                           antho...@alumni.caltech.edu



Re: Cassandra on AWS across Regions

2010-09-01 Thread Benjamin Black
It's not gossiping hostnames, it's gossiping IP addresses.  The
purpose of Peter's patch is to have the system gossip its external
address (so other nodes can connect), but bind its internal address.
As Edward notes, it helps with NAT in general, not just EC2.  Not
perfect, but a great start.


b

On Wed, Sep 1, 2010 at 2:57 PM, Andres March ama...@qualcomm.com wrote:
 Is it not possible to put the external host name in cassandra.yaml and add a
 host entry in /etc/hosts for that name to resolve to the local interface?

 On 09/01/2010 01:24 PM, Benjamin Black wrote:

 The issue is this:

 The IP address by which an EC2 instance is known _externally_ is not
 actually on the instance itself (the address being translated), and
 the _internal_ address is not accessible across regions.  Since you
 can't bind a specific address that is not on one of your local
 interfaces, and Cassandra nodes don't have a notion of internal vs
 external you need a mechanism by which a node is told to bind one IP
 (the internal one), while it gossips another (the external one).

 I like what this patch does conceptually, but would prefer
 configuration options to cause it to happen (obviously a much larger
 patch).  Very cool, Peter!


 b

 On Wed, Sep 1, 2010 at 1:10 PM, Andres March ama...@qualcomm.com wrote:

 Could you explain this point further?  Was there an exception?

 On 09/01/2010 09:26 AM, Peter Fales wrote:

 that doesn't quite work with the stock Cassandra, as it will
 try to bind and listen on those addresses and give up because they
 don't appear to be valid network addresses.

 --
 Andres March
 ama...@qualcomm.com
 Qualcomm Internet Services

 --
 Andres March
 ama...@qualcomm.com
 Qualcomm Internet Services


Re: Cassandra on AWS across Regions

2010-09-01 Thread Benjamin Black
On Wed, Sep 1, 2010 at 3:18 PM, Andres March ama...@qualcomm.com wrote:
 I thought you might say that.  Is there some reason to gossip IP addresses
 vs hostnames?  I thought that layer of indirection could be useful in more
 than just this use case.


The trade-off for that flexibility is that nodes are now dependent on
name resolution during normal operation, rather than only at startup.
The opportunities for horribly confusing failure scenarios are
numerous and frightening.  Other than NAT (which can clearly be dealt
with without gossiping hostnames), what do you think this would
enable?


b


Re: Cassandra on AWS across Regions

2010-09-01 Thread Benjamin Black
On Wed, Sep 1, 2010 at 4:16 PM, Andres March ama...@qualcomm.com wrote:
 I didn't have anything specific in mind. I understand all the issues around
 DNS and not advocating only supporting hostnames (just thought it would be a
 nice option).  I also wouldn't expect name resolution to be done all the
 time, only when the node is first being started or during initial discovery.


All nodes would have to resolve whenever topology changed.

 One use case might be when nodes are spread out over multiple networks as
 the poster describes, nodes on the same network on a private interface could
 incur less network overhead than if they go out through the public
 interface.  I'm not sure that this is even possible given that cassandra
 binds to only one interface.


This case is not actually solved more simply by gossiping hostnames.
It requires much more in-depth understanding of infrastructure
topology.


b


Re: column family names

2010-08-31 Thread Benjamin Black
Exactly.

On Mon, Aug 30, 2010 at 11:39 PM, Janne Jalkanen
janne.jalka...@ecyrd.com wrote:

 I've been doing it for years with no technical problems. However, using %
 as the escape char tends to, in some cases, confuse a certain operating
 system whose name may or may not begin with W, so using something else
 makes sense.
 However, it does require an extra cognitive step for the maintainer, since
 the mapping between filenames and logical names is no longer immediately
 obvious. Especially with multiple files this can be a pain (e.g. Chinese
 logical names which map to pretty incomprehensible sequences that are
 laborious to look up).
 So my experience suggests to avoid it for ops reasons, and just go with
 simplicity.
 /Janne
 On Aug 31, 2010, at 08:39 , Terje Marthinussen wrote:

 Beyond aesthetics, specific reasons?

 Terje

 On Tue, Aug 31, 2010 at 11:54 AM, Benjamin Black b...@b3k.us wrote:

 URL encoding.





Re: column family names

2010-08-31 Thread Benjamin Black
This is not the Unix way for good reason: it creates all manner of
operational challenges for no benefit.  This is how Windows does
everything and automation and operations for large-scale online
services is _hellish_ because of it.  This horse is sufficiently
beaten, though.


b

On Mon, Aug 30, 2010 at 11:55 PM, Terje Marthinussen
tmarthinus...@gmail.com wrote:
 Another option would of course be to store a mapping between dir/filenames
 and Keyspace/columns familes together with other info related to keyspaces
 and column families. Just add API/command line tools to look up the
 filenames and maybe store the values in the files as well for recovery
 purposes.

 Terje

 On Tue, Aug 31, 2010 at 3:39 PM, Janne Jalkanen janne.jalka...@ecyrd.com
 wrote:

 I've been doing it for years with no technical problems. However, using
 % as the escape char tends to, in some cases, confuse a certain operating
 system whose name may or may not begin with W, so using something else
 makes sense.
 However, it does require an extra cognitive step for the maintainer, since
 the mapping between filenames and logical names is no longer immediately
 obvious. Especially with multiple files this can be a pain (e.g. Chinese
 logical names which map to pretty incomprehensible sequences that are
 laborious to look up).
 So my experience suggests to avoid it for ops reasons, and just go with
 simplicity.
 /Janne
 On Aug 31, 2010, at 08:39 , Terje Marthinussen wrote:

 Beyond aesthetics, specific reasons?

 Terje

 On Tue, Aug 31, 2010 at 11:54 AM, Benjamin Black b...@b3k.us wrote:

 URL encoding.






Re: column family names

2010-08-31 Thread Benjamin Black
Then make a CF in which you store the mappings from UTF8 (or byte[]!)
names to CFs.  Now all clients can read the same mappings.  Problem
solved.

Still not solved because you have arbitrary, uncontrolled clients
doing arbitrary, uncontrolled things in the same Cassandra cluster?
You're doing it wrong.

On Tue, Aug 31, 2010 at 7:26 AM, Terje Marthinussen
tmarthinus...@gmail.com wrote:
 Sure, but as I am likely to have multiple clients (which I may not control)
 accessing a single store, I would prefer to keep such custom mappings out of
 the client for consistency reasons (much bigger problem than any of the
 operational issues highlighted so far).
 Terje
 On 31 Aug 2010, at 23:03, David Boxenhorn da...@lookin2.com wrote:

 It's not so hard to implement your mapping suggestion in your application,
 rather than in Cassandra, if you really want it.

 On Tue, Aug 31, 2010 at 1:05 PM, Terje Marthinussen
 tmarthinus...@gmail.com wrote:

 No benefit?
 Making it easier to use column families as part of your data model is a
 fairly good benefit, at least given the somewhat special data model
 cassandra offers. Much more of a benefit than the disadvantages I can
 imagine.

 fileprefix=`sometool -fileprefix tablename`
 is something I would say is a lot more unixy than windows like.

 Sorry, I don't share your concern for large scale operations here, but
 sure, '_' does the trick for me now so thanks to Aaron for reminding me
 about that.

 Some day I am sure there will be realized that unicode strings/byte arrays
 are useful here like most other places in Cassandra (\w is a bit limited for
 some of us living in the non-ascii part of the world...), but what is the
 XXX way are not the type of topics I find interesting, so another time.

 Terje


 On Tue, Aug 31, 2010 at 5:30 PM, Benjamin Black b...@b3k.us wrote:

 This is not the Unix way for good reason: it creates all manner of
 operational challenges for no benefit.  This is how Windows does
 everything and automation and operations for large-scale online
 services is _hellish_ because of it.  This horse is sufficiently
 beaten, though.


 b

 On Mon, Aug 30, 2010 at 11:55 PM, Terje Marthinussen
 tmarthinus...@gmail.com wrote:
  Another option would of course be to store a mapping between
  dir/filenames
  and Keyspace/columns familes together with other info related to
  keyspaces
  and column families. Just add API/command line tools to look up the
  filenames and maybe store the values in the files as well for recovery
  purposes.
 
  Terje
 
  On Tue, Aug 31, 2010 at 3:39 PM, Janne Jalkanen
  janne.jalka...@ecyrd.com
  wrote:
 
  I've been doing it for years with no technical problems. However,
  using
  % as the escape char tends to, in some cases, confuse a certain
  operating
  system whose name may or may not begin with W, so using something
  else
  makes sense.
  However, it does require an extra cognitive step for the maintainer,
  since
  the mapping between filenames and logical names is no longer
  immediately
  obvious. Especially with multiple files this can be a pain (e.g.
  Chinese
  logical names which map to pretty incomprehensible sequences that are
  laborious to look up).
  So my experience suggests to avoid it for ops reasons, and just go
  with
  simplicity.
  /Janne
  On Aug 31, 2010, at 08:39 , Terje Marthinussen wrote:
 
  Beyond aesthetics, specific reasons?
 
  Terje
 
  On Tue, Aug 31, 2010 at 11:54 AM, Benjamin Black b...@b3k.us wrote:
 
  URL encoding.
 
 
 
 





Re: get_slice sometimes returns previous result on php

2010-08-30 Thread Benjamin Black
On Mon, Aug 30, 2010 at 6:05 AM, Juho Mäkinen juho.maki...@gmail.com wrote:
 The application is using the
 same cassandra thrift connection (it doesn't close it in between) and
 everything is happening inside same php process.


This is why you are seeing this problem (and is specific to connection
reuse in certain languages, not a general problem with connection
reuse).


b


Re: column family names

2010-08-30 Thread Benjamin Black
URL encoding.

On Mon, Aug 30, 2010 at 5:55 PM, Aaron Morton aa...@thelastpickle.com wrote:
 under scores or URL encoding ?
 Aaron
 On 31 Aug, 2010,at 12:27 PM, Benjamin Black b...@b3k.us wrote:

 Please don't do this.

 On Mon, Aug 30, 2010 at 5:22 AM, Terje Marthinussen
 tmarthinus...@gmail.com wrote:
 Ah, sorry, I forgot that underscore was part of \w.
 That will do the trick for now.

 I do not see the big issue with file names though. Why not expand the
 allowed characters a bit and escape the file names? Maybe some sort of URL
 like escaping.

 Terje

 On Mon, Aug 30, 2010 at 6:29 PM, Aaron Morton aa...@thelastpickle.com
 wrote:

 Moving to the user list.
 The new restrictions were added as part of  CASSANDRA-1377 for 0.6.5 and
 0.7, AFAIK it's to ensure the file names created for the CFs can be
 correctly parsed. So it's probably not going to change.
 The names have to match the \w reg ex class, which includes the
 underscore
 character.

 Aaron

 On 30 Aug 2010, at 21:01, Terje Marthinussen tmarthinus...@gmail.com
 wrote:

 Hi,

 Now that we can make columns families on the fly, it gets interesting to
 use
 column families more as part of the data model (can reduce diskspace
 quite
 a
 bit vs. super columns in some cases).

 However, currently, the column family name validator is pretty strict
 allowing only word characters and in some cases it is pretty darned nice
 to
 be able to put something like a - inbetweenallthewords.

 Any reason to be this strict or could it be loosened up a little bit?

 Terje





Re: Cassandra HAProxy

2010-08-29 Thread Benjamin Black
On Sun, Aug 29, 2010 at 11:04 AM, Anthony Molinaro
antho...@alumni.caltech.edu wrote:


 I don't know it seems to tax our setup of 39 extra large ec2 nodes, its
 also closer to 24000 reqs/sec at peak since there are different tables
 (2 tables for each read and 2 for each write)


Could you clarify what you mean here?  On the face of it, this
performance seems really poor given the number and size of nodes.


b


Re: RowMutationVerbHandler.java (line 78) Error in row mutation

2010-08-28 Thread Benjamin Black
Have you tried with beta1 and is there a repro you can put in a bug
report in jira?

On Sat, Aug 28, 2010 at 11:28 AM, Todd Burruss bburr...@real.com wrote:
 Trunk



 -Original Message-
 From: Benjamin Black [...@b3k.us]
 Received: 8/28/10 10:05 AM
 To: user@cassandra.apache.org [u...@cassandra.apache.org]
 Subject: Re: RowMutationVerbHandler.java (line 78) Error in row mutation

 Todd,

 Are you using beta1 or trunk code?


 b

 On Fri, Aug 27, 2010 at 3:58 PM, B. Todd Burruss bburr...@real.com wrote:
 i got the latest code this morning.  i'm testing with 0.7


 ERROR [ROW-MUTATION-STAGE:388] 2010-08-27 15:54:58,053
 RowMutationVerbHandler.java (line 78) Error in row mutation
 org.apache.cassandra.db.UnserializableColumnFamilyException: Couldn't find
 cfId=1002
    at

 org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:113)
    at

 org.apache.cassandra.db.RowMutationSerializer.defreezeTheMaps(RowMutation.java:372)
    at

 org.apache.cassandra.db.RowMutationSerializer.deserialize(RowMutation.java:382)
    at

 org.apache.cassandra.db.RowMutationSerializer.deserialize(RowMutation.java:340)
    at

 org.apache.cassandra.db.RowMutationVerbHandler.doVerb(RowMutationVerbHandler.java:46)
    at

 org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:50)
    at

 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
    at

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
    at java.lang.Thread.run(Thread.java:619)





Re: Cassandra HAProxy

2010-08-28 Thread Benjamin Black
Because you create a bottleneck at the HAProxy and because the
presence of the proxy precludes clients properly backing off from
nodes returning errors.  The proper approach is to have clients
maintain connection pools with connections to multiple nodes in the
cluster, and then to spread requests across those connections.  Should
a node begin returning errors (for example, because it is overloaded),
clients can remove it from rotation.

On Sat, Aug 28, 2010 at 11:27 AM, Mark static.void@gmail.com wrote:
  On 8/28/10 11:20 AM, Benjamin Black wrote:

 no and no.

 On Sat, Aug 28, 2010 at 10:28 AM, Markstatic.void@gmail.com  wrote:

  I will be loadbalancing between nodes using HAProxy. Is this
 recommended?

 Also is there a some sort of ping/health check uri available?

 Thanks

 any reason on why loadbalancing client connections using HAProxy isnt
 recommended?



Re: Cassandra HAProxy

2010-08-28 Thread Benjamin Black
munin is the simplest thing.  There are numerous JMX stats of interest.

As a symmetric distributed system, you should not expect to monitor
Cassandra like you would a web server.  Intelligent clients use
connection pools and react to current node behavior in making choices
of where to send requests, including using describe_ring to discover
nodes and open new connections as needed.

On Sat, Aug 28, 2010 at 11:29 AM, Mark static.void@gmail.com wrote:
  On 8/28/10 11:20 AM, Benjamin Black wrote:

 no and no.

 On Sat, Aug 28, 2010 at 10:28 AM, Markstatic.void@gmail.com  wrote:

  I will be loadbalancing between nodes using HAProxy. Is this
 recommended?

 Also is there a some sort of ping/health check uri available?

 Thanks

 Also, what would be a good way of monitoring the health of the cluster?



Re: Cassandra HAProxy

2010-08-28 Thread Benjamin Black
On Sat, Aug 28, 2010 at 2:34 PM, Anthony Molinaro
antho...@alumni.caltech.edu wrote:
 I think maybe he thought you meant put a layer between cassandra internal
 communication.

No, I took the question to be about client connections.

 There's no problem balancing client connections with
 haproxy, we've been pushing several billion requests per month through
 haproxy to cassandra.


Can it be done: yes.  Is it best practice: no.  Even 10 billion
requests/month is an average of less than 4000 reqs/sec.   Just not
that many for a distributed database like Cassandra.

 we use

  mode tcp
  balance leastconn
  server local 127.0.0.1:12350 check

 so basically just a connect based check, and it works fine


Cassandra can, and does, fail in ways that do not stop it from
answering TCP connection requests.  Are you saying it works fine
because you have seen numerous types of node failures and this was
sufficient?  I would be quite surprised if that were so.  Using an LB
for service discovery is a fine thing (connect to a VIP, call
describe_ring, open direct connections to cluster nodes).  Relying on
an LB to do the right thing when it is totally ignorant of what is
going across those client connections (as is implied by simply
checking for connectivity) is asking for trouble.  Doubly so when you
use a leastconn policy (a failing node can spit out an error and close
a connection with impressive speed, sucking all the traffic to itself;
common problem with HTTP servers giving back errors).


b


Re: Benchmarking Cassandra 0.6.5 with YCSB client ... drags to a halt

2010-08-28 Thread Benjamin Black
cassandra.in.sh?
storage-conf.xml?
output of iostat -x while this is going on?
turn GC log level to debug?

On Sat, Aug 28, 2010 at 2:02 PM, Fernando Racca fra...@gmail.com wrote:
 Hi,
 I'm currently executing some benchmarks against 0.6.5, which i plan to
 compare against 0.7-beta1, using the YCSB client
 I'm experiencing some strange behaviour when running a small 2 nodes cluster
 using OrderPreservingPartitioner. Does anybody have any experience on using
 the client to generate load?
 It's the first benchmark that i try so i'm probably doing something dumb.
 A detailed post with screenshots of the VM and CPU history can be seen in
 this
 post.http://quantleap.blogspot.com/2010/08/cassandra-065-benchmarking-first-run.html
 I would very much appreciate your help since i'm doing this benchmarks as
 part of my master's dissertation
 A previous official benchmark is documented
 here http://research.yahoo.com/files/ycsb-v4.pdf
 Thanks!
 Fernando Racca


Re: Benchmarking Cassandra 0.6.5 with YCSB client ... drags to a halt

2010-08-28 Thread Benjamin Black
 MESSAGE-STREAMING-POOL            0         0
  INFO 23:56:20,618 LOAD-BALANCER-STAGE               0         0
  INFO 23:56:20,625 FLUSH-SORTER-POOL                 0         0

 the problem seems to be with the second node...
 any ideas?
 On 28 August 2010 22:49, Benjamin Black b...@b3k.us wrote:

 cassandra.in.sh?
 storage-conf.xml?
 output of iostat -x while this is going on?
 turn GC log level to debug?

 On Sat, Aug 28, 2010 at 2:02 PM, Fernando Racca fra...@gmail.com wrote:
  Hi,
  I'm currently executing some benchmarks against 0.6.5, which i plan to
  compare against 0.7-beta1, using the YCSB client
  I'm experiencing some strange behaviour when running a small 2 nodes
  cluster
  using OrderPreservingPartitioner. Does anybody have any experience on
  using
  the client to generate load?
  It's the first benchmark that i try so i'm probably doing something
  dumb.
  A detailed post with screenshots of the VM and CPU history can be seen
  in
  this
 
  post.http://quantleap.blogspot.com/2010/08/cassandra-065-benchmarking-first-run.html
  I would very much appreciate your help since i'm doing this benchmarks
  as
  part of my master's dissertation
  A previous official benchmark is documented
  here http://research.yahoo.com/files/ycsb-v4.pdf
  Thanks!
  Fernando Racca




Re: Follow-up post on cassandra configuration with some experiments on GC tuning

2010-08-27 Thread Benjamin Black
ecapriolo's testing seemed to indicate it _did_ change the behavior.
wonder what the difference is?

On Fri, Aug 27, 2010 at 6:23 AM, Mikio Braun mi...@cs.tu-berlin.de wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 Dear all,

 thanks for your comments, and I'm glad that you found my post helpful.

 Concerning the incremental CMS, I've recently updated my post and added
 the experiments repeated on one of our cluster nodes, and for some
 reason incremental CMS doesn't look that different anymore. So I guess
 it's ok to stick with the non-incremental CMS for now.

 - -M

 On 08/27/2010 09:12 AM, Peter Schuller wrote:
 Whether or not this is likely to happen with Cassandra I don't know. I
 don't know much about the incremental duty cycles are scheduled and it
 may be the case that Cassandra is not even remotely close to having a
 problem with incremental mode.

 I should further weaken my statement by pointing out that I never did
 any exhaustive tweaking to get around the problem (other than
 disabling incremental mode, since my primary goal has tended to be
 ensure low pause times and not so much even GC activity). It may be
 the case that even in stressful cases where it fails by default it is
 simply a matter of tweaking.

 So, I guess I should re-phrase: In terms of just turning on
 incremental mode without at least application specific tweaking (if
 not deployment specific testing), I would suggest caution.




 - --
 Dr. Mikio Braun                        email: mi...@cs.tu-berlin.de
 TU Berlin                              web: ml.cs.tu-berlin.de/~mikio
 Franklinstr. 28/29                     tel: +49 30 314 78627
 10587 Berlin, Germany



 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1.4.9 (GNU/Linux)
 Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

 iEYEARECAAYFAkx3vFUACgkQtnXKX8rQtgDUlgCfWb/euA2mgVJAWDY2tBSyAN+I
 604AoKVua1+5bYK2yF9CWwFQmLHDt0Fn
 =CIal
 -END PGP SIGNATURE-



Re: is it my cassandra cluster ok?

2010-08-26 Thread Benjamin Black
No, it means manually assign tokens to evenly distribute ring range to
the existing nodes.

On Wed, Aug 25, 2010 at 7:29 PM, john xie shanfengg...@gmail.com wrote:
 load balancing?  is it  means add more nodes?


 2010/8/26 Ryan King r...@twitter.com

 Looks like you need to do some load balancing.

 -ryan

 On Wed, Aug 25, 2010 at 12:33 AM, john xie shanfengg...@gmail.com wrote:
  /opt/apache-cassandra-0.6.4/bin/nodetool --host 192.168.123.100 ring
  Address   Status Load  Range
   Ring
 
  162027259805094200094770502377853667196
  192.168.123.101Up 183.43 GB
  26404162423947656621914545677405489813 |--|
  192.168.123.5 Up 196.18 GB
  97646479029625162367516203572215570207 |   |
  192.168.123.100Up 302.86 GB
  150826772797302282411816801037163789836|   |
  192.168.123.102Up 235.83 GB
  162027259805094200094770502377853667196|--|
 
 
  /opt/apache-cassandra-0.6.4/bin/nodetool --host 192.168.123.5 tpstats
  Pool NameActive   Pending  Completed
  FILEUTILS-DELETE-POOL 0 0610
  STREAM-STAGE  0 0  0
  RESPONSE-STAGE0 0   18879316
  ROW-READ-STAGE0 0  0
  LB-OPERATIONS 0 0  0
  MESSAGE-DESERIALIZER-POOL 0 0   86925654
  GMFD  0 0 168769
  LB-TARGET 0 0  0
  CONSISTENCY-MANAGER   0 0  0
  ROW-MUTATION-STAGE0 0   66657550
  MESSAGE-STREAMING-POOL0 0  0
  LOAD-BALANCER-STAGE   0 0  0
  FLUSH-SORTER-POOL 0 0  0
  MEMTABLE-POST-FLUSHER 0 0   1125
  FLUSH-WRITER-POOL 0 0   1125
  AE-SERVICE-STAGE  0 0  0
  HINTED-HANDOFF-POOL   1 4  0
   /opt/apache-cassandra-0.6.4/bin/nodetool --host 192.168.123.5 info
  97646479029625162367516203572215570207
  Load : 196.01 GB
  Generation No: 1282656715
  Uptime (seconds) : 64437
  Heap Memory (MB) : 2245.43 / 5111.69
 
  /opt/apache-cassandra-0.6.4/bin/nodetool --host 192.168.123.100 tpstats
  Pool NameActive   Pending  Completed
  FILEUTILS-DELETE-POOL 0 0950
  STREAM-STAGE  0 0  0
  RESPONSE-STAGE0 0   88290400
  ROW-READ-STAGE0 0  0
  LB-OPERATIONS 0 0  0
  MESSAGE-DESERIALIZER-POOL 0 0  149317269
  GMFD  0 0 187571
  LB-TARGET 0 0  0
  CONSISTENCY-MANAGER   0 0  0
  ROW-MUTATION-STAGE0 0  104055920
  MESSAGE-STREAMING-POOL0 0  0
  LOAD-BALANCER-STAGE   0 0  0
  FLUSH-SORTER-POOL 0 0  0
  MEMTABLE-POST-FLUSHER 0 0   1749
  FLUSH-WRITER-POOL 0 0   1749
  AE-SERVICE-STAGE  0 0  0
  HINTED-HANDOFF-POOL   1 4 17
   /opt/apache-cassandra-0.6.4/bin/nodetool --host 192.168.123.100 info
  150826772797302282411816801037163789836
  Load : 302.79 GB
  Generation No: 1282656138
  Uptime (seconds) : 65439
  Heap Memory (MB) : 1854.20 / 6135.69
  /opt/apache-cassandra-0.6.4/bin/nodetool --host 192.168.123.101 tpstats
  Pool NameActive   Pending  Completed
  FILEUTILS-DELETE-POOL 0 0594
  STREAM-STAGE  0 0  0
  RESPONSE-STAGE0 0  120024993
  ROW-READ-STAGE0 0  0
  LB-OPERATIONS 0 0  0
  MESSAGE-DESERIALIZER-POOL 0 0  154158111
  GMFD  0 0 193471
  LB-TARGET 0 0  0
  CONSISTENCY-MANAGER   0 0  0
  ROW-MUTATION-STAGE0 0   64174075
  MESSAGE-STREAMING-POOL0 0  0
  LOAD-BALANCER-STAGE   0 0  0
  FLUSH-SORTER-POOL 0 0  0
  MEMTABLE-POST-FLUSHER 0 0   1091
  FLUSH-WRITER-POOL 0 0   1091
  AE-SERVICE-STAGE  0 0  0
  HINTED-HANDOFF-POOL   1

Re: Repair help

2010-08-26 Thread Benjamin Black
recommend testing the waters on release software (0.6.x), not beta.

On Thu, Aug 26, 2010 at 2:53 PM, Mark static.void@gmail.com wrote:
  I have a 2 node cluster  (testing the waters) w/ a replication factor of 2.
 One node got completed screwed up (see any of my previous messages from
 today) so I deleted the commit log and data directory. I restarted the node
 and rain nodetool repair as describe in
 http://wiki.apache.org/cassandra/Operations. I waited for over an hour and
 checked my ring only to find that nothing was repaired/replicated??? I only
 have a mere 7gigs of data so I would have thought this would have been
 fairly quick?

 Address         Status State   Load            Token

 129447565151094499156612104441060791022
 x.x.x.x   Up     Normal  7.31 GB
 12949228055906550350782255148181029323
 x.x.x.x   Up     Normal  30.01 MB
  129447565151094499156612104441060791022

 I tried the alternative method of manually removing the token and then
 bootstrapping however when I tried to remove the token via nodetool
 removetoken an IllegalStateException was thrown... replication factor (2)
 exceeds number of endpoints (1)

 What should I do in this situation to get my node back up to where it should
 be? Is there anywhere I can check that the repair is actually running?

 Thanks for any suggestions

 ps I'm using 0.7.0 beta 1






Re: Follow-up post on cassandra configuration with some experiments on GC tuning

2010-08-26 Thread Benjamin Black
imo, these should be part of the defaults.

On Tue, Aug 24, 2010 at 8:29 AM, Mikio Braun mi...@cs.tu-berlin.de wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 Dear all,

 thanks again for all the comments I got on my last post. I've played a
 bit with different GC settings and got my Cassandra instance to run
 very nicely with 8GB of heap.

 I summarized my experiences with GC tuning in this follow-up post:

 http://blog.mikiobraun.de/2010/08/cassandra-gc-tuning.html

 - -M

 - --
 Dr. Mikio Braun                        email: mi...@cs.tu-berlin.de
 TU Berlin                              web: ml.cs.tu-berlin.de/~mikio
 Franklinstr. 28/29                     tel: +49 30 314 78627
 10587 Berlin, Germany



 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1.4.9 (GNU/Linux)
 Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

 iEYEARECAAYFAkxz5WcACgkQtnXKX8rQtgDiiwCeLknuTcr65eehwIcsivInjv4W
 LaQAn3RY9pH19r8SuUhVBvtE6LeyFUvB
 =MYsY
 -END PGP SIGNATURE-



Re: get_slice slow

2010-08-25 Thread Benjamin Black
Todd,

This is a really bad idea.  What you are likely doing is spreading
that single row across a large number of sstables.  The more columns
you insert, the more sstables you are likely inspecting, the longer
the get_slice operations will take.  You can test whether this is so
by running nodetool compact when things start slowing down.  If it
speeds up, that is likely the problem.  If you are deleting that much,
you should also tune GCGraceSeconds way down (from the default of 10
days) so the space is reclaimed on major compaction and, again, there
are fewer things to inspect.

Long rows written over long periods of time are almost certain to give
worse read performance, even far worse, than rows written all at once.


b

On Tue, Aug 24, 2010 at 10:17 PM, B. Todd Burruss bburr...@real.com wrote:
 thx artie,

 i haven't used a super CF because i thought it has more trouble doing slices
 because the entire row must be deserialized to get to the subcolumn you
 want?

 iostat is nothing, 0.0.  i have plenty of RAM and the OS is I/O caching
 nicely

 i haven't used the key cache, because i only have one key, the row of the
 queue ;)

 i haven't used row cache because i need the row to grow quite large,
 millions of columns.  and the size of data could be arbitrary - right now i
 am testing with  32 byte values per column.

 i do need quorum consistency.

 i have read previous that some folks are using a single row with millions of
 columns.  is anyone using get_slice to pick off the first or the last column
 in the row?

 On 08/24/2010 09:25 PM, Artie Copeland wrote:

 Have you tried using a super column, it seems that having a row with over
 100K columns and growing would be alot for cassandra to deserialize?  what
 is iostat and jmeter telling you? it would be interesting to see that data.
  also what are you using for you key or row caching?  do you need to use a
 quorum consistency as that can slow down reads as well, can you use a lower
 consistency level?

 Artie
 On Tue, Aug 24, 2010 at 9:14 PM, B. Todd Burruss bburr...@real.com wrote:

 i am using get_slice to pull columns from a row to emulate a queue.
  column names are TimeUUID and the values are small,  32 bytes.  simple
 ColumnFamily.

 i am using SlicePredicate like this to pull the first (oldest) column in
 the row:

        SlicePredicate predicate = new SlicePredicate();
        predicate.setSlice_range(new SliceRange(new byte[] {}, new byte[]
 {}, false, 1));

        get_slice(rowKey, colParent, predicate, QUORUM);

 once i get the column i remove it.  so there are a lot of gets and
 mutates, leaving lots of deleted columns.

 get_slice starts off performing just fine, but then falls off dramatically
 as the number of columns grows.  at its peak there are 100,000 columns and
 get_slice is taking over 100ms to return.

 i am running a single instance of cassandra 0.7 on localhost, default
 config.  i've done some googling and can't find any tweaks or tuning
 suggestions specific to get_slice.  i already know about separating
 commitlog and data, watching iostat, GC, etc.

 any low hanging tuning fruit anyone can think of?  in 0.6 i recall an
 index for columns, maybe that is what i need?

 thx



 --
 http://yeslinux.org
 http://yestech.org



Re: Does the scan speed with CL.ALL is faster than CL.QUORUM and CL.ONE?

2010-08-25 Thread Benjamin Black
Did you run the tests in this order without changing anything but CL?
You may be seeing the effects of OS page caching.  Run then in the
reverse order and see if the difference persists.

On Tue, Aug 24, 2010 at 11:52 PM, ring_ayumi_king
ring_ayumi_k...@yahoo.com.tw wrote:
 Hi all,

 I ran my benchmark(OPP via get_range_slices) and found the following:
 Why does the scan speed with CL.ALL is faster than CL.QUORUM and CL.ONE?

 CL.ONE (1k per row, Count:5)
 scan :11095 ms
 scan per:0.2219 ms
 scan thput:4506.5347 ops/sec

 CL.QUORUM
 scan :11072 ms
 scan per:0.22144 ms
 scan thput:4515.896 ops/sec

 CL.ALL
 scan :7869 ms
 scan per:0.15738 ms
 scan thput:6354.0474 ops/sec

 Thanks.

 Shen






Re: Cassandra and Lucene

2010-08-25 Thread Benjamin Black
Please put your storage-conf.xml and cassandra.in.sh files on
pastie/dpaste/gist and send the link.

(moving it back to the user list again)

On Sun, Jul 25, 2010 at 11:51 PM, Michelan Arendse miche...@hermanus.cc wrote:
 I have 2 seeds in my cluster, with a replication of 2. I am using cassandra
 0.6.2.

 It keeps running out of memory so I don't know if there are some memory
 leaks.
 This is what is in the log:

        at
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        ... 2 more
 ERROR [GC inspection] 2010-07-22 18:41:57,157 CassandraDaemon.java (line 78)
 Fatal exception in thread Thread[GC inspection,5,main]
 java.lang.OutOfMemoryError: Java heap space
        at java.util.AbstractList.iterator(AbstractList.java:273)
        at
 org.apache.cassandra.service.GCInspector.logIntervalGCStats(GCInspector.java:82)
        at
 org.apache.cassandra.service.GCInspector.access$000(GCInspector.java:38)
        at
 org.apache.cassandra.service.GCInspector$1.run(GCInspector.java:74)
        at java.util.TimerThread.mainLoop(Timer.java:512)
        at java.util.TimerThread.run(Timer.java:462)


 On Mon, Jul 26, 2010 at 2:14 AM, Aaron Morton aa...@thelastpickle.comwrote:

 You may need to provide a some more information. What's the cluster
 configuration, what version, what's in the logs etc.

 Aaron


 On 24 Jul, 2010,at 03:40 AM, Michelan Arendse miche...@hermanus.cc
 wrote:

 Hi

 I have recently started working on Cassandra as I need to make a distribute
 Lucene index and found that Lucandra was the best for this. Since then I
 have configured everything and it's working ok.

 Now the problem comes in when I need to write this Lucene index to
 Cassandra
 or convert it so that Cassandra can read it. The test index is 32 gigs and
 i
 find that Cassandra times out alot.

 What happens can't Cassandra take that load? Please any help will be great.

 Kind Regards,





Re: Node OOM Problems

2010-08-22 Thread Benjamin Black
How much storage do you need?  240G SSDs quite capable of saturating a
3Gbps SATA link are $600.  Larger ones are also available with similar
performance.  Perhaps you could share a bit more about the storage and
performance requirements.  How SSDs to sustain 10k writes/sec PER NODE
WITH LINEAR SCALING breaks down the commodity server concept eludes
me.


b

On Sat, Aug 21, 2010 at 11:27 PM, Wayne wav...@gmail.com wrote:
 Thank you for the advice, I will try these settings. I am running defaults
 right now. The disk subsystem is one SATA disk for commitlog and 4 SATA
 disks in raid 0 for the data.

 From your email you are implying this hardware can not handle this level of
 sustained writes? That kind of breaks down the commodity server concept for
 me. I have never used anything but a 15k SAS disk (fastest disk money could
 buy until SSD) ALWAYS with a database. I have tried to throw out that
 mentality here but are you saying nothing has really changed/ Spindles
 spindles spindles as fast as you can afford is what I have always known...I
 guess that applies here? Do I need to spend $10k per node instead of $3.5k
 to get SUSTAINED 10k writes/sec per node?



 On Sat, Aug 21, 2010 at 11:03 PM, Benjamin Black b...@b3k.us wrote:

 My guess is that you have (at least) 2 problems right now:

 You are writing 10k ops/sec to each node, but have default memtable
 flush settings.  This is resulting in memtable flushing every 30
 seconds (default ops flush setting is 300k).  You thus have a
 proliferation of tiny sstables and are seeing minor compactions
 triggered every couple of minutes.

 You have started a major compaction which is now competing with those
 near constant minor compactions for far too little I/O (3 SATA drives
 in RAID0, perhaps?).  Normally, this would result in a massive
 ballooning of your heap use as all sorts of activities (like memtable
 flushes) backed up, as well.

 I suggest you increase the memtable flush ops to at least 10 (million)
 if you are going to sustain that many writes/sec, along with an
 increase in the flush MB to match, based on your typical bytes/write
 op.  Long term, this level of write activity demands a lot faster
 storage (iops and bandwidth).


 b
 On Sat, Aug 21, 2010 at 2:18 AM, Wayne wav...@gmail.com wrote:
  I am already running with those options. I thought maybe that is why
  they
  never get completed as they keep pushed pushed down in priority? I am
  getting timeouts now and then but for the most part the cluster keeps
  running. Is it normal/ok for the repair and compaction to take so long?
  It
  has been over 12 hours since they were submitted.
 
  On Sat, Aug 21, 2010 at 10:56 AM, Jonathan Ellis jbel...@gmail.com
  wrote:
 
  yes, the AES is the repair.
 
  if you are running linux, try adding the options to reduce compaction
  priority from
  http://wiki.apache.org/cassandra/PerformanceTuning
 
  On Sat, Aug 21, 2010 at 3:17 AM, Wayne wav...@gmail.com wrote:
   I could tell from munin that the disk utilization was getting crazy
   high,
   but the strange thing is that it seemed to stall. The utilization
   went
   way
   down and everything seemed to flatten out. Requests piled up and the
   node
   was doing nothing. It did not crash but was left in a useless
   state. I
   do
   not have access to the tpstats when that occurred. Attached is the
   munin
   chart, and you can see the flat line after Friday at noon.
  
   I have reduced the writers from 10 per to 8 per node and they seem to
   be
   still running, but I am afraid they are barely hanging on. I ran
   nodetool
   repair after rebooting the failed node and I do not think the repair
   ever
   completed. I also later ran compact on each node and some it finished
   but
   some it did not. Below is the tpstats currently for the node I had to
   restart. Is the AE-SERVICE-STAGE the repair and compaction queued up?
   It
   seems several nodes are not getting enough free cycles to keep up.
   They
   are
   not timing out (30 sec timeout) for the most part but they are also
   not
   able
   to compact. Is this normal? Do I just give it time? I am migrating
   2-3
   TB of
   data from Mysql so the load is constant and will be for days and it
   seems
   even with only 8 writer processes per node I am maxed out.
  
   Thanks for the advice. Any more pointers would be greatly
   appreciated.
  
   Pool Name    Active   Pending  Completed
   FILEUTILS-DELETE-POOL 0 0   1868
   STREAM-STAGE  1 1  2
   RESPONSE-STAGE    0 2  769158645
   ROW-READ-STAGE    0 0 140942
   LB-OPERATIONS 0 0  0
   MESSAGE-DESERIALIZER-POOL 1 0 1470221842
   GMFD  0 0 169712
   LB-TARGET 0 0  0
   CONSISTENCY-MANAGER

Re: Node OOM Problems

2010-08-22 Thread Benjamin Black
I see no reason to make that assumption.  Cassandra currently has no
mechanism to alternate in that manner.  At the update rate you
require, you just need more disk io (bandwidth and iops).
Alternatively, you could use a bunch more, smaller nodes with the same
SATA RAID setup so they each take many fewer writes/sec, and so can
keep with compaction.

On Sun, Aug 22, 2010 at 12:00 AM, Wayne wav...@gmail.com wrote:
 Due to compaction being so expensive in terms of disk resources, does it
 make more sense to have 2 data volumes instead of one? We have 4 data disks
 in raid 0, would this make more sense to be 2 x 2 disks in raid 0? That way
 the reader and writer I assume would always be a different set of spindles?

 On Sun, Aug 22, 2010 at 8:27 AM, Wayne wav...@gmail.com wrote:

 Thank you for the advice, I will try these settings. I am running defaults
 right now. The disk subsystem is one SATA disk for commitlog and 4 SATA
 disks in raid 0 for the data.

 From your email you are implying this hardware can not handle this level
 of sustained writes? That kind of breaks down the commodity server concept
 for me. I have never used anything but a 15k SAS disk (fastest disk money
 could buy until SSD) ALWAYS with a database. I have tried to throw out that
 mentality here but are you saying nothing has really changed/ Spindles
 spindles spindles as fast as you can afford is what I have always known...I
 guess that applies here? Do I need to spend $10k per node instead of $3.5k
 to get SUSTAINED 10k writes/sec per node?



 On Sat, Aug 21, 2010 at 11:03 PM, Benjamin Black b...@b3k.us wrote:

 My guess is that you have (at least) 2 problems right now:

 You are writing 10k ops/sec to each node, but have default memtable
 flush settings.  This is resulting in memtable flushing every 30
 seconds (default ops flush setting is 300k).  You thus have a
 proliferation of tiny sstables and are seeing minor compactions
 triggered every couple of minutes.

 You have started a major compaction which is now competing with those
 near constant minor compactions for far too little I/O (3 SATA drives
 in RAID0, perhaps?).  Normally, this would result in a massive
 ballooning of your heap use as all sorts of activities (like memtable
 flushes) backed up, as well.

 I suggest you increase the memtable flush ops to at least 10 (million)
 if you are going to sustain that many writes/sec, along with an
 increase in the flush MB to match, based on your typical bytes/write
 op.  Long term, this level of write activity demands a lot faster
 storage (iops and bandwidth).


 b
 On Sat, Aug 21, 2010 at 2:18 AM, Wayne wav...@gmail.com wrote:
  I am already running with those options. I thought maybe that is why
  they
  never get completed as they keep pushed pushed down in priority? I am
  getting timeouts now and then but for the most part the cluster keeps
  running. Is it normal/ok for the repair and compaction to take so long?
  It
  has been over 12 hours since they were submitted.
 
  On Sat, Aug 21, 2010 at 10:56 AM, Jonathan Ellis jbel...@gmail.com
  wrote:
 
  yes, the AES is the repair.
 
  if you are running linux, try adding the options to reduce compaction
  priority from
  http://wiki.apache.org/cassandra/PerformanceTuning
 
  On Sat, Aug 21, 2010 at 3:17 AM, Wayne wav...@gmail.com wrote:
   I could tell from munin that the disk utilization was getting crazy
   high,
   but the strange thing is that it seemed to stall. The utilization
   went
   way
   down and everything seemed to flatten out. Requests piled up and the
   node
   was doing nothing. It did not crash but was left in a useless
   state. I
   do
   not have access to the tpstats when that occurred. Attached is the
   munin
   chart, and you can see the flat line after Friday at noon.
  
   I have reduced the writers from 10 per to 8 per node and they seem
   to be
   still running, but I am afraid they are barely hanging on. I ran
   nodetool
   repair after rebooting the failed node and I do not think the repair
   ever
   completed. I also later ran compact on each node and some it
   finished
   but
   some it did not. Below is the tpstats currently for the node I had
   to
   restart. Is the AE-SERVICE-STAGE the repair and compaction queued
   up?
   It
   seems several nodes are not getting enough free cycles to keep up.
   They
   are
   not timing out (30 sec timeout) for the most part but they are also
   not
   able
   to compact. Is this normal? Do I just give it time? I am migrating
   2-3
   TB of
   data from Mysql so the load is constant and will be for days and it
   seems
   even with only 8 writer processes per node I am maxed out.
  
   Thanks for the advice. Any more pointers would be greatly
   appreciated.
  
   Pool Name    Active   Pending  Completed
   FILEUTILS-DELETE-POOL 0 0   1868
   STREAM-STAGE  1 1  2
   RESPONSE-STAGE    0

Re: Node OOM Problems

2010-08-22 Thread Benjamin Black
Is the need for 10k/sec/node just for bulk loading of data or is it
how your app will operate normally?  Those are very different things.

On Sun, Aug 22, 2010 at 4:11 AM, Wayne wav...@gmail.com wrote:
 Currently each node has 4x1TB SATA disks. In MySQL we have 15tb currently
 with no replication. To move this to Cassandra replication factor 3 we need
 45TB assuming the space usage is the same, but it is probably more. We had
 assumed a 30 node cluster with 4tb per node would suffice with head room for
 compaction and to growth (120 TB).

 SSD drives for 30 nodes in this size range are not cost feasible for us. We
 can try to use 15k SAS drives and have more spindles but then our per node
 cost goes up. I guess I naively thought cassandra would do its magic and a
 few commodity SATA hard drives would be fine.

 Our performance requirement does not need 10k writes/node/sec 24 hours a
 day, but if we can not get really good performance the switch from MySQL
 becomes harder to rationalize. We can currently restore from a MySQL dump a
 2.5 terabyte backup (plain old insert statements) in 4-5 days. I expect as
 much or more from cassandra and I feel years away from simply loading 2+tb
 into cassandra without so many issues.

 What is really required in hardware for a 100+tb cluster with near 10k/sec
 write performance sustained? If the answer is SSD what can be expected from
 15k SAS drives and what from SATA?

 Thank you for your advice, I am struggling with how to make this work. Any
 insight you can provide would be greatly appreciated.



 On Sun, Aug 22, 2010 at 8:58 AM, Benjamin Black b...@b3k.us wrote:

 How much storage do you need?  240G SSDs quite capable of saturating a
 3Gbps SATA link are $600.  Larger ones are also available with similar
 performance.  Perhaps you could share a bit more about the storage and
 performance requirements.  How SSDs to sustain 10k writes/sec PER NODE
 WITH LINEAR SCALING breaks down the commodity server concept eludes
 me.


 b

 On Sat, Aug 21, 2010 at 11:27 PM, Wayne wav...@gmail.com wrote:
  Thank you for the advice, I will try these settings. I am running
  defaults
  right now. The disk subsystem is one SATA disk for commitlog and 4 SATA
  disks in raid 0 for the data.
 
  From your email you are implying this hardware can not handle this level
  of
  sustained writes? That kind of breaks down the commodity server concept
  for
  me. I have never used anything but a 15k SAS disk (fastest disk money
  could
  buy until SSD) ALWAYS with a database. I have tried to throw out that
  mentality here but are you saying nothing has really changed/ Spindles
  spindles spindles as fast as you can afford is what I have always
  known...I
  guess that applies here? Do I need to spend $10k per node instead of
  $3.5k
  to get SUSTAINED 10k writes/sec per node?
 
 
 
  On Sat, Aug 21, 2010 at 11:03 PM, Benjamin Black b...@b3k.us wrote:
 
  My guess is that you have (at least) 2 problems right now:
 
  You are writing 10k ops/sec to each node, but have default memtable
  flush settings.  This is resulting in memtable flushing every 30
  seconds (default ops flush setting is 300k).  You thus have a
  proliferation of tiny sstables and are seeing minor compactions
  triggered every couple of minutes.
 
  You have started a major compaction which is now competing with those
  near constant minor compactions for far too little I/O (3 SATA drives
  in RAID0, perhaps?).  Normally, this would result in a massive
  ballooning of your heap use as all sorts of activities (like memtable
  flushes) backed up, as well.
 
  I suggest you increase the memtable flush ops to at least 10 (million)
  if you are going to sustain that many writes/sec, along with an
  increase in the flush MB to match, based on your typical bytes/write
  op.  Long term, this level of write activity demands a lot faster
  storage (iops and bandwidth).
 
 
  b
  On Sat, Aug 21, 2010 at 2:18 AM, Wayne wav...@gmail.com wrote:
   I am already running with those options. I thought maybe that is why
   they
   never get completed as they keep pushed pushed down in priority? I am
   getting timeouts now and then but for the most part the cluster keeps
   running. Is it normal/ok for the repair and compaction to take so
   long?
   It
   has been over 12 hours since they were submitted.
  
   On Sat, Aug 21, 2010 at 10:56 AM, Jonathan Ellis jbel...@gmail.com
   wrote:
  
   yes, the AES is the repair.
  
   if you are running linux, try adding the options to reduce
   compaction
   priority from
   http://wiki.apache.org/cassandra/PerformanceTuning
  
   On Sat, Aug 21, 2010 at 3:17 AM, Wayne wav...@gmail.com wrote:
I could tell from munin that the disk utilization was getting
crazy
high,
but the strange thing is that it seemed to stall. The
utilization
went
way
down and everything seemed to flatten out. Requests piled up and
the
node
was doing nothing. It did not crash

Re: Node OOM Problems

2010-08-22 Thread Benjamin Black
Wayne,

Bulk loading this much data is a very different prospect from needing
to sustain that rate of updates indefinitely.  As was suggested
earlier, you likely need to tune things differently, including
disabling minor compactions during the bulk load, to make this work
efficiently.


b

On Sun, Aug 22, 2010 at 12:40 PM, Wayne wav...@gmail.com wrote:
 Has anyone loaded 2+ terabytes of real data in one stretch into a cluster
 without bulk loading and without any problems? How long did it take? What
 kind of nodes were used? How many writes/sec/node can be sustained for 24+
 hours?



 On Sun, Aug 22, 2010 at 8:22 PM, Peter Schuller
 peter.schul...@infidyne.com wrote:

 I only sifted recent history of this thread (for time reasons), but:

  You have started a major compaction which is now competing with those
  near constant minor compactions for far too little I/O (3 SATA drives
  in RAID0, perhaps?).  Normally, this would result in a massive
  ballooning of your heap use as all sorts of activities (like memtable
  flushes) backed up, as well.

 AFAIK memtable flushing is unrelated to compaction in the sense that
 they occur concurrently and don't block each other (except to the
 extent that they truly do compete for e.g. disk or CPU resources).

 While small memtables do indeed mean more compaction activity in
 total, the expensiveness of any given compaction should not be
 severely affecting.

 As far as I can tell, the two primary effects of small memtable sizes are:

 * An increase in total amount of compaction work done in total for a
 given database size.
 * An increase in the number of sstables that may accumulate while
 larger compactions are running.
 ** That in turn is particularly relevant because it can generate a lot
 of seek-bound activity; consider for example range queries that end up
 spanning 10 000 files on disk.

 If memtable flushes are not able to complete fast enough to cope with
 write activity, even if that is the case only during concurrenct
 compaction (for whatever reason), that suggests to me that write
 activity is too high. Increasing memtable sizes may help on average
 due to decreased compaction work, but I don't see why it would
 significantly affect the performance one compactions *do* in fact run.

 With respect to timeouts on writes: I make no claims as to whether it
 is expected, because I have not yet investigated, but I definitely see
 sporadic slowness when benchmarking high-throughput writes on a
 cassandra trunk snapshot somewhere between 0.6 and 0.7. This occurs
 even when writing to a machine where the commit log and data
 directories are both on separate RAID volumes that are battery backed
 and should have no trouble eating write bursts (and the data is such
 that one is CPU bound  rather than diskbound on average; so it only
 needs to eat bursts).

 I've had to add re-try to the benchmarking tool (or else up the
 timeout) because the default was not enough.

 I have not investigated exactly why this happens but it's an
 interesting effect that as far as I can tell should not be there.
 Haver other people done high-throughput writes (to the point of CPU
 saturation) over extended periods of time while consistently seeing
 low latencies (consistencty meaning never exceeding hundreds of ms
 over several days)?


 --
 / Peter Schuller




Re: Node OOM Problems

2010-08-22 Thread Benjamin Black
On Sun, Aug 22, 2010 at 2:03 PM, Wayne wav...@gmail.com wrote:
 From a testing whether cassandra can take the load long term I do not see it
 as different. Yes bulk loading can be made faster using very different

Then you need far more IO, whether it comes form faster drives or more
nodes.  If you can achieve 10k writes/sec/node and linear scaling
without sharding in MySQL on cheap, commodity hardware then I am
impressed.

 methods, but my purpose is to test cassandra with a large volume of writes
 (and not to bulk load as efficiently as possible). I have scaled back to 5
 writer threads per node and still see 8k writes/sec/node. With the larger
 memory table settings we shall see how it goes. I have no idea how to change
 a JMX setting and prefer to use std options to be frank. For us this is

If you want best performance, you must tune the system appropriately.
If you want to use the base settings (which are intended for the 1G
max heap which is way too small for anything interesting), expect
suboptimal performance for your application.

 after all an evaluation of whether Cassandra can replace Mysql.

 I thank everyone for their help.

 On Sun, Aug 22, 2010 at 10:37 PM, Benjamin Black b...@b3k.us wrote:

 Wayne,

 Bulk loading this much data is a very different prospect from needing
 to sustain that rate of updates indefinitely.  As was suggested
 earlier, you likely need to tune things differently, including
 disabling minor compactions during the bulk load, to make this work
 efficiently.


 b

 On Sun, Aug 22, 2010 at 12:40 PM, Wayne wav...@gmail.com wrote:
  Has anyone loaded 2+ terabytes of real data in one stretch into a
  cluster
  without bulk loading and without any problems? How long did it take?
  What
  kind of nodes were used? How many writes/sec/node can be sustained for
  24+
  hours?
 
 
 
  On Sun, Aug 22, 2010 at 8:22 PM, Peter Schuller
  peter.schul...@infidyne.com wrote:
 
  I only sifted recent history of this thread (for time reasons), but:
 
   You have started a major compaction which is now competing with those
   near constant minor compactions for far too little I/O (3 SATA drives
   in RAID0, perhaps?).  Normally, this would result in a massive
   ballooning of your heap use as all sorts of activities (like memtable
   flushes) backed up, as well.
 
  AFAIK memtable flushing is unrelated to compaction in the sense that
  they occur concurrently and don't block each other (except to the
  extent that they truly do compete for e.g. disk or CPU resources).
 
  While small memtables do indeed mean more compaction activity in
  total, the expensiveness of any given compaction should not be
  severely affecting.
 
  As far as I can tell, the two primary effects of small memtable sizes
  are:
 
  * An increase in total amount of compaction work done in total for a
  given database size.
  * An increase in the number of sstables that may accumulate while
  larger compactions are running.
  ** That in turn is particularly relevant because it can generate a lot
  of seek-bound activity; consider for example range queries that end up
  spanning 10 000 files on disk.
 
  If memtable flushes are not able to complete fast enough to cope with
  write activity, even if that is the case only during concurrenct
  compaction (for whatever reason), that suggests to me that write
  activity is too high. Increasing memtable sizes may help on average
  due to decreased compaction work, but I don't see why it would
  significantly affect the performance one compactions *do* in fact run.
 
  With respect to timeouts on writes: I make no claims as to whether it
  is expected, because I have not yet investigated, but I definitely see
  sporadic slowness when benchmarking high-throughput writes on a
  cassandra trunk snapshot somewhere between 0.6 and 0.7. This occurs
  even when writing to a machine where the commit log and data
  directories are both on separate RAID volumes that are battery backed
  and should have no trouble eating write bursts (and the data is such
  that one is CPU bound  rather than diskbound on average; so it only
  needs to eat bursts).
 
  I've had to add re-try to the benchmarking tool (or else up the
  timeout) because the default was not enough.
 
  I have not investigated exactly why this happens but it's an
  interesting effect that as far as I can tell should not be there.
  Haver other people done high-throughput writes (to the point of CPU
  saturation) over extended periods of time while consistently seeing
  low latencies (consistencty meaning never exceeding hundreds of ms
  over several days)?
 
 
  --
  / Peter Schuller
 
 




Re: Cassandra Nodes Freeze/Down for ConcurrentMarkSweep GC?

2010-08-22 Thread Benjamin Black
http://riptano.blip.tv/file/4012133/

On Sun, Aug 22, 2010 at 12:11 PM, Moleza Moleza mole...@gmail.com wrote:
 Hi,
 I am setting up a cluster on a linux box.
 Everything seems to be working great and I am watching the ring with:
 watch -d -n 2 nodetool -h localhost ring
 Suddenly, I see that one of the nodes just went down (at 14:07):
 Status changed from Up to Down.
 13 minutes later (without any intervention) the node comes back Up (by 
 itself).
 I check the logs (see at end of text) on that node and see that there
 is nothing in the log from 14:07 until 14:20 (13 minutes later).
 I also notice the GC ConcurrentMarkSweep took 13 minutes.
 Here are my questions:
 [1] Is this behavior normal?
 [2] Has it been observed by someone else before?
 [3] The node being down means that nodetool, and any other client,
 wont be able to connect to it (clients should use other nodes in
 cluster to write data). Correct?
 [4] Is GC ConcurrentMarkSweep a Stop-The-World situation? Where the
 JVM cannot do anything else? Hence then node is technically Down?
 Correct?
 [5] Why is this GC taking such a long time? (see JMV ARGS posted bellow).
 [6] Any JMV Args (switches) I can use to prevent this?
 --
 JVM_OPTS= \
       -Dprog=Cassandra \
       -ea \
       -Xms12G \
       -Xmx12G \
       -XX:+UseParNewGC \
       -XX:+UseConcMarkSweepGC \
       -XX:+CMSParallelRemarkEnabled \
       -XX:SurvivorRatio=8 \
       -XX:MaxTenuringThreshold=1 \
       -XX:+HeapDumpOnOutOfMemoryError \
       -Dcom.sun.management.jmxremote.port=8080 \
       -Dcom.sun.management.jmxremote.ssl=false \
       -Dcom.sun.management.jmxremote.authenticate=false

 
  Log Extract ##
 INFO [GC inspection] 2010-08-22 14:06:48,622 GCInspector.java (line
 116) GC for ParNew: 235 ms, 134504976 reclaimed leaving 12721498296
 used; max is 13005881344
 INFO [FLUSH-TIMER] 2010-08-22 14:19:45,429 ColumnFamilyStore.java
 (line 357)HintsColumnFamily has reached its threshold; switching in a
 fresh Memtable at
 CommitLogContext(file='/var/nes/data1/cassandra_commitlog/CommitLog-1282500306160.log',
 position=55517352)
 INFO [FLUSH-TIMER] 2010-08-22 14:19:45,429 ColumnFamilyStore.java
 (line 609) Enqueuing flush of
 memtable-hintscolumnfam...@1935604258(3147 bytes, 433 operations)
 INFO [FLUSH-WRITER-POOL:1] 2010-08-22 14:19:45,430 Memtable.java (line
 148) Writing memtable-hintscolumnfam...@1935604258(3147 bytes, 433
 operations)
 INFO [GC inspection] 2010-08-22 14:19:45,917 GCInspector.java (line
 116) GC for ParNew: 215 ms, 130254256 reclaimed leaving 12742982208
 used; max is 13005881344
 INFO [GC inspection] 2010-08-22 14:19:45,973 GCInspector.java (line
 116)GC for ConcurrentMarkSweep: 775679 ms, 12685881488 reclaimed
 leaving 196692400 used; max is 13005881344
 --



Re: Node OOM Problems

2010-08-21 Thread Benjamin Black
Perhaps I missed it in one of the earlier emails, but what is your
disk subsystem config?

On Sat, Aug 21, 2010 at 2:18 AM, Wayne wav...@gmail.com wrote:
 I am already running with those options. I thought maybe that is why they
 never get completed as they keep pushed pushed down in priority? I am
 getting timeouts now and then but for the most part the cluster keeps
 running. Is it normal/ok for the repair and compaction to take so long? It
 has been over 12 hours since they were submitted.

 On Sat, Aug 21, 2010 at 10:56 AM, Jonathan Ellis jbel...@gmail.com wrote:

 yes, the AES is the repair.

 if you are running linux, try adding the options to reduce compaction
 priority from
 http://wiki.apache.org/cassandra/PerformanceTuning

 On Sat, Aug 21, 2010 at 3:17 AM, Wayne wav...@gmail.com wrote:
  I could tell from munin that the disk utilization was getting crazy
  high,
  but the strange thing is that it seemed to stall. The utilization went
  way
  down and everything seemed to flatten out. Requests piled up and the
  node
  was doing nothing. It did not crash but was left in a useless state. I
  do
  not have access to the tpstats when that occurred. Attached is the munin
  chart, and you can see the flat line after Friday at noon.
 
  I have reduced the writers from 10 per to 8 per node and they seem to be
  still running, but I am afraid they are barely hanging on. I ran
  nodetool
  repair after rebooting the failed node and I do not think the repair
  ever
  completed. I also later ran compact on each node and some it finished
  but
  some it did not. Below is the tpstats currently for the node I had to
  restart. Is the AE-SERVICE-STAGE the repair and compaction queued up?
  It
  seems several nodes are not getting enough free cycles to keep up. They
  are
  not timing out (30 sec timeout) for the most part but they are also not
  able
  to compact. Is this normal? Do I just give it time? I am migrating 2-3
  TB of
  data from Mysql so the load is constant and will be for days and it
  seems
  even with only 8 writer processes per node I am maxed out.
 
  Thanks for the advice. Any more pointers would be greatly appreciated.
 
  Pool Name    Active   Pending  Completed
  FILEUTILS-DELETE-POOL 0 0   1868
  STREAM-STAGE  1 1  2
  RESPONSE-STAGE    0 2  769158645
  ROW-READ-STAGE    0 0 140942
  LB-OPERATIONS 0 0  0
  MESSAGE-DESERIALIZER-POOL 1 0 1470221842
  GMFD  0 0 169712
  LB-TARGET 0 0  0
  CONSISTENCY-MANAGER   0 0  0
  ROW-MUTATION-STAGE    0 1  865124937
  MESSAGE-STREAMING-POOL    0 0  6
  LOAD-BALANCER-STAGE   0 0  0
  FLUSH-SORTER-POOL 0 0  0
  MEMTABLE-POST-FLUSHER 0 0   8088
  FLUSH-WRITER-POOL 0 0   8088
  AE-SERVICE-STAGE  1    34 54
  HINTED-HANDOFF-POOL   0 0  7
 
 
 
  On Fri, Aug 20, 2010 at 11:56 PM, Bill de hÓra b...@dehora.net wrote:
 
  On Fri, 2010-08-20 at 19:17 +0200, Wayne wrote:
 
    WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-20 16:57:02,602
   MessageDeserializationTask.java (line 47) dropping message
   (1,078,378ms past timeout)
    WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-20 16:57:02,602
   MessageDeserializationTask.java (line 47) dropping message
   (1,078,378ms past timeout)
 
  MESSAGE-DESERIALIZER-POOL usually backs up when other stages are bogged
  downstream, (eg here's Ben Black describing the symptom when the
  underlying cause is running out of disk bandwidth, well worth a watch
  http://riptano.blip.tv/file/4012133/).
 
  Can you send all of nodetool tpstats?
 
  Bill
 
 
 



 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of Riptano, the source for professional Cassandra support
 http://riptano.com




Re: Node OOM Problems

2010-08-21 Thread Benjamin Black
My guess is that you have (at least) 2 problems right now:

You are writing 10k ops/sec to each node, but have default memtable
flush settings.  This is resulting in memtable flushing every 30
seconds (default ops flush setting is 300k).  You thus have a
proliferation of tiny sstables and are seeing minor compactions
triggered every couple of minutes.

You have started a major compaction which is now competing with those
near constant minor compactions for far too little I/O (3 SATA drives
in RAID0, perhaps?).  Normally, this would result in a massive
ballooning of your heap use as all sorts of activities (like memtable
flushes) backed up, as well.

I suggest you increase the memtable flush ops to at least 10 (million)
if you are going to sustain that many writes/sec, along with an
increase in the flush MB to match, based on your typical bytes/write
op.  Long term, this level of write activity demands a lot faster
storage (iops and bandwidth).


b
On Sat, Aug 21, 2010 at 2:18 AM, Wayne wav...@gmail.com wrote:
 I am already running with those options. I thought maybe that is why they
 never get completed as they keep pushed pushed down in priority? I am
 getting timeouts now and then but for the most part the cluster keeps
 running. Is it normal/ok for the repair and compaction to take so long? It
 has been over 12 hours since they were submitted.

 On Sat, Aug 21, 2010 at 10:56 AM, Jonathan Ellis jbel...@gmail.com wrote:

 yes, the AES is the repair.

 if you are running linux, try adding the options to reduce compaction
 priority from
 http://wiki.apache.org/cassandra/PerformanceTuning

 On Sat, Aug 21, 2010 at 3:17 AM, Wayne wav...@gmail.com wrote:
  I could tell from munin that the disk utilization was getting crazy
  high,
  but the strange thing is that it seemed to stall. The utilization went
  way
  down and everything seemed to flatten out. Requests piled up and the
  node
  was doing nothing. It did not crash but was left in a useless state. I
  do
  not have access to the tpstats when that occurred. Attached is the munin
  chart, and you can see the flat line after Friday at noon.
 
  I have reduced the writers from 10 per to 8 per node and they seem to be
  still running, but I am afraid they are barely hanging on. I ran
  nodetool
  repair after rebooting the failed node and I do not think the repair
  ever
  completed. I also later ran compact on each node and some it finished
  but
  some it did not. Below is the tpstats currently for the node I had to
  restart. Is the AE-SERVICE-STAGE the repair and compaction queued up?
  It
  seems several nodes are not getting enough free cycles to keep up. They
  are
  not timing out (30 sec timeout) for the most part but they are also not
  able
  to compact. Is this normal? Do I just give it time? I am migrating 2-3
  TB of
  data from Mysql so the load is constant and will be for days and it
  seems
  even with only 8 writer processes per node I am maxed out.
 
  Thanks for the advice. Any more pointers would be greatly appreciated.
 
  Pool Name    Active   Pending  Completed
  FILEUTILS-DELETE-POOL 0 0   1868
  STREAM-STAGE  1 1  2
  RESPONSE-STAGE    0 2  769158645
  ROW-READ-STAGE    0 0 140942
  LB-OPERATIONS 0 0  0
  MESSAGE-DESERIALIZER-POOL 1 0 1470221842
  GMFD  0 0 169712
  LB-TARGET 0 0  0
  CONSISTENCY-MANAGER   0 0  0
  ROW-MUTATION-STAGE    0 1  865124937
  MESSAGE-STREAMING-POOL    0 0  6
  LOAD-BALANCER-STAGE   0 0  0
  FLUSH-SORTER-POOL 0 0  0
  MEMTABLE-POST-FLUSHER 0 0   8088
  FLUSH-WRITER-POOL 0 0   8088
  AE-SERVICE-STAGE  1    34 54
  HINTED-HANDOFF-POOL   0 0  7
 
 
 
  On Fri, Aug 20, 2010 at 11:56 PM, Bill de hÓra b...@dehora.net wrote:
 
  On Fri, 2010-08-20 at 19:17 +0200, Wayne wrote:
 
    WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-20 16:57:02,602
   MessageDeserializationTask.java (line 47) dropping message
   (1,078,378ms past timeout)
    WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-20 16:57:02,602
   MessageDeserializationTask.java (line 47) dropping message
   (1,078,378ms past timeout)
 
  MESSAGE-DESERIALIZER-POOL usually backs up when other stages are bogged
  downstream, (eg here's Ben Black describing the symptom when the
  underlying cause is running out of disk bandwidth, well worth a watch
  http://riptano.blip.tv/file/4012133/).
 
  Can you send all of nodetool tpstats?
 
  Bill
 
 
 



 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 

Re: Privileges

2010-08-21 Thread Benjamin Black
No.

On Sat, Aug 21, 2010 at 4:19 PM, Mark static.void@gmail.com wrote:
  Is there anyway to remove drop column family/keyspace privileges?



Re: Privileges

2010-08-21 Thread Benjamin Black
My mistake, the access levels in 0.7 do now distinguish these
operations (at access level FULL).

On Sat, Aug 21, 2010 at 4:19 PM, Mark static.void@gmail.com wrote:
  Is there anyway to remove drop column family/keyspace privileges?



Re: Privileges

2010-08-21 Thread Benjamin Black
For reference, I learned this from reading the source:
thrift/CassandraServer.java

On Sat, Aug 21, 2010 at 4:19 PM, Mark static.void@gmail.com wrote:
  Is there anyway to remove drop column family/keyspace privileges?



Re: questions regarding read and write in cassandra

2010-08-19 Thread Benjamin Black
More recent.  Newest timestamp always wins.  And I am moving this to
the user list (again) so it can be with all its friendly threads on
the exact same topic.

On Thu, Aug 19, 2010 at 10:22 AM, Maifi Khan maifi.k...@gmail.com wrote:
 Hi David
 Thanks for your reply.
 But what happens if I read and get 2 nodes has value 10 with older
 time stamp and the third node has 20 with more recent time stamp?
 Will cassandra return 10(majority) or 20(more recent)?

 thanks
 Maifi

 On Thu, Aug 19, 2010 at 1:20 PM, David Timothy Strauss
 da...@fourkitchens.com wrote:
 The quorum write would fail, but the data would not be rolled back. Assuming 
 the offline nodes recover, the data would eventually replicate.

 This question belongs on the user list, though.

 -Original Message-
 From: Maifi Khan maifi.k...@gmail.com
 Date: Thu, 19 Aug 2010 13:00:47
 To: d...@cassandra.apache.org
 Reply-To: d...@cassandra.apache.org
 Subject: questions regarding read and write in cassandra

 Hi
 I have a question in the following scenario.
 Say we have 10 nodes, Replication factor is 5.
 Now, say, for Row X, Column Y, data is replicated to node 1,2,3,4,5
 and current value is 10
 Say, I issue a write command with value “20” to Row X, column Y with
 quorum(n/2+1=3 nodes).  Say it updated 1 and 2 and failed to update
 any other node. So it failed to write to 3 nodes. What happens in such
 scenario?

 Q: Will the user returned failed?

 Now, assuming that the write failed.
 What value will I see if I want to read the same cell with Quorum?
 Now, say I read the data with quorum. It read from 1, 4, 5 and see
 that node 1 has the most recent data (“20” which is still there as
 cassandra does not roll back).
 Will it will return the data “20” to the user or will it return the
 earlier value 10 as it is returned by the node 4 and 5?

 Also, does read repair tries to propagate 20 to all the replicas
 although cassandra returned failed to the user?


 thanks




Re: Cassandra gem

2010-08-18 Thread Benjamin Black
great, thanks!

On Tue, Aug 17, 2010 at 11:30 PM, Mark static.void@gmail.com wrote:
  On 8/17/10 5:44 PM, Benjamin Black wrote:

 Updated code is now in my master branch, with the reversion to 10.0.0.
  Please let me know of further trouble.


 b

 On Tue, Aug 17, 2010 at 8:31 AM, Markstatic.void@gmail.com  wrote:

  On 8/16/10 11:37 PM, Benjamin Black wrote:

 I'm testing with the default cassandra.yaml.

 I cannot reproduce the output in that gist, however:

 thrift_client = client.instance_variable_get(:@client)

 =    nil
 Also, the Thrift version for 0.7 is 11.0.0, according to the code I
 have.  Can someone comment on whether 0.7 beta1 is at Thrift interface
 version 10.0.0 or 11.0.0?


 b

 On Mon, Aug 16, 2010 at 9:03 PM, Markstatic.void@gmail.com
  wrote:

  On 8/16/10 8:51 PM, Mark wrote:

  On 8/16/10 6:19 PM, Benjamin Black wrote:

 client = Cassandra.new('system', '127.0.0.1:9160')

 Brand new download of beta-0.7.0-beta1

 http://gist.github.com/528357

 Which thrift/thrift_client versions are you using?

 FYI also tested similar setup on another machine and same results. Is
 there
 any configuration change I need in cassandra.yaml or something?

 thrift_client = client.instance_variable_get(:@client)

 The above client will only be instantiated after making (or attempting in
 my
 case) a request.


 Works like a charm. Thanks



Re: Cassandra gem

2010-08-17 Thread Benjamin Black
thrift (0.2.0.4)
thrift_client (0.4.6, 0.4.3)

On Mon, Aug 16, 2010 at 8:51 PM, Mark static.void@gmail.com wrote:
  On 8/16/10 6:19 PM, Benjamin Black wrote:

 client = Cassandra.new('system', '127.0.0.1:9160')

 Brand new download of beta-0.7.0-beta1

 http://gist.github.com/528357

 Which thrift/thrift_client versions are you using?



Re: Cassandra gem

2010-08-17 Thread Benjamin Black
I'm testing with the default cassandra.yaml.

I cannot reproduce the output in that gist, however:

 thrift_client = client.instance_variable_get(:@client)
= nil


Also, the Thrift version for 0.7 is 11.0.0, according to the code I
have.  Can someone comment on whether 0.7 beta1 is at Thrift interface
version 10.0.0 or 11.0.0?


b

On Mon, Aug 16, 2010 at 9:03 PM, Mark static.void@gmail.com wrote:
  On 8/16/10 8:51 PM, Mark wrote:

  On 8/16/10 6:19 PM, Benjamin Black wrote:

 client = Cassandra.new('system', '127.0.0.1:9160')

 Brand new download of beta-0.7.0-beta1

 http://gist.github.com/528357

 Which thrift/thrift_client versions are you using?

 FYI also tested similar setup on another machine and same results. Is there
 any configuration change I need in cassandra.yaml or something?



Re: Cassandra gem

2010-08-17 Thread Benjamin Black
Then this may be the issue.  I'll see if I can regenerate something
with 10.0.0 version tomorrow.

On Mon, Aug 16, 2010 at 11:45 PM, Thorvaldsson Justus
justus.thorvalds...@svenskaspel.se wrote:
 Using beta, made a describe_version(), got 10.0.0 as reply, aint using gem 
 though, just thrift from java
 /Justus

 -Ursprungligt meddelande-
 Från: Benjamin Black [mailto:b...@b3k.us]
 Skickat: den 17 augusti 2010 08:37
 Till: user@cassandra.apache.org
 Ämne: Re: Cassandra gem

 I'm testing with the default cassandra.yaml.

 I cannot reproduce the output in that gist, however:

 thrift_client = client.instance_variable_get(:@client)
 = nil


 Also, the Thrift version for 0.7 is 11.0.0, according to the code I
 have.  Can someone comment on whether 0.7 beta1 is at Thrift interface
 version 10.0.0 or 11.0.0?


 b

 On Mon, Aug 16, 2010 at 9:03 PM, Mark static.void@gmail.com wrote:
  On 8/16/10 8:51 PM, Mark wrote:

  On 8/16/10 6:19 PM, Benjamin Black wrote:

 client = Cassandra.new('system', '127.0.0.1:9160')

 Brand new download of beta-0.7.0-beta1

 http://gist.github.com/528357

 Which thrift/thrift_client versions are you using?

 FYI also tested similar setup on another machine and same results. Is there
 any configuration change I need in cassandra.yaml or something?




Re: move data between clusters

2010-08-17 Thread Benjamin Black
without answering your whole question, just fyi: there is a matching
json2sstable command for going the other direction.

On Tue, Aug 17, 2010 at 10:48 AM, Artie Copeland yeslinux@gmail.com wrote:
 what is the best way to move data between clusters.  we currently have a 4
 node prod cluster with 80G of data and want to move it to a dev env with 3
 nodes.  we have plenty of disk were looking into nodetool snapshot, but it
 look like that wont work because of the system tables.  sstabletojson does
 look like it would work as it would miss the index files.  am i missing
 something?  have others tried to do the same and been successful.
 thanx
 artie

 --
 http://yeslinux.org
 http://yestech.org



Re: indexing rows ordered by int

2010-08-17 Thread Benjamin Black
http://code.google.com/p/redis/wiki/SortedSets

On Tue, Aug 17, 2010 at 12:33 PM, S Ahmed sahmed1...@gmail.com wrote:
 So when using Redis, how do you go about updating the index?
 Do you serialize changes to the index i.e. when someone votes, you then
 update the index?
 Little confused as to how to go about updating a huge index.
 Say you have 1 million stores, and you want to order by the top votes, how
 would you maintain such an index since they are being constantly voted on.
 On Sun, Aug 15, 2010 at 10:48 PM, Chris Goffinet c...@chrisgoffinet.com
 wrote:

 Digg is using redis for such a feature as well.  We use it on the MyNews -
 Top in 24 hours. Since we need timestamp ordering + sorting by how many
 friends touch a story.

 -Chris

 On Aug 15, 2010, at 7:34 PM, Benjamin Black wrote:

  http://code.google.com/p/redis/
 
  On Sat, Aug 14, 2010 at 11:51 PM, S Ahmed sahmed1...@gmail.com wrote:
  For CF that I need to perform range scans on, I create separate CF that
  have
  custom ordering.
  Say a CF holds comments on a story (like comments on a reddit or digg
  story
  post)
  So if I need to order comments by votes, it seems I have to re-index
  every
  time someone votes on a comment (or batch it every x minutes).
 
 
  Right now I think I have to pull all the comments into memory, then
  sort by
  votes, then re-write the index.
  Are there any best-practises for this type of index?





Re: Cassandra gem

2010-08-17 Thread Benjamin Black
Updated code is now in my master branch, with the reversion to 10.0.0.
 Please let me know of further trouble.


b

On Tue, Aug 17, 2010 at 8:31 AM, Mark static.void@gmail.com wrote:
  On 8/16/10 11:37 PM, Benjamin Black wrote:

 I'm testing with the default cassandra.yaml.

 I cannot reproduce the output in that gist, however:

 thrift_client = client.instance_variable_get(:@client)

 =  nil
 Also, the Thrift version for 0.7 is 11.0.0, according to the code I
 have.  Can someone comment on whether 0.7 beta1 is at Thrift interface
 version 10.0.0 or 11.0.0?


 b

 On Mon, Aug 16, 2010 at 9:03 PM, Markstatic.void@gmail.com  wrote:

  On 8/16/10 8:51 PM, Mark wrote:

  On 8/16/10 6:19 PM, Benjamin Black wrote:

 client = Cassandra.new('system', '127.0.0.1:9160')

 Brand new download of beta-0.7.0-beta1

 http://gist.github.com/528357

 Which thrift/thrift_client versions are you using?

 FYI also tested similar setup on another machine and same results. Is
 there
 any configuration change I need in cassandra.yaml or something?


 thrift_client = client.instance_variable_get(:@client)

 The above client will only be instantiated after making (or attempting in my
 case) a request.




Re: cassandra for a inbox search with high reading qps

2010-08-17 Thread Benjamin Black
On Tue, Aug 17, 2010 at 7:55 PM, Chen Xinli chen.d...@gmail.com wrote:
 Hi,

 We are going to use cassandra for searching purpose like inbox search.
 The reading qps is very high, we'd like to use ConsitencyLevel.One for
 reading and disable read-repair at the same time.


In 0.7 you can set a probability for read repair, but disabling it is
a spectacularly bad idea.  Any write problems on a node will result in
persistent inconsistency.

 For reading consistency in this condition, the writing should use
 ConsistencyLevel.ALL. But the writing will fail if one node fails.

You are free to read and write with consistency levels where R+W  N,
it just means you have weaker consistency guarantees.

 We want such a ConsistencyLevel for writing/reading that :
 1. writing will success if there is node alive for this key
 2. reading will not forward to a node that's just recovered and doing hinted
 handoff

 So that, if some node fails, others nodes for replica will receive the data
 and surve reading successfully;
 when the failure node recovers,  it will receive hinted handoff from other
 nodes and it'll not surve reading until hinted handoff is done.

 Does cassandra support the cases already? or should I modify the code to
 meet our requirements?


You are phrasing these requirements in terms of a specific
implementation.  What are your actual consistency goals?  If node
failure is such a common occurrence in your system, you are going to
have _numerous_ problems.


b


Re: data deleted came back after 9 days.

2010-08-17 Thread Benjamin Black
On Tue, Aug 17, 2010 at 7:49 PM, Zhong Li z...@voxeo.com wrote:
 Those data were inserted one node, then deleted on a remote node in less
 than 2 seconds. So it is very possible some node lost tombstone when
 connection lost.
 My question, is a ConstencyLevel.ALL read can retrieve lost tombstone back
 instead of repair?


No.  Read repair does not replay operations.  You must run nodetool repair.


b


Re: File write errors but cassandra isn't crashing

2010-08-16 Thread Benjamin Black
Useful config option, perhaps?

On Mon, Aug 16, 2010 at 8:51 AM, Jonathan Ellis jbel...@gmail.com wrote:
 That's a tough call -- you can also come up with scenarios where you'd
 rather have it read-only than completely dead.

 On Wed, Aug 11, 2010 at 12:38 PM, Ran Tavory ran...@gmail.com wrote:
 Due to administrative error one of the hosts in the cluster lost permission
 to write to it's data directory.
 So I started seeing errors in the log, however, the server continued serving
 traffic. It wasn't able to compact and do other write operations but it
 didn't crash.
 I was wondering wether that's by design and if so, is this a good one... I
 guess I want to know if really bad things happen to my cluster...
 logs look like that...

  INFO [FLUSH-TIMER] 2010-08-11 07:53:14,683 ColumnFamilyStore.java (line
 357) KvAds has reached its threshold; switching in a fresh Memtable at
 CommitLogContext(file='/outbrain/cassandra/commitlog/Commi
 tLog-1281505164614.log', position=88521163)
  INFO [FLUSH-TIMER] 2010-08-11 07:53:14,683 ColumnFamilyStore.java (line
 609) Enqueuing flush of Memtable(KvAds)@851225759
  INFO [FLUSH-WRITER-POOL:1] 2010-08-11 07:53:14,684 Memtable.java (line 148)
 Writing Memtable(KvAds)@851225759
 ERROR [FLUSH-WRITER-POOL:1] 2010-08-11 07:53:14,688
 DebuggableThreadPoolExecutor.java (line 94) Error in executor futuretask
 java.util.concurrent.ExecutionException: java.lang.RuntimeException:
 java.io.FileNotFoundException:
 /outbrain/cassandra/data/outbrain_kvdb/KvAds-tmp-249-Data.db (Permission
 denied)
         at
 java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
         at java.util.concurrent.FutureTask.get(FutureTask.java:83)
         at
 org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor.afterExecute(DebuggableThreadPoolExecutor.java:86)
         at
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:888)
         at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
         at java.lang.Thread.run(Thread.java:619)
 Caused by: java.lang.RuntimeException: java.io.FileNotFoundException:
 /outbrain/cassandra/data/outbrain_kvdb/KvAds-tmp-249-Data.db (Permission
 denied)
         at
 org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34)
         at
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
         at
 java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
         at java.util.concurrent.FutureTask.run(FutureTask.java:138)
         at
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 ... more



 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of Riptano, the source for professional Cassandra support
 http://riptano.com



Re: Cassandra gem

2010-08-16 Thread Benjamin Black
If you pulled before a couple hours ago and did not use the 'trunk'
branch, then you don't have current code.  I merged the trunk branch
to master earlier today and sent a pull request for the fauna repo to
get the changes, as well.  Also fixed a bug another user found when
running with Ruby 1.9.

Summary: pull again, use master, have fun.  If it still doesn't work,
please open an issue to me.


b

On Mon, Aug 16, 2010 at 2:13 PM, Mark static.void@gmail.com wrote:

 Just upgraded my cassandra gem today to b/cassandra fork and noticed that
 the transport changed. I re-enabled TFramedTransport in cassandra.yml but my
 client no longer works. I keep receiving the following error.

 Thrift::ApplicationException: describe_keyspace failed: unknown result
    from
 workspace/vendor/plugins/cassandra/lib/../vendor/0.7/gen-rb/cassandra.rb:346:in
 `recv_describe_keyspace'
    from
 workspace/vendor/plugins/cassandra/lib/../vendor/0.7/gen-rb/cassandra.rb:335:in
 `describe_keyspace'
    from
 workspace/vendor/plugins/thrift_client/lib/thrift_client/abstract_thrift_client.rb:67:in
 `send'
    from
 workspace/vendor/plugins/thrift_client/lib/thrift_client/abstract_thrift_client.rb:67:in
 `send_rpc'
    from
 workspace/vendor/plugins/thrift_client/lib/thrift_client/abstract_thrift_client.rb:164:in
 `send_rpc'
    from
 workspace/vendor/plugins/thrift_client/lib/thrift_client/abstract_thrift_client.rb:63:in
 `proxy'
    from
 workspace/vendor/plugins/thrift_client/lib/thrift_client/abstract_thrift_client.rb:154:in
 `proxy'
    from
 workspace/vendor/plugins/thrift_client/lib/thrift_client/abstract_thrift_client.rb:53:in
 `handled_proxy'
    from
 workspace/vendor/plugins/thrift_client/lib/thrift_client/abstract_thrift_client.rb:150:in
 `handled_proxy'
    from
 workspace/vendor/plugins/thrift_client/lib/thrift_client/abstract_thrift_client.rb:23:in
 `describe_keyspace'
    from (irb):14

 Any clues?



Re: Cassandra gem

2010-08-16 Thread Benjamin Black
can you gist the code?

On Mon, Aug 16, 2010 at 5:46 PM, Mark static.void@gmail.com wrote:
  On 8/16/10 3:58 PM, Benjamin Black wrote:

 If you pulled before a couple hours ago and did not use the 'trunk'
 branch, then you don't have current code.  I merged the trunk branch
 to master earlier today and sent a pull request for the fauna repo to
 get the changes, as well.  Also fixed a bug another user found when
 running with Ruby 1.9.

 Summary: pull again, use master, have fun.  If it still doesn't work,
 please open an issue to me.


 b

 On Mon, Aug 16, 2010 at 2:13 PM, Markstatic.void@gmail.com  wrote:

 Just upgraded my cassandra gem today to b/cassandra fork and noticed that
 the transport changed. I re-enabled TFramedTransport in cassandra.yml but
 my
 client no longer works. I keep receiving the following error.

 Thrift::ApplicationException: describe_keyspace failed: unknown result
    from

 workspace/vendor/plugins/cassandra/lib/../vendor/0.7/gen-rb/cassandra.rb:346:in
 `recv_describe_keyspace'
    from

 workspace/vendor/plugins/cassandra/lib/../vendor/0.7/gen-rb/cassandra.rb:335:in
 `describe_keyspace'
    from

 workspace/vendor/plugins/thrift_client/lib/thrift_client/abstract_thrift_client.rb:67:in
 `send'
    from

 workspace/vendor/plugins/thrift_client/lib/thrift_client/abstract_thrift_client.rb:67:in
 `send_rpc'
    from

 workspace/vendor/plugins/thrift_client/lib/thrift_client/abstract_thrift_client.rb:164:in
 `send_rpc'
    from

 workspace/vendor/plugins/thrift_client/lib/thrift_client/abstract_thrift_client.rb:63:in
 `proxy'
    from

 workspace/vendor/plugins/thrift_client/lib/thrift_client/abstract_thrift_client.rb:154:in
 `proxy'
    from

 workspace/vendor/plugins/thrift_client/lib/thrift_client/abstract_thrift_client.rb:53:in
 `handled_proxy'
    from

 workspace/vendor/plugins/thrift_client/lib/thrift_client/abstract_thrift_client.rb:150:in
 `handled_proxy'
    from

 workspace/vendor/plugins/thrift_client/lib/thrift_client/abstract_thrift_client.rb:23:in
 `describe_keyspace'
    from (irb):14

 Any clues?

 Still getting the same error. describe_keyspaces doesnt work. Does it work
 for you?

 I am using:

 apache-cassandra-0.7.0-beta1-bin,
 thrift (0.2.0.4)
 thrift_client (0.4.6)

 Any clue? Thanks!




Re: Cassandra gem

2010-08-16 Thread Benjamin Black
$ irb
 require lib/cassandra/0.7
= true
 client = Cassandra.new('system', '127.0.0.1:9160')
= #Cassandra:2160486220, @keyspace=system, @schema={},
@servers=[127.0.0.1:9160]
 client.keyspaces
= [system]
 client.partitioner
= org.apache.cassandra.dht.RandomPartitioner


On Mon, Aug 16, 2010 at 5:46 PM, Mark static.void@gmail.com wrote:
  On 8/16/10 3:58 PM, Benjamin Black wrote:

 If you pulled before a couple hours ago and did not use the 'trunk'
 branch, then you don't have current code.  I merged the trunk branch
 to master earlier today and sent a pull request for the fauna repo to
 get the changes, as well.  Also fixed a bug another user found when
 running with Ruby 1.9.

 Summary: pull again, use master, have fun.  If it still doesn't work,
 please open an issue to me.


 b

 On Mon, Aug 16, 2010 at 2:13 PM, Markstatic.void@gmail.com  wrote:

 Just upgraded my cassandra gem today to b/cassandra fork and noticed that
 the transport changed. I re-enabled TFramedTransport in cassandra.yml but
 my
 client no longer works. I keep receiving the following error.

 Thrift::ApplicationException: describe_keyspace failed: unknown result
    from

 workspace/vendor/plugins/cassandra/lib/../vendor/0.7/gen-rb/cassandra.rb:346:in
 `recv_describe_keyspace'
    from

 workspace/vendor/plugins/cassandra/lib/../vendor/0.7/gen-rb/cassandra.rb:335:in
 `describe_keyspace'
    from

 workspace/vendor/plugins/thrift_client/lib/thrift_client/abstract_thrift_client.rb:67:in
 `send'
    from

 workspace/vendor/plugins/thrift_client/lib/thrift_client/abstract_thrift_client.rb:67:in
 `send_rpc'
    from

 workspace/vendor/plugins/thrift_client/lib/thrift_client/abstract_thrift_client.rb:164:in
 `send_rpc'
    from

 workspace/vendor/plugins/thrift_client/lib/thrift_client/abstract_thrift_client.rb:63:in
 `proxy'
    from

 workspace/vendor/plugins/thrift_client/lib/thrift_client/abstract_thrift_client.rb:154:in
 `proxy'
    from

 workspace/vendor/plugins/thrift_client/lib/thrift_client/abstract_thrift_client.rb:53:in
 `handled_proxy'
    from

 workspace/vendor/plugins/thrift_client/lib/thrift_client/abstract_thrift_client.rb:150:in
 `handled_proxy'
    from

 workspace/vendor/plugins/thrift_client/lib/thrift_client/abstract_thrift_client.rb:23:in
 `describe_keyspace'
    from (irb):14

 Any clues?

 Still getting the same error. describe_keyspaces doesnt work. Does it work
 for you?

 I am using:

 apache-cassandra-0.7.0-beta1-bin,
 thrift (0.2.0.4)
 thrift_client (0.4.6)

 Any clue? Thanks!




Re: indexing rows ordered by int

2010-08-15 Thread Benjamin Black
http://code.google.com/p/redis/

On Sat, Aug 14, 2010 at 11:51 PM, S Ahmed sahmed1...@gmail.com wrote:
 For CF that I need to perform range scans on, I create separate CF that have
 custom ordering.
 Say a CF holds comments on a story (like comments on a reddit or digg story
 post)
 So if I need to order comments by votes, it seems I have to re-index every
 time someone votes on a comment (or batch it every x minutes).


 Right now I think I have to pull all the comments into memory, then sort by
 votes, then re-write the index.
 Are there any best-practises for this type of index?


Re: Data Distribution / Replication

2010-08-13 Thread Benjamin Black
On Fri, Aug 13, 2010 at 9:48 AM, Oleg Anastasjev olega...@gmail.com wrote:
 Benjamin Black b at b3k.us writes:

  3. I waited for the data to replicate, which didn't happen.

 Correct, you need to run nodetool repair because the nodes were not
 present when the writes came in.  You can also use a higher
 consistency level to force read repair before returning data, which
 will incrementally repair things.


 Alternatively you could configure new nodes to bootstrap in storage-conf.xml. 
 If
 bootstrap is enalbed you'll get data replicated on them as soon as they join
 cluster.


My recommendation is to leave Autobootstrap disabled, copy the
datafiles over, and then run cleanup.  It is faster and more reliable
than streaming, in my experience.


b


Re: Data Distribution / Replication

2010-08-13 Thread Benjamin Black
Number of bugs I've hit doing this with scp: 0
Number of bugs I've hit with streaming: 2 (and others found more)

Also easier to monitor progress, manage bandwidth, etc.  I just prefer
using specialized tools that are really good at specific things.  This
is such a case.

b

On Fri, Aug 13, 2010 at 2:05 PM, Bill de hÓra b...@dehora.net wrote:
 On Fri, 2010-08-13 at 09:51 -0700, Benjamin Black wrote:

 My recommendation is to leave Autobootstrap disabled, copy the
 datafiles over, and then run cleanup.  It is faster and more reliable
 than streaming, in my experience.

 What is less reliable about streaming?

 Bill




Re: Data Distribution / Replication

2010-08-12 Thread Benjamin Black
On Thu, Aug 12, 2010 at 8:30 AM, Stefan Kaufmann sta...@gmail.com wrote:
 Hello again,

 last day's I started several tests with Cassandra and learned quite some 
 facts.

 However, of course, there are still enough things I need to
 understand. One thing is, how the data replication works.
 For my Testing:
 1. I set the replication Factor to 3, started with 1 active node (the
 seed) and I inserted some test key's.

This is not a correct concept of what a seed is.  I suggest you not
use the word 'seed' for it.

 2. I started 2 more nodes, which joined the cluster.
 3. I waited for the data to replicate, which didn't happen.

Correct, you need to run nodetool repair because the nodes were not
present when the writes came in.  You can also use a higher
consistency level to force read repair before returning data, which
will incrementally repair things.

 4. I inserted more key's, and it looked like they were distributed to
 all three nodes.


Correct, they were up at the time and received the write operations directly.

Seems like you might benefit from reading the operations wiki:
http://wiki.apache.org/cassandra/Operations

b
b


Re: Growing commit log directory.

2010-08-09 Thread Benjamin Black
what does the io load look like on those nodes?

On Mon, Aug 9, 2010 at 1:50 PM, Edward Capriolo edlinuxg...@gmail.com wrote:
 I have a 16 node 6.3 cluster and two nodes from my cluster are giving
 me major headaches.

 10.71.71.56   Up         58.19 GB
 10827166220211678382926910108067277    |   ^
 10.71.71.61   Down       67.77 GB
 123739042516704895804863493611552076888    v   |
 10.71.71.66   Up         43.51 GB
 127605887595351923798765477786913079296    |   ^
 10.71.71.59   Down       90.22 GB
 139206422831293007780471430312996086499    v   |
 10.71.71.65   Up         22.97 GB
 148873535527910577765226390751398592512    |   ^

 The symptoms I am seeing are nodes 61 and nodes 59 have huge 6 GB +
 commit log directories. They keep growing, along with memory usage,
 eventually the logs start showing GCInspection errors and then the
 nodes will go OOM

 INFO 14:20:01,296 Creating new commitlog segment
 /var/lib/cassandra/commitlog/CommitLog-1281378001296.log
  INFO 14:20:02,199 GC for ParNew: 327 ms, 57545496 reclaimed leaving
 7955651792 used; max is 9773776896
  INFO 14:20:03,201 GC for ParNew: 443 ms, 45124504 reclaimed leaving
 8137412920 used; max is 9773776896
  INFO 14:20:04,314 GC for ParNew: 438 ms, 54158832 reclaimed leaving
 8310139720 used; max is 9773776896
  INFO 14:20:05,547 GC for ParNew: 409 ms, 56888760 reclaimed leaving
 8480136592 used; max is 9773776896
  INFO 14:20:06,900 GC for ParNew: 441 ms, 58149704 reclaimed leaving
 8648872520 used; max is 9773776896
  INFO 14:20:08,904 GC for ParNew: 462 ms, 59185992 reclaimed leaving
 8816581312 used; max is 9773776896
  INFO 14:20:09,973 GC for ParNew: 460 ms, 57403840 reclaimed leaving
 8986063136 used; max is 9773776896
  INFO 14:20:11,976 GC for ParNew: 447 ms, 59814376 reclaimed leaving
 9153134392 used; max is 9773776896
  INFO 14:20:13,150 GC for ParNew: 441 ms, 61879728 reclaimed leaving
 9318140296 used; max is 9773776896
 java.lang.OutOfMemoryError: Java heap space
 Dumping heap to java_pid10913.hprof ...
  INFO 14:22:30,620 InetAddress /10.71.71.66 is now dead.
  INFO 14:22:30,621 InetAddress /10.71.71.65 is now dead.
  INFO 14:22:30,621 GC for ConcurrentMarkSweep: 44862 ms, 261200
 reclaimed leaving 9334753480 used; max is 9773776896
  INFO 14:22:30,621 InetAddress /10.71.71.64 is now dead.

 Heap dump file created [12730501093 bytes in 253.445 secs]
 ERROR 14:28:08,945 Uncaught exception in thread Thread[Thread-2288,5,main]
 java.lang.OutOfMemoryError: Java heap space
        at 
 org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:71)
 ERROR 14:28:08,948 Uncaught exception in thread Thread[Thread-2281,5,main]
 java.lang.OutOfMemoryError: Java heap space
        at 
 org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:71)
  INFO 14:28:09,017 GC for ConcurrentMarkSweep: 33737 ms, 85880
 reclaimed leaving 9335215296 used; max is 9773776896

 Does anyone have any ideas what is going on?



Re: TokenRange contains endpoints without any port information?

2010-08-08 Thread Benjamin Black
On Sun, Aug 8, 2010 at 5:21 AM, Carsten Krebs carsten.kr...@gmx.net wrote:

 I'm wondering why a TokenRange returned by describe_ring(keyspace) of the 
 thrift API just returns endpoints consisting only of an address but omits any 
 port information?
 My first thought was, this method could be used to expose some information 
 about the ring structure to the client, i.e. to do some client side load 
 balancing. But now, I'm not sure about this anymore. Additionally, when 
 looking into the code, I guess the address returned as part of the TokenRange 
 is the address of the storage service which could differ from the thrift 
 address, which in turn would make the returned endpoint useless for the 
 client.

Not just _could_ differ, is _guaranteed_ to differ.  The inter-node
protocol is not Thrift.  The returned endpoint is not useless for the
client: you had to connect to the RPC port to even make the call.  Use
the same port when connecting to the other nodes.  It is bad practice
to have RPC ports differ between nodes in the same cluster.

 What is the purpose of this method or respectively why is the port 
 information omitted?


Discovering which nodes are in the ring and which node claims each range.


b


Re: Columns limit

2010-08-07 Thread Benjamin Black
Right, this is an index row per time interval (your previous email was not).

On Sat, Aug 7, 2010 at 11:43 AM, Mark static.void@gmail.com wrote:
 On 8/7/10 11:30 AM, Mark wrote:

 On 8/7/10 4:22 AM, Thomas Heller wrote:

 Ok, I think the part I was missing was the concatenation of the key and
 partition to do the look ups. Is this the preferred way of accomplishing
 needs such as this? Are there alternatives ways?

 Depending on your needs you can concat the row key or use super columns.

 How would one then query over multiple days? Same question for all
 days.
 Should I use range_slice or multiget_slice? And if its range_slice does
 that
 mean I need OrderPreservingPartitioner?

 The last 3 days is pretty simple: ['2010-08-07', '2010-08-06',
 '2010-08-05'], as is 7, 31, etc. Just generate the keys in your app
 and use multiget_slice.

 If you want to get all days where a specific ip address had some
 requests you'll just need another CF where the row key is the addr and
 column names are the days (values optional again). Pretty much the
 same all over again, just add another CF and insert the data you need.

 get_range_slice in my experience is better used for offline tasks
 where you really want to process every row there is.

 /thomas

 Ok... as an example using looking up logs by ip for a certain
 timeframe/range would this work?

 ColumnFamily Name=SearchLog/

 ColumnFamily Name=IPSearchLog
                           ColumnType=Super
                           CompareWith=UTF8Type
                           CompareSubcolumnsWith=TimeUUIDType/

 Resulting in a structure like:

 {
  127.0.0.1 : {
       2010080711 : {
            uuid1 : 
            uuid2: 
            uuid3: 
       }
      2010080712 : {
            uuid1 : 
            uuid2: 
            uuid3: 
       }
   }
  some.other.ip : {
       2010080711 : {
            uuid1 : 
       }
   }
 }

 Whereas each uuid is the key used for SearchLog.  Is there anything wrong
 with this? I know there is a 2 billion column limit but in this case that
 would never be exceeded because each column represents an hour. However does
 the above schema imply that for any certain IP there can only be a maxium
 of 2GB of data stored?

 Or should I invert the ip with the time slices? The limitation of this seems
 like there can only be 2 billion unique ips per hour which is more than
 enough for our application :)

 {
  2010080711 : {
       127.0.0.1 : {
            uuid1 : 
            uuid2: 
            uuid3: 
       }
      some.other.ip : {
            uuid1 : 
            uuid2: 
            uuid3: 
       }
   }
  2010080712 : {
       127.0.0.1 : {
            uuid1 : 
       }
   }
 }




Re: Columns limit

2010-08-07 Thread Benjamin Black
certainly it matters: your previous version is not bounded on time, so
will grow without bound.  ergo, it is not a good fit for cassandra.

On Sat, Aug 7, 2010 at 2:51 PM, Mark static.void@gmail.com wrote:
 On 8/7/10 2:33 PM, Benjamin Black wrote:

 Right, this is an index row per time interval (your previous email was
 not).

 On Sat, Aug 7, 2010 at 11:43 AM, Markstatic.void@gmail.com  wrote:


 On 8/7/10 11:30 AM, Mark wrote:


 On 8/7/10 4:22 AM, Thomas Heller wrote:


 Ok, I think the part I was missing was the concatenation of the key
 and
 partition to do the look ups. Is this the preferred way of
 accomplishing
 needs such as this? Are there alternatives ways?


 Depending on your needs you can concat the row key or use super
 columns.



 How would one then query over multiple days? Same question for all
 days.
 Should I use range_slice or multiget_slice? And if its range_slice
 does
 that
 mean I need OrderPreservingPartitioner?


 The last 3 days is pretty simple: ['2010-08-07', '2010-08-06',
 '2010-08-05'], as is 7, 31, etc. Just generate the keys in your app
 and use multiget_slice.

 If you want to get all days where a specific ip address had some
 requests you'll just need another CF where the row key is the addr and
 column names are the days (values optional again). Pretty much the
 same all over again, just add another CF and insert the data you need.

 get_range_slice in my experience is better used for offline tasks
 where you really want to process every row there is.

 /thomas


 Ok... as an example using looking up logs by ip for a certain
 timeframe/range would this work?

 ColumnFamily Name=SearchLog/

 ColumnFamily Name=IPSearchLog
                           ColumnType=Super
                           CompareWith=UTF8Type
                           CompareSubcolumnsWith=TimeUUIDType/

 Resulting in a structure like:

 {
  127.0.0.1 : {
       2010080711 : {
            uuid1 : 
            uuid2: 
            uuid3: 
       }
      2010080712 : {
            uuid1 : 
            uuid2: 
            uuid3: 
       }
   }
  some.other.ip : {
       2010080711 : {
            uuid1 : 
       }
   }
 }

 Whereas each uuid is the key used for SearchLog.  Is there anything
 wrong
 with this? I know there is a 2 billion column limit but in this case
 that
 would never be exceeded because each column represents an hour. However
 does
 the above schema imply that for any certain IP there can only be a
 maxium
 of 2GB of data stored?


 Or should I invert the ip with the time slices? The limitation of this
 seems
 like there can only be 2 billion unique ips per hour which is more than
 enough for our application :)

 {
  2010080711 : {
       127.0.0.1 : {
            uuid1 : 
            uuid2: 
            uuid3: 
       }
      some.other.ip : {
            uuid1 : 
            uuid2: 
            uuid3: 
       }
   }
  2010080712 : {
       127.0.0.1 : {
            uuid1 : 
       }
   }
 }




 In the end does it really matter which one to go with? I kind of like the
 previous version so I don't have to build up all the keys for the multi_get
 and instead I can just provide and start  finish for the columns (time
 frames).



Re: Columns limit

2010-08-07 Thread Benjamin Black
Certainly.  There is also a performance penalty to unbounded row
sizes.  That penalty is your nodes OOMing.  I strongly recommend you
abandon that direction.

On Sat, Aug 7, 2010 at 9:06 PM, Mark static.void@gmail.com wrote:
 On 8/7/10 7:04 PM, Benjamin Black wrote:

 certainly it matters: your previous version is not bounded on time, so
 will grow without bound.  ergo, it is not a good fit for cassandra.

 On Sat, Aug 7, 2010 at 2:51 PM, Markstatic.void@gmail.com  wrote:


 On 8/7/10 2:33 PM, Benjamin Black wrote:


 Right, this is an index row per time interval (your previous email was
 not).

 On Sat, Aug 7, 2010 at 11:43 AM, Markstatic.void@gmail.com
  wrote:



 On 8/7/10 11:30 AM, Mark wrote:



 On 8/7/10 4:22 AM, Thomas Heller wrote:



 Ok, I think the part I was missing was the concatenation of the key
 and
 partition to do the look ups. Is this the preferred way of
 accomplishing
 needs such as this? Are there alternatives ways?



 Depending on your needs you can concat the row key or use super
 columns.




 How would one then query over multiple days? Same question for all
 days.
 Should I use range_slice or multiget_slice? And if its range_slice
 does
 that
 mean I need OrderPreservingPartitioner?



 The last 3 days is pretty simple: ['2010-08-07', '2010-08-06',
 '2010-08-05'], as is 7, 31, etc. Just generate the keys in your app
 and use multiget_slice.

 If you want to get all days where a specific ip address had some
 requests you'll just need another CF where the row key is the addr
 and
 column names are the days (values optional again). Pretty much the
 same all over again, just add another CF and insert the data you
 need.

 get_range_slice in my experience is better used for offline tasks
 where you really want to process every row there is.

 /thomas



 Ok... as an example using looking up logs by ip for a certain
 timeframe/range would this work?

 ColumnFamily Name=SearchLog/

 ColumnFamily Name=IPSearchLog
                           ColumnType=Super
                           CompareWith=UTF8Type
                           CompareSubcolumnsWith=TimeUUIDType/

 Resulting in a structure like:

 {
  127.0.0.1 : {
       2010080711 : {
            uuid1 : 
            uuid2: 
            uuid3: 
       }
      2010080712 : {
            uuid1 : 
            uuid2: 
            uuid3: 
       }
   }
  some.other.ip : {
       2010080711 : {
            uuid1 : 
       }
   }
 }

 Whereas each uuid is the key used for SearchLog.  Is there anything
 wrong
 with this? I know there is a 2 billion column limit but in this case
 that
 would never be exceeded because each column represents an hour.
 However
 does
 the above schema imply that for any certain IP there can only be a
 maxium
 of 2GB of data stored?



 Or should I invert the ip with the time slices? The limitation of this
 seems
 like there can only be 2 billion unique ips per hour which is more than
 enough for our application :)

 {
  2010080711 : {
       127.0.0.1 : {
            uuid1 : 
            uuid2: 
            uuid3: 
       }
      some.other.ip : {
            uuid1 : 
            uuid2: 
            uuid3: 
       }
   }
  2010080712 : {
       127.0.0.1 : {
            uuid1 : 
       }
   }
 }





 In the end does it really matter which one to go with? I kind of like the
 previous version so I don't have to build up all the keys for the
 multi_get
 and instead I can just provide and start  finish for the columns (time
 frames).



 Is there any performance penalty for a multi_get that includes x keys versus
 a get on 1 key with a start/finish range of x?

 Using your gem,

 multi_get(SearchLog, [20090101...20100807], 127.0.0.1)
 vs
 get(SearchLog, 127.0.0.1, :start = 20090101, :finish = 127.0.0.1)

 Thanks



Re: Question on load balancing in a cluster

2010-08-06 Thread Benjamin Black
Yes, imo, it should be renamed.

On Fri, Aug 6, 2010 at 10:10 AM, Bill Au bill.w...@gmail.com wrote:
 If nodetool loadbalance does not do what it's name implies, should it be
 renamed or maybe even remove altogether since the recommendation is to
 _never_ use it in production?

 Bill

 On Thu, Aug 5, 2010 at 6:41 AM, aaron morton aa...@thelastpickle.com
 wrote:

 This comment from Ben Black may help...

 I recommend you _never_ use nodetool loadbalance in production because
 it will _not_ result in balanced load.  The correct process is manual
 calculation of tokens (the algorithm for RP is on the Operations wiki
 page) and nodetool move.
 
 http://www.mail-archive.com/user@cassandra.apache.org/msg04933.html

 So the recommendation is to manually set initial tokens and then manually
 move them.

 As for the need to decommission I'm guessing it's for reasons such as
 making it easier to avoid overlapping tokens and to avoid accepting writes
 that will soon be moved.

 Others may be able to add more.

 Aaron


 On 5 Aug 2010, at 14:49, anand_s wrote:

 
  Hi,
 
  Have some thoughts on load balancing on current / new nodes. I have come
  across some posts around this, but not sure of what is being finally
  proposed, so..
 
  From what I have read, a nodebalance on a node does a decommission and
  bootstrap of that node. Is there a reason why it is that way
  (decommission
  and bootstrap) and not just a simple look at my next neighbor and just
  split
  the load with it? As in if the ring has nodes A, B, C and D with load
  (in
  GB) on these respectively is 100, 70, 100, 80. Then a nodetool balance
  on B
  should result in 100, 85, 85, 80 (some tokens move from C to B). It is
  still
  manual but data movement is only what is needed – 15 GB instead of the
  100+GB (decommission and bootstrap) . The idea is not to get a perfect
  balance, but an acceptable balance with less data movement.
 
  Also when a new node is added, it takes 50% from the most loaded node.
  Don't
  we want to rebalance such that the load is more or less evenly
  distributed
  across the cluster? Would it not help if I could just specify the % load
  as
  a parameter to rebalance command, so that I can optimize the moment of
  data
  for rebalancing. E.g. A,B,C,E is a cluster with load being 80, 78, 83,
  84.
  Now I add a new node D (position will be before E), so eventually after
  all
  the rebalance activity I want the load to be ~66 (245/5) . Now to
  minimize
  the movement of data and still get a good balance, we move only what is
  needed (so data sort of flows from more to less loaded nodes until
  balanced). This could be a manual process (I am basically suggesting a
  similar approach as in paragraph one).
 
  Another thought is that instead of using pure current usage on a node to
  determine load, shouldn't there be higher level concept like node
  weight
  to handle heterogeneous nodes or is the expectation that all nodes are
  more
  or less equal?
 
 
  Thanks
  Anand
  --
  View this message in context:
  http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Question-on-load-balancing-in-a-cluster-tp5375140p5375140.html
  Sent from the cassandra-u...@incubator.apache.org mailing list archive
  at Nabble.com.





Re: one question about cassandra write

2010-08-06 Thread Benjamin Black
On Fri, Aug 6, 2010 at 12:51 PM, Maifi Khan maifi.k...@gmail.com wrote:
 Hi
 I have a question about the internal of cassandra write.
 Say, I already have the following in the database -
 (row_x,col_y,val1)

 Now if I try to insert
 (row_x,col_y,val100), what will happen?
 Will it overwrite the old data?
 I mean, will it overwrite the data physically or will it keep both the
 old version and the new version of the data?

Assuming the old version is already on disk in an SSTable, the new
version will not overwrite it, and both versions will be in the
system.  A compaction will remove the old version, however.

 If the later is the case, can I retrieve the old version of the data?


No.  And no, there is no plan to add that functionality.  If it is
needed it is simple to emulate in a variety of ways with the current
feature set.

This is recommended reading:
http://maxgrinev.com/2010/07/12/update-idempotency-why-it-is-important-in-cassandra-applications-2/


b


Re: set ReplicationFactor and Token at Column Family/SuperColumn level.

2010-08-06 Thread Benjamin Black
Additional keyspaces have very little overhead (unlike CFs).

On Fri, Aug 6, 2010 at 9:42 AM, Zhong Li z...@voxeo.com wrote:

 If I create 3-4 keyspaces, will this impact performance and resources (esp.
 memory and disk I/O) too much?

 Thanks,

 Zhong

 On Aug 5, 2010, at 4:52 PM, Benjamin Black wrote:

 On Thu, Aug 5, 2010 at 12:59 PM, Zhong Li z...@voxeo.com wrote:

 The big thing bother me is initial ring token. We have some Column
 Families.
 It is very hard to choose one token suitable for all CFs. Also some
 Column
 Families need higher Consistent Level and some don't. If we set

 Consistency Level is set by clients, per request.  If you require
 different _Replication Factors_ for different CFs, then just put them
 in different keyspaces.  Additional keyspaces have very little
 overhead (unlike CFs).

 ReplicationFactor too high, it is too costy for crossing datacenter,
 especially in otherside the world.

 I know we can setup multiple rings, but it costs more hardware.

 if Cassandra can implement  Ring,Token and RF on the CF level, or even
 SuperColumn level, it will make design much easier and more efficiency.

 Is it possible?


 The approach I described above is what you can do.  The rest of what
 you asked is not happening.



 b




Re: How to migrate any relational database to Cassandra

2010-08-06 Thread Benjamin Black
http://maxgrinev.com/2010/07/12/do-you-really-need-sql-to-do-it-all-in-cassandra/
http://www.slideshare.net/benjaminblack/cassandra-basics-indexing

On Fri, Aug 6, 2010 at 11:42 AM, sonia gehlot sonia.geh...@gmail.com wrote:
 Thanks for reply,

 I am sorry It seems my question comes out wrong..

 * My question is what are the considration should I keep in mind to Migrate
 to Cassandra?

 * Like we do in ETL to extract data from source data we write query and then
 load it in our database after applying desired transformation.. How can we
 do this if we want extract data from MySQL and load it into Cassandra.

 * I think I can write script for these kind of stuff but do anyone have any
 example script?

 * What kind of setup I need to do this?

 -Sonia


 On Fri, Aug 6, 2010 at 11:24 AM, Michael Dürgner mich...@duergner.de
 wrote:

 In my opinion it's the wrong approach when so ask how to migrate from
 MySQL to Cassandra from a database level view. The lack of joins in NoSQL
 should lead to think about what you wanna get out of your persistent storage
 and afterwards think about how to migrate and most of the time how to
 denormalize the data you have in order to insert it into a NoSQL storage
 like Cassandra.

 Simply just migrating the data and moving the joins up to the application
 level might work in the beginning but most of the times doesn't scale in the
 end.

 Am 06.08.2010 um 20:00 schrieb sonia gehlot:

  Hi All,
 
 
  Little background about myself. I am ETL engineer worked in only
  relational databases.
 
  I have been reading and trying Cassandra since 3-4 weeks. I kind of
  understood Cassandra data model, its structure,  nodes etc. I also 
  installed
  Cassandra and played around with it, like
 
    cassandra set Keyspace1.Standard2['jsmith']['first'] = 'John'
    Value inserted.
   cassandra get Keyspace1.Standard2['jsmith']
      (column=first, value=John; timestamp=1249930053103)
    Returned 1 rows.
 
  But don't know what to do next? Like if someone says me this is MySQL
  database migrate it to cassandra, then I dont know what should be my next
  step?
 
  Can you please help me how to move forward? How should I do all the
  setup for this?
 
  Any help is appreciated.
 
  Thanks,
  Sonia
 





Re: Cassandra 0.7 Ruby/Thrift Bindings

2010-08-06 Thread Benjamin Black
Ryan,

I believe my branch was merged into fauna some time ago by jmhodges.
However, 0.7 support must be explicitly enabled by require
'cassandra/0.7' as it currently defaults to 0.6.


b

On Fri, Aug 6, 2010 at 10:02 AM, Ryan King r...@twitter.com wrote:
 On Fri, Aug 6, 2010 at 9:57 AM, Mark static.void@gmail.com wrote:
 Wow.. fast answer AND correct. In Cassandra.yml

 # Frame size for thrift (maximum field length).
 # 0 disables TFramedTransport in favor of TSocket.
 thrift_framed_transport_size_in_mb: 15

 I just had to change that value to 0 and everything worked. Now for my
 follow up question :)  What is the difference between these two and why does
 0.7 default to true while earlier versions default to false? Thanks again!

 Ah, you're using 0.7. fauna-cassandra has not been updated for 0.7.
 There's an experimental branch for it here:
 http://github.com/b/cassandra/tree/0.7

 -ryan



Re: Columns limit

2010-08-06 Thread Benjamin Black
Yes, it is common to create distinct CFs for indices.

On Fri, Aug 6, 2010 at 4:40 PM, Software Dev static.void@gmail.com wrote:

 Thanks for the suggestion.

 I've somewhat understand all that, the point where my head begins to explode
 is when I want to figure out something like

 Continuing with your example: Over the last X amount of days give me all
 the logs for remote_addr:XXX.
 I'm guessing I would need to create a separate index ColumnFamily???

 On Fri, Aug 6, 2010 at 4:32 PM, Thomas Heller i...@zilence.net wrote:

 Howdy,

 thought I jump in here. I did something similar, meaning I had lots of
 items coming in per day and wanted to somehow partition them to avoid
 running into the column limit (it was also logging related). Solution
 was pretty simple, log data is immutable, so no SuperColumn needed.

 ColumnFamily Standard: LogRecords, CompareWith=TimeUUIDType

 Row Key 20100806:
  Column Name: TimeUUID.new Value: JSON({'remote_addr':...,
 'user_agent':, 'url':)
  ..., more Columns

 In my case I chose to partition by day, if you are getting too many
 columns per day, just get hours in there. If you want an extra
 seperation level (foo, bar) in your example you could either go for a
 SuperColumn or just adjust your row key accordingly (eg.
 foo:20100806)

 HTH,
 /thomas




Re: Columns limit

2010-08-06 Thread Benjamin Black
Same answer as on other thread right now about how to index:

http://maxgrinev.com/2010/07/12/do-you-really-need-sql-to-do-it-all-in-cassandra/
http://www.slideshare.net/benjaminblack/cassandra-basics-indexing

On Fri, Aug 6, 2010 at 6:18 PM, Mark static.void@gmail.com wrote:
 On 8/6/10 4:50 PM, Thomas Heller wrote:

 Thanks for the suggestion.

 I've somewhat understand all that, the point where my head begins to
 explode
 is when I want to figure out something like

 Continuing with your example: Over the last X amount of days give me all
 the logs for remote_addr:XXX.
 I'm guessing I would need to create a separate index ColumnFamily???



 Depending on your needs you can either insert them directly or pull
 them out later in some map/reduce fashion. What you want is another
 column Family and a similar structure.

 ColumnFamily Standard LogByRemoteAddrAndDate CompareWith: TimeUUID

 Row: 127.0.0.1:20100806 Column TimeUUID/JSON as usual. If you want
 to link to the actual log record (to avoid writing if multiple
 times) just insert the same timeuuid you inserted into the other CF
 and leave the value empty. So you have your Index, aka list of
 column names, and you can look up the actual values using get_slice
 with column_names.

 Confusing at first, but really quite simple once you get used to the
 idea. Just alot more work then letting SQL do it for you. ;)

 HTH,
 /thomas


 Ok, I think the part I was missing was the concatenation of the key and
 partition to do the look ups. Is this the preferred way of accomplishing
 needs such as this? Are there alternatives ways?

 How would one then query over multiple days? Same question for all days.
 Should I use range_slice or multiget_slice? And if its range_slice does that
 mean I need OrderPreservingPartitioner?





Re: when should new nodes be added to a cluster

2010-08-02 Thread Benjamin Black
you have insufficient i/o bandwidth and are seeing reads suffer due to
competition from memtable flushes and compaction.  adding additional
nodes will help some, but i recommend increasing the disk i/o
bandwidth, regardless.


b

On Mon, Aug 2, 2010 at 11:47 AM, Artie Copeland yeslinux@gmail.com wrote:
 i have a question on what are the signs from cassandra that new nodes should
 be added to the cluster.  We are currently seeing long read times from the
 one node that has about 70GB of data with 60GB in one column family.  we are
 using a replication factor of 3.  I have tracked down the slow to occur when
 either row-read-stage or message-deserializer-pool is high like atleast
 4000.  my systems are 16core, 3 TB, 48GB mem servers.  we would like to be
 able to use more of the server than just 70GB.
 The system is a realtime system that needs to scale quite large.  Our
 current heap size is 25GB and are getting atleast 50% row cache hit rates.
  Does it seem strange that cassandra is not able to handle the work load?
  We perform multislice gets when reading similar to twissandra does.  this
 is to cut down on the network ops.  Looking at iostat it doesnt appear to
 have alot of queued reads.
 What are others seeing when they have to add new nodes?  What data sizes are
 they seeing?  This is needed so we can plan our growth and server purchase
 strategy.
 thanx
 Artie

 --
 http://yeslinux.org
 http://yestech.org



Re: Cassandra cookbook for Chef

2010-08-02 Thread Benjamin Black
Correct, it is on its own branch.

On Mon, Aug 2, 2010 at 9:08 AM, Sal Fuentes fuente...@gmail.com wrote:
 I'm guessing its the cassandra branch from that repo. You can find it here:
 http://github.com/b/cookbooks/tree/cassandra

 On Mon, Aug 2, 2010 at 1:49 AM, Boris Shulman shulm...@gmail.com wrote:

 I can't find this cookbook anymore at the specified URL. Where can I find
 it?

 On Tue, Mar 16, 2010 at 6:40 AM, Benjamin Black b...@b3k.us wrote:
  I've just pushed a rough but useful chef cookbook for Cassandra:
  http://github.com/b/cookbooks/tree/master/cassandra
 
  It is lacking in documentation and assumes you have a Cassandra
  package handy to install.  I'd really appreciate if folks could try it
  out and give feed back (or, even better, patches to improve it).
 
 
  b
 



 --
 Salvador Fuentes Jr.



Re: Columns limit

2010-07-31 Thread Benjamin Black
The proper way to handle this is to have a row per time interval such
that the number of columns per row is constrained.

On Thu, Jul 29, 2010 at 2:39 PM, Mark static.void@gmail.com wrote:
 Is there any limitations on the number of columns a row can have? Does all
 the day for a single key need to reside on a single host? If so, wouldn't
 that mean there is an implicit limit on the number of columns one can
 have... ie the disk size of that machine.

 What is the proper way to handle timelines in this matter. For example lets
 say I wanted to store all user searches in a super column.

 ColumnFamily Name=SearchLogs
                    ColumnType=Super
                    CompareWith=TimeUUIDType
                    CompareSubcolumnsWith=BytesType/

 Which results in a structure as follows
 {
   SearchLogs : {
       foo : {
            timeuuid_1 : { metadata goes here}
            timeuuid_2: { metadata goes here}
       },
       bar : {
            timeuuid_1 : { metadata goes here}
            timeuuid_2: { metadata goes here}
       }
  }
 }

 Couldn't this theoretically run out of columns for the same search term
 because for each unique term there can (and will) be many timeuuid columns?

 Thanks for clearing this up for me.




Re: Unreliable transport layer

2010-07-31 Thread Benjamin Black
Because it is extremely well-understood, handles a lot of the
reliability needs itself, and nothing more is required for the
application.

On Thu, Jul 29, 2010 at 7:02 PM, ChingShen chingshenc...@gmail.com wrote:
 Why? What reasons did you choose TCP?

 Shen

 On Sat, Mar 6, 2010 at 9:15 AM, Jonathan Ellis jbel...@gmail.com wrote:

 In 0.6 gossip is over TCP.

 On Fri, Mar 5, 2010 at 6:54 PM, Ashwin Jayaprakash
 ashwin.jayaprak...@gmail.com wrote:
  Hey guys! I have a simple question. I'm a casual observer, not a real
  Cassandra user yet. So, excuse my ignorance.
 
  I see that the Gossip feature uses UDP. I was curious to know if you
  guys
  faced issues with unreliable transports in your production clusters?
  Like
  faulty switches, dropped packets etc during heavy network loads?
 
  If I'm not mistaken are all client reads/writes doing point-to-point
  over
  TCP?
 
  Thanks,
  Ashwin.
 
 
 






Re: Columns limit

2010-07-31 Thread Benjamin Black
Have the TimeUUID as the key, and then index rows named for the time
intervals, each containing columns with TimeUUID names giving the data
in those intervals.

On Sat, Jul 31, 2010 at 5:13 PM, Mark static.void@gmail.com wrote:
 So have the TimeUUID as the key?

 SearchLogs : {
    TimeUUID_1 : { metadata goes here},
    TimeUUID_2 : { metadata goes here},
    TimeUUID_3 : { metadata goes here},
    ...
 }

 On 7/31/10 3:42 PM, Benjamin Black wrote:

 The proper way to handle this is to have a row per time interval such
 that the number of columns per row is constrained.

 On Thu, Jul 29, 2010 at 2:39 PM, Markstatic.void@gmail.com  wrote:


 Is there any limitations on the number of columns a row can have? Does
 all
 the day for a single key need to reside on a single host? If so, wouldn't
 that mean there is an implicit limit on the number of columns one can
 have... ie the disk size of that machine.

 What is the proper way to handle timelines in this matter. For example
 lets
 say I wanted to store all user searches in a super column.

 ColumnFamily Name=SearchLogs
                    ColumnType=Super
                    CompareWith=TimeUUIDType
                    CompareSubcolumnsWith=BytesType/

 Which results in a structure as follows
 {
   SearchLogs : {
       foo : {
            timeuuid_1 : { metadata goes here}
            timeuuid_2: { metadata goes here}
       },
       bar : {
            timeuuid_1 : { metadata goes here}
            timeuuid_2: { metadata goes here}
       }
  }
 }

 Couldn't this theoretically run out of columns for the same search term
 because for each unique term there can (and will) be many timeuuid
 columns?

 Thanks for clearing this up for me.







Re: Consequences of Cassandra key NOT unique

2010-07-29 Thread Benjamin Black
You are both confusing columns with rows.  Columns have timestamps,
row keys do not.

On Wed, Jul 28, 2010 at 11:37 PM, Thorvaldsson Justus
justus.thorvalds...@svenskaspel.se wrote:
 You insert 500 rows with key “x”

 And 1000 rows with key “y”

 You make a query getting all rows.

 It will only show two rows, the ones with the latest timestamps.

 /Justus



 Från: Rana Aich [mailto:aichr...@gmail.com]
 Skickat: den 29 juli 2010 08:23
 Till: user@cassandra.apache.org
 Ämne: Re: Consequences of Cassandra key NOT unique



 Thanks for your reply! I thought in that case a new row would be inserted
 with a new timestamp and cassandra will report the new row. But how this
 will affect my range query?

 It would not affect it.



 On Wed, Jul 28, 2010 at 7:03 PM, Benjamin Black b...@b3k.us wrote:

 If you write new data with a key that is already present, the existing
 columns are overwritten or new columns are added.  There is no way to
 cause a duplicate key to be inserted.

 On Wed, Jul 28, 2010 at 6:16 PM, Rana Aich aichr...@gmail.com wrote:
 Hello,
 I was wondering what may the pitfalls in Cassandra when the Key value is
 not
 UNIQUE?
 Will it affect the range query performance?
 Thanks and regards,
 raich






Re: Cassandra vs MongoDB

2010-07-28 Thread Benjamin Black
They have approximately nothing in common.  And, no, Cassandra is
definitely not dying off.

On Tue, Jul 27, 2010 at 8:14 AM, Mark static.void@gmail.com wrote:
 Can someone quickly explain the differences between the two? Other than the
 fact that MongoDB supports ad-hoc querying I don't know whats different. It
 also appears (using google trends) that MongoDB seems to be growing while
 Cassandra is dying off. Is this the case?

 Thanks for the help



Re: Consequences of Cassandra key NOT unique

2010-07-28 Thread Benjamin Black
If you write new data with a key that is already present, the existing
columns are overwritten or new columns are added.  There is no way to
cause a duplicate key to be inserted.

On Wed, Jul 28, 2010 at 6:16 PM, Rana Aich aichr...@gmail.com wrote:
 Hello,
 I was wondering what may the pitfalls in Cassandra when the Key value is not
 UNIQUE?
 Will it affect the range query performance?
 Thanks and regards,
 raich




Re: Quick Poll: Server names

2010-07-27 Thread Benjamin Black
[role][sequence].[airport code][sequence].[domain].[tld]


  1   2   3   >