Re: how to implement a client with off-heap memory

2012-10-29 Thread aaron morton
The thrift client is just auto generated code, if you really wanted to you may 
be able to change / override it to modify the SerDe when it pulls things off 
the wire.

Not sure if this does what you are looking for 
https://issues.apache.org/jira/browse/CASSANDRA-2478


Cheers

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 29/10/2012, at 4:59 PM, Manu Zhang owenzhang1...@gmail.com wrote:

 Hi all,
 
 I've been writing a client on Cassandra Thrift API. The client will read 
 almost 1G of data into JVM heap and thus its performance suffers from GC 
 operations. 
 To reduce latency, I'm currently thinking about implementing an off-heap 
 memory (just like that of RowCache) to hold data and manage it myself.
 The problem is that with Thrift API I read all the data as ListKeySlice 
 directly into heap.
 Is there a work around? Any other suggestions would also be appreciated. 
 Thanks!
 



Re: High bandwidth usage between datacenters for cluster

2012-10-29 Thread aaron morton
Outbound messages for other DC's are grouped and a single instance is sent to a 
single node in the remote DC. The remote node then forwards the message on to 
the other recipients in it's DC. All remote DC nodes will however reply 
directly to the coordinator.

 Normally this isn’t an issue for us, but at times we are writing 
 approximately 1MB a sec of data, and seeing a corresponding 3MB of traffic 
 across the WAN to all the Cassandra DR servers.

Can you break the traffic down by port and direction ?

Cheers



-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 28/10/2012, at 12:18 PM, Bryce Godfrey bryce.godf...@azaleos.com wrote:

 Network topology with the topology file filled out is already the 
 configuration we are using. 
  
 From: sankalp kohli [mailto:kohlisank...@gmail.com] 
 Sent: Thursday, October 25, 2012 11:55 AM
 To: user@cassandra.apache.org
 Subject: Re: High bandwidth usage between datacenters for cluster
  
 Use placement_strategy = 
 'org.apache.cassandra.locator.NetworkTopologyStrategy' and also fill the 
 topology.properties file. This will tell cassandra that you have two DCs. You 
 can verify that by looking at output of the ring command.  
  
 If you DCs are setup properly, only one request will go over WAN. Though the 
 responses from all nodes in other DC will go over WAN. 
  
 On Thu, Oct 25, 2012 at 10:44 AM, Bryce Godfrey bryce.godf...@azaleos.com 
 wrote:
 We have a 5 node cluster, with a matching 5 nodes for DR in another data 
 center.   With a replication factor of 3, does the node I send a write too 
 attempt to send it to the 3 servers in the DR also?  Or does it send it to 1 
 and let it replicate locally in the DR environment to save bandwidth across 
 the WAN?
 Normally this isn’t an issue for us, but at times we are writing 
 approximately 1MB a sec of data, and seeing a corresponding 3MB of traffic 
 across the WAN to all the Cassandra DR servers.
  
 If my assumptions are right, is this configurable somehow for writing to one 
 node and letting it do local replication?  We are on 1.1.5
  
 Thanks



Re: Roadmap/Changelog?

2012-10-29 Thread aaron morton
For committed changes https://github.com/apache/cassandra/blob/trunk/CHANGES.txt

For interesting changer per release 
https://github.com/apache/cassandra/blob/trunk/NEWS.txt

For the road map 
https://issues.apache.org/jira/browse/CASSANDRA#selectedTab=com.atlassian.jira.plugin.system.project%3Aroadmap-panel

Cheers

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 28/10/2012, at 11:56 AM, Timmy Turner timm.t...@gmail.com wrote:

 Hi everyone,
 
 I wrote a library/extension for Cassandra 0.8 a while back, and would
 like to update it to the current version now, however I can't really
 find any articles on what has changed in Cassandra. I read the
 changelog, but those points are too detailed, and it's hard to
 determine what impact they really have on the functionality.
 
 The last things I remember are that CQL v3 was scheduled for 1.1 and
 supercoloumns would be removed and replaced by compound columns (and
 included in CQL). Has that already happened?
 
 Also it would be interesting to know whether there is any kind of
 roadmap for Cassandra for new features or functionality that may be
 introduced in upcoming versions, or features that may be removed in
 future versions.
 
 
 Thanks!



Re: compression

2012-10-29 Thread Tamar Fraenkel
Hi!
Thanks Aaron!
Today I restarted Cassandra on that node and ran scrub again, now it is
fine.

I am worried though that if I decide to change another CF to use
compression I will have that issue again. Any clue how to avoid it?

Thanks.

*Tamar Fraenkel *
Senior Software Engineer, TOK Media

[image: Inline image 1]

ta...@tok-media.com
Tel:   +972 2 6409736
Mob:  +972 54 8356490
Fax:   +972 2 5612956





On Wed, Sep 26, 2012 at 3:40 AM, aaron morton aa...@thelastpickle.comwrote:

 Check the logs on  nodes 2 and 3 to see if the scrub started. The logs on
 1 will be a good help with that.

 Cheers

   -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 24/09/2012, at 10:31 PM, Tamar Fraenkel ta...@tok-media.com wrote:

 Hi!
 I ran
 UPDATE COLUMN FAMILY cf_name WITH
 compression_options={sstable_compression:SnappyCompressor,
 chunk_length_kb:64};

 I then ran on all my nodes (3)
 sudo nodetool -h localhost scrub tok cf_name

 I have replication factor 3. The size of the data on disk was cut in half
 in the first node and in the jmx I can see that indeed the compression
 ration is 0.46. But on nodes 2 and 3 nothing happened. In the jmx I can see
 that compression ratio is 0 and the size of the files of disk stayed the
 same.

 In cli

 ColumnFamily: cf_name
   Key Validation Class: org.apache.cassandra.db.marshal.UUIDType
   Default column value validator:
 org.apache.cassandra.db.marshal.UTF8Type
   Columns sorted by:
 org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.UTF8Type,org.apache.cassandra.db.marshal.UTF8Type)
   Row cache size / save period in seconds / keys to save : 0.0/0/all
   Row Cache Provider:
 org.apache.cassandra.cache.SerializingCacheProvider
   Key cache size / save period in seconds: 20.0/14400
   GC grace seconds: 864000
   Compaction min/max thresholds: 4/32
   Read repair chance: 1.0
   Replicate on write: true
   Bloom Filter FP chance: default
   Built indexes: []
   Compaction Strategy:
 org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy
   Compression Options:
 chunk_length_kb: 64
 sstable_compression:
 org.apache.cassandra.io.compress.SnappyCompressor

 Can anyone help?
 Thanks

  *Tamar Fraenkel *
 Senior Software Engineer, TOK Media

 tokLogo.png


 ta...@tok-media.com
 Tel:   +972 2 6409736
 Mob:  +972 54 8356490
 Fax:   +972 2 5612956





 On Mon, Sep 24, 2012 at 8:37 AM, Tamar Fraenkel ta...@tok-media.comwrote:

 Thanks all, that helps. Will start with one - two CFs and let you know
 the effect


 *Tamar Fraenkel *
 Senior Software Engineer, TOK Media

 tokLogo.png


 ta...@tok-media.com
 Tel:   +972 2 6409736
 Mob:  +972 54 8356490
 Fax:   +972 2 5612956





 On Sun, Sep 23, 2012 at 8:21 PM, Hiller, Dean dean.hil...@nrel.govwrote:

 As well as your unlimited column names may all have the same prefix,
 right? Like accounts.rowkey56, accounts.rowkey78, etc. etc.  so the
 accounts gets a ton of compression then.

 Later,
 Dean

 From: Tyler Hobbs ty...@datastax.commailto:ty...@datastax.com
 Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Date: Sunday, September 23, 2012 11:46 AM
 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Subject: Re: compression

  column metadata, you're still likely to get a reasonable amount of
 compression.  This is especially true if there is some amount of repetition
 in the column names, values, or TTLs in wide rows.  Compression will almost
 always be beneficial unless you're already somehow CPU bound or are using
 large column values that are high in entropy, such as pre-compressed or
 encrypted data.





tokLogo.png

Re: Hinted Handoff storage inflation

2012-10-29 Thread aaron morton
 With both data centers functional, the test takes just a few minutes to run, 
 with one data center down, 15x the amount of time.
Could you provide the numbers, it's easier to get a feel for how the throughput 
is dropping. Does latency reported by nodetool cf stats change ? 
I'm also interested to know how long hints were collected for. 

Each coordinator will be writing three hints, which will be slowing down the 
other writes it needs to do. 

 but I found that the storage overhead was the same regardless of the size of 
 the batch mutation (i.e., 5 vs 25 mutations made no difference).
Batch size makes no difference. Each row mutation is treated as an individual 
command, the batch is simply a way to reduce network calls. 

 Each write is new data only (no overwrites). Each mutation adds a row to one 
 column family with a column containing about ~100 bytes of data and a new row 
 to another column family with a SuperColumn containing 2x17KiB payloads.
I cannot remember anyone raising this sort of issue about HH before. It may be 
that no one has looked at how that level of hints is handled. 
Could you reproduce the problem with a smaller test case ? 

Cheers

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 27/10/2012, at 7:56 AM, Mattias Larsson mlars...@yahoo-inc.com wrote:

 
 On Oct 24, 2012, at 6:05 PM, aaron morton wrote:
 
 Hints store the columns, row key, KS name and CF id(s) for each mutation to 
 each node. Where an executed mutation will store the most recent columns 
 collated with others under the same row key. So depending on the type of 
 mutation hints will take up more space. 
 
 The worse case would be lots of overwrites. After that writing a small 
 amount of data to many rows would result in a lot of the serialised space 
 being devoted to row keys, KS name and CF id.
 
 16Gb is a lot though. What was the write workload like ?
 
 Each write is new data only (no overwrites). Each mutation adds a row to one 
 column family with a column containing about ~100 bytes of data and a new row 
 to another column family with a SuperColumn containing 2x17KiB payloads. 
 These are sent in batches with several in them, but I found that the storage 
 overhead was the same regardless of the size of the batch mutation (i.e., 5 
 vs 25 mutations made no difference). A total of 1,000,000 mutations like 
 these are sent over the duration of the test.
 
 
 You can get an estimate on the number of keys in the Hints CF using nodetool 
 cfstats. Also some metrics in the JMX will tell you how many hints are 
 stored. 
 
 This has a huge impact on write performance as well.
 Yup. Hints are added to the same Mutation thread pool as normal mutations. 
 They are processed async to the mutation request but they still take 
 resources to store. 
 
 You can adjust how long hints a collected for with max_hint_window_in_ms in 
 the yaml file. 
 
 How long did the test run for ? 
 
 
 With both data centers functional, the test takes just a few minutes to run, 
 with one data center down, 15x the amount of time.
 
 /dml
 
 



Re: compression

2012-10-29 Thread Alain RODRIGUEZ
I have no clue. I never did it even if I am planning to do so.

1 - Did you just spent 1 month with a cluster in an unstable state ? Had
you any issue during this time related to the transitional state of your
cluster ?

I am currently storing counters with:
row = objectId, column name = date#event, data = counter (date format
20121029).

2 - Is it a good Idea to compress this kind of data ?

I am looking for using composites columns.

3 - What are the benefits of using a column name like
CompositeType(UTF8Type, UTF8Type) and a simple UTF8 column with event and
date separated by a sharp as I am doing right now ?

4 - Would compression be a good idea in this case ?

Thanks for your help on any of these 4 points :).

Alain


2012/10/29 Tamar Fraenkel ta...@tok-media.com

 Hi!
 Thanks Aaron!
 Today I restarted Cassandra on that node and ran scrub again, now it is
 fine.

 I am worried though that if I decide to change another CF to use
 compression I will have that issue again. Any clue how to avoid it?

 Thanks.

 *Tamar Fraenkel *
 Senior Software Engineer, TOK Media

 [image: Inline image 1]


 ta...@tok-media.com
 Tel:   +972 2 6409736
 Mob:  +972 54 8356490
 Fax:   +972 2 5612956





 On Wed, Sep 26, 2012 at 3:40 AM, aaron morton aa...@thelastpickle.comwrote:

 Check the logs on  nodes 2 and 3 to see if the scrub started. The logs on
 1 will be a good help with that.

 Cheers

   -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 24/09/2012, at 10:31 PM, Tamar Fraenkel ta...@tok-media.com wrote:

 Hi!
 I ran
 UPDATE COLUMN FAMILY cf_name WITH
 compression_options={sstable_compression:SnappyCompressor,
 chunk_length_kb:64};

 I then ran on all my nodes (3)
 sudo nodetool -h localhost scrub tok cf_name

 I have replication factor 3. The size of the data on disk was cut in half
 in the first node and in the jmx I can see that indeed the compression
 ration is 0.46. But on nodes 2 and 3 nothing happened. In the jmx I can see
 that compression ratio is 0 and the size of the files of disk stayed the
 same.

 In cli

 ColumnFamily: cf_name
   Key Validation Class: org.apache.cassandra.db.marshal.UUIDType
   Default column value validator:
 org.apache.cassandra.db.marshal.UTF8Type
   Columns sorted by:
 org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.UTF8Type,org.apache.cassandra.db.marshal.UTF8Type)
   Row cache size / save period in seconds / keys to save : 0.0/0/all
   Row Cache Provider:
 org.apache.cassandra.cache.SerializingCacheProvider
   Key cache size / save period in seconds: 20.0/14400
   GC grace seconds: 864000
   Compaction min/max thresholds: 4/32
   Read repair chance: 1.0
   Replicate on write: true
   Bloom Filter FP chance: default
   Built indexes: []
   Compaction Strategy:
 org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy
   Compression Options:
 chunk_length_kb: 64
 sstable_compression:
 org.apache.cassandra.io.compress.SnappyCompressor

 Can anyone help?
 Thanks

  *Tamar Fraenkel *
 Senior Software Engineer, TOK Media

 tokLogo.png


 ta...@tok-media.com
 Tel:   +972 2 6409736
 Mob:  +972 54 8356490
 Fax:   +972 2 5612956





 On Mon, Sep 24, 2012 at 8:37 AM, Tamar Fraenkel ta...@tok-media.comwrote:

 Thanks all, that helps. Will start with one - two CFs and let you know
 the effect


 *Tamar Fraenkel *
 Senior Software Engineer, TOK Media

 tokLogo.png


 ta...@tok-media.com
 Tel:   +972 2 6409736
 Mob:  +972 54 8356490
 Fax:   +972 2 5612956





 On Sun, Sep 23, 2012 at 8:21 PM, Hiller, Dean dean.hil...@nrel.govwrote:

 As well as your unlimited column names may all have the same prefix,
 right? Like accounts.rowkey56, accounts.rowkey78, etc. etc.  so the
 accounts gets a ton of compression then.

 Later,
 Dean

 From: Tyler Hobbs ty...@datastax.commailto:ty...@datastax.com
 Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Date: Sunday, September 23, 2012 11:46 AM
 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Subject: Re: compression

  column metadata, you're still likely to get a reasonable amount of
 compression.  This is especially true if there is some amount of repetition
 in the column names, values, or TTLs in wide rows.  Compression will almost
 always be beneficial unless you're already somehow CPU bound or are using
 large column values that are high in entropy, such as pre-compressed or
 encrypted data.






tokLogo.png

CQL3: Unknown property 'comparator'?

2012-10-29 Thread Timmy Turner
Does CQL3 not allow dynamic columns (column names) any more?


Re: CQL3: Unknown property 'comparator'?

2012-10-29 Thread Sylvain Lebresne
CQL3 does absolutely allow dynamic column families, but does it
differently from CQL2. See
http://www.datastax.com/dev/blog/cql3-for-cassandra-experts.

--
Sylvain

On Mon, Oct 29, 2012 at 12:34 PM, Timmy Turner timm.t...@gmail.com wrote:
 Does CQL3 not allow dynamic columns (column names) any more?


Re: CQL3: Unknown property 'comparator'?

2012-10-29 Thread Timmy Turner
Thank you! That article helps clear up a lot of my confusion about the
changes between CQL 2 and 3, since I was wondering how to
access/manipulate CompositeType/DynamicCompositeType columns through
CQL.

So does this mean that in CQL 3 an explicit schema is absolutely
mandatory? It's now impossible (within CQL) to add new
(non-primary-key) columns only for individual rows implicitly with
DML-queries (insert/update)?.




2012/10/29 Sylvain Lebresne sylv...@datastax.com:
 CQL3 does absolutely allow dynamic column families, but does it
 differently from CQL2. See
 http://www.datastax.com/dev/blog/cql3-for-cassandra-experts.

 --
 Sylvain

 On Mon, Oct 29, 2012 at 12:34 PM, Timmy Turner timm.t...@gmail.com wrote:
 Does CQL3 not allow dynamic columns (column names) any more?


Re: ColumnFamilyInputFormat - error when column name is UUID

2012-10-29 Thread Marcelo Elias Del Valle
Answering myself: it seems we can't have any non type 1 UUIDs in column
names. I used the UTF8 comparator and saved my UUIDs as strings, it worked.

2012/10/29 Marcelo Elias Del Valle mvall...@gmail.com

 Hello,

 I am using ColumnFamilyInputFormat the same way it's described in this
 example:
 https://github.com/apache/cassandra/blob/trunk/examples/hadoop_word_count/src/WordCount.java#L215

 I have been able to successfully process data in cassandra by using
 hadoop. However, as this solution doesn't allow me to filter which data in
 cassandra I want to filter, I decided to create a query column family to
 list data I want to process in hadoop. This column family is as follows:

 row key: MM
 column name: UUID - user ID
 column value: timestamp - last processed date

  The problem is, when I run hadoop, I get the exception bellow. Is
 there any limitation in having UUIDs as column names? I am generating my
 user IDs with java.util.UUID.randomUUID() for now. I could change the
 method later, but only type 1 UUIDs are 16 bits longer, isn't it?


 java.lang.RuntimeException: InvalidRequestException(why:UUIDs must be
 exactly 16 bytes)
 at
 org.apache.cassandra.hadoop.ColumnFamilyRecordReader$StaticRowIterator.maybeInit(ColumnFamilyRecordReader.java:391)
  at
 org.apache.cassandra.hadoop.ColumnFamilyRecordReader$StaticRowIterator.computeNext(ColumnFamilyRecordReader.java:397)
 at
 org.apache.cassandra.hadoop.ColumnFamilyRecordReader$StaticRowIterator.computeNext(ColumnFamilyRecordReader.java:323)
  at
 com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
 at
 com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
  at
 org.apache.cassandra.hadoop.ColumnFamilyRecordReader.nextKeyValue(ColumnFamilyRecordReader.java:188)
 at
 org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:532)
  at
 org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
 at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
  at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
  at
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
 Caused by: InvalidRequestException(why:UUIDs must be exactly 16 bytes)
  at
 org.apache.cassandra.thrift.Cassandra$get_range_slices_result.read(Cassandra.java:12254)
 at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
  at
 org.apache.cassandra.thrift.Cassandra$Client.recv_get_range_slices(Cassandra.java:683)
 at
 org.apache.cassandra.thrift.Cassandra$Client.get_range_slices(Cassandra.java:667)
  at
 org.apache.cassandra.hadoop.ColumnFamilyRecordReader$StaticRowIterator.maybeInit(ColumnFamilyRecordReader.java:356)
 ... 11 more

 Best regards,
 --
 Marcelo Elias Del Valle
 http://mvalle.com - @mvallebr




-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr


Re: ColumnFamilyInputFormat - error when column name is UUID

2012-10-29 Thread Andre Tavares
Marcelo,

das vezes q tive este problema geralmente era porque o valor UUID sendo
tratado para o cassandra não correspondia a um valor exato  em UUID, para
isso utilizava bastante o UUID.randomUUID() (para gerar um UUID valido)
e UUID.fromString(081f4500-047e-401c-8c0b-a41fefd099d7) - este para
transformar uma String em UUID valido.

Como temos 2 keyspaces no cassandra (dmp_input-Astyanax) e (dmp-PlayOrm)
pode acontecer destes frameworks tratarem as chaves UUID de maneira
diferentes (em nossa implementação feita )

portanto acho válido a solução que você encontrou (sorry por não ter
enxergado o probs antes caso, seja este o seu caso ...)

Abs,

André

2012/10/29 Marcelo Elias Del Valle mvall...@gmail.com

 Answering myself: it seems we can't have any non type 1 UUIDs in column
 names. I used the UTF8 comparator and saved my UUIDs as strings, it worked.


 2012/10/29 Marcelo Elias Del Valle mvall...@gmail.com

 Hello,

 I am using ColumnFamilyInputFormat the same way it's described in
 this example:
 https://github.com/apache/cassandra/blob/trunk/examples/hadoop_word_count/src/WordCount.java#L215

 I have been able to successfully process data in cassandra by using
 hadoop. However, as this solution doesn't allow me to filter which data in
 cassandra I want to filter, I decided to create a query column family to
 list data I want to process in hadoop. This column family is as follows:

 row key: MM
 column name: UUID - user ID
 column value: timestamp - last processed date

  The problem is, when I run hadoop, I get the exception bellow. Is
 there any limitation in having UUIDs as column names? I am generating my
 user IDs with java.util.UUID.randomUUID() for now. I could change the
 method later, but only type 1 UUIDs are 16 bits longer, isn't it?


 java.lang.RuntimeException: InvalidRequestException(why:UUIDs must be
 exactly 16 bytes)
 at
 org.apache.cassandra.hadoop.ColumnFamilyRecordReader$StaticRowIterator.maybeInit(ColumnFamilyRecordReader.java:391)
  at
 org.apache.cassandra.hadoop.ColumnFamilyRecordReader$StaticRowIterator.computeNext(ColumnFamilyRecordReader.java:397)
 at
 org.apache.cassandra.hadoop.ColumnFamilyRecordReader$StaticRowIterator.computeNext(ColumnFamilyRecordReader.java:323)
  at
 com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
 at
 com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
  at
 org.apache.cassandra.hadoop.ColumnFamilyRecordReader.nextKeyValue(ColumnFamilyRecordReader.java:188)
 at
 org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:532)
  at
 org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
 at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
  at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
  at
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
 Caused by: InvalidRequestException(why:UUIDs must be exactly 16 bytes)
  at
 org.apache.cassandra.thrift.Cassandra$get_range_slices_result.read(Cassandra.java:12254)
 at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
  at
 org.apache.cassandra.thrift.Cassandra$Client.recv_get_range_slices(Cassandra.java:683)
 at
 org.apache.cassandra.thrift.Cassandra$Client.get_range_slices(Cassandra.java:667)
  at
 org.apache.cassandra.hadoop.ColumnFamilyRecordReader$StaticRowIterator.maybeInit(ColumnFamilyRecordReader.java:356)
 ... 11 more

 Best regards,
 --
 Marcelo Elias Del Valle
 http://mvalle.com - @mvallebr




 --
 Marcelo Elias Del Valle
 http://mvalle.com - @mvallebr



Re: ColumnFamilyInputFormat - error when column name is UUID

2012-10-29 Thread Hiller, Dean
Hmm, this brings the question of what uuid libraries are others using?  I know 
this one generates type 1 UUIDs with two longs so it is 16 bytes.

http://johannburkard.de/software/uuid/

Thanks,
Dean

From: Marcelo Elias Del Valle mvall...@gmail.commailto:mvall...@gmail.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Monday, October 29, 2012 1:17 PM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: ColumnFamilyInputFormat - error when column name is UUID

Answering myself: it seems we can't have any non type 1 UUIDs in column names. 
I used the UTF8 comparator and saved my UUIDs as strings, it worked.


Re: Simulating a failed node

2012-10-29 Thread Andrew Bialecki
Thanks, extremely helpful. The key bit was I wasn't flushing the old
Keyspace before re-running the stress test, so I was stuck at RF = 1 from a
previous run despite passing RF = 2 to the stress tool.

On Sun, Oct 28, 2012 at 2:49 AM, Peter Schuller peter.schul...@infidyne.com
 wrote:

  Operation [158320] retried 10 times - error inserting key 0158320
 ((UnavailableException))

 This means that at the point where the thrift request to write data
 was handled, the co-ordinator node (the one your client is connected
 to) believed that, among the replicas responsible for the key, too
 many were down to satisfy the consistency level. Most likely causes
 would be that you're in fact not using RF  2 (e.g., is the RF really
  1 for the keyspace you're inserting into), or you're in fact not
 using ONE.

  I'm sure my naive setup is flawed in some way, but what I was hoping for
 was when the node went down it would fail to write to the downed node and
 instead write to one of the other nodes in the clusters. So question is why
 are writes failing even after a retry? It might be the stress client
 doesn't pool connections (I took

 Write always go to all responsible replicas that are up, and when
 enough return (according to consistency level), the insert succeeds.

 If replicas fail to respond you may get a TimeoutException.

 UnavailableException means it didn't even try because it didn't have
 enough replicas to even try to write to.

 (Note though: Reads are a bit of a different story and if you want to
 test behavior when nodes go down I suggest including that. See
 CASSANDRA-2540 and CASSANDRA-3927.)

 --
 / Peter Schuller (@scode, http://worldmodscode.wordpress.com)



Re: ColumnFamilyInputFormat - error when column name is UUID

2012-10-29 Thread Marcelo Elias Del Valle
Dean,

 Are type 1 UUIDs the best ones to use if I want to avoid conflict? I
saw this page: http://en.wikipedia.org/wiki/Universally_unique_identifier
 The only problem with type 1 UUIDs is they are not opaque? I know
there is one kind of UUID that can generate two equal values if you
generate them at the same milisecond, but I guess I was confusing them...

Best regards,
Marcelo Valle.

2012/10/29 Hiller, Dean dean.hil...@nrel.gov

 Hmm, this brings the question of what uuid libraries are others using?  I
 know this one generates type 1 UUIDs with two longs so it is 16 bytes.

 http://johannburkard.de/software/uuid/

 Thanks,
 Dean

 From: Marcelo Elias Del Valle mvall...@gmail.commailto:
 mvall...@gmail.com
 Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Date: Monday, October 29, 2012 1:17 PM
 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Subject: Re: ColumnFamilyInputFormat - error when column name is UUID

 Answering myself: it seems we can't have any non type 1 UUIDs in column
 names. I used the UTF8 comparator and saved my UUIDs as strings, it worked.




-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr


Re: ColumnFamilyInputFormat - error when column name is UUID

2012-10-29 Thread Marcelo Elias Del Valle
Err... Guess you replied in portuguese to the list :D

2012/10/29 Andre Tavares andre...@gmail.com

 Marcelo,

 das vezes q tive este problema geralmente era porque o valor UUID sendo
 tratado para o cassandra não correspondia a um valor exato  em UUID, para
 isso utilizava bastante o UUID.randomUUID() (para gerar um UUID valido)
 e UUID.fromString(081f4500-047e-401c-8c0b-a41fefd099d7) - este para
 transformar uma String em UUID valido.

 Como temos 2 keyspaces no cassandra (dmp_input-Astyanax) e (dmp-PlayOrm)
 pode acontecer destes frameworks tratarem as chaves UUID de maneira
 diferentes (em nossa implementação feita )

 portanto acho válido a solução que você encontrou (sorry por não ter
 enxergado o probs antes caso, seja este o seu caso ...)

 Abs,

 André


 2012/10/29 Marcelo Elias Del Valle mvall...@gmail.com

 Answering myself: it seems we can't have any non type 1 UUIDs in column
 names. I used the UTF8 comparator and saved my UUIDs as strings, it worked.


 2012/10/29 Marcelo Elias Del Valle mvall...@gmail.com

 Hello,

 I am using ColumnFamilyInputFormat the same way it's described in
 this example:
 https://github.com/apache/cassandra/blob/trunk/examples/hadoop_word_count/src/WordCount.java#L215

 I have been able to successfully process data in cassandra by using
 hadoop. However, as this solution doesn't allow me to filter which data in
 cassandra I want to filter, I decided to create a query column family to
 list data I want to process in hadoop. This column family is as follows:

 row key: MM
 column name: UUID - user ID
 column value: timestamp - last processed date

  The problem is, when I run hadoop, I get the exception bellow. Is
 there any limitation in having UUIDs as column names? I am generating my
 user IDs with java.util.UUID.randomUUID() for now. I could change the
 method later, but only type 1 UUIDs are 16 bits longer, isn't it?


 java.lang.RuntimeException: InvalidRequestException(why:UUIDs must be
 exactly 16 bytes)
 at
 org.apache.cassandra.hadoop.ColumnFamilyRecordReader$StaticRowIterator.maybeInit(ColumnFamilyRecordReader.java:391)
  at
 org.apache.cassandra.hadoop.ColumnFamilyRecordReader$StaticRowIterator.computeNext(ColumnFamilyRecordReader.java:397)
 at
 org.apache.cassandra.hadoop.ColumnFamilyRecordReader$StaticRowIterator.computeNext(ColumnFamilyRecordReader.java:323)
  at
 com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
 at
 com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
  at
 org.apache.cassandra.hadoop.ColumnFamilyRecordReader.nextKeyValue(ColumnFamilyRecordReader.java:188)
 at
 org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:532)
  at
 org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
 at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
  at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
  at
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
 Caused by: InvalidRequestException(why:UUIDs must be exactly 16 bytes)
  at
 org.apache.cassandra.thrift.Cassandra$get_range_slices_result.read(Cassandra.java:12254)
 at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
  at
 org.apache.cassandra.thrift.Cassandra$Client.recv_get_range_slices(Cassandra.java:683)
 at
 org.apache.cassandra.thrift.Cassandra$Client.get_range_slices(Cassandra.java:667)
  at
 org.apache.cassandra.hadoop.ColumnFamilyRecordReader$StaticRowIterator.maybeInit(ColumnFamilyRecordReader.java:356)
 ... 11 more

 Best regards,
 --
 Marcelo Elias Del Valle
 http://mvalle.com - @mvallebr




 --
 Marcelo Elias Del Valle
 http://mvalle.com - @mvallebr





-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr


Re: Benifits by adding nodes to the cluster

2012-10-29 Thread Andrey Ilinykh
This is how cassandra scales. More nodes means better performance.

thank you,
  Andrey

On Mon, Oct 29, 2012 at 2:57 PM, Roshan codeva...@gmail.com wrote:
 Hi All

 This may be a silly question, but what kind of benefits we can get by adding
 new nodes to the cluster?

 Some may be high availability. Any others?

 /Roshan



 --
 View this message in context: 
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Benifits-by-adding-nodes-to-the-cluster-tp7583437.html
 Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
 Nabble.com.


RE: Hinted Handoff runs every ten minutes

2012-10-29 Thread Stephen Pierce
I'm running 1.1.5; the bug says it's fixed in 1.0.9/1.1.0. 

How can I check to see why it keeps running HintedHandoff?

Steve
 

-Original Message-
From: Brandon Williams [mailto:dri...@gmail.com] 
Sent: Wednesday, October 24, 2012 4:56 AM
To: user@cassandra.apache.org
Subject: Re: Hinted Handoff runs every ten minutes

On Sun, Oct 21, 2012 at 6:44 PM, aaron morton aa...@thelastpickle.com wrote:
 I *think* this may be ghost rows which have not being compacted.

You would be correct in the case of 1.0.8:
https://issues.apache.org/jira/browse/CASSANDRA-3955

-Brandon


Re: Hinted Handoff runs every ten minutes

2012-10-29 Thread Radim Kolar

Dne 29.10.2012 23:24, Stephen Pierce napsal(a):

I'm running 1.1.5; the bug says it's fixed in 1.0.9/1.1.0.

How can I check to see why it keeps running HintedHandoff?
you have tombstone is system.HintsColumnFamily use list command in 
cassandra-cli to check




idea drive layout - 4 drives + RAID question

2012-10-29 Thread Ran User
For a server with 4 drive slots only, I'm thinking:

either:

- OS (1 drive)
- Commit Log (1 drive)
- Data (2 drives, software raid 0)

vs

- OS  + Data (3 drives, software raid 0)
- Commit Log (1 drive)

or something else?

also, if I can spare the wasted storage, would RAID 10 for cassandra data
improve read performance and have no effect on write performance?

Thank you!


Re: compression

2012-10-29 Thread aaron morton
  Any clue how to avoid it?
Not really sure what went wrong. Diagnosing that sort of problem usually takes 
access to the running node and time to poke around and see what it does in 
responses to various things. 

Rebooting works for Windows 95 and Cassandra is not that different. 

Cheers
 
-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 29/10/2012, at 9:12 PM, Tamar Fraenkel ta...@tok-media.com wrote:

 Hi!
 Thanks Aaron!
 Today I restarted Cassandra on that node and ran scrub again, now it is fine.
 
 I am worried though that if I decide to change another CF to use compression 
 I will have that issue again. Any clue how to avoid it?
 
 Thanks.
 
 Tamar Fraenkel 
 Senior Software Engineer, TOK Media 
 
 tokLogo.png
 
 ta...@tok-media.com
 Tel:   +972 2 6409736 
 Mob:  +972 54 8356490 
 Fax:   +972 2 5612956 
 
 
 
 
 
 On Wed, Sep 26, 2012 at 3:40 AM, aaron morton aa...@thelastpickle.com wrote:
 Check the logs on  nodes 2 and 3 to see if the scrub started. The logs on 1 
 will be a good help with that. 
 
 Cheers
 
 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com
 
 On 24/09/2012, at 10:31 PM, Tamar Fraenkel ta...@tok-media.com wrote:
 
 Hi!
 I ran 
 UPDATE COLUMN FAMILY cf_name WITH 
 compression_options={sstable_compression:SnappyCompressor, 
 chunk_length_kb:64};
 
 I then ran on all my nodes (3)
 sudo nodetool -h localhost scrub tok cf_name
 
 I have replication factor 3. The size of the data on disk was cut in half in 
 the first node and in the jmx I can see that indeed the compression ration 
 is 0.46. But on nodes 2 and 3 nothing happened. In the jmx I can see that 
 compression ratio is 0 and the size of the files of disk stayed the same.
 
 In cli 
 
 ColumnFamily: cf_name
   Key Validation Class: org.apache.cassandra.db.marshal.UUIDType
   Default column value validator: 
 org.apache.cassandra.db.marshal.UTF8Type
   Columns sorted by: 
 org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.UTF8Type,org.apache.cassandra.db.marshal.UTF8Type)
   Row cache size / save period in seconds / keys to save : 0.0/0/all
   Row Cache Provider: org.apache.cassandra.cache.SerializingCacheProvider
   Key cache size / save period in seconds: 20.0/14400
   GC grace seconds: 864000
   Compaction min/max thresholds: 4/32
   Read repair chance: 1.0
   Replicate on write: true
   Bloom Filter FP chance: default
   Built indexes: []
   Compaction Strategy: 
 org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy
   Compression Options:
 chunk_length_kb: 64
 sstable_compression: 
 org.apache.cassandra.io.compress.SnappyCompressor
 
 Can anyone help?
 Thanks
 
 Tamar Fraenkel 
 Senior Software Engineer, TOK Media 
 
 tokLogo.png
 
 
 ta...@tok-media.com
 Tel:   +972 2 6409736 
 Mob:  +972 54 8356490 
 Fax:   +972 2 5612956 
 
 
 
 
 
 On Mon, Sep 24, 2012 at 8:37 AM, Tamar Fraenkel ta...@tok-media.com wrote:
 Thanks all, that helps. Will start with one - two CFs and let you know the 
 effect
 
 
 Tamar Fraenkel 
 Senior Software Engineer, TOK Media 
 
 tokLogo.png
 
 
 ta...@tok-media.com
 Tel:   +972 2 6409736 
 Mob:  +972 54 8356490 
 Fax:   +972 2 5612956 
 
 
 
 
 
 On Sun, Sep 23, 2012 at 8:21 PM, Hiller, Dean dean.hil...@nrel.gov wrote:
 As well as your unlimited column names may all have the same prefix, right? 
 Like accounts.rowkey56, accounts.rowkey78, etc. etc.  so the accounts 
 gets a ton of compression then.
 
 Later,
 Dean
 
 From: Tyler Hobbs ty...@datastax.commailto:ty...@datastax.com
 Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Date: Sunday, September 23, 2012 11:46 AM
 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Subject: Re: compression
 
  column metadata, you're still likely to get a reasonable amount of 
 compression.  This is especially true if there is some amount of repetition 
 in the column names, values, or TTLs in wide rows.  Compression will almost 
 always be beneficial unless you're already somehow CPU bound or are using 
 large column values that are high in entropy, such as pre-compressed or 
 encrypted data.
 
 
 
 



Re: idea drive layout - 4 drives + RAID question

2012-10-29 Thread Timmy Turner
I'm not sure whether the raid 0 gets you anything other than headaches
should one of the drives fail. You can already distribute the
individual Cassandra column families on different drives by just
setting up symlinks to the individual folders.

2012/10/30 Ran User ranuse...@gmail.com:
 For a server with 4 drive slots only, I'm thinking:

 either:

 - OS (1 drive)
 - Commit Log (1 drive)
 - Data (2 drives, software raid 0)

 vs

 - OS  + Data (3 drives, software raid 0)
 - Commit Log (1 drive)

 or something else?

 also, if I can spare the wasted storage, would RAID 10 for cassandra data
 improve read performance and have no effect on write performance?

 Thank you!


Re: CQL3: Unknown property 'comparator'?

2012-10-29 Thread aaron morton
More background http://www.datastax.com/dev/blog/thrift-to-cql3

 So does this mean that in CQL 3 an explicit schema is absolutely
 mandatory? 
Not really, it sort of depends on your view.

Lets say this is a schema free CF definition in CLI

  create column family clicks
with key_validation_class = UTF8Type
 and comparator = DateType
 and default_validation_class = UTF8Type

It could be used for wide rows with lots of columns, where the name is a date. 

As the article at the top says, this CQL 3 DDL is equivalent:

CREATE TABLE clicks (
  key text,
  column1 timestamp,
  value text,
  PRIMARY KEY (key, column)
) WITH COMPACT STORAGE

This creates a single row inside C*, column name is a date. The difference is 
CQL 3 pivots this one storage engine row into multiple CQL 3 rows. (See 
article)

So far so good. Let's add some schema:

CREATE TABLE clicks (
  user_id text,
  click_time timestamp,
  click_url text,
  PRIMARY KEY (user_id, click_time)
) WITH COMPACT STORAGE

That's functionally the same but has some more schema in it. It tells CQL 3 
that the label to use for the name of a column is click_time. Previously the 
label was column1. 


 It's now impossible (within CQL) to add new
 (non-primary-key) columns only for individual rows implicitly with
 DML-queries (insert/update)?.
Is your use case covered in the article above ?
 
Cheers

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 30/10/2012, at 2:31 AM, Timmy Turner timm.t...@gmail.com wrote:

 Thank you! That article helps clear up a lot of my confusion about the
 changes between CQL 2 and 3, since I was wondering how to
 access/manipulate CompositeType/DynamicCompositeType columns through
 CQL.
 
 So does this mean that in CQL 3 an explicit schema is absolutely
 mandatory? It's now impossible (within CQL) to add new
 (non-primary-key) columns only for individual rows implicitly with
 DML-queries (insert/update)?.
 
 
 
 
 2012/10/29 Sylvain Lebresne sylv...@datastax.com:
 CQL3 does absolutely allow dynamic column families, but does it
 differently from CQL2. See
 http://www.datastax.com/dev/blog/cql3-for-cassandra-experts.
 
 --
 Sylvain
 
 On Mon, Oct 29, 2012 at 12:34 PM, Timmy Turner timm.t...@gmail.com wrote:
 Does CQL3 not allow dynamic columns (column names) any more?



Re: idea drive layout - 4 drives + RAID question

2012-10-29 Thread Ran User
I was hoping to achieve approx. 2x IO (write and read) performance via RAID
0 (by accepting a higher MTBF).

Do believe the performance gains of RAID0 are much lower and/or are not
worth it vs the increased server failure rate?

From my understanding, RAID 10 would achieve the read performance benefits
of RAID 0, but not the write benefits.  I'm also considering RAID 10 to
maximize server IO performance.

Currently, we're working with 1 CF.


Thank you

On Mon, Oct 29, 2012 at 11:51 PM, Timmy Turner timm.t...@gmail.com wrote:

 I'm not sure whether the raid 0 gets you anything other than headaches
 should one of the drives fail. You can already distribute the
 individual Cassandra column families on different drives by just
 setting up symlinks to the individual folders.

 2012/10/30 Ran User ranuse...@gmail.com:
  For a server with 4 drive slots only, I'm thinking:
 
  either:
 
  - OS (1 drive)
  - Commit Log (1 drive)
  - Data (2 drives, software raid 0)
 
  vs
 
  - OS  + Data (3 drives, software raid 0)
  - Commit Log (1 drive)
 
  or something else?
 
  also, if I can spare the wasted storage, would RAID 10 for cassandra data
  improve read performance and have no effect on write performance?
 
  Thank you!



Re: idea drive layout - 4 drives + RAID question

2012-10-29 Thread Ran User
Have you considered running RAID 10 for the data drives to improve MTBF?

On one hand Cassandra is handling redundancy issues, on the other
hand, reducing the frequency of dealing with failed nodes
is attractive if cheap (switching RAID levels to 10).

We have no experience with software RAID (have always used hardware raid
with BBU).  I'm assuming software RAID 1 or 10 (the mirroring part) is
inherently reliable (perhaps minus some edge case).

On Tue, Oct 30, 2012 at 1:07 AM, Tupshin Harper tups...@tupshin.com wrote:

 I would generally recommend 1 drive for OS and commit log and 3 drive raid
 0 for data. The raid does give you good performance benefit, and it can be
 convenient to have the OS on a side drive for configuration ease and better
 MTBF.

 -Tupshin
 On Oct 29, 2012 8:56 PM, Ran User ranuse...@gmail.com wrote:

 I was hoping to achieve approx. 2x IO (write and read) performance via
 RAID 0 (by accepting a higher MTBF).

 Do believe the performance gains of RAID0 are much lower and/or are not
 worth it vs the increased server failure rate?

 From my understanding, RAID 10 would achieve the read performance
 benefits of RAID 0, but not the write benefits.  I'm also considering RAID
 10 to maximize server IO performance.

 Currently, we're working with 1 CF.


 Thank you

 On Mon, Oct 29, 2012 at 11:51 PM, Timmy Turner timm.t...@gmail.comwrote:

 I'm not sure whether the raid 0 gets you anything other than headaches
 should one of the drives fail. You can already distribute the
 individual Cassandra column families on different drives by just
 setting up symlinks to the individual folders.

 2012/10/30 Ran User ranuse...@gmail.com:
  For a server with 4 drive slots only, I'm thinking:
 
  either:
 
  - OS (1 drive)
  - Commit Log (1 drive)
  - Data (2 drives, software raid 0)
 
  vs
 
  - OS  + Data (3 drives, software raid 0)
  - Commit Log (1 drive)
 
  or something else?
 
  also, if I can spare the wasted storage, would RAID 10 for cassandra
 data
  improve read performance and have no effect on write performance?
 
  Thank you!





Throughput decreases as latency increases with YCSB

2012-10-29 Thread Peter Bailis
Hi,

I'm currently benchmarking Cassandra and have encountered some interesting
behavior. As I increase the number of client threads (and connections),
latency increases as expected but, at some point, throughput actually
decreases.

I've seen a few posts about this online, with no clear resolution:

If we move to higher threadcounts, throughput does not
 increase and even  decreases. Do you have any idea why this is
 happening and possibly suggestions how to scale throughput to much
 higher numbers? [1]


If you want to increase throughput, try increasing the number of clients.
 Of course, it doesnt mean that throughtput will always increase. My
 observation was that it will increase and after certain number of clients
 throughput decrease again. [2]


You can see a graph of the behavior I'm experiencing here:
https://dl.dropbox.com/u/34647904/cassandra-lat-thru.pdf

I'm using YCSB on EC2 with one m1.large instance to drive client load and
one m1.large instance for a single Cassandra node with maximum connections
set to 1024 and with Cassandra's files on RAID0 ephemeral storage. This
problem occurs when commitlog sync is both batch and periodic, with HSHA
and sync on, and with a variety of heapsize settings. As far as I can tell,
this isn't due to GC and nodetool tpstats isn't showing any dropped
requests or even serious queuing. Any thoughts?

My guess is that this reflects some sort of overhead due to the extra
connections--perhaps something due to context switching?

Thanks,
Peter

[1]
http://mail-archives.apache.org/mod_mbox/cassandra-user/201102.mbox/%3C12ECB704F2665F40A9C09018C73D95AEC92A8F3618@IE2RD2XVS011.red002.local%3E
[2]
http://grokbase.com/t/cassandra/user/127h25p3hy/cassandra-evaluation-benchmarking-throughput-not-scaling-as-expected-neither-latency-showing-good-numbers#20120718x3cpg6enq250gbjg19ns14678g
[3] Example Bash script: https://gist.github.com/3978273


Re: Throughput decreases as latency increases with YCSB

2012-10-29 Thread Peter Bailis

 I'm using YCSB on EC2 with one m1.large instance to drive client load


To add, I don't believe this is due to YCSB. I've done a fair bit of
client-side profiling and neither client CPU or NIC (or server NIC) are
bottlenecks.

I'll also add that this dataset fits in memory.

Thanks!
Peter