Re: Data tombstoned during bulk loading 1.2.10 - 2.0.3

2014-02-04 Thread olek.stas...@gmail.com
I don't know what is the real cause of my problem. We are still guessing.
All operations I have done one cluster are described on timeline:
1.1.7- 1.2.10 - upgradesstable - 2.0.2 - normal operations -2.0.3
- normal operations - now
normal operations means reads/writes/repairs.
Could you please, describe briefly how to recover data? I have a
problem with scenario described under link:
http://thelastpickle.com/blog/2011/12/15/Anatomy-of-a-Cassandra-Partition.html ,
I can't apply this solution to my case.
regards
Olek

2014-02-03 Robert Coli rc...@eventbrite.com:
 On Mon, Feb 3, 2014 at 2:17 PM, olek.stas...@gmail.com
 olek.stas...@gmail.com wrote:

 No, i've done repair after upgrade sstables. In fact it was about 4
 weeks after, because of bug:


 If you only did a repair after you upgraded SSTables, when did you have an
 opportunity to hit :

 https://issues.apache.org/jira/browse/CASSANDRA-6527

 ... which relies on you having multiple versions of SStables while
 streaming?

 Did you do any operation which involves streaming? (Add/Remove/Replace a
 node?)

 =Rob



Maximum size and number of datafiles

2014-02-04 Thread Bonnet Jonathan .
Hello here,

   Is it possible to tell me if it possible to choose the maximum size for a
datafile to prevent fs saturation. When cassandra choose to add a datafile ?

   Thanks 4 all your answears.

Regards,

Bonnet Jonathan.



Keyspace directory not getting created in 1 machine

2014-02-04 Thread Hari Rajendhran
 
Dear Team ,

I have a 3 node cassandra 1.1.12 opensource version installed in our lab.The db 
files for columnfamilies are getting created in 2 machines while in one of the 
machine the data directory 
is empty.I have tried with the following option

nodetool  -h [IP address of the not working machine] rebuild --- to auto 
bootstrap system keyspace and other column family information


Still the error persist



Best Regards
Hari Krishnan Rajendhran
Hadoop Admin
DESS-ABIM ,Chennai BIGDATA Galaxy
Tata Consultancy Services
Cell:- 9677985515
Mailto: hari.rajendh...@tcs.com
Website: http://www.tcs.com

Experience certainty.   IT Services
Business Solutions
Consulting

=-=-=
Notice: The information contained in this e-mail
message and/or attachments to it may contain 
confidential or privileged information. If you are 
not the intended recipient, any dissemination, use, 
review, distribution, printing or copying of the 
information contained in this e-mail message 
and/or attachments to it are strictly prohibited. If 
you have received this communication in error, 
please notify us by reply e-mail or telephone and 
immediately and permanently delete the message 
and any attachments. Thank you




Re: Keyspace directory not getting created in 1 machine

2014-02-04 Thread Duncan Sands

Hi Hari,

On 04/02/14 10:38, Hari Rajendhran wrote:


Dear Team ,

I have a 3 node cassandra 1.1.12 opensource version installed in our lab.The db
files for columnfamilies are getting created in 2 machines while in one of the
machine the data directory
is empty.I have tried with the following option

nodetool  -h [IP address of the not working machine] rebuild --- to auto
bootstrap system keyspace and other column family information


Still the error persist


do all of the nodes know about all of the other nodes?  Try using nodetool 
status on each node to check.  Another possibility is that you configured that 
one node to store its data in a different directory rather than the standard 
one, but you are mistakenly looking for the data in the standard location.


Ciao, Duncan.





Best Regards
Hari Krishnan Rajendhran
Hadoop Admin
DESS-ABIM ,Chennai BIGDATA Galaxy
Tata Consultancy Services
Cell:- 9677985515
Mailto: hari.rajendh...@tcs.com
Website: http://www.tcs.com

Experience certainty. IT Services
Business Solutions
Consulting


=-=-=
Notice: The information contained in this e-mail
message and/or attachments to it may contain
confidential or privileged information. If you are
not the intended recipient, any dissemination, use,
review, distribution, printing or copying of the
information contained in this e-mail message
and/or attachments to it are strictly prohibited. If
you have received this communication in error,
please notify us by reply e-mail or telephone and
immediately and permanently delete the message
and any attachments. Thank you





what tool will create noncql columnfamilies in cassandra 3a

2014-02-04 Thread Edward Capriolo
Cassandra 2.0.4 cli is informing me that it will no longer exist in the
next major.

How will users adjust the meta data of non cql column families and other
cfs that do not fit into the cql model?

-- 
Sorry this was sent from mobile. Will do less grammar and spell check than
usual.


Re: Ultra wide row anti pattern

2014-02-04 Thread Edward Capriolo
I have actually been building something similar in my space time. You can
hang around and wait for it or build your own. Here is the basics. Not
perfect but it will work.

Create column family queue with gc_grace_period=[1 day]

set queue [timeuuid()] [z+timeuuid()] = [ work do do]

The producer can decide how it wants to role over the row key and the
column key it does not matter.

Supposing there are N consumers. We need a way for the consumers to not do
the same work. We can use something like the bakery algorithm. Remember at
QUORUM a reader sees writes.

A consumer needs an identifier (it could be another uuid or an ip address)
A consumer calls get_range_slice on the queue the slice is from new byte[]
to byte[] limit 100

The consumer sees data like this.

[1234] [z-$timeuuid] = data

Now we register that this consumer wants to consume this queue

set [1234] [a-$[ip}] at quorum

Now we do a slice
get_slice [1234]  from new byte [] to ' b'

There are a few possible returns.
1) 1 bidder...
[1234] [a-$myip]
You won start consuming

2)  2 bidders
[1234] [a-$myip]
[1234] [a-$otherip]
compare $myip vs $otherip higher wins

Whoever wins can then start consuming the columns in the queue and delete
them when done.






On Friday, January 31, 2014, DuyHai Doan doanduy...@gmail.com wrote:
 Thanks Nat for your ideas.
This could be as simple as adding year and month to the primary key (in
the form 'mm'). Alternatively, you could add this in the partition in
the definition. Either way, it then becomes pretty easy to re-generate
these based on the query parameters.

  The thing is that it's not that simple. My customer has a very BAD idea,
using Cassandra as a queue (the perfect anti-pattern ever).
  Before trying to tell them to redesign their entire architecture and put
in some queueing system like ActiveMQ or something similar, I would like to
see how I can use wide rows to meet the requirements.
  The functional need is quite simple:
  1) A process A loads users into Cassandra and sets the status on this
user to be 'TODO'. When using the bucketing technique, we can limit a row
width to, let's say 100 000 columns. So at the end of the current row,
process A knows that it should move to next bucket. Bucket is coded using
composite partition key, in our example it would be 'TODO:1', 'TODO:2' 
etc

  2) A process B reads the wide row for 'TODO' status. It starts at bucket
1 so it will read row with partition key 'TODO:1'. The users are processed
and inserted in a new row 'PROCESSED:1' for example to keep track of the
status. After retrieving 100 000 columns, it will switch automatically to
the next bucket. Simple. Fair enough

  3) Now what sucks it that some time, process B does not have enough data
to perform functional logic on the user it fetched from the wide row, so it
has to REPUT some users back into the 'TODO' status rather than
transitioning to 'PROCESSED' status. That's exactly a queue behavior.
  A simplistic idea would be to insert again those m users with 'TODO:n',
with n higher than the current bucket number so it can be processed later.
But then it screws up all the counting system. Process A which inserts data
will not know that there are already m users in row n, so will happily add
100 000 columns, making the row size grow to  100 000 + m. When process B
reads back again this row, it will stop at the first 100 000 columns and
skip the trailing m elements .
   That 's the main reason for which I dropped the idea of bucketing
(which is quite smart in normal case) to trade for ultra wide row.
  Any way, I'll follow your advice and play around with the parameters of
SizeTiered
  Regards
  Duy Hai DOAN

 On Fri, Jan 31, 2014 at 9:23 PM, Nate McCall n...@thelastpickle.com
wrote:

  The only drawback for ultra wide row I can see is point 1). But if I
use leveled compaction with a sufficiently large value for
sstable_size_in_mb (let's say 200Mb), will my read performance be
impacted as the row grows ?

 For this use case, you would want to use SizeTieredCompaction and play
around with the configuration a bit to keep a small number of large
SSTables. Specifically: keep min|max_threshold really low, set bucket_low
and bucket_high closer together maybe even both to 1.0, and maybe a larger
min_sstable_size.
 YMMV though - per Rob's suggestion, take the time to run some tests
tweaking these options.


  Of course, splitting wide row into several rows using bucketing
technique is one solution but it forces us to keep track of the bucket
number and it's not convenient. We have one process (jvm) that insert data
and another process (jvm) that read data. Using bucketing, we need to
synchronize the bucket number between the 2 processes.

 This could be as simple as adding year and month to the primary key (in
the form 'mm'). Alternatively, you could add this in the partition in
the definition. Either way, it then becomes pretty easy to re-generate
these based on the query parameters.


 --
 

Re: Ultra wide row anti pattern

2014-02-04 Thread Yogi Nerella
Sorry, I am not understanding the problem, and I am new to Cassandra, and
want to understand this issue.

Why do we need to use wide row for this situation, why not a simple table
in cassandra?

todolist  (user, state)   == is there any other information in this table
which needs for processing todo?
processedlist (user, state)



On Tue, Feb 4, 2014 at 7:50 AM, Edward Capriolo edlinuxg...@gmail.comwrote:

 I have actually been building something similar in my space time. You can
 hang around and wait for it or build your own. Here is the basics. Not
 perfect but it will work.

 Create column family queue with gc_grace_period=[1 day]

 set queue [timeuuid()] [z+timeuuid()] = [ work do do]

 The producer can decide how it wants to role over the row key and the
 column key it does not matter.

 Supposing there are N consumers. We need a way for the consumers to not do
 the same work. We can use something like the bakery algorithm. Remember at
 QUORUM a reader sees writes.

 A consumer needs an identifier (it could be another uuid or an ip address)
 A consumer calls get_range_slice on the queue the slice is from new byte[]
 to byte[] limit 100

 The consumer sees data like this.

 [1234] [z-$timeuuid] = data

 Now we register that this consumer wants to consume this queue

 set [1234] [a-$[ip}] at quorum

 Now we do a slice
 get_slice [1234]  from new byte [] to ' b'

 There are a few possible returns.
 1) 1 bidder...
 [1234] [a-$myip]
 You won start consuming

 2)  2 bidders
 [1234] [a-$myip]
 [1234] [a-$otherip]
 compare $myip vs $otherip higher wins

 Whoever wins can then start consuming the columns in the queue and delete
 them when done.






 On Friday, January 31, 2014, DuyHai Doan doanduy...@gmail.com wrote:
  Thanks Nat for your ideas.
 This could be as simple as adding year and month to the primary key (in
 the form 'mm'). Alternatively, you could add this in the partition in
 the definition. Either way, it then becomes pretty easy to re-generate
 these based on the query parameters.
 
   The thing is that it's not that simple. My customer has a very BAD
 idea, using Cassandra as a queue (the perfect anti-pattern ever).
   Before trying to tell them to redesign their entire architecture and
 put in some queueing system like ActiveMQ or something similar, I would
 like to see how I can use wide rows to meet the requirements.
   The functional need is quite simple:
   1) A process A loads users into Cassandra and sets the status on this
 user to be 'TODO'. When using the bucketing technique, we can limit a row
 width to, let's say 100 000 columns. So at the end of the current row,
 process A knows that it should move to next bucket. Bucket is coded using
 composite partition key, in our example it would be 'TODO:1', 'TODO:2' 
 etc
 
   2) A process B reads the wide row for 'TODO' status. It starts at
 bucket 1 so it will read row with partition key 'TODO:1'. The users are
 processed and inserted in a new row 'PROCESSED:1' for example to keep track
 of the status. After retrieving 100 000 columns, it will switch
 automatically to the next bucket. Simple. Fair enough
 
   3) Now what sucks it that some time, process B does not have enough
 data to perform functional logic on the user it fetched from the wide row,
 so it has to REPUT some users back into the 'TODO' status rather than
 transitioning to 'PROCESSED' status. That's exactly a queue behavior.
   A simplistic idea would be to insert again those m users with 'TODO:n',
 with n higher than the current bucket number so it can be processed later.
 But then it screws up all the counting system. Process A which inserts data
 will not know that there are already m users in row n, so will happily add
 100 000 columns, making the row size grow to  100 000 + m. When process B
 reads back again this row, it will stop at the first 100 000 columns and
 skip the trailing m elements .
That 's the main reason for which I dropped the idea of bucketing
 (which is quite smart in normal case) to trade for ultra wide row.
   Any way, I'll follow your advice and play around with the parameters of
 SizeTiered
   Regards
   Duy Hai DOAN
 
  On Fri, Jan 31, 2014 at 9:23 PM, Nate McCall n...@thelastpickle.com
 wrote:
 
   The only drawback for ultra wide row I can see is point 1). But if I
 use leveled compaction with a sufficiently large value for
 sstable_size_in_mb (let's say 200Mb), will my read performance be
 impacted as the row grows ?
 
  For this use case, you would want to use SizeTieredCompaction and play
 around with the configuration a bit to keep a small number of large
 SSTables. Specifically: keep min|max_threshold really low, set bucket_low
 and bucket_high closer together maybe even both to 1.0, and maybe a larger
 min_sstable_size.
  YMMV though - per Rob's suggestion, take the time to run some tests
 tweaking these options.
 
 
   Of course, splitting wide row into several rows using bucketing
 technique is one solution but it forces us 

Re: Ultra wide row anti pattern

2014-02-04 Thread Edward Capriolo
Generally you need to make a wide row because the row keys in cassandra are
ordered by their md5/murmer code. As a result you have no way of locating
new rows, but if the row name is predictable the columns inside the row
are ordered.


On Tue, Feb 4, 2014 at 12:02 PM, Yogi Nerella ynerella...@gmail.com wrote:

 Sorry, I am not understanding the problem, and I am new to Cassandra, and
 want to understand this issue.

 Why do we need to use wide row for this situation, why not a simple table
 in cassandra?

 todolist  (user, state)   == is there any other information in this table
 which needs for processing todo?
 processedlist (user, state)



 On Tue, Feb 4, 2014 at 7:50 AM, Edward Capriolo edlinuxg...@gmail.comwrote:

 I have actually been building something similar in my space time. You can
 hang around and wait for it or build your own. Here is the basics. Not
 perfect but it will work.

 Create column family queue with gc_grace_period=[1 day]

 set queue [timeuuid()] [z+timeuuid()] = [ work do do]

 The producer can decide how it wants to role over the row key and the
 column key it does not matter.

 Supposing there are N consumers. We need a way for the consumers to not
 do the same work. We can use something like the bakery algorithm. Remember
 at QUORUM a reader sees writes.

 A consumer needs an identifier (it could be another uuid or an ip
 address)
 A consumer calls get_range_slice on the queue the slice is from new
 byte[] to byte[] limit 100

 The consumer sees data like this.

 [1234] [z-$timeuuid] = data

 Now we register that this consumer wants to consume this queue

 set [1234] [a-$[ip}] at quorum

 Now we do a slice
 get_slice [1234]  from new byte [] to ' b'

 There are a few possible returns.
 1) 1 bidder...
 [1234] [a-$myip]
 You won start consuming

 2)  2 bidders
 [1234] [a-$myip]
 [1234] [a-$otherip]
 compare $myip vs $otherip higher wins

 Whoever wins can then start consuming the columns in the queue and delete
 them when done.






 On Friday, January 31, 2014, DuyHai Doan doanduy...@gmail.com wrote:
  Thanks Nat for your ideas.
 This could be as simple as adding year and month to the primary key (in
 the form 'mm'). Alternatively, you could add this in the partition in
 the definition. Either way, it then becomes pretty easy to re-generate
 these based on the query parameters.
 
   The thing is that it's not that simple. My customer has a very BAD
 idea, using Cassandra as a queue (the perfect anti-pattern ever).
   Before trying to tell them to redesign their entire architecture and
 put in some queueing system like ActiveMQ or something similar, I would
 like to see how I can use wide rows to meet the requirements.
   The functional need is quite simple:
   1) A process A loads users into Cassandra and sets the status on this
 user to be 'TODO'. When using the bucketing technique, we can limit a row
 width to, let's say 100 000 columns. So at the end of the current row,
 process A knows that it should move to next bucket. Bucket is coded using
 composite partition key, in our example it would be 'TODO:1', 'TODO:2' 
 etc
 
   2) A process B reads the wide row for 'TODO' status. It starts at
 bucket 1 so it will read row with partition key 'TODO:1'. The users are
 processed and inserted in a new row 'PROCESSED:1' for example to keep track
 of the status. After retrieving 100 000 columns, it will switch
 automatically to the next bucket. Simple. Fair enough
 
   3) Now what sucks it that some time, process B does not have enough
 data to perform functional logic on the user it fetched from the wide row,
 so it has to REPUT some users back into the 'TODO' status rather than
 transitioning to 'PROCESSED' status. That's exactly a queue behavior.
   A simplistic idea would be to insert again those m users with
 'TODO:n', with n higher than the current bucket number so it can be
 processed later. But then it screws up all the counting system. Process A
 which inserts data will not know that there are already m users in row n,
 so will happily add 100 000 columns, making the row size grow to  100 000 +
 m. When process B reads back again this row, it will stop at the first 100
 000 columns and skip the trailing m elements .
That 's the main reason for which I dropped the idea of bucketing
 (which is quite smart in normal case) to trade for ultra wide row.
   Any way, I'll follow your advice and play around with the parameters
 of SizeTiered
   Regards
   Duy Hai DOAN
 
  On Fri, Jan 31, 2014 at 9:23 PM, Nate McCall n...@thelastpickle.com
 wrote:
 
   The only drawback for ultra wide row I can see is point 1). But if I
 use leveled compaction with a sufficiently large value for
 sstable_size_in_mb (let's say 200Mb), will my read performance be
 impacted as the row grows ?
 
  For this use case, you would want to use SizeTieredCompaction and play
 around with the configuration a bit to keep a small number of large
 SSTables. Specifically: keep min|max_threshold really 

Re: Data tombstoned during bulk loading 1.2.10 - 2.0.3

2014-02-04 Thread Robert Coli
On Tue, Feb 4, 2014 at 12:21 AM, olek.stas...@gmail.com 
olek.stas...@gmail.com wrote:

 I don't know what is the real cause of my problem. We are still guessing.
 All operations I have done one cluster are described on timeline:
 1.1.7- 1.2.10 - upgradesstable - 2.0.2 - normal operations -2.0.3
 - normal operations - now
 normal operations means reads/writes/repairs.
 Could you please, describe briefly how to recover data? I have a
 problem with scenario described under link:

 http://thelastpickle.com/blog/2011/12/15/Anatomy-of-a-Cassandra-Partition.html,
 I can't apply this solution to my case.


I think your only option is the following :

1) determine which SSTables contain rows have doomstones (tombstones from
the far future)
2) determine whether these tombstones mask a live or dead version of the
row, by looking at other row fragments
3) dump/filter/re-write all your data via some method, probably
sstable2json/json2sstable
4) load the corrected sstables by starting a node with the sstables in the
data directory

I understand you have a lot of data, but I am pretty sure there is no way
for you to fix it within Cassandra. Perhaps ask for advice on the JIRA
ticket mentioned upthread if this answer is not sufficient?

=Rob


Re: Ultra wide row anti pattern

2014-02-04 Thread DuyHai Doan
Great idea for implementing queue pattern. Thank you Edward.

However with your design there are still corner cases for 2 consumers to
read from the same queue. Reading and writing with QUORUM does not prevent
race conditions. I believe the new CAS feature of C* 2.0 might be useful
here but with the expense of reduced throughput (because of the Paxos round)




On Tue, Feb 4, 2014 at 4:50 PM, Edward Capriolo edlinuxg...@gmail.comwrote:

 I have actually been building something similar in my space time. You can
 hang around and wait for it or build your own. Here is the basics. Not
 perfect but it will work.

 Create column family queue with gc_grace_period=[1 day]

 set queue [timeuuid()] [z+timeuuid()] = [ work do do]

 The producer can decide how it wants to role over the row key and the
 column key it does not matter.

 Supposing there are N consumers. We need a way for the consumers to not do
 the same work. We can use something like the bakery algorithm. Remember at
 QUORUM a reader sees writes.

 A consumer needs an identifier (it could be another uuid or an ip address)
 A consumer calls get_range_slice on the queue the slice is from new byte[]
 to byte[] limit 100

 The consumer sees data like this.

 [1234] [z-$timeuuid] = data

 Now we register that this consumer wants to consume this queue

 set [1234] [a-$[ip}] at quorum

 Now we do a slice
 get_slice [1234]  from new byte [] to ' b'

 There are a few possible returns.
 1) 1 bidder...
 [1234] [a-$myip]
 You won start consuming

 2)  2 bidders
 [1234] [a-$myip]
 [1234] [a-$otherip]
 compare $myip vs $otherip higher wins

 Whoever wins can then start consuming the columns in the queue and delete
 them when done.






 On Friday, January 31, 2014, DuyHai Doan doanduy...@gmail.com wrote:
  Thanks Nat for your ideas.
 This could be as simple as adding year and month to the primary key (in
 the form 'mm'). Alternatively, you could add this in the partition in
 the definition. Either way, it then becomes pretty easy to re-generate
 these based on the query parameters.
 
   The thing is that it's not that simple. My customer has a very BAD
 idea, using Cassandra as a queue (the perfect anti-pattern ever).
   Before trying to tell them to redesign their entire architecture and
 put in some queueing system like ActiveMQ or something similar, I would
 like to see how I can use wide rows to meet the requirements.
   The functional need is quite simple:
   1) A process A loads users into Cassandra and sets the status on this
 user to be 'TODO'. When using the bucketing technique, we can limit a row
 width to, let's say 100 000 columns. So at the end of the current row,
 process A knows that it should move to next bucket. Bucket is coded using
 composite partition key, in our example it would be 'TODO:1', 'TODO:2' 
 etc
 
   2) A process B reads the wide row for 'TODO' status. It starts at
 bucket 1 so it will read row with partition key 'TODO:1'. The users are
 processed and inserted in a new row 'PROCESSED:1' for example to keep track
 of the status. After retrieving 100 000 columns, it will switch
 automatically to the next bucket. Simple. Fair enough
 
   3) Now what sucks it that some time, process B does not have enough
 data to perform functional logic on the user it fetched from the wide row,
 so it has to REPUT some users back into the 'TODO' status rather than
 transitioning to 'PROCESSED' status. That's exactly a queue behavior.
   A simplistic idea would be to insert again those m users with 'TODO:n',
 with n higher than the current bucket number so it can be processed later.
 But then it screws up all the counting system. Process A which inserts data
 will not know that there are already m users in row n, so will happily add
 100 000 columns, making the row size grow to  100 000 + m. When process B
 reads back again this row, it will stop at the first 100 000 columns and
 skip the trailing m elements .
That 's the main reason for which I dropped the idea of bucketing
 (which is quite smart in normal case) to trade for ultra wide row.
   Any way, I'll follow your advice and play around with the parameters of
 SizeTiered
   Regards
   Duy Hai DOAN
 
  On Fri, Jan 31, 2014 at 9:23 PM, Nate McCall n...@thelastpickle.com
 wrote:
 
   The only drawback for ultra wide row I can see is point 1). But if I
 use leveled compaction with a sufficiently large value for
 sstable_size_in_mb (let's say 200Mb), will my read performance be
 impacted as the row grows ?
 
  For this use case, you would want to use SizeTieredCompaction and play
 around with the configuration a bit to keep a small number of large
 SSTables. Specifically: keep min|max_threshold really low, set bucket_low
 and bucket_high closer together maybe even both to 1.0, and maybe a larger
 min_sstable_size.
  YMMV though - per Rob's suggestion, take the time to run some tests
 tweaking these options.
 
 
   Of course, splitting wide row into several rows using bucketing
 technique 

Re: Ultra wide row anti pattern

2014-02-04 Thread Edward Capriolo
You could use another column of CAS as a management layer. You only have to
consult it when picking up new rows.


On Tue, Feb 4, 2014 at 3:45 PM, DuyHai Doan doanduy...@gmail.com wrote:

 Great idea for implementing queue pattern. Thank you Edward.

 However with your design there are still corner cases for 2 consumers to
 read from the same queue. Reading and writing with QUORUM does not prevent
 race conditions. I believe the new CAS feature of C* 2.0 might be useful
 here but with the expense of reduced throughput (because of the Paxos round)




 On Tue, Feb 4, 2014 at 4:50 PM, Edward Capriolo edlinuxg...@gmail.comwrote:

 I have actually been building something similar in my space time. You can
 hang around and wait for it or build your own. Here is the basics. Not
 perfect but it will work.

 Create column family queue with gc_grace_period=[1 day]

 set queue [timeuuid()] [z+timeuuid()] = [ work do do]

 The producer can decide how it wants to role over the row key and the
 column key it does not matter.

 Supposing there are N consumers. We need a way for the consumers to not
 do the same work. We can use something like the bakery algorithm. Remember
 at QUORUM a reader sees writes.

 A consumer needs an identifier (it could be another uuid or an ip
 address)
 A consumer calls get_range_slice on the queue the slice is from new
 byte[] to byte[] limit 100

 The consumer sees data like this.

 [1234] [z-$timeuuid] = data

 Now we register that this consumer wants to consume this queue

 set [1234] [a-$[ip}] at quorum

 Now we do a slice
 get_slice [1234]  from new byte [] to ' b'

 There are a few possible returns.
 1) 1 bidder...
 [1234] [a-$myip]
 You won start consuming

 2)  2 bidders
 [1234] [a-$myip]
 [1234] [a-$otherip]
 compare $myip vs $otherip higher wins

 Whoever wins can then start consuming the columns in the queue and delete
 them when done.






 On Friday, January 31, 2014, DuyHai Doan doanduy...@gmail.com wrote:
  Thanks Nat for your ideas.
 This could be as simple as adding year and month to the primary key (in
 the form 'mm'). Alternatively, you could add this in the partition in
 the definition. Either way, it then becomes pretty easy to re-generate
 these based on the query parameters.
 
   The thing is that it's not that simple. My customer has a very BAD
 idea, using Cassandra as a queue (the perfect anti-pattern ever).
   Before trying to tell them to redesign their entire architecture and
 put in some queueing system like ActiveMQ or something similar, I would
 like to see how I can use wide rows to meet the requirements.
   The functional need is quite simple:
   1) A process A loads users into Cassandra and sets the status on this
 user to be 'TODO'. When using the bucketing technique, we can limit a row
 width to, let's say 100 000 columns. So at the end of the current row,
 process A knows that it should move to next bucket. Bucket is coded using
 composite partition key, in our example it would be 'TODO:1', 'TODO:2' 
 etc
 
   2) A process B reads the wide row for 'TODO' status. It starts at
 bucket 1 so it will read row with partition key 'TODO:1'. The users are
 processed and inserted in a new row 'PROCESSED:1' for example to keep track
 of the status. After retrieving 100 000 columns, it will switch
 automatically to the next bucket. Simple. Fair enough
 
   3) Now what sucks it that some time, process B does not have enough
 data to perform functional logic on the user it fetched from the wide row,
 so it has to REPUT some users back into the 'TODO' status rather than
 transitioning to 'PROCESSED' status. That's exactly a queue behavior.
   A simplistic idea would be to insert again those m users with
 'TODO:n', with n higher than the current bucket number so it can be
 processed later. But then it screws up all the counting system. Process A
 which inserts data will not know that there are already m users in row n,
 so will happily add 100 000 columns, making the row size grow to  100 000 +
 m. When process B reads back again this row, it will stop at the first 100
 000 columns and skip the trailing m elements .
That 's the main reason for which I dropped the idea of bucketing
 (which is quite smart in normal case) to trade for ultra wide row.
   Any way, I'll follow your advice and play around with the parameters
 of SizeTiered
   Regards
   Duy Hai DOAN
 
  On Fri, Jan 31, 2014 at 9:23 PM, Nate McCall n...@thelastpickle.com
 wrote:
 
   The only drawback for ultra wide row I can see is point 1). But if I
 use leveled compaction with a sufficiently large value for
 sstable_size_in_mb (let's say 200Mb), will my read performance be
 impacted as the row grows ?
 
  For this use case, you would want to use SizeTieredCompaction and play
 around with the configuration a bit to keep a small number of large
 SSTables. Specifically: keep min|max_threshold really low, set bucket_low
 and bucket_high closer together maybe even both to 1.0, and maybe a larger
 

Re: Cassandra 2.0 with Hadoop 2.x?

2014-02-04 Thread Cyril Scetbon
Hi,

Look for posts from Thunder Stumpges in this mailing list. I know he has 
succeeded to make it Hadoop 2.x work with Cassandra 2.x

For those who are interested in using it with Cassandra 1.2.13 you can use the 
patch 
https://github.com/cscetbon/cassandra/commit/88d694362d8d6bc09b3eeceb6baad7b3cc068ad3.patch

It uses Cloudera CDH4 repository for Hadoop Classes but you can use others.

Regards
-- 
Cyril SCETBON

On 03 Feb 2014, at 19:10, Clint Kelly clint.ke...@gmail.com wrote:

 Folks,
 
 Has anyone out there used Cassandra 2.0 with Hadoop 2.x?  I saw this
 discussion on the Cassandra JIRA:
 
https://issues.apache.org/jira/browse/CASSANDRA-5201
 
 but the fix referenced
 (https://github.com/michaelsembwever/cassandra-hadoop) is for
 Cassandra 1.2.
 
 I put together a similar patch for Cassandra 2.0 for anyone who is interested:
 
https://github.com/wibiclint/cassandra2-hadoop2
 
 but I'm wondering if there is a more official solution to this
 problem.  Any help would be appreciated.  Thanks!
 
 Best regards,
 Clint



Re: Data tombstoned during bulk loading 1.2.10 - 2.0.3

2014-02-04 Thread olek.stas...@gmail.com
Seems good. I'll discus it with data owners and we choose the best method.
Best regards,
Aleksander
4 lut 2014 19:40 Robert Coli rc...@eventbrite.com napisał(a):

 On Tue, Feb 4, 2014 at 12:21 AM, olek.stas...@gmail.com 
 olek.stas...@gmail.com wrote:

 I don't know what is the real cause of my problem. We are still guessing.
 All operations I have done one cluster are described on timeline:
 1.1.7- 1.2.10 - upgradesstable - 2.0.2 - normal operations -2.0.3
 - normal operations - now
 normal operations means reads/writes/repairs.
 Could you please, describe briefly how to recover data? I have a
 problem with scenario described under link:

 http://thelastpickle.com/blog/2011/12/15/Anatomy-of-a-Cassandra-Partition.html,
 I can't apply this solution to my case.


 I think your only option is the following :

 1) determine which SSTables contain rows have doomstones (tombstones from
 the far future)
 2) determine whether these tombstones mask a live or dead version of the
 row, by looking at other row fragments
 3) dump/filter/re-write all your data via some method, probably
 sstable2json/json2sstable
 4) load the corrected sstables by starting a node with the sstables in the
 data directory

 I understand you have a lot of data, but I am pretty sure there is no way
 for you to fix it within Cassandra. Perhaps ask for advice on the JIRA
 ticket mentioned upthread if this answer is not sufficient?

 =Rob




Question 1: JMX binding, Question 2: Logging

2014-02-04 Thread Kyle Crumpton (kcrumpto)
Hi all,

I'm fairly new to Cassandra. I'm deploying it to a PaaS. One thing this entails 
is that it must be able to have more than one instance on a single node. I'm 
running into the problem that JMX binds to 0.0.0.0:7199. My question is this: 
Is there a way to configure this? I have actually found the post that said to 
change the the following
JVM_OPTS=$JVM_OPTS -Djava.rmi.server.hostname=127.1.246.3 where 127.1.246.3 
is the IP I want to bind to..
This actually did not change the JMX binding by any means for me. I saw a post 
about a jmx listen address in cassandra.yaml and this also did not work.
Any clarity on whether this is bindable at all? Or if there are plans for it?

Also-

I have logging turned on. For some reason, though, my Cassandra is not actually 
logging as intended. My log folder is actually empty after each (failed) run 
(due to the port being taken by my other cassandra process).

Here is an actual copy of my log4j-server.properites file: 
http://fpaste.org/74470/15510941/

Any idea why this might not be logging?

Thank you and best regards

Kyle


Re: Question 1: JMX binding, Question 2: Logging

2014-02-04 Thread srmore
Hello Kyle,
For your first question, you need to create aliases to localhost e.g.
127.0.0.2,127.0.0.3 etc. this should get you going.
About the logging issue, I think if your instance failing before it gets to
long anything, as an example you can strart one instance and make sure it
logs correctly.

Hope that helps.
Sandeep




On Tue, Feb 4, 2014 at 4:25 PM, Kyle Crumpton (kcrumpto) kcrum...@cisco.com
 wrote:

  Hi all,

  I'm fairly new to Cassandra. I'm deploying it to a PaaS. One thing this
 entails is that it must be able to have more than one instance on a single
 node. I'm running into the problem that JMX binds to 0.0.0.0:7199. My
 question is this: Is there a way to configure this? I have actually found
 the post that said to change the the following

 JVM_OPTS=$JVM_OPTS -Djava.rmi.server.hostname=127.1.246.3 where
 127.1.246.3 is the IP I want to bind to..

 This actually did not change the JMX binding by any means for me. I saw a
 post about a jmx listen address in cassandra.yaml and this also did not
 work.
 Any clarity on whether this is bindable at all? Or if there are plans for
 it?

  Also-

  I have logging turned on. For some reason, though, my Cassandra is not
 actually logging as intended. My log folder is actually empty after each
 (failed) run (due to the port being taken by my other cassandra process).

  Here is an actual copy of my log4j-server.properites file:
 http://fpaste.org/74470/15510941/

  Any idea why this might not be logging?

  Thank you and best regards

  Kyle



Re: Question 1: JMX binding, Question 2: Logging

2014-02-04 Thread Andrey Ilinykh
JMX stuff is in /conf/cassandra-env.sh


On Tue, Feb 4, 2014 at 2:25 PM, Kyle Crumpton (kcrumpto) kcrum...@cisco.com
 wrote:

  Hi all,

  I'm fairly new to Cassandra. I'm deploying it to a PaaS. One thing this
 entails is that it must be able to have more than one instance on a single
 node. I'm running into the problem that JMX binds to 0.0.0.0:7199. My
 question is this: Is there a way to configure this? I have actually found
 the post that said to change the the following

 JVM_OPTS=$JVM_OPTS -Djava.rmi.server.hostname=127.1.246.3 where
 127.1.246.3 is the IP I want to bind to..

 This actually did not change the JMX binding by any means for me. I saw a
 post about a jmx listen address in cassandra.yaml and this also did not
 work.
 Any clarity on whether this is bindable at all? Or if there are plans for
 it?

  Also-

  I have logging turned on. For some reason, though, my Cassandra is not
 actually logging as intended. My log folder is actually empty after each
 (failed) run (due to the port being taken by my other cassandra process).

  Here is an actual copy of my log4j-server.properites file:
 http://fpaste.org/74470/15510941/

  Any idea why this might not be logging?

  Thank you and best regards

  Kyle



Re: Cassandra 2.0 with Hadoop 2.x?

2014-02-04 Thread Thunder Stumpges
Hello Clint,

Yes I was able to get it working after a bit of work. I have pushed the
branch with the fix (which is currently quite a ways behind latest). You
can compare to yours I suppose. Let me know if you have any questions.

https://github.com/VerticalSearchWorks/cassandra/tree/Cassandra2-CDH4

regards,
Thunder




On Tue, Feb 4, 2014 at 1:40 PM, Cyril Scetbon cyril.scet...@free.fr wrote:

 Hi,

 Look for posts from Thunder Stumpges in this mailing list. I know he has
 succeeded to make it Hadoop 2.x work with Cassandra 2.x

 For those who are interested in using it with Cassandra 1.2.13 you can use
 the patch
 https://github.com/cscetbon/cassandra/commit/88d694362d8d6bc09b3eeceb6baad7b3cc068ad3.patch

 It uses Cloudera CDH4 repository for Hadoop Classes but you can use others.

 Regards
 --
 Cyril SCETBON

 On 03 Feb 2014, at 19:10, Clint Kelly clint.ke...@gmail.com wrote:

  Folks,
 
  Has anyone out there used Cassandra 2.0 with Hadoop 2.x?  I saw this
  discussion on the Cassandra JIRA:
 
 https://issues.apache.org/jira/browse/CASSANDRA-5201
 
  but the fix referenced
  (https://github.com/michaelsembwever/cassandra-hadoop) is for
  Cassandra 1.2.
 
  I put together a similar patch for Cassandra 2.0 for anyone who is
 interested:
 
 https://github.com/wibiclint/cassandra2-hadoop2
 
  but I'm wondering if there is a more official solution to this
  problem.  Any help would be appreciated.  Thanks!
 
  Best regards,
  Clint




Re: Lots of deletions results in death by GC

2014-02-04 Thread Robert Wille
I ran my test again, and Flush Writer¹s ³All time blocked² increased to 2
and then shortly thereafter GC went into its death spiral. I doubled
memtable_flush_writers (to 2) and memtable_flush_queue_size (to 8) and tried
again.

This time, the table that always sat with Memtable data size = 0 now showed
increases in Memtable data size. That was encouraging. It never flushed,
which isn¹t too surprising, because that table has relatively few rows and
they are pretty wide. However, on the fourth table to clean, Flush Writer¹s
³All time blocked² went to 1, and then there were no more completed events,
and about 10 minutes later GC went into its death spiral. I assume that each
time Flush Writer completes an event, that means a table was flushed. Is
that right? Also, I got two dropped mutation messages at the same time that
Flush Writer¹s All time blocked incremented.

I then increased the writers and queue size to 3 and 12, respectively, and
ran my test again. This time All time blocked remained at 0, but I still
suffered death by GC.

I would almost think that this is caused by high load on the server, but
I¹ve never seen CPU utilization go above about two of my eight available
cores. If high load triggers this problem, then that is very disconcerting.
That means that a CPU spike could permanently cripple a node. Okay, not
permanently, but until a manual flush occurs.

If anyone has any further thoughts, I¹d love to hear them. I¹m quite at the
end of my rope.

Thanks in advance

Robert

From:  Nate McCall n...@thelastpickle.com
Reply-To:  user@cassandra.apache.org
Date:  Saturday, February 1, 2014 at 9:25 AM
To:  Cassandra Users user@cassandra.apache.org
Subject:  Re: Lots of deletions results in death by GC

What's the output of 'nodetool tpstats' while this is happening?
Specifically is Flush Writer All time blocked increasing? If so, play
around with turning up memtable_flush_writers and memtable_flush_queue_size
and see if that helps.


On Sat, Feb 1, 2014 at 9:03 AM, Robert Wille rwi...@fold3.com wrote:
 A few days ago I posted about an issue I¹m having where GC takes a long time
 (20-30 seconds), and it happens repeatedly and basically no work gets done.
 I¹ve done further investigation, and I now believe that I know the cause. If I
 do a lot of deletes, it creates memory pressure until the memtables are
 flushed, but Cassandra doesn¹t flush them. If I manually flush, then life is
 good again (although that takes a very long time because of the GC issue). If
 I just leave the flushing to Cassandra, then I end up with death by GC. I
 believe that when the memtables are full of tombstones, Cassadnra doesn¹t
 realize how much memory the memtables are actually taking up, and so it
 doesn¹t proactively flush them in order to free up heap.
 
 As I was deleting records out of one of my tables, I was watching it via
 nodetool cfstats, and I found a very curious thing:
 
 Memtable cell count: 1285
 Memtable data size, bytes: 0
 Memtable switch count: 56
 
 As the deletion process was chugging away, the memtable cell count increased,
 as expected, but the data size stayed at 0. No flushing occurred.
 
 Here¹s the schema for this table:
 
 CREATE TABLE bdn_index_pub (
 
 tshard VARCHAR,
 
 pord INT,
 
 ord INT,
 
 hpath VARCHAR,
 
 page BIGINT,
 
 PRIMARY KEY (tshard, pord)
 
 ) WITH gc_grace_seconds = 0 AND compaction = { 'class' :
 'LeveledCompactionStrategy', 'sstable_size_in_mb' : 160 };
 
 
 I have a few tables that I run this cleaning process on, and not all of them
 exhibit this behavior. One of them reported an increasing number of bytes, as
 expected, and it also flushed as expected. Here¹s the schema for that table:
 
 
 CREATE TABLE bdn_index_child (
 
 ptshard VARCHAR,
 
 ord INT,
 
 hpath VARCHAR,
 
 PRIMARY KEY (ptshard, ord)
 
 ) WITH gc_grace_seconds = 0 AND compaction = { 'class' :
 'LeveledCompactionStrategy', 'sstable_size_in_mb' : 160 };
 
 
 In both cases, I¹m deleting the entire record (i.e. specifying just the first
 component of the primary key in the delete statement). Most records in
 bdn_index_pub have 10,000 rows per record. bdn_index_child usually has just a
 handful of rows, but a few records can have up 10,000.
 
 Still a further mystery, 1285 tombstones in the bdn_index_pub memtable doesn¹t
 seem like nearly enough to create a memory problem. Perhaps there are other
 flaws in the memory metering. Or perhaps there is some other issue that causes
 Cassandra to mismanage the heap when there are a lot of deletes. One other
 thought I had is that I page through these tables and clean them out as I go.
 Perhaps there is some interaction between the paging and the deleting that
 causes the GC problems and I should create a list of keys to delete and then
 delete them after I¹ve finished reading the entire table.
 
 I reduced memtable_total_space_in_mb from the default (probably 2.7 GB) to 1
 GB, in hopes that it would force Cassandra to flush 

Re: Lots of deletions results in death by GC

2014-02-04 Thread Benedict Elliott Smith
Is it possible you are generating *exclusively* deletes for this table?


On 5 February 2014 00:10, Robert Wille rwi...@fold3.com wrote:

 I ran my test again, and Flush Writer's All time blocked increased to 2
 and then shortly thereafter GC went into its death spiral. I doubled
 memtable_flush_writers (to 2) and memtable_flush_queue_size (to 8) and
 tried again.

 This time, the table that always sat with Memtable data size = 0 now
 showed increases in Memtable data size. That was encouraging. It never
 flushed, which isn't too surprising, because that table has relatively few
 rows and they are pretty wide. However, on the fourth table to clean, Flush
 Writer's All time blocked went to 1, and then there were no more
 completed events, and about 10 minutes later GC went into its death spiral.
 I assume that each time Flush Writer completes an event, that means a table
 was flushed. Is that right? Also, I got two dropped mutation messages at
 the same time that Flush Writer's All time blocked incremented.

 I then increased the writers and queue size to 3 and 12, respectively, and
 ran my test again. This time All time blocked remained at 0, but I still
 suffered death by GC.

 I would almost think that this is caused by high load on the server, but
 I've never seen CPU utilization go above about two of my eight available
 cores. If high load triggers this problem, then that is very disconcerting.
 That means that a CPU spike could permanently cripple a node. Okay, not
 permanently, but until a manual flush occurs.

 If anyone has any further thoughts, I'd love to hear them. I'm quite at
 the end of my rope.

 Thanks in advance

 Robert

 From: Nate McCall n...@thelastpickle.com
 Reply-To: user@cassandra.apache.org
 Date: Saturday, February 1, 2014 at 9:25 AM
 To: Cassandra Users user@cassandra.apache.org
 Subject: Re: Lots of deletions results in death by GC

 What's the output of 'nodetool tpstats' while this is happening?
 Specifically is Flush Writer All time blocked increasing? If so, play
 around with turning up memtable_flush_writers and memtable_flush_queue_size
 and see if that helps.


 On Sat, Feb 1, 2014 at 9:03 AM, Robert Wille rwi...@fold3.com wrote:

 A few days ago I posted about an issue I'm having where GC takes a long
 time (20-30 seconds), and it happens repeatedly and basically no work gets
 done. I've done further investigation, and I now believe that I know the
 cause. If I do a lot of deletes, it creates memory pressure until the
 memtables are flushed, but Cassandra doesn't flush them. If I manually
 flush, then life is good again (although that takes a very long time
 because of the GC issue). If I just leave the flushing to Cassandra, then I
 end up with death by GC. I believe that when the memtables are full of
 tombstones, Cassadnra doesn't realize how much memory the memtables are
 actually taking up, and so it doesn't proactively flush them in order to
 free up heap.

 As I was deleting records out of one of my tables, I was watching it via
 nodetool cfstats, and I found a very curious thing:

 Memtable cell count: 1285
 Memtable data size, bytes: 0
 Memtable switch count: 56

 As the deletion process was chugging away, the memtable cell count
 increased, as expected, but the data size stayed at 0. No flushing
 occurred.

 Here's the schema for this table:

 CREATE TABLE bdn_index_pub (

 tshard VARCHAR,

 pord INT,

 ord INT,

 hpath VARCHAR,

 page BIGINT,

 PRIMARY KEY (tshard, pord)

 ) WITH gc_grace_seconds = 0 AND compaction = { 'class' :
 'LeveledCompactionStrategy', 'sstable_size_in_mb' : 160 };

 I have a few tables that I run this cleaning process on, and not all of
 them exhibit this behavior. One of them reported an increasing number of
 bytes, as expected, and it also flushed as expected. Here's the schema for
 that table:


 CREATE TABLE bdn_index_child (

 ptshard VARCHAR,

 ord INT,

 hpath VARCHAR,

 PRIMARY KEY (ptshard, ord)

 ) WITH gc_grace_seconds = 0 AND compaction = { 'class' :
 'LeveledCompactionStrategy', 'sstable_size_in_mb' : 160 };

 In both cases, I'm deleting the entire record (i.e. specifying just the
 first component of the primary key in the delete statement). Most records
 in bdn_index_pub have 10,000 rows per record. bdn_index_child usually has
 just a handful of rows, but a few records can have up 10,000.

 Still a further mystery, 1285 tombstones in the bdn_index_pub memtable
 doesn't seem like nearly enough to create a memory problem. Perhaps there
 are other flaws in the memory metering. Or perhaps there is some other
 issue that causes Cassandra to mismanage the heap when there are a lot of
 deletes. One other thought I had is that I page through these tables and
 clean them out as I go. Perhaps there is some interaction between the
 paging and the deleting that causes the GC problems and I should create a
 list of keys to delete and then delete them after I've finished reading the
 

Re: what tool will create noncql columnfamilies in cassandra 3a

2014-02-04 Thread Patricia Gorla
I am also curious as to how users will manage Thrift-based tables without
the cli.

PyCassaShell comes to mind, as does using Thrift-based clients.


On Tue, Feb 4, 2014 at 9:53 AM, Edward Capriolo edlinuxg...@gmail.comwrote:

 Cassandra 2.0.4 cli is informing me that it will no longer exist in the
 next major.

 How will users adjust the meta data of non cql column families and other
 cfs that do not fit into the cql model?

 --
 Sorry this was sent from mobile. Will do less grammar and spell check than
 usual.




-- 
Patricia Gorla
@patriciagorla

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com http://thelastpickle.com


Re: Lots of deletions results in death by GC

2014-02-04 Thread srmore
Sorry to hear that Robert, I ran into similar issue a while ago. I had an
extremely heavy write and update load, as a result Cassandra (1.2.9) was
constantly flushing to disk and used to GC, tried exactly the same steps
you tried (tuning memtable_flush_writers (to 2) and
memtable_flush_queue_size (to 8) )  no luck. Almost all of the issues went
away when I migrated to 1.2.13 this release also had some fixes which I
badly needed.  What version are you running ? (I tried to look in the
thread but couldn't find one, sorry if this is a repeat question)

Dropped messages are the sign that Cassandra is taking heavy that's the
load shedding mechanism. I would love to see some sort of  back-pressure
implemented.

-sandeep


On Tue, Feb 4, 2014 at 6:10 PM, Robert Wille rwi...@fold3.com wrote:

 I ran my test again, and Flush Writer's All time blocked increased to 2
 and then shortly thereafter GC went into its death spiral. I doubled
 memtable_flush_writers (to 2) and memtable_flush_queue_size (to 8) and
 tried again.

 This time, the table that always sat with Memtable data size = 0 now
 showed increases in Memtable data size. That was encouraging. It never
 flushed, which isn't too surprising, because that table has relatively few
 rows and they are pretty wide. However, on the fourth table to clean, Flush
 Writer's All time blocked went to 1, and then there were no more
 completed events, and about 10 minutes later GC went into its death spiral.
 I assume that each time Flush Writer completes an event, that means a table
 was flushed. Is that right? Also, I got two dropped mutation messages at
 the same time that Flush Writer's All time blocked incremented.

 I then increased the writers and queue size to 3 and 12, respectively, and
 ran my test again. This time All time blocked remained at 0, but I still
 suffered death by GC.

 I would almost think that this is caused by high load on the server, but
 I've never seen CPU utilization go above about two of my eight available
 cores. If high load triggers this problem, then that is very disconcerting.
 That means that a CPU spike could permanently cripple a node. Okay, not
 permanently, but until a manual flush occurs.

 If anyone has any further thoughts, I'd love to hear them. I'm quite at
 the end of my rope.

 Thanks in advance

 Robert

 From: Nate McCall n...@thelastpickle.com
 Reply-To: user@cassandra.apache.org
 Date: Saturday, February 1, 2014 at 9:25 AM
 To: Cassandra Users user@cassandra.apache.org
 Subject: Re: Lots of deletions results in death by GC

 What's the output of 'nodetool tpstats' while this is happening?
 Specifically is Flush Writer All time blocked increasing? If so, play
 around with turning up memtable_flush_writers and memtable_flush_queue_size
 and see if that helps.


 On Sat, Feb 1, 2014 at 9:03 AM, Robert Wille rwi...@fold3.com wrote:

 A few days ago I posted about an issue I'm having where GC takes a long
 time (20-30 seconds), and it happens repeatedly and basically no work gets
 done. I've done further investigation, and I now believe that I know the
 cause. If I do a lot of deletes, it creates memory pressure until the
 memtables are flushed, but Cassandra doesn't flush them. If I manually
 flush, then life is good again (although that takes a very long time
 because of the GC issue). If I just leave the flushing to Cassandra, then I
 end up with death by GC. I believe that when the memtables are full of
 tombstones, Cassadnra doesn't realize how much memory the memtables are
 actually taking up, and so it doesn't proactively flush them in order to
 free up heap.

 As I was deleting records out of one of my tables, I was watching it via
 nodetool cfstats, and I found a very curious thing:

 Memtable cell count: 1285
 Memtable data size, bytes: 0
 Memtable switch count: 56

 As the deletion process was chugging away, the memtable cell count
 increased, as expected, but the data size stayed at 0. No flushing
 occurred.

 Here's the schema for this table:

 CREATE TABLE bdn_index_pub (

 tshard VARCHAR,

 pord INT,

 ord INT,

 hpath VARCHAR,

 page BIGINT,

 PRIMARY KEY (tshard, pord)

 ) WITH gc_grace_seconds = 0 AND compaction = { 'class' :
 'LeveledCompactionStrategy', 'sstable_size_in_mb' : 160 };

 I have a few tables that I run this cleaning process on, and not all of
 them exhibit this behavior. One of them reported an increasing number of
 bytes, as expected, and it also flushed as expected. Here's the schema for
 that table:


 CREATE TABLE bdn_index_child (

 ptshard VARCHAR,

 ord INT,

 hpath VARCHAR,

 PRIMARY KEY (ptshard, ord)

 ) WITH gc_grace_seconds = 0 AND compaction = { 'class' :
 'LeveledCompactionStrategy', 'sstable_size_in_mb' : 160 };

 In both cases, I'm deleting the entire record (i.e. specifying just the
 first component of the primary key in the delete statement). Most records
 in bdn_index_pub have 10,000 rows per record. bdn_index_child usually has
 

Looking for clarification on the gossip protocol... 3 random nodes every second?

2014-02-04 Thread Sameer Farooqui
Hi, I'm looking to get some clarification on how the gossip protocol works
in Cassandra 2.0.

Does a node contact 3 purely random nodes every second for gossip or is
there more intelligence involved in how it selects the 3 nodes?

*The Apache wiki on Cassandra states this:*
Gossip timer task runs every second. During each of these runs the node
initiates gossip exchange according to following rules:

1) Gossip to random live endpoint (if any)

2) Gossip to random unreachable endpoint with certain probability depending
on number of unreachable and live nodes

3) If the node gossiped to at (1) was not seed, or the number of live nodes
is less than number of seeds, gossip to random seed with certain
probability depending on number of unreachable, seed and live nodes.

These rules were developed to ensure that if the network is up, all nodes
will eventually know about all other nodes.

Link: http://wiki.apache.org/cassandra/ArchitectureGossip

Is the above still true for C* 2.0?

Let's say all 20 nodes in a C* cluster are up. In this case will each node
simply contact 3 random nodes every second?

- SF