nodetool upgradesstables skip major version

2015-10-30 Thread Xu Zhongxing
Can I run
nodetool upgradesstables
after updating a Cassandra 2.0 node directly to Cassandra 3.0?


Or do I have to upgrade to 2.1 and then upgrade to 3.0?

Re:Re: High cpu usage when the cluster is idle

2015-10-24 Thread Xu Zhongxing
Thank you very much. I figured out the number of tables is the cause yesterday. 
Your analysis confirmed that.






在2015年10月25 05时23分, "Graham Sanderson"写道:

I would imagine you are running on fairly slow machines (given the CPU usage), 
but 2.0.12 and 2.1 use a fairly old version of the yammer/codehale metrics 
library.


It is waking up every 5 seconds, and updating Meters… there are a bunch of 
these Meters per table (embedded in Timers), so your large 1500 table count is 
basically most of the problem. 


AFAIK there is no way to turn the metrics off; they also power the JMX 
interfaces.


On Oct 24, 2015, at 7:54 AM, Xu Zhongxing  wrote:


The cassandra version is 2.0.12.  We have 1500 tables in the cluster of 6 
nodes, with a total 2.5 billion rows.


在2015年10月24 20时52分, "Xu Zhongxing"写道:




I saw an average 10% cpu usage on each node when the cassandra cluster has no 
load at all.
I checked which thread was using the cpu, and I got the following 2 metric 
threads each occupying 5% cpu.


jstack output:  


"metrics-meter-tick-thread-2" daemon prio=10 tic=...
  java.lang.Thread.State: WAITING (parking)
  at sum.misc.Unsafe.park(Native Method)
  -parking to wait for ...
  at ... (LockSupport.java:186)
  at ... (AbstractQueuedSynchronizer.java:2043)
...
at .. (Thread.java:745)


The other thread is the same.


Can someone give some clue to this problem? Thank you.



Re:High cpu usage when the cluster is idle

2015-10-24 Thread Xu Zhongxing
The cassandra version is 2.0.12.  We have 1500 tables in the cluster of 6 
nodes, with a total 2.5 billion rows.


在2015年10月24 20时52分, "Xu Zhongxing"写道:




I saw an average 10% cpu usage on each node when the cassandra cluster has no 
load at all.
I checked which thread was using the cpu, and I got the following 2 metric 
threads each occupying 5% cpu.


jstack output:  


"metrics-meter-tick-thread-2" daemon prio=10 tic=...
  java.lang.Thread.State: WAITING (parking)
  at sum.misc.Unsafe.park(Native Method)
  -parking to wait for ...
  at ... (LockSupport.java:186)
  at ... (AbstractQueuedSynchronizer.java:2043)
...
at .. (Thread.java:745)


The other thread is the same.


Can someone give some clue to this problem? Thank you.

High cpu usage when the cluster is idle

2015-10-24 Thread Xu Zhongxing


I saw an average 10% cpu usage on each node when the cassandra cluster has no 
load at all.
I checked which thread was using the cpu, and I got the following 2 metric 
threads each occupying 5% cpu.


jstack output:  


"metrics-meter-tick-thread-2" daemon prio=10 tic=...
  java.lang.Thread.State: WAITING (parking)
  at sum.misc.Unsafe.park(Native Method)
  -parking to wait for ...
  at ... (LockSupport.java:186)
  at ... (AbstractQueuedSynchronizer.java:2043)
...
at .. (Thread.java:745)


The other thread is the same.


Can someone give some clue to this problem? Thank you.

Re: full-tabe scan - extracting all data from C*

2015-01-27 Thread Xu Zhongxing
This is hard to answer. The performance is a thing depending on context. 
You could tune various parameters.

At 2015-01-28 14:43:38, "Shenghua(Daniel) Wan"  wrote:

Cool. What about performance? e.g. how many record for how long?


On Tue, Jan 27, 2015 at 10:16 PM, Xu Zhongxing  wrote:

For Java driver, there is no special API actually, just


ResultSet rs = session.execute("select * from ...");
for (Row r : rs) {
   ...
}


For Spark, the code skeleton is:


val rdd = sc.cassandraTable("ks", "table")


then call various standard Spark API to process the table parallelly.


I have not used CqlInputFormat.


At 2015-01-28 13:38:20, "Shenghua(Daniel) Wan"  wrote:
Hi, Zhongxing,
I am also interested in your table size. I am trying to dump 10s Million record 
data from C* using map-reduce related API like CqlInputFormat.
You mentioned about Java driver. Could you suggest any API you used? Thanks.


On Tue, Jan 27, 2015 at 5:33 PM, Xu Zhongxing  wrote:

Both Java driver "select * from table" and Spark sc.cassandraTable() work well. 
I use both of them frequently.

At 2015-01-28 04:06:20, "Mohammed Guller"  wrote:


Hi –

 

Over the last few weeks, I have seen several emails on this mailing list from 
people trying to extract all data from C*, so that they can import that data 
into other analytical tools that provide much richer analytics functionality 
than C*. Extracting all data from C* is a full-table scan, which is not the 
ideal use case for C*. However, people don’t have much choice if they want to 
do ad-hoc analytics on the data in C*. Unfortunately, I don’t think C* comes 
with any built-in tools that make this task easy for a large dataset. Please 
correct me if I am wrong. Cqlsh has a COPY TO command, but it doesn’t really 
work if you have a large amount of data in C*.

 

I am aware of couple of approaches for extracting all data from a table in C*:

1)  Iterate through all the C* partitions (physical rows) using the Java 
Driver and CQL.

2)  Extract the data directly from SSTables files.

 

Either approach can be used with Hadoop or Spark to speed up the extraction 
process.

 

I wanted to do a quick survey and find out how many people on this mailing list 
have successfully used approach #1 or #2 for extracting large datasets 
(terabytes) from C*. Also, if you have used some other techniques, it would be 
great if you could share your approach with the group.

 

Mohammed

 






--



Regards,
Shenghua (Daniel) Wan





--



Regards,
Shenghua (Daniel) Wan

Re:Re: full-tabe scan - extracting all data from C*

2015-01-27 Thread Xu Zhongxing
For Java driver, there is no special API actually, just


ResultSet rs = session.execute("select * from ...");
for (Row r : rs) {
   ...
}


For Spark, the code skeleton is:


val rdd = sc.cassandraTable("ks", "table")


then call various standard Spark API to process the table parallelly.


I have not used CqlInputFormat.


At 2015-01-28 13:38:20, "Shenghua(Daniel) Wan"  wrote:
Hi, Zhongxing,
I am also interested in your table size. I am trying to dump 10s Million record 
data from C* using map-reduce related API like CqlInputFormat.
You mentioned about Java driver. Could you suggest any API you used? Thanks.


On Tue, Jan 27, 2015 at 5:33 PM, Xu Zhongxing  wrote:

Both Java driver "select * from table" and Spark sc.cassandraTable() work well. 
I use both of them frequently.

At 2015-01-28 04:06:20, "Mohammed Guller"  wrote:


Hi –

 

Over the last few weeks, I have seen several emails on this mailing list from 
people trying to extract all data from C*, so that they can import that data 
into other analytical tools that provide much richer analytics functionality 
than C*. Extracting all data from C* is a full-table scan, which is not the 
ideal use case for C*. However, people don’t have much choice if they want to 
do ad-hoc analytics on the data in C*. Unfortunately, I don’t think C* comes 
with any built-in tools that make this task easy for a large dataset. Please 
correct me if I am wrong. Cqlsh has a COPY TO command, but it doesn’t really 
work if you have a large amount of data in C*.

 

I am aware of couple of approaches for extracting all data from a table in C*:

1)  Iterate through all the C* partitions (physical rows) using the Java 
Driver and CQL.

2)  Extract the data directly from SSTables files.

 

Either approach can be used with Hadoop or Spark to speed up the extraction 
process.

 

I wanted to do a quick survey and find out how many people on this mailing list 
have successfully used approach #1 or #2 for extracting large datasets 
(terabytes) from C*. Also, if you have used some other techniques, it would be 
great if you could share your approach with the group.

 

Mohammed

 






--



Regards,
Shenghua (Daniel) Wan

Re:full-tabe scan - extracting all data from C*

2015-01-27 Thread Xu Zhongxing
The table has several billion rows.
I think the table size is irrelevant here. Cassandra driver can do paging well. 
Spark handles data partition well, too.


At 2015-01-28 10:45:17, "Mohammed Guller"  wrote:


How big is your table? How much data does it have?

 

Mohammed

 

From: Xu Zhongxing [mailto:xu_zhong_x...@163.com]
Sent: Tuesday, January 27, 2015 5:34 PM
To:user@cassandra.apache.org
Subject: Re:full-tabe scan - extracting all data from C*

 

Both Java driver "select * from table" and Spark sc.cassandraTable() work well. 

I use both of them frequently.


At 2015-01-28 04:06:20, "Mohammed Guller"  wrote:



Hi –

 

Over the last few weeks, I have seen several emails on this mailing list from 
people trying to extract all data from C*, so that they can import that data 
into other analytical tools that provide much richer analytics functionality 
than C*. Extracting all data from C* is a full-table scan, which is not the 
ideal use case for C*. However, people don’t have much choice if they want to 
do ad-hoc analytics on the data in C*. Unfortunately, I don’t think C* comes 
with any built-in tools that make this task easy for a large dataset. Please 
correct me if I am wrong. Cqlsh has a COPY TO command, but it doesn’t really 
work if you have a large amount of data in C*.

 

I am aware of couple of approaches for extracting all data from a table in C*:

1)  Iterate through all the C* partitions (physical rows) using the Java 
Driver and CQL.

2)  Extract the data directly from SSTables files.

 

Either approach can be used with Hadoop or Spark to speed up the extraction 
process.

 

I wanted to do a quick survey and find out how many people on this mailing list 
have successfully used approach #1 or #2 for extracting large datasets 
(terabytes) from C*. Also, if you have used some other techniques, it would be 
great if you could share your approach with the group.

 

Mohammed

 

Re:full-tabe scan - extracting all data from C*

2015-01-27 Thread Xu Zhongxing
Both Java driver "select * from table" and Spark sc.cassandraTable() work well. 
I use both of them frequently.

At 2015-01-28 04:06:20, "Mohammed Guller"  wrote:


Hi –

 

Over the last few weeks, I have seen several emails on this mailing list from 
people trying to extract all data from C*, so that they can import that data 
into other analytical tools that provide much richer analytics functionality 
than C*. Extracting all data from C* is a full-table scan, which is not the 
ideal use case for C*. However, people don’t have much choice if they want to 
do ad-hoc analytics on the data in C*. Unfortunately, I don’t think C* comes 
with any built-in tools that make this task easy for a large dataset. Please 
correct me if I am wrong. Cqlsh has a COPY TO command, but it doesn’t really 
work if you have a large amount of data in C*.

 

I am aware of couple of approaches for extracting all data from a table in C*:

1)  Iterate through all the C* partitions (physical rows) using the Java 
Driver and CQL.

2)  Extract the data directly from SSTables files.

 

Either approach can be used with Hadoop or Spark to speed up the extraction 
process.

 

I wanted to do a quick survey and find out how many people on this mailing list 
have successfully used approach #1 or #2 for extracting large datasets 
(terabytes) from C*. Also, if you have used some other techniques, it would be 
great if you could share your approach with the group.

 

Mohammed

 

Re: Dynamic Columns

2015-01-20 Thread Xu Zhongxing
The original dynamic column idea in Google BigTable paper is a mapping of:


(row key, raw bytes) -> raw bytes


The restriction imposed by CQL is, as far as I understand, you need to have a 
type for each column. 


If the value types involved in the schema is limited, e.g. text or int or 
timestamp, we can approximate the raw bytes mapping by setting up a few value 
columns of explicit type.





At 2015-01-21 10:46:27, "Peter Lin"  wrote:



the thing is, CQL only handles some types of dynamic column use cases. There's 
plenty of examples on datastax.com that shows how to do CQL style dynamic 
columns.


based on what was described by Chetan, I don't feel CQL3 is a perfect fit for 
what he wants to do. To use CQL3, he'd have to change his approach.

In my temporal database, I use both Thrift and CQL. They compliment each other 
very nice. I don't understand why people have to put down Thrift or pretend it 
supports 100% of the use cases. Lots of people who started using Cassandra pre 
CQL and had no problems using thrift. Yes you have to understand more and the 
learning curve is steeper, but taking time to learn the internals of cassandra 
is a good thing.


Using CQL3 lists or maps, it would force the query to load the enter 
collection, but that is by design. To get the full power of the old style of 
dynamic columns, thrift is a better fit. I hope CQL continues to improve so 
that it supports 100% of the existing use cases.





On Tue, Jan 20, 2015 at 8:50 PM, Xu Zhongxing  wrote:

I approximate dynamic columns by data_key and data_value columns.
Is there a better way to get dynamic columns in CQL 3?

At 2015-01-21 09:41:02, "Peter Lin"  wrote:



I think that table example misses the point of chetan's functional requirement. 
he actually needs dynamic columns.



On Tue, Jan 20, 2015 at 8:12 PM, Xu Zhongxing  wrote:

Maybe this is the closest thing to "dynamic columns" in CQL 3.


create table reivew (
product_id bigint,
created_at timestamp,
data_key text,
data_tvalue text,
data_ivalue int,
primary key ((priduct_id, created_at), data_key)
);


data_tvalue and data_ivalue is optional.


At 2015-01-21 04:44:07, "chetan verma"  wrote:

Hi,


Adding to previous mail. For example: We have a column family named review 
(with some arbitrary data in map).


CREATE TABLE review(
product_id bigint,
created_at timestamp,
data_int map,
data_text map,
PRIMARY KEY (product_id, created_at)
);


Assume that these 2 maps I use to store arbitrary data (i.e. data_int and 
data_text for int and text values)
when we see output on cassandra-cli, it looks like in a partition as :
:data_int:map_key as column name and value as map value.
suppose I need to get this value, I couldn't do that with CQL3 but in thrift 
its possible. Any Solution?


On Wed, Jan 21, 2015 at 1:06 AM, chetan verma  wrote:

Hi,


Most of the time I will  be querying on product_id and created_at, but for 
analytic I need to query almost on all column.
Multiple collections ideas is good but the only is cassandra reads a collection 
entirely, what if I need a slice of it, I mean 
columns for certain keys which is possible with thrift. Please suggest.


On Wed, Jan 21, 2015 at 12:36 AM, Jonathan Lacefield  
wrote:

Hello,


There are probably lots of options to this challenge.  The more details around 
your use case that you can provide, the easier it will be for this group to 
offer advice.


A few follow-up questions: 
  - How will you query this data?  
  - Do your queries require filtering on specific columns other than product_id 
and created_at, i.e. the dynamic columns?


Depending on the answers to these questions, you have several options, of which 
here are a few:
Cassandra efficiently stores sparse data, so you could create columns and not 
populate them, without much of a penalty
Could use a clustering column to store a columns type and another col 
(potentially clustering) to store the value
i.e. CREATE TABLE foo (col1 int, attname text, attvalue text, col4...n, PRIMARY 
KEY (col1, attname, attvalue));
where attname stores the name of the attribute/column and attvalue stores the 
value of that attribute
have seen users use this model and create a "main" attribute row within a 
partition that stores the values associated with col4...n
Could store multiple collections
Others probably have ideas as well
You may want to look in the archives for a similar discussion topic.  Believe 
this item was asked a few months ago as well.



Jonathan Lacefield

Solution Architect |(404) 822 3487 | jlacefi...@datastax.com





On Tue, Jan 20, 2015 at 1:40 PM, chetan verma  wrote:

Hi,


I am creating a review system. for instance lets assume following are the 
attibutes of system:


Review{
id bigint,
product_id bigint,
created_at timestamp,
summary text,
description text,
pros set,
cons set,
feature_rating map
etc
}
I created partition key a

Re:Re: Dynamic Columns

2015-01-20 Thread Xu Zhongxing
I approximate dynamic columns by data_key and data_value columns.
Is there a better way to get dynamic columns in CQL 3?

At 2015-01-21 09:41:02, "Peter Lin"  wrote:



I think that table example misses the point of chetan's functional requirement. 
he actually needs dynamic columns.



On Tue, Jan 20, 2015 at 8:12 PM, Xu Zhongxing  wrote:

Maybe this is the closest thing to "dynamic columns" in CQL 3.


create table reivew (
product_id bigint,
created_at timestamp,
data_key text,
data_tvalue text,
data_ivalue int,
primary key ((priduct_id, created_at), data_key)
);


data_tvalue and data_ivalue is optional.


At 2015-01-21 04:44:07, "chetan verma"  wrote:

Hi,


Adding to previous mail. For example: We have a column family named review 
(with some arbitrary data in map).


CREATE TABLE review(
product_id bigint,
created_at timestamp,
data_int map,
data_text map,
PRIMARY KEY (product_id, created_at)
);


Assume that these 2 maps I use to store arbitrary data (i.e. data_int and 
data_text for int and text values)
when we see output on cassandra-cli, it looks like in a partition as :
:data_int:map_key as column name and value as map value.
suppose I need to get this value, I couldn't do that with CQL3 but in thrift 
its possible. Any Solution?


On Wed, Jan 21, 2015 at 1:06 AM, chetan verma  wrote:

Hi,


Most of the time I will  be querying on product_id and created_at, but for 
analytic I need to query almost on all column.
Multiple collections ideas is good but the only is cassandra reads a collection 
entirely, what if I need a slice of it, I mean 
columns for certain keys which is possible with thrift. Please suggest.


On Wed, Jan 21, 2015 at 12:36 AM, Jonathan Lacefield  
wrote:

Hello,


There are probably lots of options to this challenge.  The more details around 
your use case that you can provide, the easier it will be for this group to 
offer advice.


A few follow-up questions: 
  - How will you query this data?  
  - Do your queries require filtering on specific columns other than product_id 
and created_at, i.e. the dynamic columns?


Depending on the answers to these questions, you have several options, of which 
here are a few:
Cassandra efficiently stores sparse data, so you could create columns and not 
populate them, without much of a penalty
Could use a clustering column to store a columns type and another col 
(potentially clustering) to store the value
i.e. CREATE TABLE foo (col1 int, attname text, attvalue text, col4...n, PRIMARY 
KEY (col1, attname, attvalue));
where attname stores the name of the attribute/column and attvalue stores the 
value of that attribute
have seen users use this model and create a "main" attribute row within a 
partition that stores the values associated with col4...n
Could store multiple collections
Others probably have ideas as well
You may want to look in the archives for a similar discussion topic.  Believe 
this item was asked a few months ago as well.



Jonathan Lacefield

Solution Architect |(404) 822 3487 | jlacefi...@datastax.com





On Tue, Jan 20, 2015 at 1:40 PM, chetan verma  wrote:

Hi,


I am creating a review system. for instance lets assume following are the 
attibutes of system:


Review{
id bigint,
product_id bigint,
created_at timestamp,
summary text,
description text,
pros set,
cons set,
feature_rating map
etc
}
I created partition key as product_id (so that all the reviews for a given 
product will reside on same node)
and clustering key as created_at and id (Desc) so that  reviews will be sorted 
by time.


I can have more column and that requirement I want to fulfil by dynamic columns 
but there are limitations to it explained above.
Could you please let me know the best way.


On Tue, Jan 20, 2015 at 11:59 PM, Jonathan Lacefield  
wrote:

Hello,


  Have you looked at solving this challenge with clustering columns?  Also, 
please describe the problem set details for more specific advice from this 
group.


  Starting new projects on Thrift isn't the recommended approach.  


Jonathan



Jonathan Lacefield

Solution Architect |(404) 822 3487 | jlacefi...@datastax.com





On Tue, Jan 20, 2015 at 1:24 PM, chetan verma  wrote:

Hi,


I am starting a new project with cassandra as database.
I have unstructured data so I need dynamic columns, 
though in CQL3 we can achive this via Collections but there are some downsides 
to it.
1. Collections are used to store small amount of data.
2. The maximum size of an item in a collection is 64K.
3. Cassandra reads a collection in its entirety.
4. Restrictions on number of items in collections is 64,000


And no support to get single column by map key, which is possible via cassandra 
cli.
Please suggest whether I should use CQL3 or Thrift and which driver is best.


--

Regards,
Chetan Verma
+91 99860 86634







--

Regards,
Chetan Verma
+91 99860 86634







--

Regards,
Chetan Verma
+91 99860 86634





--

Regards,
Chetan Verma
+91 99860 86634



Re: Dynamic Columns

2015-01-20 Thread Xu Zhongxing
Maybe this is the closest thing to "dynamic columns" in CQL 3.


create table reivew (
product_id bigint,
created_at timestamp,
data_key text,
data_tvalue text,
data_ivalue int,
primary key ((priduct_id, created_at), data_key)
);


data_tvalue and data_ivalue is optional.


At 2015-01-21 04:44:07, "chetan verma"  wrote:

Hi,


Adding to previous mail. For example: We have a column family named review 
(with some arbitrary data in map).


CREATE TABLE review(
product_id bigint,
created_at timestamp,
data_int map,
data_text map,
PRIMARY KEY (product_id, created_at)
);


Assume that these 2 maps I use to store arbitrary data (i.e. data_int and 
data_text for int and text values)
when we see output on cassandra-cli, it looks like in a partition as :
:data_int:map_key as column name and value as map value.
suppose I need to get this value, I couldn't do that with CQL3 but in thrift 
its possible. Any Solution?


On Wed, Jan 21, 2015 at 1:06 AM, chetan verma  wrote:

Hi,


Most of the time I will  be querying on product_id and created_at, but for 
analytic I need to query almost on all column.
Multiple collections ideas is good but the only is cassandra reads a collection 
entirely, what if I need a slice of it, I mean 
columns for certain keys which is possible with thrift. Please suggest.


On Wed, Jan 21, 2015 at 12:36 AM, Jonathan Lacefield  
wrote:

Hello,


There are probably lots of options to this challenge.  The more details around 
your use case that you can provide, the easier it will be for this group to 
offer advice.


A few follow-up questions: 
  - How will you query this data?  
  - Do your queries require filtering on specific columns other than product_id 
and created_at, i.e. the dynamic columns?


Depending on the answers to these questions, you have several options, of which 
here are a few:
Cassandra efficiently stores sparse data, so you could create columns and not 
populate them, without much of a penalty
Could use a clustering column to store a columns type and another col 
(potentially clustering) to store the value
i.e. CREATE TABLE foo (col1 int, attname text, attvalue text, col4...n, PRIMARY 
KEY (col1, attname, attvalue));
where attname stores the name of the attribute/column and attvalue stores the 
value of that attribute
have seen users use this model and create a "main" attribute row within a 
partition that stores the values associated with col4...n
Could store multiple collections
Others probably have ideas as well
You may want to look in the archives for a similar discussion topic.  Believe 
this item was asked a few months ago as well.



Jonathan Lacefield

Solution Architect |(404) 822 3487 | jlacefi...@datastax.com





On Tue, Jan 20, 2015 at 1:40 PM, chetan verma  wrote:

Hi,


I am creating a review system. for instance lets assume following are the 
attibutes of system:


Review{
id bigint,
product_id bigint,
created_at timestamp,
summary text,
description text,
pros set,
cons set,
feature_rating map
etc
}
I created partition key as product_id (so that all the reviews for a given 
product will reside on same node)
and clustering key as created_at and id (Desc) so that  reviews will be sorted 
by time.


I can have more column and that requirement I want to fulfil by dynamic columns 
but there are limitations to it explained above.
Could you please let me know the best way.


On Tue, Jan 20, 2015 at 11:59 PM, Jonathan Lacefield  
wrote:

Hello,


  Have you looked at solving this challenge with clustering columns?  Also, 
please describe the problem set details for more specific advice from this 
group.


  Starting new projects on Thrift isn't the recommended approach.  


Jonathan



Jonathan Lacefield

Solution Architect |(404) 822 3487 | jlacefi...@datastax.com





On Tue, Jan 20, 2015 at 1:24 PM, chetan verma  wrote:

Hi,


I am starting a new project with cassandra as database.
I have unstructured data so I need dynamic columns, 
though in CQL3 we can achive this via Collections but there are some downsides 
to it.
1. Collections are used to store small amount of data.
2. The maximum size of an item in a collection is 64K.
3. Cassandra reads a collection in its entirety.
4. Restrictions on number of items in collections is 64,000


And no support to get single column by map key, which is possible via cassandra 
cli.
Please suggest whether I should use CQL3 or Thrift and which driver is best.


--

Regards,
Chetan Verma
+91 99860 86634







--

Regards,
Chetan Verma
+91 99860 86634







--

Regards,
Chetan Verma
+91 99860 86634





--

Regards,
Chetan Verma
+91 99860 86634

Re: nodetool compact cannot remove tombstone in system keyspace

2015-01-13 Thread Xu Zhongxing
Thanks for confirming "the tombstones will only get removed during compaction 
if they are older than GC_Grace_Seconds for that CF". I didn't find such a 
clarification in the documentation. That answered my question.


Since the table that has too many tombstones is in the system keyspace, I 
cannot alter its gc_grace_seconds setting. gc_grace_seconds is now 7 days, 
which is certainly longer than the age of the tombstones. 


Is there any way that I can remove the tombstones in the system keyspace 
immediately?

At 2015-01-13 19:49:47, "Rahul Neelakantan"  wrote:

I am not sure about the tombstone_failure_threshold, but the tombstones will 
only get removed during compaction if they are older than GC_Grace_Seconds for 
that CF. How old are these tombstones?

Rahul

On Jan 12, 2015, at 11:27 PM, Xu Zhongxing  wrote:


Hi,


When I connect to C* with driver, I found some warnings in the log (I increased 
tombstone_failure_threshold to 15 to see the warning)


WARN [ReadStage:5] 2015-01-13 12:21:14,595 SliceQueryFilter.java (line 225) 
Read 34188 live and 104186 tombstoned cells in system.schema_columns (see 
tombstone_warn_threshold). 2147483387 columns was requested, slices=[-], 
delInfo={deletedAt=-9223372036854775808, localDeletion=2147483647}
 WARN [ReadStage:5] 2015-01-13 12:21:15,562 SliceQueryFilter.java (line 225) 
Read 34209 live and 104247 tombstoned cells in system.schema_columns (see 
tombstone_warn_threshold). 2147449199 columns was requested, slices=[-], 
delInfo={deletedAt=-9223372036854775808, localDeletion=2147483647}


I run the command:
nodetool compact system 


But the tombstone number does not decrease. I still see the warnings with the 
exact number of tombstones.
Why is this happening? What should I do to remove the tombstones in the system 
keyspace?

nodetool compact cannot remove tombstone in system keyspace

2015-01-12 Thread Xu Zhongxing
Hi,


When I connect to C* with driver, I found some warnings in the log (I increased 
tombstone_failure_threshold to 15 to see the warning)


WARN [ReadStage:5] 2015-01-13 12:21:14,595 SliceQueryFilter.java (line 225) 
Read 34188 live and 104186 tombstoned cells in system.schema_columns (see 
tombstone_warn_threshold). 2147483387 columns was requested, slices=[-], 
delInfo={deletedAt=-9223372036854775808, localDeletion=2147483647}
 WARN [ReadStage:5] 2015-01-13 12:21:15,562 SliceQueryFilter.java (line 225) 
Read 34209 live and 104247 tombstoned cells in system.schema_columns (see 
tombstone_warn_threshold). 2147449199 columns was requested, slices=[-], 
delInfo={deletedAt=-9223372036854775808, localDeletion=2147483647}


I run the command:
nodetool compact system 


But the tombstone number does not decrease. I still see the warnings with the 
exact number of tombstones.
Why is this happening? What should I do to remove the tombstones in the system 
keyspace?

Re: CQLSSTableWriter memory leak

2014-06-06 Thread Xu Zhongxing
We figured out the reason for the growing memory usage. When adding rows, if 
flush-to-disk operation is done in SStableSimpleUnsortedWriter.newRow(). But 
for the compound primary key case, when the clustering key is identical, there 
is no new row created. So the single huge row is kept in the memory and no disk 
sync() is done.






在 2014-06-06 00:16:13,"Jack Krupansky"  写道:

How many rows (primary key values) are you writing for each partition of the 
primary key? I mean, are there relatively few, or are these very wide 
partitions?
 
Oh, I see! You’re writing 50,000,000 rows to a single partition! My, that IS 
ambitious.
 
-- Jack Krupansky
 
From:Xu Zhongxing
Sent: Thursday, June 5, 2014 3:34 AM
To:user@cassandra.apache.org
Subject: CQLSSTableWriter memory leak
 

I am using Cassandra's CQLSSTableWriter to import a large amount of data into 
Cassandra. When I use CQLSSTableWriter to write to a table with compound 
primary key, the memory consumption keeps growing. The GC of JVM cannot collect 
any used memory. When writing to tables with no compound primary key, the JVM 
GC works fine.

My Cassandra version is 2.0.5. The OS is Ubuntu 14.04 x86-64. JVM parameters 
are -Xms1g -Xmx2g. This is sufficient for all other non-compound primary key 
cases.

The problem can be reproduced by the following test case:

import org.apache.cassandra.io.sstable.CQLSSTableWriter;
import org.apache.cassandra.exceptions.InvalidRequestException;

import java.io.IOException;
import java.util.UUID;

class SS {
public static void main(String[] args) {
String schema = "create table test.t (x uuid, y uuid, primary key (x, 
y))";


String insert = "insert into test.t (x, y) values (?, ?)";
CQLSSTableWriter writer = CQLSSTableWriter.builder()
.inDirectory("/tmp/test/t")
.forTable(schema).withBufferSizeInMB(32)
.using(insert).build();

UUID id = UUID.randomUUID();
try {
for (int i = 0; i < 5000; i++) {
UUID id2 = UUID.randomUUID();
writer.addRow(id, id2);
}

writer.close();
} catch (Exception e) {
System.err.println("hell");
}
}
}

Re: CQLSSTableWriter memory leak

2014-06-05 Thread Xu Zhongxing
Is writing too many rows to a single partition the cause of memory consumption?


What I want to achieve is this: say I have 5 partition ID. Each corresponds to 
50 million IDs.  Given a partition ID, I need to get its corresponding 50 
million IDs. Is there another way to design the schema to avoid such a compound 
primary key?


I could use the clustering IDs as the primary key, and create index on the 
partition ID. But is that equivalent to create another table with compound keys?


At 2014-06-06 00:16:13, "Jack Krupansky"  wrote:
How many rows (primary key values) are you writing for each partition of the 
primary key? I mean, are there relatively few, or are these very wide 
partitions?
 
Oh, I see! You’re writing 50,000,000 rows to a single partition! My, that IS 
ambitious.
 
-- Jack Krupansky
 
From:Xu Zhongxing
Sent: Thursday, June 5, 2014 3:34 AM
To:user@cassandra.apache.org
Subject: CQLSSTableWriter memory leak
 

I am using Cassandra's CQLSSTableWriter to import a large amount of data into 
Cassandra. When I use CQLSSTableWriter to write to a table with compound 
primary key, the memory consumption keeps growing. The GC of JVM cannot collect 
any used memory. When writing to tables with no compound primary key, the JVM 
GC works fine.

My Cassandra version is 2.0.5. The OS is Ubuntu 14.04 x86-64. JVM parameters 
are -Xms1g -Xmx2g. This is sufficient for all other non-compound primary key 
cases.

The problem can be reproduced by the following test case:

import org.apache.cassandra.io.sstable.CQLSSTableWriter;
import org.apache.cassandra.exceptions.InvalidRequestException;

import java.io.IOException;
import java.util.UUID;

class SS {
public static void main(String[] args) {
String schema = "create table test.t (x uuid, y uuid, primary key (x, 
y))";


String insert = "insert into test.t (x, y) values (?, ?)";
CQLSSTableWriter writer = CQLSSTableWriter.builder()
.inDirectory("/tmp/test/t")
.forTable(schema).withBufferSizeInMB(32)
.using(insert).build();

UUID id = UUID.randomUUID();
try {
for (int i = 0; i < 5000; i++) {
UUID id2 = UUID.randomUUID();
writer.addRow(id, id2);
}

writer.close();
} catch (Exception e) {
System.err.println("hell");
}
}
}

CQLSSTableWriter memory leak

2014-06-05 Thread Xu Zhongxing
I am using Cassandra's CQLSSTableWriter to import a large amount of data into 
Cassandra. When I use CQLSSTableWriter to write to a table with compound 
primary key, the memory consumption keeps growing. The GC of JVM cannot collect 
any used memory. When writing to tables with no compound primary key, the JVM 
GC works fine.

My Cassandra version is 2.0.5. The OS is Ubuntu 14.04 x86-64. JVM parameters 
are -Xms1g -Xmx2g. This is sufficient for all other non-compound primary key 
cases.

The problem can be reproduced by the following test case:

import org.apache.cassandra.io.sstable.CQLSSTableWriter;
import org.apache.cassandra.exceptions.InvalidRequestException;

import java.io.IOException;
import java.util.UUID;

class SS {
public static void main(String[] args) {
String schema = "create table test.t (x uuid, y uuid, primary key (x, 
y))";


String insert = "insert into test.t (x, y) values (?, ?)";
CQLSSTableWriter writer = CQLSSTableWriter.builder()
.inDirectory("/tmp/test/t")
.forTable(schema).withBufferSizeInMB(32)
.using(insert).build();

UUID id = UUID.randomUUID();
try {
for (int i = 0; i < 5000; i++) {
UUID id2 = UUID.randomUUID();
writer.addRow(id, id2);
}

writer.close();
} catch (Exception e) {
System.err.println("hell");
}
}
}