Unable to add columns to empty row in Column family: Cassandra

2011-05-03 Thread anuya joshi
Hello,

I am using Cassandra for my application.My Cassandra client uses Thrift APIs
directly. The problem I am facing currently is as follows:

1) I added a row and columns in it dynamically via Thrift API Client
2) Next, I used command line client to delete row which actually deleted all
the columns in it, leaving empty row with original row id.
3) Now, I am trying to add columns dynamically using client program into
this empty row with same row key
However, columns are not being inserted.
But, when tried from command line client, it worked correctly.

Any pointer on this would be of great use

Thanks in  advance,

Regards,
Anuya


Re: Building from source from behind firewall since Maven switch?

2011-05-03 Thread Stephen Connolly
-autoproxy worked for me when I write the original patch

but as I no longer work for the company where I wrote the patch, I don't
have a firewall to deal with

worst case you might have to create a ~/.m2/settings.xml with the proxy
details... if that is the case can you raise a jira in MANTTASKS (which is
at jira.codehaus.org for hysterical reasons)

- Stephen

---
Sent from my Android phone, so random spelling mistakes, random nonsense
words and other nonsense are a direct result of using swype to type on the
screen
On 3 May 2011 01:06, Suan-Aik Yeo s...@enovafinancial.com wrote:


Re: Unable to add columns to empty row in Column family: Cassandra

2011-05-03 Thread chovatia jaydeep
Hi Anuya,

 However, columns are not being inserted.


Do you mean to say that after insert operation you couldn't retrieve the same 
data? If so, then please check the time-stamp when you reinserted after delete 
operation. Your second insertion time-stamp has to be greater than the previous 
insertion.

Thank you,
Jaydeep


From: anuya joshi anu...@gmail.com
To: user@cassandra.apache.org
Sent: Monday, 2 May 2011 11:34 PM
Subject: Re: Unable to add columns to empty row in Column family: Cassandra


Hello,

I am using Cassandra for my application.My Cassandra client uses Thrift APIs 
directly. The problem I am facing currently is as follows:

1) I added a row and columns in it dynamically via Thrift API Client
2) Next, I used command line client to delete row which actually deleted all 
the columns in it, leaving empty row with original row id.
3) Now, I am trying to add columns dynamically using client program into this 
empty row with same row key
    However, columns are not being inserted.
    But, when tried from command line client, it worked correctly.

Any pointer on this would be of great use

Thanks in  advance,

Regards,
Anuya

Re: Unable to add columns to empty row in Column family: Cassandra

2011-05-03 Thread chovatia jaydeep
One small correction in my mail below. 
Second insertion time-stamp has to be greater than delete time-stamp in-order 
to retrieve the data.

Thank you,
Jaydeep


From: chovatia jaydeep chovatia_jayd...@yahoo.co.in
To: user@cassandra.apache.org user@cassandra.apache.org
Sent: Monday, 2 May 2011 11:52 PM
Subject: Re: Unable to add columns to empty row in Column family: Cassandra


Hi Anuya,

 However, columns are not being inserted.


Do you mean to say that after insert operation you couldn't retrieve the same 
data? If so, then please check the time-stamp when you reinserted after delete 
operation. Your second insertion time-stamp has to be greater than the previous 
insertion.

Thank you,
Jaydeep


From: anuya joshi anu...@gmail.com
To: user@cassandra.apache.org
Sent: Monday, 2 May 2011 11:34 PM
Subject: Re: Unable to add columns to empty row in Column family: Cassandra


Hello,

I am using Cassandra for my application.My Cassandra client uses Thrift APIs 
directly. The problem I am facing currently is as follows:

1) I added a row and columns in it dynamically via Thrift API Client
2) Next, I used command line client to delete row which actually deleted all 
the columns in it, leaving empty row with original row id.
3) Now, I am trying to add columns dynamically using client program into this 
empty row with same row key
    However, columns are not being inserted.
    But, when tried from command line client, it worked correctly.

Any pointer on this would be of great use

Thanks in  advance,

Regards,
Anuya

Using snapshot for backup and restore

2011-05-03 Thread Arsene Lee
Hi,

We are trying to use snapshot for backup and restore. We found out that 
snapshot doesn't take secondary indexes.
We are wondering why is that? And is there any way we can rebuild the secondary 
index?

Regards,

Arsene


One cluster or many?

2011-05-03 Thread David Boxenhorn
If I have a database that partitions naturally into non-overlapping
datasets, in which there are no references between datasets, where each
dataset is quite large (i.e. large enough to merit its own cluster from the
point of view of quantity of data), should I set up one cluster per database
or one large cluster for everything together?

As I see it:

The primary advantage of separate clusters is total isolation: if I have a
problem with one dataset, my application will continue working normally for
all other datasets.

The primary advantage of one big cluster is usage pooling: when one server
goes down in a large cluster it's much less important than when one server
goes down in a small cluster. Also, different temporal usage patterns of the
different datasets (i.e. there will be different peak hours on different
datasets) can be combined to ease capacity requirements.

Any thoughts?


low performance inserting

2011-05-03 Thread charles THIBAULT
Hello everybody,

first: sorry for my english in advance!!

I'm getting started with Cassandra on a 5 nodes cluster inserting data
with the pycassa API.

I've read everywere on internet that cassandra's performance are better than
MySQL
because of the writes append's only into commit logs files.

When i'm trying to insert 100 000 rows with 10 columns per row with batch
insert, I'v this result: 27 seconds
But with MySQL (load data infile) this take only 2 seconds (using indexes)

Here my configuration

cassandra version: 0.7.5
nodes : 192.168.1.210, 192.168.1.211, 192.168.1.212, 192.168.1.213,
192.168.1.214
seed: 192.168.1.210

My script
*
#!/usr/bin/env python

import pycassa
import time
import random
from cassandra import ttypes

pool = pycassa.connect('test', ['192.168.1.210:9160'])
cf = pycassa.ColumnFamily(pool, 'test')
b = cf.batch(queue_size=50,
write_consistency_level=ttypes.ConsistencyLevel.ANY)

tps1 = time.time()
for i in range(10):
columns = dict()
for j in range(10):
columns[str(j)] = str(random.randint(0,100))
b.insert(str(i), columns)
b.send()
tps2 = time.time()


print(execution time:  + str(tps2 - tps1) +  seconds)
*

what I'm doing rong ?


Problems recovering a dead node

2011-05-03 Thread Héctor Izquierdo Seliva
Hi everyone. One of the nodes in my 6 node cluster died with disk
failures. I have replaced the disks, and it's clean. It has the same
configuration (same ip, same token).

When I try to restart the node it starts to throw mmap underflow
exceptions till it closes again.

I tried setting io to standard, but it still fails. It gives errors
about two decorated keys being different, and the EOFException.

Here is an excerpt of the log

http://pastebin.com/ZXW1wY6T

I can provide more info if needed. I'm at a loss here so any help is
appreciated.

Thanks all for your time

Héctor Izquierdo



Re: Replica data distributing between racks

2011-05-03 Thread aaron morton
I've been digging into this and worked was able to reproduce something, not 
sure if it's a fault and I can't work on it any more tonight. 


To reproduce:
- 2 node cluster on my mac book
- set the tokens as if they were nodes 3 and 4 in a 4 node cluster, e.g. node 1 
with 85070591730234615865843651857942052864 and node 2 
127605887595351923798765477786913079296 
- set cassandra-topology.properties to put the nodes in DC1 on RAC1 and RAC2
- create a keyspace using NTS and strategy_options = [{DC1:1}]

Inserted 10 rows they were distributed as 
- node 1 - 9 rows 
- node 2 - 1 row

I *think* the problem has to do with TokenMetadata.firstTokenIndex(). It often 
says the closest token to a key is the node 1 because in effect...

- node 1 is responsible for 0 to 85070591730234615865843651857942052864
- node 2 is responsible for 85070591730234615865843651857942052864 to 
127605887595351923798765477786913079296
- AND node 1 does the wrap around from 127605887595351923798765477786913079296 
to 0 as keys that would insert past the last token in the ring array wrap to 0 
because  insertMin is false. 

Thoughts ? 

Aaron


On 3 May 2011, at 10:29, Eric tamme wrote:

 On Mon, May 2, 2011 at 5:59 PM, aaron morton aa...@thelastpickle.com wrote:
 My bad, I missed the way TokenMetadata.ringIterator() and firstTokenIndex() 
 work.
 
 Eric, can you show the output from nodetool ring ?
 
 
 
 Sorry if the previous paste was way to unformatted, here is a
 pastie.org link with nicer formatting of nodetool ring output than
 plain text email allows.
 
 http://pastie.org/private/50khpakpffjhsmgf66oetg



Re: Using snapshot for backup and restore

2011-05-03 Thread aaron morton
Looking at the code for the snapshot it looks like it does not include 
secondary indexes. And I cannot see a way to manually trigger an index rebuild 
(via CFS.buildSecondaryIndexes())

Looking at this it's probably handy to snapshot them 
https://issues.apache.org/jira/browse/CASSANDRA-2470

I'm not sure if there is a reason for excluding them. Is this causing a problem 
right now ?

Aaron



On 3 May 2011, at 20:22, Arsene Lee wrote:

 Hi,
  
 We are trying to use snapshot for backup and restore. We found out that 
 snapshot doesn’t take secondary indexes.
 We are wondering why is that? And is there any way we can rebuild the 
 secondary index?
  
 Regards,
  
 Arsene



Write performance help needed

2011-05-03 Thread Steve Smith
I am working for client that needs to persist 100K-200K records per second
for later querying.  As a proof of concept, we are looking at several
options including nosql (Cassandra and MongoDB).

I have been running some tests on my laptop (MacBook Pro, 4GB RAM, 2.66 GHz,
Dual Core/4 logical cores) and have not been happy with the results.

The best I have been able to accomplish is 100K records in approximately 30
seconds.  Each record has 30 columns, mostly made up of integers.  I have
tried both the Hector and Pelops APIs, and have tried writing in batches
versus one at a time.  The times have not varied much.

I am using the out of the box configuration for Cassandra, and while I know
using 1 disk will have an impact on performance, I would expect to see
better write numbers than I am.

As a point of reference, the same test using MongoDB I was able to
accomplish 100K records in 3.5 seconds.

Any tips would be appreciated.

- Steve


Re: low performance inserting

2011-05-03 Thread Roland Gude
Hi,
Not sure this is the case for your Bad Performance, but you are Meassuring Data 
creation and Insertion together. Your Data creation involves Lots of class 
casts which are probably quite Slow.
Try
Timing only the b.send Part and See how Long that Takes. 

Roland

Am 03.05.2011 um 12:30 schrieb charles THIBAULT charl.thiba...@gmail.com:

 Hello everybody, 
 
 first: sorry for my english in advance!!
 
 I'm getting started with Cassandra on a 5 nodes cluster inserting data
 with the pycassa API.
 
 I've read everywere on internet that cassandra's performance are better than 
 MySQL
 because of the writes append's only into commit logs files.
 
 When i'm trying to insert 100 000 rows with 10 columns per row with batch 
 insert, I'v this result: 27 seconds
 But with MySQL (load data infile) this take only 2 seconds (using indexes)
 
 Here my configuration
 
 cassandra version: 0.7.5
 nodes : 192.168.1.210, 192.168.1.211, 192.168.1.212, 192.168.1.213, 
 192.168.1.214
 seed: 192.168.1.210
 
 My script
 *
 #!/usr/bin/env python
 
 import pycassa
 import time
 import random
 from cassandra import ttypes
 
 pool = pycassa.connect('test', ['192.168.1.210:9160'])
 cf = pycassa.ColumnFamily(pool, 'test')
 b = cf.batch(queue_size=50, 
 write_consistency_level=ttypes.ConsistencyLevel.ANY)
 
 tps1 = time.time()
 for i in range(10):
 columns = dict()
 for j in range(10):
 columns[str(j)] = str(random.randint(0,100))
 b.insert(str(i), columns)
 b.send()
 tps2 = time.time()
 
 
 print(execution time:  + str(tps2 - tps1) +  seconds)
 *
 
 what I'm doing rong ?


Re: Write performance help needed

2011-05-03 Thread Eric tamme
Use more nodes to increase your write throughput.  Testing on a single
machine is not really a viable benchmark for what you can achieve with
cassandra.


RE: Using snapshot for backup and restore

2011-05-03 Thread Arsene Lee
If snapshot doesn't include secondary indexes then we can't use it for our 
backup and restore procedure.
.
This mean, we need to stop our service when we want to do backups and this 
would cause longer system down time.

If there is no particular reason, it is probably a good idea to also include 
secondary indexes when taking the snapshot.


Arsene


From: aaron morton [aa...@thelastpickle.com]
Sent: Tuesday, May 03, 2011 7:28 PM
To: user@cassandra.apache.org
Subject: Re: Using snapshot for backup and restore

Looking at the code for the snapshot it looks like it does not include 
secondary indexes. And I cannot see a way to manually trigger an index rebuild 
(via CFS.buildSecondaryIndexes())

Looking at this it's probably handy to snapshot them 
https://issues.apache.org/jira/browse/CASSANDRA-2470

I'm not sure if there is a reason for excluding them. Is this causing a problem 
right now ?

Aaron



On 3 May 2011, at 20:22, Arsene Lee 
wrote:https://issues.apache.org/jira/browse/CASSANDRA-2470

Hi,

We are trying to use snapshot for backup and restore. We found out that 
snapshot doesn’t take secondary indexes.
We are wondering why is that? And is there any way we can rebuild the secondary 
index?

Regards,

Arsene




Re: low performance inserting

2011-05-03 Thread Sylvain Lebresne
There is probably a fair number of things you'd have to make sure you do to
improve the write performance on the Cassandra side (starting by using multiple
threads to do the insertion), but the first thing is probably to start
comparing things
that are at least mildly comparable. If you do inserts in Cassandra,
you should try
to do inserts in MySQL too, not load data infile (which really is
just a bulk loading
utility). And as stated here
http://dev.mysql.com/doc/refman/5.0/en/insert-speed.html:
When loading a table from a text file, use LOAD DATA INFILE. This is
usually 20 times
faster than using INSERT statements.

--
Sylvain

On Tue, May 3, 2011 at 12:30 PM, charles THIBAULT
charl.thiba...@gmail.com wrote:
 Hello everybody,

 first: sorry for my english in advance!!

 I'm getting started with Cassandra on a 5 nodes cluster inserting data
 with the pycassa API.

 I've read everywere on internet that cassandra's performance are better than
 MySQL
 because of the writes append's only into commit logs files.

 When i'm trying to insert 100 000 rows with 10 columns per row with batch
 insert, I'v this result: 27 seconds
 But with MySQL (load data infile) this take only 2 seconds (using indexes)

 Here my configuration

 cassandra version: 0.7.5
 nodes : 192.168.1.210, 192.168.1.211, 192.168.1.212, 192.168.1.213,
 192.168.1.214
 seed: 192.168.1.210

 My script
 *
 #!/usr/bin/env python

 import pycassa
 import time
 import random
 from cassandra import ttypes

 pool = pycassa.connect('test', ['192.168.1.210:9160'])
 cf = pycassa.ColumnFamily(pool, 'test')
 b = cf.batch(queue_size=50,
 write_consistency_level=ttypes.ConsistencyLevel.ANY)

 tps1 = time.time()
 for i in range(10):
     columns = dict()
     for j in range(10):
         columns[str(j)] = str(random.randint(0,100))
     b.insert(str(i), columns)
 b.send()
 tps2 = time.time()


 print(execution time:  + str(tps2 - tps1) +  seconds)
 *

 what I'm doing rong ?



Re: Replica data distributing between racks

2011-05-03 Thread Jonathan Ellis
Right, when you are computing balanced RP tokens for NTS you need to
compute the tokens for each DC independently.

On Tue, May 3, 2011 at 6:23 AM, aaron morton aa...@thelastpickle.com wrote:
 I've been digging into this and worked was able to reproduce something, not 
 sure if it's a fault and I can't work on it any more tonight.


 To reproduce:
 - 2 node cluster on my mac book
 - set the tokens as if they were nodes 3 and 4 in a 4 node cluster, e.g. node 
 1 with 85070591730234615865843651857942052864 and node 2 
 127605887595351923798765477786913079296
 - set cassandra-topology.properties to put the nodes in DC1 on RAC1 and RAC2
 - create a keyspace using NTS and strategy_options = [{DC1:1}]

 Inserted 10 rows they were distributed as
 - node 1 - 9 rows
 - node 2 - 1 row

 I *think* the problem has to do with TokenMetadata.firstTokenIndex(). It 
 often says the closest token to a key is the node 1 because in effect...

 - node 1 is responsible for 0 to 85070591730234615865843651857942052864
 - node 2 is responsible for 85070591730234615865843651857942052864 to 
 127605887595351923798765477786913079296
 - AND node 1 does the wrap around from 
 127605887595351923798765477786913079296 to 0 as keys that would insert past 
 the last token in the ring array wrap to 0 because  insertMin is false.

 Thoughts ?

 Aaron


 On 3 May 2011, at 10:29, Eric tamme wrote:

 On Mon, May 2, 2011 at 5:59 PM, aaron morton aa...@thelastpickle.com wrote:
 My bad, I missed the way TokenMetadata.ringIterator() and firstTokenIndex() 
 work.

 Eric, can you show the output from nodetool ring ?



 Sorry if the previous paste was way to unformatted, here is a
 pastie.org link with nicer formatting of nodetool ring output than
 plain text email allows.

 http://pastie.org/private/50khpakpffjhsmgf66oetg





-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


Re: Write performance help needed

2011-05-03 Thread Jonathan Ellis
You don't give many details, but I would guess:

- your benchmark is not multithreaded
- mongodb is not configured for durable writes, so you're really only
measuring the time for it to buffer it in memory
- you haven't loaded enough data to hit mongo's index doesn't fit in
memory anymore

On Tue, May 3, 2011 at 8:24 AM, Steve Smith stevenpsmith...@gmail.com wrote:
 I am working for client that needs to persist 100K-200K records per second
 for later querying.  As a proof of concept, we are looking at several
 options including nosql (Cassandra and MongoDB).
 I have been running some tests on my laptop (MacBook Pro, 4GB RAM, 2.66 GHz,
 Dual Core/4 logical cores) and have not been happy with the results.
 The best I have been able to accomplish is 100K records in approximately 30
 seconds.  Each record has 30 columns, mostly made up of integers.  I have
 tried both the Hector and Pelops APIs, and have tried writing in batches
 versus one at a time.  The times have not varied much.
 I am using the out of the box configuration for Cassandra, and while I know
 using 1 disk will have an impact on performance, I would expect to see
 better write numbers than I am.
 As a point of reference, the same test using MongoDB I was able to
 accomplish 100K records in 3.5 seconds.
 Any tips would be appreciated.

 - Steve




-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


RE: Replica data distributing between racks

2011-05-03 Thread Jeremiah Jordan
So we are currently running a 10 node ring in one DC, and we are going to be 
adding 5 more nodes
in another DC.  To keep the rings in each DC balanced, should I really 
calculate the tokens independently
and just make sure none of them are the same? Something like:

DC1 (RF 5):
1:  0
2:  17014118346046923173168730371588410572
3:  34028236692093846346337460743176821144
4:  51042355038140769519506191114765231716
5:  68056473384187692692674921486353642288
6:  85070591730234615865843651857942052860
7:  102084710076281539039012382229530463432
8:  119098828422328462212181112601118874004
9:  136112946768375385385349842972707284576
10: 153127065114422308558518573344295695148

DC2 (RF 3):
1:  1 (one off from DC1 node 1)
2:  34028236692093846346337460743176821145 (one off from DC1 node 3)
3:  68056473384187692692674921486353642290 (two off from DC1 node 5)
4:  102084710076281539039012382229530463435 (three off from DC1 node 7)
5:  136112946768375385385349842972707284580 (four off from DC1 node 9)

Originally I was thinking I should spread the DC2 nodes evenly in between every 
other DC1 node.
Or does it not matter where they are in respect to the DC1 nodes, and long as 
they fall somewhere
after every other DC1 node? So it is DC1-1, DC2-1, DC1-2, DC1-3, DC2-2, DC1-4, 
DC1-5...

-Original Message-
From: Jonathan Ellis [mailto:jbel...@gmail.com] 
Sent: Tuesday, May 03, 2011 9:14 AM
To: user@cassandra.apache.org
Subject: Re: Replica data distributing between racks

Right, when you are computing balanced RP tokens for NTS you need to compute 
the tokens for each DC independently.

On Tue, May 3, 2011 at 6:23 AM, aaron morton aa...@thelastpickle.com wrote:
 I've been digging into this and worked was able to reproduce something, not 
 sure if it's a fault and I can't work on it any more tonight.


 To reproduce:
 - 2 node cluster on my mac book
 - set the tokens as if they were nodes 3 and 4 in a 4 node cluster, 
 e.g. node 1 with 85070591730234615865843651857942052864 and node 2 
 127605887595351923798765477786913079296
 - set cassandra-topology.properties to put the nodes in DC1 on RAC1 
 and RAC2
 - create a keyspace using NTS and strategy_options = [{DC1:1}]

 Inserted 10 rows they were distributed as
 - node 1 - 9 rows
 - node 2 - 1 row

 I *think* the problem has to do with TokenMetadata.firstTokenIndex(). It 
 often says the closest token to a key is the node 1 because in effect...

 - node 1 is responsible for 0 to 
 85070591730234615865843651857942052864
 - node 2 is responsible for 85070591730234615865843651857942052864 to 
 127605887595351923798765477786913079296
 - AND node 1 does the wrap around from 
 127605887595351923798765477786913079296 to 0 as keys that would insert past 
 the last token in the ring array wrap to 0 because  insertMin is false.

 Thoughts ?

 Aaron


 On 3 May 2011, at 10:29, Eric tamme wrote:

 On Mon, May 2, 2011 at 5:59 PM, aaron morton aa...@thelastpickle.com wrote:
 My bad, I missed the way TokenMetadata.ringIterator() and firstTokenIndex() 
 work.

 Eric, can you show the output from nodetool ring ?



 Sorry if the previous paste was way to unformatted, here is a 
 pastie.org link with nicer formatting of nodetool ring output than 
 plain text email allows.

 http://pastie.org/private/50khpakpffjhsmgf66oetg





--
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support 
http://www.datastax.com


IOException: Unable to create hard link ... /snapshots/ ... (errno 17)

2011-05-03 Thread Mck
Running a 3 node cluster with cassandra-0.8.0-beta1 

I'm seeing the first node logging many (thousands) times lines like


Caused by: java.io.IOException: Unable to create hard link
from 
/iad/finn/countstatistics/cassandra-data/countstatisticsCount/thrift_no_finntech_countstats_count_Count_1299479381593068337-f-5504-Data.db
 to 
/iad/finn/countstatistics/cassandra-data/countstatisticsCount/snapshots/compact-thrift_no_finntech_countstats_count_Count_1299479381593068337/thrift_no_finntech_countstats_count_Count_1299479381593068337-f-5504-Data.db
 (errno 17)


This seems to happen for all column families (including system).
It happens a lot during startup.

The hardlinks do exist. Stopping, deleting the hardlinks, and starting
again does not help.

But i haven't seen it once on the other nodes...

~mck


ps the stacktrace


java.io.IOError: java.io.IOException: Unable to create hard link from 
/iad/finn/countstatistics/cassandra-data/countstatisticsCount/thrift_no_finntech_countstats_count_Count_1299479381593068337-f-3875-Data.db
 to 
/iad/finn/countstatistics/cassandra-data/countstatisticsCount/snapshots/compact-thrift_no_finntech_countstats_count_Count_1299479381593068337/thrift_no_finntech_countstats_count_Count_1299479381593068337-f-3875-Data.db
 (errno 17)
at 
org.apache.cassandra.db.ColumnFamilyStore.snapshotWithoutFlush(ColumnFamilyStore.java:1629)
at 
org.apache.cassandra.db.ColumnFamilyStore.snapshot(ColumnFamilyStore.java:1654)
at org.apache.cassandra.db.Table.snapshot(Table.java:198)
at 
org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:504)
at 
org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:146)
at 
org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:112)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.IOException: Unable to create hard link from 
/iad/finn/countstatistics/cassandra-data/countstatisticsCount/thrift_no_finntech_countstats_count_Count_1299479381593068337-f-3875-Data.db
 to 
/iad/finn/countstatistics/cassandra-data/countstatisticsCount/snapshots/compact-thrift_no_finntech_countstats_count_Count_1299479381593068337/thrift_no_finntech_countstats_count_Count_1299479381593068337-f-3875-Data.db
 (errno 17)
at org.apache.cassandra.utils.CLibrary.createHardLink(CLibrary.java:155)
at 
org.apache.cassandra.io.sstable.SSTableReader.createLinks(SSTableReader.java:713)
at 
org.apache.cassandra.db.ColumnFamilyStore.snapshotWithoutFlush(ColumnFamilyStore.java:1622)
... 10 more





Re: Replica data distributing between racks

2011-05-03 Thread Eric tamme
On Tue, May 3, 2011 at 10:13 AM, Jonathan Ellis jbel...@gmail.com wrote:
 Right, when you are computing balanced RP tokens for NTS you need to
 compute the tokens for each DC independently.

I am confused ... sorry.  Are you saying that ... I need to change how
my keys are calculated to fix this problem?  Or are you talking about
the implementation of how replication selects a token?

-Eric


Re: low performance inserting

2011-05-03 Thread charles THIBAULT
Hi Sylvain,

thanks for your answer.

I'd make a test with the stress utility inserting 100 000 rows with 10
columns per row
I use these options: -o insert -t 5 -n 10 -c 10 -d
192.168.1.210,192.168.1.211,...
result: 161 seconds

with MySQL using inserts (after a dump): 1.79 second

Charles

2011/5/3 Sylvain Lebresne sylv...@datastax.com

 There is probably a fair number of things you'd have to make sure you do to
 improve the write performance on the Cassandra side (starting by using
 multiple
 threads to do the insertion), but the first thing is probably to start
 comparing things
 that are at least mildly comparable. If you do inserts in Cassandra,
 you should try
 to do inserts in MySQL too, not load data infile (which really is
 just a bulk loading
 utility). And as stated here
 http://dev.mysql.com/doc/refman/5.0/en/insert-speed.html:
 When loading a table from a text file, use LOAD DATA INFILE. This is
 usually 20 times
 faster than using INSERT statements.

 --
 Sylvain

 On Tue, May 3, 2011 at 12:30 PM, charles THIBAULT
 charl.thiba...@gmail.com wrote:
  Hello everybody,
 
  first: sorry for my english in advance!!
 
  I'm getting started with Cassandra on a 5 nodes cluster inserting data
  with the pycassa API.
 
  I've read everywere on internet that cassandra's performance are better
 than
  MySQL
  because of the writes append's only into commit logs files.
 
  When i'm trying to insert 100 000 rows with 10 columns per row with batch
  insert, I'v this result: 27 seconds
  But with MySQL (load data infile) this take only 2 seconds (using
 indexes)
 
  Here my configuration
 
  cassandra version: 0.7.5
  nodes : 192.168.1.210, 192.168.1.211, 192.168.1.212, 192.168.1.213,
  192.168.1.214
  seed: 192.168.1.210
 
  My script
 
 *
  #!/usr/bin/env python
 
  import pycassa
  import time
  import random
  from cassandra import ttypes
 
  pool = pycassa.connect('test', ['192.168.1.210:9160'])
  cf = pycassa.ColumnFamily(pool, 'test')
  b = cf.batch(queue_size=50,
  write_consistency_level=ttypes.ConsistencyLevel.ANY)
 
  tps1 = time.time()
  for i in range(10):
  columns = dict()
  for j in range(10):
  columns[str(j)] = str(random.randint(0,100))
  b.insert(str(i), columns)
  b.send()
  tps2 = time.time()
 
 
  print(execution time:  + str(tps2 - tps1) +  seconds)
 
 *
 
  what I'm doing rong ?
 



Unicode key encoding problem when upgrading from 0.6.13 to 0.7.5

2011-05-03 Thread Henrik Schröder
Hey everyone,

We did some tests before upgrading our Cassandra cluster from 0.6 to 0.7,
just to make sure that the change in how keys are encoded wouldn't cause us
any dataloss. Unfortunately it seems that rows stored under a unicode key
couldn't be retrieved after the upgrade. We're running everything on
Windows, and we're using the generated thrift client in C# to access it.

I managed to make a minimal test to reproduce the error consistently:

First, I started up Cassandra 0.6.13 with an empty data directory, and a
really simple config with a single keyspace with a single bytestype
columnfamily.
I wrote two rows, each with a single column with a simple column name and a
1-byte value of 1. The first row had a key using only ascii chars ('foo'),
and the second row had a key using unicode chars ('ドメインウ').

Using multi_get, and both those keys, I got both columns back, as expected.
Using multi_get_slice and both those keys, I got both columns back, as
expected.
I also did a get_range_slices to get all rows in the columnfamily, and I got
both columns back, as expected.

So far so good. Then I drain and shut down Cassandra 0.6.13, and start up
Cassandra 0.7.5, pointing to the same data directory, with a config
containing the same keyspace, and I run the schematool import command.

I then start up my test program that uses the new thrift api, and run some
commands.

Using multi_get_slice, and those two keys encoded as UTF8 byte-arrays, I
only get back one column, the one under the key 'foo'. The other row I
simply can't retrieve.

However, when I use get_range_slices to get all rows, I get back two rows,
with the correct column values, and the byte-array keys are identical to my
encoded keys, and when I decode the byte-arrays as UTF8 drings, I get back
my two original keys. This means that both my rows are still there, the keys
as output by Cassandra are identical to the original string keys I used when
I created the rows in 0.6.13, but it's just impossible to retrieve the
second row.

To continue the test, I inserted a row with the key 'ドメインウ' encoded as UTF-8
again, and gave it a similar column as the original, but with a 1-byte value
of 2.

Now, when I use multi_get_slice with my two encoded keys, I get back two
rows, the 'foo' row has the old value as expected, and the other row has the
new value as expected.

However, when I use get_range_slices to get all rows, I get back *three*
rows, two of which have the *exact same* byte-array key, one has the old
column, one has the new column.


How is this possible? How can there be two different rows with the exact
same key? I'm guessing that it's related to the encoding of string keys in
0.6, and that the internal representation is off somehow. I checked the
generated thrift client for 0.6, and it UTF8-encodes all keys before sending
them to the server, so it should be UTF8 all the way, but apparently it
isn't.

Has anyone else experienced the same problem? Is it a platform-specific
problem? Is there a way to avoid this and upgrade from 0.6 to 0.7 and not
lose any rows? I would also really like to know which byte-array I should
send in to get back that second row, there's gotta be some key that can be
used to get it, the row is still there after all.


/Henrik Schröder


Re: Unicode key encoding problem when upgrading from 0.6.13 to 0.7.5

2011-05-03 Thread Henrik Schröder
The way we solved this problem is that it turned out we had only a few
hundred rows with unicode keys, so we simply extracted them, upgraded to
0.7, and wrote them back. However, this means that among the rows, there are
a few hundred weird duplicate rows with identical keys.

Is this going to be a problem in the future? Is there a chance that the good
duplicate is cleaned out in favour of the bad duplicate so that we suddnely
lose those rows again?


/Henrik Schröder


Re: One cluster or many?

2011-05-03 Thread Jonathan Ellis
I would add that running one cluster is operationally less work than
running multiple.

On Tue, May 3, 2011 at 4:15 AM, David Boxenhorn da...@taotown.com wrote:
 If I have a database that partitions naturally into non-overlapping
 datasets, in which there are no references between datasets, where each
 dataset is quite large (i.e. large enough to merit its own cluster from the
 point of view of quantity of data), should I set up one cluster per database
 or one large cluster for everything together?

 As I see it:

 The primary advantage of separate clusters is total isolation: if I have a
 problem with one dataset, my application will continue working normally for
 all other datasets.

 The primary advantage of one big cluster is usage pooling: when one server
 goes down in a large cluster it's much less important than when one server
 goes down in a small cluster. Also, different temporal usage patterns of the
 different datasets (i.e. there will be different peak hours on different
 datasets) can be combined to ease capacity requirements.

 Any thoughts?




-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


Re: Experiences with MapReduce Stress Tests

2011-05-03 Thread Jeremy Hanna
Writing to Cassandra from map/reduce jobs over HDFS shouldn't be a problem.  
We're doing it in our cluster and I know of others doing the same thing.  You 
might just make sure the number of reducers (or mappers) writing to cassandra 
don't overwhelm it.  There's no data locality for writes, though a cassandra 
specific partitioner might help with that in the future.  See CASSANDRA-1473 - 
https://issues.apache.org/jira/browse/CASSANDRA-1473.

I apologize that I misspoke about one of the settings.  The batch size is in 
fact the number of rows it gets each time.  The input splits just affects how 
many mappers it splits the data into.

As far as recommending this solution, it really depends on the problem.  The 
people I know doing what you're thinking of doing typically store raw data in 
HDFS, perform mapreduce jobs over that data and output the results into 
Cassandra for realtime queries.

We're using it where I work for storage and analytics both.  We store raw data 
into S3/HDFS, mapreduce over that data and output into cassandra, then perform 
realtime queries as well as analytics over that data.  If you want to do run 
analytics over Cassandra data, you'll want to partition your cluster so that 
mapreduce jobs don't affect the realtime performance.

On May 3, 2011, at 3:19 AM, Subscriber wrote:

 Hi Jeremy, 
 
 yes, the setup on the data-nodes is:
   - Hadoop DataNode
   - Hadoop TaskTracker
   - CassandraDaemon
 
 However - the map-input is not read from Cassandra. I am running a writing 
 stress test - no reads (well from time to time I check the produced items 
 using cassandra-cli).
 Is it possible to achieve data-locality on writes? Well I think that this is 
 (in practice) not possible (one could create some artificial data that 
 correlates with the hashed row-key values or so ... ;-)
 
 Thanks for all your tips and hints! It's good see that someone worries about 
 my problems :-)
 But - to be honest - my number one priority is not to get this test running 
 but to answer the question whether the setup Cassandra+Hadoop with massive 
 parallel writes (using map/reduce) meets the demands of our customer.
 
 I found out that the following configuration helps a lot. 
 * disk_access_mode: standard 
 * MAX_HEAP_SIZE=4G
 * HEAP_NEWSIZE=400M
 * rpc_timeout_in_ms: 2
 
 Now the stress test runs through, but there are still timeouts (Hadoop 
 reschedules the failing mapper tasks on another node and so the test runs 
 through).
 But what causes this timeouts? 20 seconds are a long time for a modern cpu 
 (and an eternity for an android ;-) 
 
 It seems to me that it's not only the massive amount of data or to many 
 parallel mappers, because Cassandra can handle this huge write rate over one 
 hour! 
 I found in the system.logs that the ConcurrentMarkSweeps take quite long (up 
 to 8 seconds). The heap size didn't grow much about 3GB so there was still 
 enough air to breath.
 
 So the question remains: can I recommend this setup?
 
 Thanks again and best regards
 Udo
 
 
 Am 02.05.2011 um 20:21 schrieb Jeremy Hanna:
 
 Udo,
 
 One thing to get out of the way - you're running task trackers on all of 
 your cassandra nodes, right?  That is the first and foremost way to get good 
 performance.  Otherwise you don't have data locality, which is really the 
 point of map/reduce, co-locating your data and your processes operating over 
 that data.  You're probably already doing that, but I had forgotten to ask 
 that before.
 
 Besides that...
 
 You might try messing with those values a bit more as well as the input 
 split size - cassandra.input.split.size which defaults to ~65k.  So you 
 might try rpc timeout of 30s just to see if that helps and try reducing the 
 input split size significantly to see if that helps.
 
 For your setup I don't see the range batch size as being meaningful at all 
 with your narrow rows, so don't worry about that.
 
 Also, the capacity of your nodes and the number of mappers/reducers you're 
 trying to use will also have an effect on whether it has to timeout.  
 Essentially it's getting overwhelmed for some reason.  You might lower the 
 number of mappers and reducers you're hitting your cassandra cluster with to 
 see if that helps.
 
 Jeremy
 
 On May 2, 2011, at 6:25 AM, Subscriber wrote:
 
 Hi Jeremy, 
 
 thanks for the link.
 I doubled the rpc_timeout (20 seconds) and reduced the range-batch-size to 
 2048, but I still get timeouts...
 
 Udo
 
 Am 29.04.2011 um 18:53 schrieb Jeremy Hanna:
 
 It sounds like there might be some tuning you can do to your jobs - take a 
 look at the wiki's HadoopSupport page, specifically the Troubleshooting 
 section:
 http://wiki.apache.org/cassandra/HadoopSupport#Troubleshooting
 
 On Apr 29, 2011, at 11:45 AM, Subscriber wrote:
 
 Hi all, 
 
 We want to share our experiences we got during our Cassandra plus Hadoop 
 Map/Reduce evaluation.
 Our question was whether Cassandra is suitable for massive distributed 
 data 

Re: Range Slice Issue

2011-05-03 Thread Jonathan Ellis
Do you still see this behavior if you disable dynamic snitch?

On Tue, May 3, 2011 at 12:31 PM, Serediuk, Adam
adam.sered...@serialssolutions.com wrote:
 We appear to have encountered an issue with cassandra 0.7.5 after upgrading
 from 0.7.2. While doing a batch read using a get_range_slice against the
 ranges an individual node is master for we are able to reproduce
 consistently that the last two nodes in the ring, regardless of the ring
 size (we have a 60 node production cluster and a 12 node test cluster)
 perform this read over the network using replicas of executing locally.
 Every other node in the ring successfully reads locally.
 To be sure there were no data consistency issues we performed a nodetool
 repair against both of these nodes and the issue persists. We also tried
 truncating the column family and repopulating, but the issue remains.
 This seems to be related to CASSANDRA-2286 in 0.7.4. We always want to read
 data locally if it is available there. We
 use Cassandra.Client.describe_ring() to figure out which machine in the
 ring is master for which TokenRange. I then compare the master for
 each TokenRange against the localhost to find out which token ranges
 are owned by the local machine (remote reads are too slow for this type
 of batch processing). Once I know which TokenRanges are on
 each machine locally I get evenly sized splits using
 Cassandra.Client.describe_splits().

 Adam




-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


Re: Range Slice Issue

2011-05-03 Thread Serediuk, Adam
I just ran a test and we do not see that behavior with dynamic snitch disabled. 
All nodes appear to be doing local reads as expected.


On May 3, 2011, at 10:37 AM, Jonathan Ellis wrote:

 Do you still see this behavior if you disable dynamic snitch?
 
 On Tue, May 3, 2011 at 12:31 PM, Serediuk, Adam
 adam.sered...@serialssolutions.com wrote:
 We appear to have encountered an issue with cassandra 0.7.5 after upgrading
 from 0.7.2. While doing a batch read using a get_range_slice against the
 ranges an individual node is master for we are able to reproduce
 consistently that the last two nodes in the ring, regardless of the ring
 size (we have a 60 node production cluster and a 12 node test cluster)
 perform this read over the network using replicas of executing locally.
 Every other node in the ring successfully reads locally.
 To be sure there were no data consistency issues we performed a nodetool
 repair against both of these nodes and the issue persists. We also tried
 truncating the column family and repopulating, but the issue remains.
 This seems to be related to CASSANDRA-2286 in 0.7.4. We always want to read
 data locally if it is available there. We
 use Cassandra.Client.describe_ring() to figure out which machine in the
 ring is master for which TokenRange. I then compare the master for
 each TokenRange against the localhost to find out which token ranges
 are owned by the local machine (remote reads are too slow for this type
 of batch processing). Once I know which TokenRanges are on
 each machine locally I get evenly sized splits using
 Cassandra.Client.describe_splits().
 
 Adam
 
 
 
 
 -- 
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of DataStax, the source for professional Cassandra support
 http://www.datastax.com
 




Re: Range Slice Issue

2011-05-03 Thread Jonathan Ellis
So either (a) dynamic snitch is wrong or (b) those nodes really are
more heavily loaded than the others, and are correctly pushing queries
to other replicas.

On Tue, May 3, 2011 at 12:47 PM, Serediuk, Adam
adam.sered...@serialssolutions.com wrote:
 I just ran a test and we do not see that behavior with dynamic snitch 
 disabled. All nodes appear to be doing local reads as expected.


 On May 3, 2011, at 10:37 AM, Jonathan Ellis wrote:

 Do you still see this behavior if you disable dynamic snitch?

 On Tue, May 3, 2011 at 12:31 PM, Serediuk, Adam
 adam.sered...@serialssolutions.com wrote:
 We appear to have encountered an issue with cassandra 0.7.5 after upgrading
 from 0.7.2. While doing a batch read using a get_range_slice against the
 ranges an individual node is master for we are able to reproduce
 consistently that the last two nodes in the ring, regardless of the ring
 size (we have a 60 node production cluster and a 12 node test cluster)
 perform this read over the network using replicas of executing locally.
 Every other node in the ring successfully reads locally.
 To be sure there were no data consistency issues we performed a nodetool
 repair against both of these nodes and the issue persists. We also tried
 truncating the column family and repopulating, but the issue remains.
 This seems to be related to CASSANDRA-2286 in 0.7.4. We always want to read
 data locally if it is available there. We
 use Cassandra.Client.describe_ring() to figure out which machine in the
 ring is master for which TokenRange. I then compare the master for
 each TokenRange against the localhost to find out which token ranges
 are owned by the local machine (remote reads are too slow for this type
 of batch processing). Once I know which TokenRanges are on
 each machine locally I get evenly sized splits using
 Cassandra.Client.describe_splits().

 Adam




 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of DataStax, the source for professional Cassandra support
 http://www.datastax.com







-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


MemtablePostFlusher with high number of pending calls?

2011-05-03 Thread Terje Marthinussen
Cassandra 0.8 beta trunk from about 1 week ago:

Pool NameActive   Pending  Completed
ReadStage 0 0  5
RequestResponseStage  0 0  87129
MutationStage 0 0 187298
ReadRepairStage   0 0  0
ReplicateOnWriteStage 0 0  0
GossipStage   0 01353524
AntiEntropyStage  0 0  0
MigrationStage0 0 10
MemtablePostFlusher   1   190108
StreamStage   0 0  0
FlushWriter   0 0302
FILEUTILS-DELETE-POOL 0 0 26
MiscStage 0 0  0
FlushSorter   0 0  0
InternalResponseStage 0 0  0
HintedHandoff 1 4  7


Anyone with nice theories about the pending value on the memtable post
flusher?

Regards,
Terje


Re: MemtablePostFlusher with high number of pending calls?

2011-05-03 Thread Jonathan Ellis
Does it resolve down to 0 eventually if you stop doing writes?

On Tue, May 3, 2011 at 12:56 PM, Terje Marthinussen
tmarthinus...@gmail.com wrote:
 Cassandra 0.8 beta trunk from about 1 week ago:
 Pool Name                    Active   Pending      Completed
 ReadStage                         0         0              5
 RequestResponseStage              0         0          87129
 MutationStage                     0         0         187298
 ReadRepairStage                   0         0              0
 ReplicateOnWriteStage             0         0              0
 GossipStage                       0         0        1353524
 AntiEntropyStage                  0         0              0
 MigrationStage                    0         0             10
 MemtablePostFlusher               1       190            108
 StreamStage                       0         0              0
 FlushWriter                       0         0            302
 FILEUTILS-DELETE-POOL             0         0             26
 MiscStage                         0         0              0
 FlushSorter                       0         0              0
 InternalResponseStage             0         0              0
 HintedHandoff                     1         4              7

 Anyone with nice theories about the pending value on the memtable post
 flusher?
 Regards,
 Terje



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


Re: MemtablePostFlusher with high number of pending calls?

2011-05-03 Thread Jonathan Ellis
... and are there any exceptions in the log?

On Tue, May 3, 2011 at 1:01 PM, Jonathan Ellis jbel...@gmail.com wrote:
 Does it resolve down to 0 eventually if you stop doing writes?

 On Tue, May 3, 2011 at 12:56 PM, Terje Marthinussen
 tmarthinus...@gmail.com wrote:
 Cassandra 0.8 beta trunk from about 1 week ago:
 Pool Name                    Active   Pending      Completed
 ReadStage                         0         0              5
 RequestResponseStage              0         0          87129
 MutationStage                     0         0         187298
 ReadRepairStage                   0         0              0
 ReplicateOnWriteStage             0         0              0
 GossipStage                       0         0        1353524
 AntiEntropyStage                  0         0              0
 MigrationStage                    0         0             10
 MemtablePostFlusher               1       190            108
 StreamStage                       0         0              0
 FlushWriter                       0         0            302
 FILEUTILS-DELETE-POOL             0         0             26
 MiscStage                         0         0              0
 FlushSorter                       0         0              0
 InternalResponseStage             0         0              0
 HintedHandoff                     1         4              7

 Anyone with nice theories about the pending value on the memtable post
 flusher?
 Regards,
 Terje



 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of DataStax, the source for professional Cassandra support
 http://www.datastax.com




-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


Re: Range Slice Issue

2011-05-03 Thread Serediuk, Adam
Both data and system load are equal across all nodes and the smaller test 
cluster also exhibits the same issue. tokens are balanced and total node size 
is equivalent.

On May 3, 2011, at 10:51 AM, Jonathan Ellis wrote:

 So either (a) dynamic snitch is wrong or (b) those nodes really are
 more heavily loaded than the others, and are correctly pushing queries
 to other replicas.
 
 On Tue, May 3, 2011 at 12:47 PM, Serediuk, Adam
 adam.sered...@serialssolutions.com wrote:
 I just ran a test and we do not see that behavior with dynamic snitch 
 disabled. All nodes appear to be doing local reads as expected.
 
 
 On May 3, 2011, at 10:37 AM, Jonathan Ellis wrote:
 
 Do you still see this behavior if you disable dynamic snitch?
 
 On Tue, May 3, 2011 at 12:31 PM, Serediuk, Adam
 adam.sered...@serialssolutions.com wrote:
 We appear to have encountered an issue with cassandra 0.7.5 after upgrading
 from 0.7.2. While doing a batch read using a get_range_slice against the
 ranges an individual node is master for we are able to reproduce
 consistently that the last two nodes in the ring, regardless of the ring
 size (we have a 60 node production cluster and a 12 node test cluster)
 perform this read over the network using replicas of executing locally.
 Every other node in the ring successfully reads locally.
 To be sure there were no data consistency issues we performed a nodetool
 repair against both of these nodes and the issue persists. We also tried
 truncating the column family and repopulating, but the issue remains.
 This seems to be related to CASSANDRA-2286 in 0.7.4. We always want to read
 data locally if it is available there. We
 use Cassandra.Client.describe_ring() to figure out which machine in the
 ring is master for which TokenRange. I then compare the master for
 each TokenRange against the localhost to find out which token ranges
 are owned by the local machine (remote reads are too slow for this type
 of batch processing). Once I know which TokenRanges are on
 each machine locally I get evenly sized splits using
 Cassandra.Client.describe_splits().
 
 Adam
 
 
 
 
 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of DataStax, the source for professional Cassandra support
 http://www.datastax.com
 
 
 
 
 
 
 
 -- 
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of DataStax, the source for professional Cassandra support
 http://www.datastax.com
 




Re: MemtablePostFlusher with high number of pending calls?

2011-05-03 Thread Terje Marthinussen
Just some very tiny amount of writes in the background here (some hints
spooled up on another node slowly coming in).
No new data.

I thought there was no exceptions, but I did not look far enough back in the
log at first.

Going back a bit further now however, I see that about 50 hours ago:
ERROR [CompactionExecutor:387] 2011-05-02 01:16:01,027
AbstractCassandraDaemon.java (line 112) Fatal exception in thread
Thread[CompactionExecutor:387,1,main]
java.io.IOException: No space left on device
at java.io.RandomAccessFile.writeBytes(Native Method)
at java.io.RandomAccessFile.write(RandomAccessFile.java:466)
at
org.apache.cassandra.io.util.BufferedRandomAccessFile.flush(BufferedRandomAccessFile.java:160)
at
org.apache.cassandra.io.util.BufferedRandomAccessFile.reBuffer(BufferedRandomAccessFile.java:225)
at
org.apache.cassandra.io.util.BufferedRandomAccessFile.writeAtMost(BufferedRandomAccessFile.java:356)
at
org.apache.cassandra.io.util.BufferedRandomAccessFile.write(BufferedRandomAccessFile.java:335)
at
org.apache.cassandra.io.PrecompactedRow.write(PrecompactedRow.java:102)
at
org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:130)
at
org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:566)
at
org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:146)
at
org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:112)
at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[followed by a few more of those...]

and then a bunch of these:
ERROR [FlushWriter:123] 2011-05-02 01:21:12,690 AbstractCassandraDaemon.java
(line 112) Fatal exception in thread Thread[FlushWriter:123,5,main]
java.lang.RuntimeException: java.lang.RuntimeException: Insufficient disk
space to flush 40009184 bytes
at
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.RuntimeException: Insufficient disk space to flush
40009184 bytes
at
org.apache.cassandra.db.ColumnFamilyStore.getFlushPath(ColumnFamilyStore.java:597)
at
org.apache.cassandra.db.ColumnFamilyStore.createFlushWriter(ColumnFamilyStore.java:2100)
at
org.apache.cassandra.db.Memtable.writeSortedContents(Memtable.java:239)
at org.apache.cassandra.db.Memtable.access$400(Memtable.java:50)
at org.apache.cassandra.db.Memtable$3.runMayThrow(Memtable.java:263)
at
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
... 3 more

Seems like compactions stopped after this (a bunch of tmp tables there still
from when those errors where generated), and I can only suspect the post
flusher may have stopped at the same time.

There is 890GB of disk for data, sstables are currently using 604G (139GB is
old tmp tables from when it ran out of disk) and ring tells me the load on
the node is 313GB.

Terje



On Wed, May 4, 2011 at 3:02 AM, Jonathan Ellis jbel...@gmail.com wrote:

 ... and are there any exceptions in the log?

 On Tue, May 3, 2011 at 1:01 PM, Jonathan Ellis jbel...@gmail.com wrote:
  Does it resolve down to 0 eventually if you stop doing writes?
 
  On Tue, May 3, 2011 at 12:56 PM, Terje Marthinussen
  tmarthinus...@gmail.com wrote:
  Cassandra 0.8 beta trunk from about 1 week ago:
  Pool NameActive   Pending  Completed
  ReadStage 0 0  5
  RequestResponseStage  0 0  87129
  MutationStage 0 0 187298
  ReadRepairStage   0 0  0
  ReplicateOnWriteStage 0 0  0
  GossipStage   0 01353524
  AntiEntropyStage  0 0  0
  MigrationStage0 0 10
  MemtablePostFlusher   1   190108
  StreamStage   0 0  0
  FlushWriter   0 0302
  FILEUTILS-DELETE-POOL 0 0 26
  MiscStage 0 0  0
  FlushSorter   0 0  0
  InternalResponseStage 0 0  0
  HintedHandoff 1 4  7
 
  Anyone with nice 

Re: MemtablePostFlusher with high number of pending calls?

2011-05-03 Thread Terje Marthinussen
So yes, there is currently some 200GB empty disk.

On Wed, May 4, 2011 at 3:20 AM, Terje Marthinussen
tmarthinus...@gmail.comwrote:

 Just some very tiny amount of writes in the background here (some hints
 spooled up on another node slowly coming in).
 No new data.

 I thought there was no exceptions, but I did not look far enough back in
 the log at first.

 Going back a bit further now however, I see that about 50 hours ago:
 ERROR [CompactionExecutor:387] 2011-05-02 01:16:01,027
 AbstractCassandraDaemon.java (line 112) Fatal exception in thread
 Thread[CompactionExecutor:387,1,main]
 java.io.IOException: No space left on device
 at java.io.RandomAccessFile.writeBytes(Native Method)
 at java.io.RandomAccessFile.write(RandomAccessFile.java:466)
 at
 org.apache.cassandra.io.util.BufferedRandomAccessFile.flush(BufferedRandomAccessFile.java:160)
 at
 org.apache.cassandra.io.util.BufferedRandomAccessFile.reBuffer(BufferedRandomAccessFile.java:225)
 at
 org.apache.cassandra.io.util.BufferedRandomAccessFile.writeAtMost(BufferedRandomAccessFile.java:356)
 at
 org.apache.cassandra.io.util.BufferedRandomAccessFile.write(BufferedRandomAccessFile.java:335)
 at
 org.apache.cassandra.io.PrecompactedRow.write(PrecompactedRow.java:102)
 at
 org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:130)
 at
 org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:566)
 at
 org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:146)
 at
 org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:112)
 at
 java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 at java.util.concurrent.FutureTask.run(FutureTask.java:138)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)
 [followed by a few more of those...]

 and then a bunch of these:
 ERROR [FlushWriter:123] 2011-05-02 01:21:12,690
 AbstractCassandraDaemon.java (line 112) Fatal exception in thread
 Thread[FlushWriter:123,5,main]
 java.lang.RuntimeException: java.lang.RuntimeException: Insufficient disk
 space to flush 40009184 bytes
 at
 org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)
 Caused by: java.lang.RuntimeException: Insufficient disk space to flush
 40009184 bytes
 at
 org.apache.cassandra.db.ColumnFamilyStore.getFlushPath(ColumnFamilyStore.java:597)
 at
 org.apache.cassandra.db.ColumnFamilyStore.createFlushWriter(ColumnFamilyStore.java:2100)
 at
 org.apache.cassandra.db.Memtable.writeSortedContents(Memtable.java:239)
 at org.apache.cassandra.db.Memtable.access$400(Memtable.java:50)
 at
 org.apache.cassandra.db.Memtable$3.runMayThrow(Memtable.java:263)
 at
 org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
 ... 3 more

 Seems like compactions stopped after this (a bunch of tmp tables there
 still from when those errors where generated), and I can only suspect the
 post flusher may have stopped at the same time.

 There is 890GB of disk for data, sstables are currently using 604G (139GB
 is old tmp tables from when it ran out of disk) and ring tells me the load
 on the node is 313GB.

 Terje



 On Wed, May 4, 2011 at 3:02 AM, Jonathan Ellis jbel...@gmail.com wrote:

 ... and are there any exceptions in the log?

 On Tue, May 3, 2011 at 1:01 PM, Jonathan Ellis jbel...@gmail.com wrote:
  Does it resolve down to 0 eventually if you stop doing writes?
 
  On Tue, May 3, 2011 at 12:56 PM, Terje Marthinussen
  tmarthinus...@gmail.com wrote:
  Cassandra 0.8 beta trunk from about 1 week ago:
  Pool NameActive   Pending  Completed
  ReadStage 0 0  5
  RequestResponseStage  0 0  87129
  MutationStage 0 0 187298
  ReadRepairStage   0 0  0
  ReplicateOnWriteStage 0 0  0
  GossipStage   0 01353524
  AntiEntropyStage  0 0  0
  MigrationStage0 0 10
  MemtablePostFlusher   1   190108
  StreamStage   0 0  0
  FlushWriter   0 0302
  FILEUTILS-DELETE-POOL 0 0 26
  MiscStage 0 0  0
  

Re: IOException: Unable to create hard link ... /snapshots/ ... (errno 17)

2011-05-03 Thread Mck
On Tue, 2011-05-03 at 16:52 +0200, Mck wrote:
 Running a 3 node cluster with cassandra-0.8.0-beta1 
 
 I'm seeing the first node logging many (thousands) times 

Only special thing about this first node is it receives all the writes
from our sybase-cassandra import job.
This process migrates an existing 60million rows into cassandra (before
the cluster is /turned on/ for normal operations). The import job runs
over ~20minutes.

I wiped everything and started from scratch, this time running the
import job with cassandra configured instead with:

incremental_backups: false
snapshot_before_compaction: false

This created the problem then on another node.
So changing to these settings on all nodes and running the import again
fixed it: no more Unable to create hard link ...

After the import i could turn both incremental_backups and
snapshot_before_compaction to true again without problems so far.

To me this says something is broken with incremental_backups and
snapshot_before_compaction under heavy writing?

~mck




Re: MemtablePostFlusher with high number of pending calls?

2011-05-03 Thread Jonathan Ellis
post flusher is responsible for updating commitlog header after a
flush; each task waits for a specific flush to complete, then does its
thing.

so when you had a flush catastrophically fail, its corresponding
post-flush task will be stuck.

On Tue, May 3, 2011 at 1:20 PM, Terje Marthinussen
tmarthinus...@gmail.com wrote:
 Just some very tiny amount of writes in the background here (some hints
 spooled up on another node slowly coming in).
 No new data.

 I thought there was no exceptions, but I did not look far enough back in the
 log at first.
 Going back a bit further now however, I see that about 50 hours ago:
 ERROR [CompactionExecutor:387] 2011-05-02 01:16:01,027
 AbstractCassandraDaemon.java (line 112) Fatal exception in thread
 Thread[CompactionExecutor:387,1,main]
 java.io.IOException: No space left on device
         at java.io.RandomAccessFile.writeBytes(Native Method)
         at java.io.RandomAccessFile.write(RandomAccessFile.java:466)
         at
 org.apache.cassandra.io.util.BufferedRandomAccessFile.flush(BufferedRandomAccessFile.java:160)
         at
 org.apache.cassandra.io.util.BufferedRandomAccessFile.reBuffer(BufferedRandomAccessFile.java:225)
         at
 org.apache.cassandra.io.util.BufferedRandomAccessFile.writeAtMost(BufferedRandomAccessFile.java:356)
         at
 org.apache.cassandra.io.util.BufferedRandomAccessFile.write(BufferedRandomAccessFile.java:335)
         at
 org.apache.cassandra.io.PrecompactedRow.write(PrecompactedRow.java:102)
         at
 org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:130)
         at
 org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:566)
         at
 org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:146)
         at
 org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:112)
         at
 java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
         at java.util.concurrent.FutureTask.run(FutureTask.java:138)
         at
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
         at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
         at java.lang.Thread.run(Thread.java:662)
 [followed by a few more of those...]
 and then a bunch of these:
 ERROR [FlushWriter:123] 2011-05-02 01:21:12,690 AbstractCassandraDaemon.java
 (line 112) Fatal exception in thread Thread[FlushWriter:123,5,main]
 java.lang.RuntimeException: java.lang.RuntimeException: Insufficient disk
 space to flush 40009184 bytes
         at
 org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34)
         at
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
         at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
         at java.lang.Thread.run(Thread.java:662)
 Caused by: java.lang.RuntimeException: Insufficient disk space to flush
 40009184 bytes
         at
 org.apache.cassandra.db.ColumnFamilyStore.getFlushPath(ColumnFamilyStore.java:597)
         at
 org.apache.cassandra.db.ColumnFamilyStore.createFlushWriter(ColumnFamilyStore.java:2100)
         at
 org.apache.cassandra.db.Memtable.writeSortedContents(Memtable.java:239)
         at org.apache.cassandra.db.Memtable.access$400(Memtable.java:50)
         at org.apache.cassandra.db.Memtable$3.runMayThrow(Memtable.java:263)
         at
 org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
         ... 3 more
 Seems like compactions stopped after this (a bunch of tmp tables there still
 from when those errors where generated), and I can only suspect the post
 flusher may have stopped at the same time.
 There is 890GB of disk for data, sstables are currently using 604G (139GB is
 old tmp tables from when it ran out of disk) and ring tells me the load on
 the node is 313GB.
 Terje


 On Wed, May 4, 2011 at 3:02 AM, Jonathan Ellis jbel...@gmail.com wrote:

 ... and are there any exceptions in the log?

 On Tue, May 3, 2011 at 1:01 PM, Jonathan Ellis jbel...@gmail.com wrote:
  Does it resolve down to 0 eventually if you stop doing writes?
 
  On Tue, May 3, 2011 at 12:56 PM, Terje Marthinussen
  tmarthinus...@gmail.com wrote:
  Cassandra 0.8 beta trunk from about 1 week ago:
  Pool Name                    Active   Pending      Completed
  ReadStage                         0         0              5
  RequestResponseStage              0         0          87129
  MutationStage                     0         0         187298
  ReadRepairStage                   0         0              0
  ReplicateOnWriteStage             0         0              0
  GossipStage                       0         0        1353524
  AntiEntropyStage                  0         0              0
  MigrationStage                    0         0             10
  MemtablePostFlusher               1       190            108
  StreamStage                       0         0              0
  

Re: IOException: Unable to create hard link ... /snapshots/ ... (errno 17)

2011-05-03 Thread Mck
On Tue, 2011-05-03 at 13:52 -0500, Jonathan Ellis wrote:
 you should probably look to see what errno 17 means for the link
 system call on your system. 

That the file already exists.
It seems cassandra is trying to make the same hard link in parallel
(under heavy write load) ?

I see now i can also reproduce the problem with hadoop and
ColumnFamilyOutputFormat. 
Turning off snapshot_before_compaction seems to be enough to prevent
it. 

~mck




Re: Using snapshot for backup and restore

2011-05-03 Thread Jonathan Ellis
You're right, this is an oversight.  Created
https://issues.apache.org/jira/browse/CASSANDRA-2596 to fix.

As for a workaround, you can drop the index + recreate. (Upgrade to
0.7.5 first, if you haven't yet.)

On Tue, May 3, 2011 at 3:22 AM, Arsene Lee
arsene@ruckuswireless.com wrote:
 Hi,



 We are trying to use snapshot for backup and restore. We found out that
 snapshot doesn’t take secondary indexes.

 We are wondering why is that? And is there any way we can rebuild the
 secondary index?



 Regards,



 Arsene



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


Re: IOException: Unable to create hard link ... /snapshots/ ... (errno 17)

2011-05-03 Thread Jonathan Ellis
Ah, that makes sense.  snapshot_before_compaction is trying to
snapshot, but incremental_backups already created one (for newly
flushed sstables).  You're probably the only one running with both
options on. :)

Can you create a ticket?

On Tue, May 3, 2011 at 2:05 PM, Mck m...@apache.org wrote:
 On Tue, 2011-05-03 at 13:52 -0500, Jonathan Ellis wrote:
 you should probably look to see what errno 17 means for the link
 system call on your system.

 That the file already exists.
 It seems cassandra is trying to make the same hard link in parallel
 (under heavy write load) ?

 I see now i can also reproduce the problem with hadoop and
 ColumnFamilyOutputFormat.
 Turning off snapshot_before_compaction seems to be enough to prevent
 it.

 ~mck






-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


Re: MemtablePostFlusher with high number of pending calls?

2011-05-03 Thread Terje Marthinussen
Yes, I realize that.

I am bit curious why it ran out of disk, or rather, why I have 200GB empty
disk now, but unfortunately it seems like we may not have had monitoring
enabled on this node to tell me what happened in terms of disk usage.

I also thought that compaction was supposed to resume (try again with less
data) if it fails?

Terje

On Wed, May 4, 2011 at 3:50 AM, Jonathan Ellis jbel...@gmail.com wrote:

 post flusher is responsible for updating commitlog header after a
 flush; each task waits for a specific flush to complete, then does its
 thing.

 so when you had a flush catastrophically fail, its corresponding
 post-flush task will be stuck.

 On Tue, May 3, 2011 at 1:20 PM, Terje Marthinussen
 tmarthinus...@gmail.com wrote:
  Just some very tiny amount of writes in the background here (some hints
  spooled up on another node slowly coming in).
  No new data.
 
  I thought there was no exceptions, but I did not look far enough back in
 the
  log at first.
  Going back a bit further now however, I see that about 50 hours ago:
  ERROR [CompactionExecutor:387] 2011-05-02 01:16:01,027
  AbstractCassandraDaemon.java (line 112) Fatal exception in thread
  Thread[CompactionExecutor:387,1,main]
  java.io.IOException: No space left on device
  at java.io.RandomAccessFile.writeBytes(Native Method)
  at java.io.RandomAccessFile.write(RandomAccessFile.java:466)
  at
 
 org.apache.cassandra.io.util.BufferedRandomAccessFile.flush(BufferedRandomAccessFile.java:160)
  at
 
 org.apache.cassandra.io.util.BufferedRandomAccessFile.reBuffer(BufferedRandomAccessFile.java:225)
  at
 
 org.apache.cassandra.io.util.BufferedRandomAccessFile.writeAtMost(BufferedRandomAccessFile.java:356)
  at
 
 org.apache.cassandra.io.util.BufferedRandomAccessFile.write(BufferedRandomAccessFile.java:335)
  at
  org.apache.cassandra.io.PrecompactedRow.write(PrecompactedRow.java:102)
  at
 
 org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:130)
  at
 
 org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:566)
  at
 
 org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:146)
  at
 
 org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:112)
  at
  java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
  at java.util.concurrent.FutureTask.run(FutureTask.java:138)
  at
 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
  at
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
  at java.lang.Thread.run(Thread.java:662)
  [followed by a few more of those...]
  and then a bunch of these:
  ERROR [FlushWriter:123] 2011-05-02 01:21:12,690
 AbstractCassandraDaemon.java
  (line 112) Fatal exception in thread Thread[FlushWriter:123,5,main]
  java.lang.RuntimeException: java.lang.RuntimeException: Insufficient disk
  space to flush 40009184 bytes
  at
  org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34)
  at
 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
  at
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
  at java.lang.Thread.run(Thread.java:662)
  Caused by: java.lang.RuntimeException: Insufficient disk space to flush
  40009184 bytes
  at
 
 org.apache.cassandra.db.ColumnFamilyStore.getFlushPath(ColumnFamilyStore.java:597)
  at
 
 org.apache.cassandra.db.ColumnFamilyStore.createFlushWriter(ColumnFamilyStore.java:2100)
  at
  org.apache.cassandra.db.Memtable.writeSortedContents(Memtable.java:239)
  at org.apache.cassandra.db.Memtable.access$400(Memtable.java:50)
  at
 org.apache.cassandra.db.Memtable$3.runMayThrow(Memtable.java:263)
  at
  org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
  ... 3 more
  Seems like compactions stopped after this (a bunch of tmp tables there
 still
  from when those errors where generated), and I can only suspect the post
  flusher may have stopped at the same time.
  There is 890GB of disk for data, sstables are currently using 604G (139GB
 is
  old tmp tables from when it ran out of disk) and ring tells me the load
 on
  the node is 313GB.
  Terje
 
 
  On Wed, May 4, 2011 at 3:02 AM, Jonathan Ellis jbel...@gmail.com
 wrote:
 
  ... and are there any exceptions in the log?
 
  On Tue, May 3, 2011 at 1:01 PM, Jonathan Ellis jbel...@gmail.com
 wrote:
   Does it resolve down to 0 eventually if you stop doing writes?
  
   On Tue, May 3, 2011 at 12:56 PM, Terje Marthinussen
   tmarthinus...@gmail.com wrote:
   Cassandra 0.8 beta trunk from about 1 week ago:
   Pool NameActive   Pending  Completed
   ReadStage 0 0  5
   RequestResponseStage  0 

Re: Replica data distributing between racks

2011-05-03 Thread aaron morton
Jonathan, 
I think you are saying each DC should have it's own (logical) token 
ring. Which makes sense as the only way to balance the load in each dc. I think 
most people assume (including me) there was a single token ring for the entire 
cluster. 

But currently two endpoints cannot have the same token regardless of 
the DC they are in. Or should people just bump the tokens in extra DC's to 
avoid the collision?  

Cheers
Aaron

On 4 May 2011, at 03:03, Eric tamme wrote:

 On Tue, May 3, 2011 at 10:13 AM, Jonathan Ellis jbel...@gmail.com wrote:
 Right, when you are computing balanced RP tokens for NTS you need to
 compute the tokens for each DC independently.
 
 I am confused ... sorry.  Are you saying that ... I need to change how
 my keys are calculated to fix this problem?  Or are you talking about
 the implementation of how replication selects a token?
 
 -Eric



Re: Problems recovering a dead node

2011-05-03 Thread aaron morton
When you say it's clean does that mean the node has no data files ?

After you replaced the disk what process did you use to recover  ?

Also what version are you running and what's the recent upgrade history ?

Cheers
Aaron

On 3 May 2011, at 23:09, Héctor Izquierdo Seliva wrote:

 Hi everyone. One of the nodes in my 6 node cluster died with disk
 failures. I have replaced the disks, and it's clean. It has the same
 configuration (same ip, same token).
 
 When I try to restart the node it starts to throw mmap underflow
 exceptions till it closes again.
 
 I tried setting io to standard, but it still fails. It gives errors
 about two decorated keys being different, and the EOFException.
 
 Here is an excerpt of the log
 
 http://pastebin.com/ZXW1wY6T
 
 I can provide more info if needed. I'm at a loss here so any help is
 appreciated.
 
 Thanks all for your time
 
 Héctor Izquierdo
 



Re: Write performance help needed

2011-05-03 Thread aaron morton
To give an idea, last March (2010) I run the a much older Cassandra on 10 HP 
blades (dual socket, 4 core, 16GB, 2.5 laptop HDD) and was writing around 250K 
columns per second with 500 python processes loading the data from wikipedia 
running on another 10 HP blades. 

This was my first out of the box no tuning (other then using sensible batch 
updates) test. Since then Cassandra has gotten much faster.
  
Hope that helps
Aaron

On 4 May 2011, at 02:22, Jonathan Ellis wrote:

 You don't give many details, but I would guess:
 
 - your benchmark is not multithreaded
 - mongodb is not configured for durable writes, so you're really only
 measuring the time for it to buffer it in memory
 - you haven't loaded enough data to hit mongo's index doesn't fit in
 memory anymore
 
 On Tue, May 3, 2011 at 8:24 AM, Steve Smith stevenpsmith...@gmail.com wrote:
 I am working for client that needs to persist 100K-200K records per second
 for later querying.  As a proof of concept, we are looking at several
 options including nosql (Cassandra and MongoDB).
 I have been running some tests on my laptop (MacBook Pro, 4GB RAM, 2.66 GHz,
 Dual Core/4 logical cores) and have not been happy with the results.
 The best I have been able to accomplish is 100K records in approximately 30
 seconds.  Each record has 30 columns, mostly made up of integers.  I have
 tried both the Hector and Pelops APIs, and have tried writing in batches
 versus one at a time.  The times have not varied much.
 I am using the out of the box configuration for Cassandra, and while I know
 using 1 disk will have an impact on performance, I would expect to see
 better write numbers than I am.
 As a point of reference, the same test using MongoDB I was able to
 accomplish 100K records in 3.5 seconds.
 Any tips would be appreciated.
 
 - Steve
 
 
 
 
 -- 
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of DataStax, the source for professional Cassandra support
 http://www.datastax.com



Re: MemtablePostFlusher with high number of pending calls?

2011-05-03 Thread Jonathan Ellis
Compaction does, but flush didn't until
https://issues.apache.org/jira/browse/CASSANDRA-2404

On Tue, May 3, 2011 at 2:26 PM, Terje Marthinussen
tmarthinus...@gmail.com wrote:
 Yes, I realize that.
 I am bit curious why it ran out of disk, or rather, why I have 200GB empty
 disk now, but unfortunately it seems like we may not have had monitoring
 enabled on this node to tell me what happened in terms of disk usage.
 I also thought that compaction was supposed to resume (try again with less
 data) if it fails?
 Terje

 On Wed, May 4, 2011 at 3:50 AM, Jonathan Ellis jbel...@gmail.com wrote:

 post flusher is responsible for updating commitlog header after a
 flush; each task waits for a specific flush to complete, then does its
 thing.

 so when you had a flush catastrophically fail, its corresponding
 post-flush task will be stuck.

 On Tue, May 3, 2011 at 1:20 PM, Terje Marthinussen
 tmarthinus...@gmail.com wrote:
  Just some very tiny amount of writes in the background here (some hints
  spooled up on another node slowly coming in).
  No new data.
 
  I thought there was no exceptions, but I did not look far enough back in
  the
  log at first.
  Going back a bit further now however, I see that about 50 hours ago:
  ERROR [CompactionExecutor:387] 2011-05-02 01:16:01,027
  AbstractCassandraDaemon.java (line 112) Fatal exception in thread
  Thread[CompactionExecutor:387,1,main]
  java.io.IOException: No space left on device
          at java.io.RandomAccessFile.writeBytes(Native Method)
          at java.io.RandomAccessFile.write(RandomAccessFile.java:466)
          at
 
  org.apache.cassandra.io.util.BufferedRandomAccessFile.flush(BufferedRandomAccessFile.java:160)
          at
 
  org.apache.cassandra.io.util.BufferedRandomAccessFile.reBuffer(BufferedRandomAccessFile.java:225)
          at
 
  org.apache.cassandra.io.util.BufferedRandomAccessFile.writeAtMost(BufferedRandomAccessFile.java:356)
          at
 
  org.apache.cassandra.io.util.BufferedRandomAccessFile.write(BufferedRandomAccessFile.java:335)
          at
  org.apache.cassandra.io.PrecompactedRow.write(PrecompactedRow.java:102)
          at
 
  org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:130)
          at
 
  org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:566)
          at
 
  org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:146)
          at
 
  org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:112)
          at
  java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
          at java.util.concurrent.FutureTask.run(FutureTask.java:138)
          at
 
  java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
          at
 
  java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
          at java.lang.Thread.run(Thread.java:662)
  [followed by a few more of those...]
  and then a bunch of these:
  ERROR [FlushWriter:123] 2011-05-02 01:21:12,690
  AbstractCassandraDaemon.java
  (line 112) Fatal exception in thread Thread[FlushWriter:123,5,main]
  java.lang.RuntimeException: java.lang.RuntimeException: Insufficient
  disk
  space to flush 40009184 bytes
          at
  org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34)
          at
 
  java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
          at
 
  java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
          at java.lang.Thread.run(Thread.java:662)
  Caused by: java.lang.RuntimeException: Insufficient disk space to flush
  40009184 bytes
          at
 
  org.apache.cassandra.db.ColumnFamilyStore.getFlushPath(ColumnFamilyStore.java:597)
          at
 
  org.apache.cassandra.db.ColumnFamilyStore.createFlushWriter(ColumnFamilyStore.java:2100)
          at
  org.apache.cassandra.db.Memtable.writeSortedContents(Memtable.java:239)
          at org.apache.cassandra.db.Memtable.access$400(Memtable.java:50)
          at
  org.apache.cassandra.db.Memtable$3.runMayThrow(Memtable.java:263)
          at
  org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
          ... 3 more
  Seems like compactions stopped after this (a bunch of tmp tables there
  still
  from when those errors where generated), and I can only suspect the post
  flusher may have stopped at the same time.
  There is 890GB of disk for data, sstables are currently using 604G
  (139GB is
  old tmp tables from when it ran out of disk) and ring tells me the
  load on
  the node is 313GB.
  Terje
 
 
  On Wed, May 4, 2011 at 3:02 AM, Jonathan Ellis jbel...@gmail.com
  wrote:
 
  ... and are there any exceptions in the log?
 
  On Tue, May 3, 2011 at 1:01 PM, Jonathan Ellis jbel...@gmail.com
  wrote:
   Does it resolve down to 0 eventually if you stop doing writes?
  
   On Tue, May 3, 2011 at 12:56 PM, Terje Marthinussen
   tmarthinus...@gmail.com wrote:
   Cassandra 0.8 beta 

Re: Replica data distributing between racks

2011-05-03 Thread Jonathan Ellis
On Tue, May 3, 2011 at 2:46 PM, aaron morton aa...@thelastpickle.com wrote:
 Jonathan,
        I think you are saying each DC should have it's own (logical) token 
 ring.

Right. (Only with NTS, although you'd usually end up with a similar
effect if you alternate DC locations for nodes in a ONTS cluster.)

        But currently two endpoints cannot have the same token regardless of 
 the DC they are in.

Also right.

 Or should people just bump the tokens in extra DC's to avoid the collision?

Yes.

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


Re: Unicode key encoding problem when upgrading from 0.6.13 to 0.7.5

2011-05-03 Thread aaron morton
Can you provide some details of the data returned from you do the =
get_range() ? It will be interesting to see the raw bytes returned for =
the keys. The likely culprit is a change in the encoding. Can you also =
try to grab the bytes sent for the key when doing the single select that =
fails.=20

You can grab these either on the client and/or by turing on the logging =
the DEBUG in conf/log4j-server.properties

Thanks
Aaron

On 4 May 2011, at 03:19, Henrik Schröder wrote:

 The way we solved this problem is that it turned out we had only a few 
 hundred rows with unicode keys, so we simply extracted them, upgraded to 0.7, 
 and wrote them back. However, this means that among the rows, there are a few 
 hundred weird duplicate rows with identical keys.
 
 Is this going to be a problem in the future? Is there a chance that the good 
 duplicate is cleaned out in favour of the bad duplicate so that we suddnely 
 lose those rows again?
 
 
 /Henrik Schröder



Re: Replica data distributing between racks

2011-05-03 Thread Eric tamme
On Tue, May 3, 2011 at 4:08 PM, Jonathan Ellis jbel...@gmail.com wrote:
 On Tue, May 3, 2011 at 2:46 PM, aaron morton aa...@thelastpickle.com wrote:
 Jonathan,
        I think you are saying each DC should have it's own (logical) token 
 ring.

 Right. (Only with NTS, although you'd usually end up with a similar
 effect if you alternate DC locations for nodes in a ONTS cluster.)

        But currently two endpoints cannot have the same token regardless of 
 the DC they are in.

 Also right.

 Or should people just bump the tokens in extra DC's to avoid the collision?

 Yes.



I am sorry, but I do not understand fully.  I would appreciate it if
some one could explain with more verbosity for me.

I do not understand why data insertion is even, but replication is not.

I do not understand how to solve the problem.  What does bumping
tokens entail - Is that going to change my insertion distribution?  I
had no idea you can create different logical keyspaces ... and I am
not sure what that exactly means... or that I even want to do it.  Is
there a clear solution to fixing the problem I laid out, and getting
replication data evenly distributed between racks in each DC?

Sorry again for needing more verbosity - I am learning as I go with
this stuff.  I appreciate everyones help.

-Eric


Re: IOException: Unable to create hard link ... /snapshots/ ... (errno 17)

2011-05-03 Thread Mck
On Tue, 2011-05-03 at 14:22 -0500, Jonathan Ellis wrote:
 Can you create a ticket?

CASSANDRA-2598



Backup full cluster

2011-05-03 Thread A J
Snapshot runs on a local node. How do I ensure I have a 'point in
time' snapshot of the full cluster ? Do I have to stop the writes on
the full cluster and then snapshot all the nodes individually ?

Thanks.


Re: MemtablePostFlusher with high number of pending calls?

2011-05-03 Thread Terje Marthinussen
Hm... peculiar.

Post flush is not involved in compactions, right?

May 2nd
01:06 - Out of disk
01:51 - Starts a mix of major and minor compactions on different column
families
It then starts a few minor compactions extra over the day, but given that
there are more than 1000 sstables, and we are talking 3 minor compactions
started, it is not normal I think.
May 3rd 1 minor compaction started.

When I checked today, there was a bunch of tmp files on the disk with last
modify time from 01:something on may 2nd and 200GB empty disk...

Definitely no compaction going on.
Guess I will add some debug logging and see if I get lucky and run out of
disk again.

Terje

On Wed, May 4, 2011 at 5:06 AM, Jonathan Ellis jbel...@gmail.com wrote:

 Compaction does, but flush didn't until
 https://issues.apache.org/jira/browse/CASSANDRA-2404

 On Tue, May 3, 2011 at 2:26 PM, Terje Marthinussen
 tmarthinus...@gmail.com wrote:
  Yes, I realize that.
  I am bit curious why it ran out of disk, or rather, why I have 200GB
 empty
  disk now, but unfortunately it seems like we may not have had monitoring
  enabled on this node to tell me what happened in terms of disk usage.
  I also thought that compaction was supposed to resume (try again with
 less
  data) if it fails?
  Terje
 
  On Wed, May 4, 2011 at 3:50 AM, Jonathan Ellis jbel...@gmail.com
 wrote:
 
  post flusher is responsible for updating commitlog header after a
  flush; each task waits for a specific flush to complete, then does its
  thing.
 
  so when you had a flush catastrophically fail, its corresponding
  post-flush task will be stuck.
 
  On Tue, May 3, 2011 at 1:20 PM, Terje Marthinussen
  tmarthinus...@gmail.com wrote:
   Just some very tiny amount of writes in the background here (some
 hints
   spooled up on another node slowly coming in).
   No new data.
  
   I thought there was no exceptions, but I did not look far enough back
 in
   the
   log at first.
   Going back a bit further now however, I see that about 50 hours ago:
   ERROR [CompactionExecutor:387] 2011-05-02 01:16:01,027
   AbstractCassandraDaemon.java (line 112) Fatal exception in thread
   Thread[CompactionExecutor:387,1,main]
   java.io.IOException: No space left on device
   at java.io.RandomAccessFile.writeBytes(Native Method)
   at java.io.RandomAccessFile.write(RandomAccessFile.java:466)
   at
  
  
 org.apache.cassandra.io.util.BufferedRandomAccessFile.flush(BufferedRandomAccessFile.java:160)
   at
  
  
 org.apache.cassandra.io.util.BufferedRandomAccessFile.reBuffer(BufferedRandomAccessFile.java:225)
   at
  
  
 org.apache.cassandra.io.util.BufferedRandomAccessFile.writeAtMost(BufferedRandomAccessFile.java:356)
   at
  
  
 org.apache.cassandra.io.util.BufferedRandomAccessFile.write(BufferedRandomAccessFile.java:335)
   at
  
 org.apache.cassandra.io.PrecompactedRow.write(PrecompactedRow.java:102)
   at
  
  
 org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:130)
   at
  
  
 org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:566)
   at
  
  
 org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:146)
   at
  
  
 org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:112)
   at
   java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
   at
  
  
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
   at
  
  
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
   at java.lang.Thread.run(Thread.java:662)
   [followed by a few more of those...]
   and then a bunch of these:
   ERROR [FlushWriter:123] 2011-05-02 01:21:12,690
   AbstractCassandraDaemon.java
   (line 112) Fatal exception in thread Thread[FlushWriter:123,5,main]
   java.lang.RuntimeException: java.lang.RuntimeException: Insufficient
   disk
   space to flush 40009184 bytes
   at
  
 org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34)
   at
  
  
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
   at
  
  
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
   at java.lang.Thread.run(Thread.java:662)
   Caused by: java.lang.RuntimeException: Insufficient disk space to
 flush
   40009184 bytes
   at
  
  
 org.apache.cassandra.db.ColumnFamilyStore.getFlushPath(ColumnFamilyStore.java:597)
   at
  
  
 org.apache.cassandra.db.ColumnFamilyStore.createFlushWriter(ColumnFamilyStore.java:2100)
   at
  
 org.apache.cassandra.db.Memtable.writeSortedContents(Memtable.java:239)
   at
 org.apache.cassandra.db.Memtable.access$400(Memtable.java:50)
   at
   org.apache.cassandra.db.Memtable$3.runMayThrow(Memtable.java:263)
   at
  
 

Re: Performance tests using stress testing tool

2011-05-03 Thread Baskar Duraikannu
Thanks Peter. 

I believe...I found the root cause. Switch that we used was bad. 
Now on a 4 node cluster ( Each Node has 1 CPU  - Quad Core and 16 GB of RAM), 
I was able to get around 11,000 writes and 10,050 reads per second 
simultaneously (CPU usage is around 45% on all nodes. Disk queue size is in the 
neighbourhood of 10)

Is this inline with what you usually see with Cassandra? 


- Original Message - 
From: Peter Schuller 
To: user@cassandra.apache.org 
Sent: Friday, April 29, 2011 12:21 PM
Subject: Re: Performance tests using stress testing tool


 Thanks Peter. I am using java version of the stress testing tool from the
 contrib folder. Is there any issue that should be aware of? Do you recommend
 using pystress?

I just saw Brandon file this:
https://issues.apache.org/jira/browse/CASSANDRA-2578

Maybe that's it.

-- 
/ Peter Schuller


Decommissioning node is causing broken pipe error

2011-05-03 Thread tamara.alexander
Hi all,

I ran decommission on a node in my 32 node cluster. After about an hour of 
streaming files to another node, I got this error on the node being 
decommissioned:
INFO [MiscStage:1] 2011-05-03 21:49:00,235 StreamReplyVerbHandler.java (line 
58) Need to re-stream file /raiddrive/MDR/MeterRecords-f-2283-Data.db to 
/10.206.63.208
ERROR [Streaming:1] 2011-05-03 21:49:01,580 DebuggableThreadPoolExecutor.java 
(line 103) Error in ThreadPoolExecutor
java.lang.RuntimeException: java.io.IOException: Broken pipe
at 
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.IOException: Broken pipe
at sun.nio.ch.FileChannelImpl.transferTo0(Native Method)
at 
sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:415)
at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:516)
at 
org.apache.cassandra.streaming.FileStreamTask.stream(FileStreamTask.java:105)
at 
org.apache.cassandra.streaming.FileStreamTask.runMayThrow(FileStreamTask.java:67)
at 
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
... 3 more
ERROR [Streaming:1] 2011-05-03 21:49:01,581 AbstractCassandraDaemon.java (line 
112) Fatal exception in thread Thread[Streaming:1,1,main]
java.lang.RuntimeException: java.io.IOException: Broken pipe
at 
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.IOException: Broken pipe
at sun.nio.ch.FileChannelImpl.transferTo0(Native Method)
at 
sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:415)
at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:516)
at 
org.apache.cassandra.streaming.FileStreamTask.stream(FileStreamTask.java:105)
at 
org.apache.cassandra.streaming.FileStreamTask.runMayThrow(FileStreamTask.java:67)
at 
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
... 3 more

And this message on the node that it was streaming to:
INFO [Thread-333] 2011-05-03 21:49:00,234 StreamInSession.java (line 121) 
Streaming of file 
/raiddrive/MDR/MeterRecords-f-2283-Data.db/(98605680685,197932763967)
 progress=49016107008/99327083282 - 49% from 
org.apache.cassandra.streaming.StreamInSession@33721219 failed: requesting a 
retry.

I tried running decommission again (and running scrub + decommission), but I 
keep getting this error on the same file.

I checked out the file and saw that it is a lot bigger than all the other 
sstables... 184GB instead of about 74MB. I haven't run a major compaction for a 
bit, so I'm trying to stream 658 sstables.

I'm using Cassandra 0.7.4, I have two data directories (I know that's not good 
practice...), and all my nodes are on Amazon EC2.

Any thoughts on what could be going on or how to prevent this?

Thanks!
Tamara




This message is for the designated recipient only and may contain privileged, 
proprietary, or otherwise private information. If you have received it in 
error, please notify the sender immediately and delete the original. Any other 
use of the email by you is prohibited.


Re: Problems recovering a dead node

2011-05-03 Thread Héctor Izquierdo Seliva

Hi Aaron

It has no data files whatsoever. The upgrade path is 0.7.4 - 0.7.5. It
turns out the initial problem was the sw raid failing silently because
of another faulty disk.

Now that the storage is working, I brought up the node again, same IP,
same token and tried doing nodetool repair. 

All adjacent nodes have finished the streaming session, and now the node
has a total of 248 GB of data. Is this normal when the load per node is
about 18GB? 

Also there are 1245 pending tasks. It's been compacting or rebuilding
sstables for the last 8 hours non stop. There are 2057 sstables in the
data folder.

Should I have done thing differently or is this the normal behaviour?

Thanks!

El mié, 04-05-2011 a las 07:54 +1200, aaron morton escribió:
 When you say it's clean does that mean the node has no data files ?
 
 After you replaced the disk what process did you use to recover  ?
 
 Also what version are you running and what's the recent upgrade history ?
 
 Cheers
 Aaron
 
 On 3 May 2011, at 23:09, Héctor Izquierdo Seliva wrote:
 
  Hi everyone. One of the nodes in my 6 node cluster died with disk
  failures. I have replaced the disks, and it's clean. It has the same
  configuration (same ip, same token).
  
  When I try to restart the node it starts to throw mmap underflow
  exceptions till it closes again.
  
  I tried setting io to standard, but it still fails. It gives errors
  about two decorated keys being different, and the EOFException.
  
  Here is an excerpt of the log
  
  http://pastebin.com/ZXW1wY6T
  
  I can provide more info if needed. I'm at a loss here so any help is
  appreciated.
  
  Thanks all for your time
  
  Héctor Izquierdo