Unable to add columns to empty row in Column family: Cassandra
Hello, I am using Cassandra for my application.My Cassandra client uses Thrift APIs directly. The problem I am facing currently is as follows: 1) I added a row and columns in it dynamically via Thrift API Client 2) Next, I used command line client to delete row which actually deleted all the columns in it, leaving empty row with original row id. 3) Now, I am trying to add columns dynamically using client program into this empty row with same row key However, columns are not being inserted. But, when tried from command line client, it worked correctly. Any pointer on this would be of great use Thanks in advance, Regards, Anuya
Re: Building from source from behind firewall since Maven switch?
-autoproxy worked for me when I write the original patch but as I no longer work for the company where I wrote the patch, I don't have a firewall to deal with worst case you might have to create a ~/.m2/settings.xml with the proxy details... if that is the case can you raise a jira in MANTTASKS (which is at jira.codehaus.org for hysterical reasons) - Stephen --- Sent from my Android phone, so random spelling mistakes, random nonsense words and other nonsense are a direct result of using swype to type on the screen On 3 May 2011 01:06, Suan-Aik Yeo s...@enovafinancial.com wrote:
Re: Unable to add columns to empty row in Column family: Cassandra
Hi Anuya, However, columns are not being inserted. Do you mean to say that after insert operation you couldn't retrieve the same data? If so, then please check the time-stamp when you reinserted after delete operation. Your second insertion time-stamp has to be greater than the previous insertion. Thank you, Jaydeep From: anuya joshi anu...@gmail.com To: user@cassandra.apache.org Sent: Monday, 2 May 2011 11:34 PM Subject: Re: Unable to add columns to empty row in Column family: Cassandra Hello, I am using Cassandra for my application.My Cassandra client uses Thrift APIs directly. The problem I am facing currently is as follows: 1) I added a row and columns in it dynamically via Thrift API Client 2) Next, I used command line client to delete row which actually deleted all the columns in it, leaving empty row with original row id. 3) Now, I am trying to add columns dynamically using client program into this empty row with same row key However, columns are not being inserted. But, when tried from command line client, it worked correctly. Any pointer on this would be of great use Thanks in advance, Regards, Anuya
Re: Unable to add columns to empty row in Column family: Cassandra
One small correction in my mail below. Second insertion time-stamp has to be greater than delete time-stamp in-order to retrieve the data. Thank you, Jaydeep From: chovatia jaydeep chovatia_jayd...@yahoo.co.in To: user@cassandra.apache.org user@cassandra.apache.org Sent: Monday, 2 May 2011 11:52 PM Subject: Re: Unable to add columns to empty row in Column family: Cassandra Hi Anuya, However, columns are not being inserted. Do you mean to say that after insert operation you couldn't retrieve the same data? If so, then please check the time-stamp when you reinserted after delete operation. Your second insertion time-stamp has to be greater than the previous insertion. Thank you, Jaydeep From: anuya joshi anu...@gmail.com To: user@cassandra.apache.org Sent: Monday, 2 May 2011 11:34 PM Subject: Re: Unable to add columns to empty row in Column family: Cassandra Hello, I am using Cassandra for my application.My Cassandra client uses Thrift APIs directly. The problem I am facing currently is as follows: 1) I added a row and columns in it dynamically via Thrift API Client 2) Next, I used command line client to delete row which actually deleted all the columns in it, leaving empty row with original row id. 3) Now, I am trying to add columns dynamically using client program into this empty row with same row key However, columns are not being inserted. But, when tried from command line client, it worked correctly. Any pointer on this would be of great use Thanks in advance, Regards, Anuya
Using snapshot for backup and restore
Hi, We are trying to use snapshot for backup and restore. We found out that snapshot doesn't take secondary indexes. We are wondering why is that? And is there any way we can rebuild the secondary index? Regards, Arsene
One cluster or many?
If I have a database that partitions naturally into non-overlapping datasets, in which there are no references between datasets, where each dataset is quite large (i.e. large enough to merit its own cluster from the point of view of quantity of data), should I set up one cluster per database or one large cluster for everything together? As I see it: The primary advantage of separate clusters is total isolation: if I have a problem with one dataset, my application will continue working normally for all other datasets. The primary advantage of one big cluster is usage pooling: when one server goes down in a large cluster it's much less important than when one server goes down in a small cluster. Also, different temporal usage patterns of the different datasets (i.e. there will be different peak hours on different datasets) can be combined to ease capacity requirements. Any thoughts?
low performance inserting
Hello everybody, first: sorry for my english in advance!! I'm getting started with Cassandra on a 5 nodes cluster inserting data with the pycassa API. I've read everywere on internet that cassandra's performance are better than MySQL because of the writes append's only into commit logs files. When i'm trying to insert 100 000 rows with 10 columns per row with batch insert, I'v this result: 27 seconds But with MySQL (load data infile) this take only 2 seconds (using indexes) Here my configuration cassandra version: 0.7.5 nodes : 192.168.1.210, 192.168.1.211, 192.168.1.212, 192.168.1.213, 192.168.1.214 seed: 192.168.1.210 My script * #!/usr/bin/env python import pycassa import time import random from cassandra import ttypes pool = pycassa.connect('test', ['192.168.1.210:9160']) cf = pycassa.ColumnFamily(pool, 'test') b = cf.batch(queue_size=50, write_consistency_level=ttypes.ConsistencyLevel.ANY) tps1 = time.time() for i in range(10): columns = dict() for j in range(10): columns[str(j)] = str(random.randint(0,100)) b.insert(str(i), columns) b.send() tps2 = time.time() print(execution time: + str(tps2 - tps1) + seconds) * what I'm doing rong ?
Problems recovering a dead node
Hi everyone. One of the nodes in my 6 node cluster died with disk failures. I have replaced the disks, and it's clean. It has the same configuration (same ip, same token). When I try to restart the node it starts to throw mmap underflow exceptions till it closes again. I tried setting io to standard, but it still fails. It gives errors about two decorated keys being different, and the EOFException. Here is an excerpt of the log http://pastebin.com/ZXW1wY6T I can provide more info if needed. I'm at a loss here so any help is appreciated. Thanks all for your time Héctor Izquierdo
Re: Replica data distributing between racks
I've been digging into this and worked was able to reproduce something, not sure if it's a fault and I can't work on it any more tonight. To reproduce: - 2 node cluster on my mac book - set the tokens as if they were nodes 3 and 4 in a 4 node cluster, e.g. node 1 with 85070591730234615865843651857942052864 and node 2 127605887595351923798765477786913079296 - set cassandra-topology.properties to put the nodes in DC1 on RAC1 and RAC2 - create a keyspace using NTS and strategy_options = [{DC1:1}] Inserted 10 rows they were distributed as - node 1 - 9 rows - node 2 - 1 row I *think* the problem has to do with TokenMetadata.firstTokenIndex(). It often says the closest token to a key is the node 1 because in effect... - node 1 is responsible for 0 to 85070591730234615865843651857942052864 - node 2 is responsible for 85070591730234615865843651857942052864 to 127605887595351923798765477786913079296 - AND node 1 does the wrap around from 127605887595351923798765477786913079296 to 0 as keys that would insert past the last token in the ring array wrap to 0 because insertMin is false. Thoughts ? Aaron On 3 May 2011, at 10:29, Eric tamme wrote: On Mon, May 2, 2011 at 5:59 PM, aaron morton aa...@thelastpickle.com wrote: My bad, I missed the way TokenMetadata.ringIterator() and firstTokenIndex() work. Eric, can you show the output from nodetool ring ? Sorry if the previous paste was way to unformatted, here is a pastie.org link with nicer formatting of nodetool ring output than plain text email allows. http://pastie.org/private/50khpakpffjhsmgf66oetg
Re: Using snapshot for backup and restore
Looking at the code for the snapshot it looks like it does not include secondary indexes. And I cannot see a way to manually trigger an index rebuild (via CFS.buildSecondaryIndexes()) Looking at this it's probably handy to snapshot them https://issues.apache.org/jira/browse/CASSANDRA-2470 I'm not sure if there is a reason for excluding them. Is this causing a problem right now ? Aaron On 3 May 2011, at 20:22, Arsene Lee wrote: Hi, We are trying to use snapshot for backup and restore. We found out that snapshot doesn’t take secondary indexes. We are wondering why is that? And is there any way we can rebuild the secondary index? Regards, Arsene
Write performance help needed
I am working for client that needs to persist 100K-200K records per second for later querying. As a proof of concept, we are looking at several options including nosql (Cassandra and MongoDB). I have been running some tests on my laptop (MacBook Pro, 4GB RAM, 2.66 GHz, Dual Core/4 logical cores) and have not been happy with the results. The best I have been able to accomplish is 100K records in approximately 30 seconds. Each record has 30 columns, mostly made up of integers. I have tried both the Hector and Pelops APIs, and have tried writing in batches versus one at a time. The times have not varied much. I am using the out of the box configuration for Cassandra, and while I know using 1 disk will have an impact on performance, I would expect to see better write numbers than I am. As a point of reference, the same test using MongoDB I was able to accomplish 100K records in 3.5 seconds. Any tips would be appreciated. - Steve
Re: low performance inserting
Hi, Not sure this is the case for your Bad Performance, but you are Meassuring Data creation and Insertion together. Your Data creation involves Lots of class casts which are probably quite Slow. Try Timing only the b.send Part and See how Long that Takes. Roland Am 03.05.2011 um 12:30 schrieb charles THIBAULT charl.thiba...@gmail.com: Hello everybody, first: sorry for my english in advance!! I'm getting started with Cassandra on a 5 nodes cluster inserting data with the pycassa API. I've read everywere on internet that cassandra's performance are better than MySQL because of the writes append's only into commit logs files. When i'm trying to insert 100 000 rows with 10 columns per row with batch insert, I'v this result: 27 seconds But with MySQL (load data infile) this take only 2 seconds (using indexes) Here my configuration cassandra version: 0.7.5 nodes : 192.168.1.210, 192.168.1.211, 192.168.1.212, 192.168.1.213, 192.168.1.214 seed: 192.168.1.210 My script * #!/usr/bin/env python import pycassa import time import random from cassandra import ttypes pool = pycassa.connect('test', ['192.168.1.210:9160']) cf = pycassa.ColumnFamily(pool, 'test') b = cf.batch(queue_size=50, write_consistency_level=ttypes.ConsistencyLevel.ANY) tps1 = time.time() for i in range(10): columns = dict() for j in range(10): columns[str(j)] = str(random.randint(0,100)) b.insert(str(i), columns) b.send() tps2 = time.time() print(execution time: + str(tps2 - tps1) + seconds) * what I'm doing rong ?
Re: Write performance help needed
Use more nodes to increase your write throughput. Testing on a single machine is not really a viable benchmark for what you can achieve with cassandra.
RE: Using snapshot for backup and restore
If snapshot doesn't include secondary indexes then we can't use it for our backup and restore procedure. . This mean, we need to stop our service when we want to do backups and this would cause longer system down time. If there is no particular reason, it is probably a good idea to also include secondary indexes when taking the snapshot. Arsene From: aaron morton [aa...@thelastpickle.com] Sent: Tuesday, May 03, 2011 7:28 PM To: user@cassandra.apache.org Subject: Re: Using snapshot for backup and restore Looking at the code for the snapshot it looks like it does not include secondary indexes. And I cannot see a way to manually trigger an index rebuild (via CFS.buildSecondaryIndexes()) Looking at this it's probably handy to snapshot them https://issues.apache.org/jira/browse/CASSANDRA-2470 I'm not sure if there is a reason for excluding them. Is this causing a problem right now ? Aaron On 3 May 2011, at 20:22, Arsene Lee wrote:https://issues.apache.org/jira/browse/CASSANDRA-2470 Hi, We are trying to use snapshot for backup and restore. We found out that snapshot doesn’t take secondary indexes. We are wondering why is that? And is there any way we can rebuild the secondary index? Regards, Arsene
Re: low performance inserting
There is probably a fair number of things you'd have to make sure you do to improve the write performance on the Cassandra side (starting by using multiple threads to do the insertion), but the first thing is probably to start comparing things that are at least mildly comparable. If you do inserts in Cassandra, you should try to do inserts in MySQL too, not load data infile (which really is just a bulk loading utility). And as stated here http://dev.mysql.com/doc/refman/5.0/en/insert-speed.html: When loading a table from a text file, use LOAD DATA INFILE. This is usually 20 times faster than using INSERT statements. -- Sylvain On Tue, May 3, 2011 at 12:30 PM, charles THIBAULT charl.thiba...@gmail.com wrote: Hello everybody, first: sorry for my english in advance!! I'm getting started with Cassandra on a 5 nodes cluster inserting data with the pycassa API. I've read everywere on internet that cassandra's performance are better than MySQL because of the writes append's only into commit logs files. When i'm trying to insert 100 000 rows with 10 columns per row with batch insert, I'v this result: 27 seconds But with MySQL (load data infile) this take only 2 seconds (using indexes) Here my configuration cassandra version: 0.7.5 nodes : 192.168.1.210, 192.168.1.211, 192.168.1.212, 192.168.1.213, 192.168.1.214 seed: 192.168.1.210 My script * #!/usr/bin/env python import pycassa import time import random from cassandra import ttypes pool = pycassa.connect('test', ['192.168.1.210:9160']) cf = pycassa.ColumnFamily(pool, 'test') b = cf.batch(queue_size=50, write_consistency_level=ttypes.ConsistencyLevel.ANY) tps1 = time.time() for i in range(10): columns = dict() for j in range(10): columns[str(j)] = str(random.randint(0,100)) b.insert(str(i), columns) b.send() tps2 = time.time() print(execution time: + str(tps2 - tps1) + seconds) * what I'm doing rong ?
Re: Replica data distributing between racks
Right, when you are computing balanced RP tokens for NTS you need to compute the tokens for each DC independently. On Tue, May 3, 2011 at 6:23 AM, aaron morton aa...@thelastpickle.com wrote: I've been digging into this and worked was able to reproduce something, not sure if it's a fault and I can't work on it any more tonight. To reproduce: - 2 node cluster on my mac book - set the tokens as if they were nodes 3 and 4 in a 4 node cluster, e.g. node 1 with 85070591730234615865843651857942052864 and node 2 127605887595351923798765477786913079296 - set cassandra-topology.properties to put the nodes in DC1 on RAC1 and RAC2 - create a keyspace using NTS and strategy_options = [{DC1:1}] Inserted 10 rows they were distributed as - node 1 - 9 rows - node 2 - 1 row I *think* the problem has to do with TokenMetadata.firstTokenIndex(). It often says the closest token to a key is the node 1 because in effect... - node 1 is responsible for 0 to 85070591730234615865843651857942052864 - node 2 is responsible for 85070591730234615865843651857942052864 to 127605887595351923798765477786913079296 - AND node 1 does the wrap around from 127605887595351923798765477786913079296 to 0 as keys that would insert past the last token in the ring array wrap to 0 because insertMin is false. Thoughts ? Aaron On 3 May 2011, at 10:29, Eric tamme wrote: On Mon, May 2, 2011 at 5:59 PM, aaron morton aa...@thelastpickle.com wrote: My bad, I missed the way TokenMetadata.ringIterator() and firstTokenIndex() work. Eric, can you show the output from nodetool ring ? Sorry if the previous paste was way to unformatted, here is a pastie.org link with nicer formatting of nodetool ring output than plain text email allows. http://pastie.org/private/50khpakpffjhsmgf66oetg -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Write performance help needed
You don't give many details, but I would guess: - your benchmark is not multithreaded - mongodb is not configured for durable writes, so you're really only measuring the time for it to buffer it in memory - you haven't loaded enough data to hit mongo's index doesn't fit in memory anymore On Tue, May 3, 2011 at 8:24 AM, Steve Smith stevenpsmith...@gmail.com wrote: I am working for client that needs to persist 100K-200K records per second for later querying. As a proof of concept, we are looking at several options including nosql (Cassandra and MongoDB). I have been running some tests on my laptop (MacBook Pro, 4GB RAM, 2.66 GHz, Dual Core/4 logical cores) and have not been happy with the results. The best I have been able to accomplish is 100K records in approximately 30 seconds. Each record has 30 columns, mostly made up of integers. I have tried both the Hector and Pelops APIs, and have tried writing in batches versus one at a time. The times have not varied much. I am using the out of the box configuration for Cassandra, and while I know using 1 disk will have an impact on performance, I would expect to see better write numbers than I am. As a point of reference, the same test using MongoDB I was able to accomplish 100K records in 3.5 seconds. Any tips would be appreciated. - Steve -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
RE: Replica data distributing between racks
So we are currently running a 10 node ring in one DC, and we are going to be adding 5 more nodes in another DC. To keep the rings in each DC balanced, should I really calculate the tokens independently and just make sure none of them are the same? Something like: DC1 (RF 5): 1: 0 2: 17014118346046923173168730371588410572 3: 34028236692093846346337460743176821144 4: 51042355038140769519506191114765231716 5: 68056473384187692692674921486353642288 6: 85070591730234615865843651857942052860 7: 102084710076281539039012382229530463432 8: 119098828422328462212181112601118874004 9: 136112946768375385385349842972707284576 10: 153127065114422308558518573344295695148 DC2 (RF 3): 1: 1 (one off from DC1 node 1) 2: 34028236692093846346337460743176821145 (one off from DC1 node 3) 3: 68056473384187692692674921486353642290 (two off from DC1 node 5) 4: 102084710076281539039012382229530463435 (three off from DC1 node 7) 5: 136112946768375385385349842972707284580 (four off from DC1 node 9) Originally I was thinking I should spread the DC2 nodes evenly in between every other DC1 node. Or does it not matter where they are in respect to the DC1 nodes, and long as they fall somewhere after every other DC1 node? So it is DC1-1, DC2-1, DC1-2, DC1-3, DC2-2, DC1-4, DC1-5... -Original Message- From: Jonathan Ellis [mailto:jbel...@gmail.com] Sent: Tuesday, May 03, 2011 9:14 AM To: user@cassandra.apache.org Subject: Re: Replica data distributing between racks Right, when you are computing balanced RP tokens for NTS you need to compute the tokens for each DC independently. On Tue, May 3, 2011 at 6:23 AM, aaron morton aa...@thelastpickle.com wrote: I've been digging into this and worked was able to reproduce something, not sure if it's a fault and I can't work on it any more tonight. To reproduce: - 2 node cluster on my mac book - set the tokens as if they were nodes 3 and 4 in a 4 node cluster, e.g. node 1 with 85070591730234615865843651857942052864 and node 2 127605887595351923798765477786913079296 - set cassandra-topology.properties to put the nodes in DC1 on RAC1 and RAC2 - create a keyspace using NTS and strategy_options = [{DC1:1}] Inserted 10 rows they were distributed as - node 1 - 9 rows - node 2 - 1 row I *think* the problem has to do with TokenMetadata.firstTokenIndex(). It often says the closest token to a key is the node 1 because in effect... - node 1 is responsible for 0 to 85070591730234615865843651857942052864 - node 2 is responsible for 85070591730234615865843651857942052864 to 127605887595351923798765477786913079296 - AND node 1 does the wrap around from 127605887595351923798765477786913079296 to 0 as keys that would insert past the last token in the ring array wrap to 0 because insertMin is false. Thoughts ? Aaron On 3 May 2011, at 10:29, Eric tamme wrote: On Mon, May 2, 2011 at 5:59 PM, aaron morton aa...@thelastpickle.com wrote: My bad, I missed the way TokenMetadata.ringIterator() and firstTokenIndex() work. Eric, can you show the output from nodetool ring ? Sorry if the previous paste was way to unformatted, here is a pastie.org link with nicer formatting of nodetool ring output than plain text email allows. http://pastie.org/private/50khpakpffjhsmgf66oetg -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
IOException: Unable to create hard link ... /snapshots/ ... (errno 17)
Running a 3 node cluster with cassandra-0.8.0-beta1 I'm seeing the first node logging many (thousands) times lines like Caused by: java.io.IOException: Unable to create hard link from /iad/finn/countstatistics/cassandra-data/countstatisticsCount/thrift_no_finntech_countstats_count_Count_1299479381593068337-f-5504-Data.db to /iad/finn/countstatistics/cassandra-data/countstatisticsCount/snapshots/compact-thrift_no_finntech_countstats_count_Count_1299479381593068337/thrift_no_finntech_countstats_count_Count_1299479381593068337-f-5504-Data.db (errno 17) This seems to happen for all column families (including system). It happens a lot during startup. The hardlinks do exist. Stopping, deleting the hardlinks, and starting again does not help. But i haven't seen it once on the other nodes... ~mck ps the stacktrace java.io.IOError: java.io.IOException: Unable to create hard link from /iad/finn/countstatistics/cassandra-data/countstatisticsCount/thrift_no_finntech_countstats_count_Count_1299479381593068337-f-3875-Data.db to /iad/finn/countstatistics/cassandra-data/countstatisticsCount/snapshots/compact-thrift_no_finntech_countstats_count_Count_1299479381593068337/thrift_no_finntech_countstats_count_Count_1299479381593068337-f-3875-Data.db (errno 17) at org.apache.cassandra.db.ColumnFamilyStore.snapshotWithoutFlush(ColumnFamilyStore.java:1629) at org.apache.cassandra.db.ColumnFamilyStore.snapshot(ColumnFamilyStore.java:1654) at org.apache.cassandra.db.Table.snapshot(Table.java:198) at org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:504) at org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:146) at org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:112) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Caused by: java.io.IOException: Unable to create hard link from /iad/finn/countstatistics/cassandra-data/countstatisticsCount/thrift_no_finntech_countstats_count_Count_1299479381593068337-f-3875-Data.db to /iad/finn/countstatistics/cassandra-data/countstatisticsCount/snapshots/compact-thrift_no_finntech_countstats_count_Count_1299479381593068337/thrift_no_finntech_countstats_count_Count_1299479381593068337-f-3875-Data.db (errno 17) at org.apache.cassandra.utils.CLibrary.createHardLink(CLibrary.java:155) at org.apache.cassandra.io.sstable.SSTableReader.createLinks(SSTableReader.java:713) at org.apache.cassandra.db.ColumnFamilyStore.snapshotWithoutFlush(ColumnFamilyStore.java:1622) ... 10 more
Re: Replica data distributing between racks
On Tue, May 3, 2011 at 10:13 AM, Jonathan Ellis jbel...@gmail.com wrote: Right, when you are computing balanced RP tokens for NTS you need to compute the tokens for each DC independently. I am confused ... sorry. Are you saying that ... I need to change how my keys are calculated to fix this problem? Or are you talking about the implementation of how replication selects a token? -Eric
Re: low performance inserting
Hi Sylvain, thanks for your answer. I'd make a test with the stress utility inserting 100 000 rows with 10 columns per row I use these options: -o insert -t 5 -n 10 -c 10 -d 192.168.1.210,192.168.1.211,... result: 161 seconds with MySQL using inserts (after a dump): 1.79 second Charles 2011/5/3 Sylvain Lebresne sylv...@datastax.com There is probably a fair number of things you'd have to make sure you do to improve the write performance on the Cassandra side (starting by using multiple threads to do the insertion), but the first thing is probably to start comparing things that are at least mildly comparable. If you do inserts in Cassandra, you should try to do inserts in MySQL too, not load data infile (which really is just a bulk loading utility). And as stated here http://dev.mysql.com/doc/refman/5.0/en/insert-speed.html: When loading a table from a text file, use LOAD DATA INFILE. This is usually 20 times faster than using INSERT statements. -- Sylvain On Tue, May 3, 2011 at 12:30 PM, charles THIBAULT charl.thiba...@gmail.com wrote: Hello everybody, first: sorry for my english in advance!! I'm getting started with Cassandra on a 5 nodes cluster inserting data with the pycassa API. I've read everywere on internet that cassandra's performance are better than MySQL because of the writes append's only into commit logs files. When i'm trying to insert 100 000 rows with 10 columns per row with batch insert, I'v this result: 27 seconds But with MySQL (load data infile) this take only 2 seconds (using indexes) Here my configuration cassandra version: 0.7.5 nodes : 192.168.1.210, 192.168.1.211, 192.168.1.212, 192.168.1.213, 192.168.1.214 seed: 192.168.1.210 My script * #!/usr/bin/env python import pycassa import time import random from cassandra import ttypes pool = pycassa.connect('test', ['192.168.1.210:9160']) cf = pycassa.ColumnFamily(pool, 'test') b = cf.batch(queue_size=50, write_consistency_level=ttypes.ConsistencyLevel.ANY) tps1 = time.time() for i in range(10): columns = dict() for j in range(10): columns[str(j)] = str(random.randint(0,100)) b.insert(str(i), columns) b.send() tps2 = time.time() print(execution time: + str(tps2 - tps1) + seconds) * what I'm doing rong ?
Unicode key encoding problem when upgrading from 0.6.13 to 0.7.5
Hey everyone, We did some tests before upgrading our Cassandra cluster from 0.6 to 0.7, just to make sure that the change in how keys are encoded wouldn't cause us any dataloss. Unfortunately it seems that rows stored under a unicode key couldn't be retrieved after the upgrade. We're running everything on Windows, and we're using the generated thrift client in C# to access it. I managed to make a minimal test to reproduce the error consistently: First, I started up Cassandra 0.6.13 with an empty data directory, and a really simple config with a single keyspace with a single bytestype columnfamily. I wrote two rows, each with a single column with a simple column name and a 1-byte value of 1. The first row had a key using only ascii chars ('foo'), and the second row had a key using unicode chars ('ドメインウ'). Using multi_get, and both those keys, I got both columns back, as expected. Using multi_get_slice and both those keys, I got both columns back, as expected. I also did a get_range_slices to get all rows in the columnfamily, and I got both columns back, as expected. So far so good. Then I drain and shut down Cassandra 0.6.13, and start up Cassandra 0.7.5, pointing to the same data directory, with a config containing the same keyspace, and I run the schematool import command. I then start up my test program that uses the new thrift api, and run some commands. Using multi_get_slice, and those two keys encoded as UTF8 byte-arrays, I only get back one column, the one under the key 'foo'. The other row I simply can't retrieve. However, when I use get_range_slices to get all rows, I get back two rows, with the correct column values, and the byte-array keys are identical to my encoded keys, and when I decode the byte-arrays as UTF8 drings, I get back my two original keys. This means that both my rows are still there, the keys as output by Cassandra are identical to the original string keys I used when I created the rows in 0.6.13, but it's just impossible to retrieve the second row. To continue the test, I inserted a row with the key 'ドメインウ' encoded as UTF-8 again, and gave it a similar column as the original, but with a 1-byte value of 2. Now, when I use multi_get_slice with my two encoded keys, I get back two rows, the 'foo' row has the old value as expected, and the other row has the new value as expected. However, when I use get_range_slices to get all rows, I get back *three* rows, two of which have the *exact same* byte-array key, one has the old column, one has the new column. How is this possible? How can there be two different rows with the exact same key? I'm guessing that it's related to the encoding of string keys in 0.6, and that the internal representation is off somehow. I checked the generated thrift client for 0.6, and it UTF8-encodes all keys before sending them to the server, so it should be UTF8 all the way, but apparently it isn't. Has anyone else experienced the same problem? Is it a platform-specific problem? Is there a way to avoid this and upgrade from 0.6 to 0.7 and not lose any rows? I would also really like to know which byte-array I should send in to get back that second row, there's gotta be some key that can be used to get it, the row is still there after all. /Henrik Schröder
Re: Unicode key encoding problem when upgrading from 0.6.13 to 0.7.5
The way we solved this problem is that it turned out we had only a few hundred rows with unicode keys, so we simply extracted them, upgraded to 0.7, and wrote them back. However, this means that among the rows, there are a few hundred weird duplicate rows with identical keys. Is this going to be a problem in the future? Is there a chance that the good duplicate is cleaned out in favour of the bad duplicate so that we suddnely lose those rows again? /Henrik Schröder
Re: One cluster or many?
I would add that running one cluster is operationally less work than running multiple. On Tue, May 3, 2011 at 4:15 AM, David Boxenhorn da...@taotown.com wrote: If I have a database that partitions naturally into non-overlapping datasets, in which there are no references between datasets, where each dataset is quite large (i.e. large enough to merit its own cluster from the point of view of quantity of data), should I set up one cluster per database or one large cluster for everything together? As I see it: The primary advantage of separate clusters is total isolation: if I have a problem with one dataset, my application will continue working normally for all other datasets. The primary advantage of one big cluster is usage pooling: when one server goes down in a large cluster it's much less important than when one server goes down in a small cluster. Also, different temporal usage patterns of the different datasets (i.e. there will be different peak hours on different datasets) can be combined to ease capacity requirements. Any thoughts? -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Experiences with MapReduce Stress Tests
Writing to Cassandra from map/reduce jobs over HDFS shouldn't be a problem. We're doing it in our cluster and I know of others doing the same thing. You might just make sure the number of reducers (or mappers) writing to cassandra don't overwhelm it. There's no data locality for writes, though a cassandra specific partitioner might help with that in the future. See CASSANDRA-1473 - https://issues.apache.org/jira/browse/CASSANDRA-1473. I apologize that I misspoke about one of the settings. The batch size is in fact the number of rows it gets each time. The input splits just affects how many mappers it splits the data into. As far as recommending this solution, it really depends on the problem. The people I know doing what you're thinking of doing typically store raw data in HDFS, perform mapreduce jobs over that data and output the results into Cassandra for realtime queries. We're using it where I work for storage and analytics both. We store raw data into S3/HDFS, mapreduce over that data and output into cassandra, then perform realtime queries as well as analytics over that data. If you want to do run analytics over Cassandra data, you'll want to partition your cluster so that mapreduce jobs don't affect the realtime performance. On May 3, 2011, at 3:19 AM, Subscriber wrote: Hi Jeremy, yes, the setup on the data-nodes is: - Hadoop DataNode - Hadoop TaskTracker - CassandraDaemon However - the map-input is not read from Cassandra. I am running a writing stress test - no reads (well from time to time I check the produced items using cassandra-cli). Is it possible to achieve data-locality on writes? Well I think that this is (in practice) not possible (one could create some artificial data that correlates with the hashed row-key values or so ... ;-) Thanks for all your tips and hints! It's good see that someone worries about my problems :-) But - to be honest - my number one priority is not to get this test running but to answer the question whether the setup Cassandra+Hadoop with massive parallel writes (using map/reduce) meets the demands of our customer. I found out that the following configuration helps a lot. * disk_access_mode: standard * MAX_HEAP_SIZE=4G * HEAP_NEWSIZE=400M * rpc_timeout_in_ms: 2 Now the stress test runs through, but there are still timeouts (Hadoop reschedules the failing mapper tasks on another node and so the test runs through). But what causes this timeouts? 20 seconds are a long time for a modern cpu (and an eternity for an android ;-) It seems to me that it's not only the massive amount of data or to many parallel mappers, because Cassandra can handle this huge write rate over one hour! I found in the system.logs that the ConcurrentMarkSweeps take quite long (up to 8 seconds). The heap size didn't grow much about 3GB so there was still enough air to breath. So the question remains: can I recommend this setup? Thanks again and best regards Udo Am 02.05.2011 um 20:21 schrieb Jeremy Hanna: Udo, One thing to get out of the way - you're running task trackers on all of your cassandra nodes, right? That is the first and foremost way to get good performance. Otherwise you don't have data locality, which is really the point of map/reduce, co-locating your data and your processes operating over that data. You're probably already doing that, but I had forgotten to ask that before. Besides that... You might try messing with those values a bit more as well as the input split size - cassandra.input.split.size which defaults to ~65k. So you might try rpc timeout of 30s just to see if that helps and try reducing the input split size significantly to see if that helps. For your setup I don't see the range batch size as being meaningful at all with your narrow rows, so don't worry about that. Also, the capacity of your nodes and the number of mappers/reducers you're trying to use will also have an effect on whether it has to timeout. Essentially it's getting overwhelmed for some reason. You might lower the number of mappers and reducers you're hitting your cassandra cluster with to see if that helps. Jeremy On May 2, 2011, at 6:25 AM, Subscriber wrote: Hi Jeremy, thanks for the link. I doubled the rpc_timeout (20 seconds) and reduced the range-batch-size to 2048, but I still get timeouts... Udo Am 29.04.2011 um 18:53 schrieb Jeremy Hanna: It sounds like there might be some tuning you can do to your jobs - take a look at the wiki's HadoopSupport page, specifically the Troubleshooting section: http://wiki.apache.org/cassandra/HadoopSupport#Troubleshooting On Apr 29, 2011, at 11:45 AM, Subscriber wrote: Hi all, We want to share our experiences we got during our Cassandra plus Hadoop Map/Reduce evaluation. Our question was whether Cassandra is suitable for massive distributed data
Re: Range Slice Issue
Do you still see this behavior if you disable dynamic snitch? On Tue, May 3, 2011 at 12:31 PM, Serediuk, Adam adam.sered...@serialssolutions.com wrote: We appear to have encountered an issue with cassandra 0.7.5 after upgrading from 0.7.2. While doing a batch read using a get_range_slice against the ranges an individual node is master for we are able to reproduce consistently that the last two nodes in the ring, regardless of the ring size (we have a 60 node production cluster and a 12 node test cluster) perform this read over the network using replicas of executing locally. Every other node in the ring successfully reads locally. To be sure there were no data consistency issues we performed a nodetool repair against both of these nodes and the issue persists. We also tried truncating the column family and repopulating, but the issue remains. This seems to be related to CASSANDRA-2286 in 0.7.4. We always want to read data locally if it is available there. We use Cassandra.Client.describe_ring() to figure out which machine in the ring is master for which TokenRange. I then compare the master for each TokenRange against the localhost to find out which token ranges are owned by the local machine (remote reads are too slow for this type of batch processing). Once I know which TokenRanges are on each machine locally I get evenly sized splits using Cassandra.Client.describe_splits(). Adam -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Range Slice Issue
I just ran a test and we do not see that behavior with dynamic snitch disabled. All nodes appear to be doing local reads as expected. On May 3, 2011, at 10:37 AM, Jonathan Ellis wrote: Do you still see this behavior if you disable dynamic snitch? On Tue, May 3, 2011 at 12:31 PM, Serediuk, Adam adam.sered...@serialssolutions.com wrote: We appear to have encountered an issue with cassandra 0.7.5 after upgrading from 0.7.2. While doing a batch read using a get_range_slice against the ranges an individual node is master for we are able to reproduce consistently that the last two nodes in the ring, regardless of the ring size (we have a 60 node production cluster and a 12 node test cluster) perform this read over the network using replicas of executing locally. Every other node in the ring successfully reads locally. To be sure there were no data consistency issues we performed a nodetool repair against both of these nodes and the issue persists. We also tried truncating the column family and repopulating, but the issue remains. This seems to be related to CASSANDRA-2286 in 0.7.4. We always want to read data locally if it is available there. We use Cassandra.Client.describe_ring() to figure out which machine in the ring is master for which TokenRange. I then compare the master for each TokenRange against the localhost to find out which token ranges are owned by the local machine (remote reads are too slow for this type of batch processing). Once I know which TokenRanges are on each machine locally I get evenly sized splits using Cassandra.Client.describe_splits(). Adam -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Range Slice Issue
So either (a) dynamic snitch is wrong or (b) those nodes really are more heavily loaded than the others, and are correctly pushing queries to other replicas. On Tue, May 3, 2011 at 12:47 PM, Serediuk, Adam adam.sered...@serialssolutions.com wrote: I just ran a test and we do not see that behavior with dynamic snitch disabled. All nodes appear to be doing local reads as expected. On May 3, 2011, at 10:37 AM, Jonathan Ellis wrote: Do you still see this behavior if you disable dynamic snitch? On Tue, May 3, 2011 at 12:31 PM, Serediuk, Adam adam.sered...@serialssolutions.com wrote: We appear to have encountered an issue with cassandra 0.7.5 after upgrading from 0.7.2. While doing a batch read using a get_range_slice against the ranges an individual node is master for we are able to reproduce consistently that the last two nodes in the ring, regardless of the ring size (we have a 60 node production cluster and a 12 node test cluster) perform this read over the network using replicas of executing locally. Every other node in the ring successfully reads locally. To be sure there were no data consistency issues we performed a nodetool repair against both of these nodes and the issue persists. We also tried truncating the column family and repopulating, but the issue remains. This seems to be related to CASSANDRA-2286 in 0.7.4. We always want to read data locally if it is available there. We use Cassandra.Client.describe_ring() to figure out which machine in the ring is master for which TokenRange. I then compare the master for each TokenRange against the localhost to find out which token ranges are owned by the local machine (remote reads are too slow for this type of batch processing). Once I know which TokenRanges are on each machine locally I get evenly sized splits using Cassandra.Client.describe_splits(). Adam -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
MemtablePostFlusher with high number of pending calls?
Cassandra 0.8 beta trunk from about 1 week ago: Pool NameActive Pending Completed ReadStage 0 0 5 RequestResponseStage 0 0 87129 MutationStage 0 0 187298 ReadRepairStage 0 0 0 ReplicateOnWriteStage 0 0 0 GossipStage 0 01353524 AntiEntropyStage 0 0 0 MigrationStage0 0 10 MemtablePostFlusher 1 190108 StreamStage 0 0 0 FlushWriter 0 0302 FILEUTILS-DELETE-POOL 0 0 26 MiscStage 0 0 0 FlushSorter 0 0 0 InternalResponseStage 0 0 0 HintedHandoff 1 4 7 Anyone with nice theories about the pending value on the memtable post flusher? Regards, Terje
Re: MemtablePostFlusher with high number of pending calls?
Does it resolve down to 0 eventually if you stop doing writes? On Tue, May 3, 2011 at 12:56 PM, Terje Marthinussen tmarthinus...@gmail.com wrote: Cassandra 0.8 beta trunk from about 1 week ago: Pool Name Active Pending Completed ReadStage 0 0 5 RequestResponseStage 0 0 87129 MutationStage 0 0 187298 ReadRepairStage 0 0 0 ReplicateOnWriteStage 0 0 0 GossipStage 0 0 1353524 AntiEntropyStage 0 0 0 MigrationStage 0 0 10 MemtablePostFlusher 1 190 108 StreamStage 0 0 0 FlushWriter 0 0 302 FILEUTILS-DELETE-POOL 0 0 26 MiscStage 0 0 0 FlushSorter 0 0 0 InternalResponseStage 0 0 0 HintedHandoff 1 4 7 Anyone with nice theories about the pending value on the memtable post flusher? Regards, Terje -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: MemtablePostFlusher with high number of pending calls?
... and are there any exceptions in the log? On Tue, May 3, 2011 at 1:01 PM, Jonathan Ellis jbel...@gmail.com wrote: Does it resolve down to 0 eventually if you stop doing writes? On Tue, May 3, 2011 at 12:56 PM, Terje Marthinussen tmarthinus...@gmail.com wrote: Cassandra 0.8 beta trunk from about 1 week ago: Pool Name Active Pending Completed ReadStage 0 0 5 RequestResponseStage 0 0 87129 MutationStage 0 0 187298 ReadRepairStage 0 0 0 ReplicateOnWriteStage 0 0 0 GossipStage 0 0 1353524 AntiEntropyStage 0 0 0 MigrationStage 0 0 10 MemtablePostFlusher 1 190 108 StreamStage 0 0 0 FlushWriter 0 0 302 FILEUTILS-DELETE-POOL 0 0 26 MiscStage 0 0 0 FlushSorter 0 0 0 InternalResponseStage 0 0 0 HintedHandoff 1 4 7 Anyone with nice theories about the pending value on the memtable post flusher? Regards, Terje -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Range Slice Issue
Both data and system load are equal across all nodes and the smaller test cluster also exhibits the same issue. tokens are balanced and total node size is equivalent. On May 3, 2011, at 10:51 AM, Jonathan Ellis wrote: So either (a) dynamic snitch is wrong or (b) those nodes really are more heavily loaded than the others, and are correctly pushing queries to other replicas. On Tue, May 3, 2011 at 12:47 PM, Serediuk, Adam adam.sered...@serialssolutions.com wrote: I just ran a test and we do not see that behavior with dynamic snitch disabled. All nodes appear to be doing local reads as expected. On May 3, 2011, at 10:37 AM, Jonathan Ellis wrote: Do you still see this behavior if you disable dynamic snitch? On Tue, May 3, 2011 at 12:31 PM, Serediuk, Adam adam.sered...@serialssolutions.com wrote: We appear to have encountered an issue with cassandra 0.7.5 after upgrading from 0.7.2. While doing a batch read using a get_range_slice against the ranges an individual node is master for we are able to reproduce consistently that the last two nodes in the ring, regardless of the ring size (we have a 60 node production cluster and a 12 node test cluster) perform this read over the network using replicas of executing locally. Every other node in the ring successfully reads locally. To be sure there were no data consistency issues we performed a nodetool repair against both of these nodes and the issue persists. We also tried truncating the column family and repopulating, but the issue remains. This seems to be related to CASSANDRA-2286 in 0.7.4. We always want to read data locally if it is available there. We use Cassandra.Client.describe_ring() to figure out which machine in the ring is master for which TokenRange. I then compare the master for each TokenRange against the localhost to find out which token ranges are owned by the local machine (remote reads are too slow for this type of batch processing). Once I know which TokenRanges are on each machine locally I get evenly sized splits using Cassandra.Client.describe_splits(). Adam -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: MemtablePostFlusher with high number of pending calls?
Just some very tiny amount of writes in the background here (some hints spooled up on another node slowly coming in). No new data. I thought there was no exceptions, but I did not look far enough back in the log at first. Going back a bit further now however, I see that about 50 hours ago: ERROR [CompactionExecutor:387] 2011-05-02 01:16:01,027 AbstractCassandraDaemon.java (line 112) Fatal exception in thread Thread[CompactionExecutor:387,1,main] java.io.IOException: No space left on device at java.io.RandomAccessFile.writeBytes(Native Method) at java.io.RandomAccessFile.write(RandomAccessFile.java:466) at org.apache.cassandra.io.util.BufferedRandomAccessFile.flush(BufferedRandomAccessFile.java:160) at org.apache.cassandra.io.util.BufferedRandomAccessFile.reBuffer(BufferedRandomAccessFile.java:225) at org.apache.cassandra.io.util.BufferedRandomAccessFile.writeAtMost(BufferedRandomAccessFile.java:356) at org.apache.cassandra.io.util.BufferedRandomAccessFile.write(BufferedRandomAccessFile.java:335) at org.apache.cassandra.io.PrecompactedRow.write(PrecompactedRow.java:102) at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:130) at org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:566) at org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:146) at org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:112) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) [followed by a few more of those...] and then a bunch of these: ERROR [FlushWriter:123] 2011-05-02 01:21:12,690 AbstractCassandraDaemon.java (line 112) Fatal exception in thread Thread[FlushWriter:123,5,main] java.lang.RuntimeException: java.lang.RuntimeException: Insufficient disk space to flush 40009184 bytes at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Caused by: java.lang.RuntimeException: Insufficient disk space to flush 40009184 bytes at org.apache.cassandra.db.ColumnFamilyStore.getFlushPath(ColumnFamilyStore.java:597) at org.apache.cassandra.db.ColumnFamilyStore.createFlushWriter(ColumnFamilyStore.java:2100) at org.apache.cassandra.db.Memtable.writeSortedContents(Memtable.java:239) at org.apache.cassandra.db.Memtable.access$400(Memtable.java:50) at org.apache.cassandra.db.Memtable$3.runMayThrow(Memtable.java:263) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30) ... 3 more Seems like compactions stopped after this (a bunch of tmp tables there still from when those errors where generated), and I can only suspect the post flusher may have stopped at the same time. There is 890GB of disk for data, sstables are currently using 604G (139GB is old tmp tables from when it ran out of disk) and ring tells me the load on the node is 313GB. Terje On Wed, May 4, 2011 at 3:02 AM, Jonathan Ellis jbel...@gmail.com wrote: ... and are there any exceptions in the log? On Tue, May 3, 2011 at 1:01 PM, Jonathan Ellis jbel...@gmail.com wrote: Does it resolve down to 0 eventually if you stop doing writes? On Tue, May 3, 2011 at 12:56 PM, Terje Marthinussen tmarthinus...@gmail.com wrote: Cassandra 0.8 beta trunk from about 1 week ago: Pool NameActive Pending Completed ReadStage 0 0 5 RequestResponseStage 0 0 87129 MutationStage 0 0 187298 ReadRepairStage 0 0 0 ReplicateOnWriteStage 0 0 0 GossipStage 0 01353524 AntiEntropyStage 0 0 0 MigrationStage0 0 10 MemtablePostFlusher 1 190108 StreamStage 0 0 0 FlushWriter 0 0302 FILEUTILS-DELETE-POOL 0 0 26 MiscStage 0 0 0 FlushSorter 0 0 0 InternalResponseStage 0 0 0 HintedHandoff 1 4 7 Anyone with nice
Re: MemtablePostFlusher with high number of pending calls?
So yes, there is currently some 200GB empty disk. On Wed, May 4, 2011 at 3:20 AM, Terje Marthinussen tmarthinus...@gmail.comwrote: Just some very tiny amount of writes in the background here (some hints spooled up on another node slowly coming in). No new data. I thought there was no exceptions, but I did not look far enough back in the log at first. Going back a bit further now however, I see that about 50 hours ago: ERROR [CompactionExecutor:387] 2011-05-02 01:16:01,027 AbstractCassandraDaemon.java (line 112) Fatal exception in thread Thread[CompactionExecutor:387,1,main] java.io.IOException: No space left on device at java.io.RandomAccessFile.writeBytes(Native Method) at java.io.RandomAccessFile.write(RandomAccessFile.java:466) at org.apache.cassandra.io.util.BufferedRandomAccessFile.flush(BufferedRandomAccessFile.java:160) at org.apache.cassandra.io.util.BufferedRandomAccessFile.reBuffer(BufferedRandomAccessFile.java:225) at org.apache.cassandra.io.util.BufferedRandomAccessFile.writeAtMost(BufferedRandomAccessFile.java:356) at org.apache.cassandra.io.util.BufferedRandomAccessFile.write(BufferedRandomAccessFile.java:335) at org.apache.cassandra.io.PrecompactedRow.write(PrecompactedRow.java:102) at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:130) at org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:566) at org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:146) at org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:112) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) [followed by a few more of those...] and then a bunch of these: ERROR [FlushWriter:123] 2011-05-02 01:21:12,690 AbstractCassandraDaemon.java (line 112) Fatal exception in thread Thread[FlushWriter:123,5,main] java.lang.RuntimeException: java.lang.RuntimeException: Insufficient disk space to flush 40009184 bytes at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Caused by: java.lang.RuntimeException: Insufficient disk space to flush 40009184 bytes at org.apache.cassandra.db.ColumnFamilyStore.getFlushPath(ColumnFamilyStore.java:597) at org.apache.cassandra.db.ColumnFamilyStore.createFlushWriter(ColumnFamilyStore.java:2100) at org.apache.cassandra.db.Memtable.writeSortedContents(Memtable.java:239) at org.apache.cassandra.db.Memtable.access$400(Memtable.java:50) at org.apache.cassandra.db.Memtable$3.runMayThrow(Memtable.java:263) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30) ... 3 more Seems like compactions stopped after this (a bunch of tmp tables there still from when those errors where generated), and I can only suspect the post flusher may have stopped at the same time. There is 890GB of disk for data, sstables are currently using 604G (139GB is old tmp tables from when it ran out of disk) and ring tells me the load on the node is 313GB. Terje On Wed, May 4, 2011 at 3:02 AM, Jonathan Ellis jbel...@gmail.com wrote: ... and are there any exceptions in the log? On Tue, May 3, 2011 at 1:01 PM, Jonathan Ellis jbel...@gmail.com wrote: Does it resolve down to 0 eventually if you stop doing writes? On Tue, May 3, 2011 at 12:56 PM, Terje Marthinussen tmarthinus...@gmail.com wrote: Cassandra 0.8 beta trunk from about 1 week ago: Pool NameActive Pending Completed ReadStage 0 0 5 RequestResponseStage 0 0 87129 MutationStage 0 0 187298 ReadRepairStage 0 0 0 ReplicateOnWriteStage 0 0 0 GossipStage 0 01353524 AntiEntropyStage 0 0 0 MigrationStage0 0 10 MemtablePostFlusher 1 190108 StreamStage 0 0 0 FlushWriter 0 0302 FILEUTILS-DELETE-POOL 0 0 26 MiscStage 0 0 0
Re: IOException: Unable to create hard link ... /snapshots/ ... (errno 17)
On Tue, 2011-05-03 at 16:52 +0200, Mck wrote: Running a 3 node cluster with cassandra-0.8.0-beta1 I'm seeing the first node logging many (thousands) times Only special thing about this first node is it receives all the writes from our sybase-cassandra import job. This process migrates an existing 60million rows into cassandra (before the cluster is /turned on/ for normal operations). The import job runs over ~20minutes. I wiped everything and started from scratch, this time running the import job with cassandra configured instead with: incremental_backups: false snapshot_before_compaction: false This created the problem then on another node. So changing to these settings on all nodes and running the import again fixed it: no more Unable to create hard link ... After the import i could turn both incremental_backups and snapshot_before_compaction to true again without problems so far. To me this says something is broken with incremental_backups and snapshot_before_compaction under heavy writing? ~mck
Re: MemtablePostFlusher with high number of pending calls?
post flusher is responsible for updating commitlog header after a flush; each task waits for a specific flush to complete, then does its thing. so when you had a flush catastrophically fail, its corresponding post-flush task will be stuck. On Tue, May 3, 2011 at 1:20 PM, Terje Marthinussen tmarthinus...@gmail.com wrote: Just some very tiny amount of writes in the background here (some hints spooled up on another node slowly coming in). No new data. I thought there was no exceptions, but I did not look far enough back in the log at first. Going back a bit further now however, I see that about 50 hours ago: ERROR [CompactionExecutor:387] 2011-05-02 01:16:01,027 AbstractCassandraDaemon.java (line 112) Fatal exception in thread Thread[CompactionExecutor:387,1,main] java.io.IOException: No space left on device at java.io.RandomAccessFile.writeBytes(Native Method) at java.io.RandomAccessFile.write(RandomAccessFile.java:466) at org.apache.cassandra.io.util.BufferedRandomAccessFile.flush(BufferedRandomAccessFile.java:160) at org.apache.cassandra.io.util.BufferedRandomAccessFile.reBuffer(BufferedRandomAccessFile.java:225) at org.apache.cassandra.io.util.BufferedRandomAccessFile.writeAtMost(BufferedRandomAccessFile.java:356) at org.apache.cassandra.io.util.BufferedRandomAccessFile.write(BufferedRandomAccessFile.java:335) at org.apache.cassandra.io.PrecompactedRow.write(PrecompactedRow.java:102) at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:130) at org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:566) at org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:146) at org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:112) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) [followed by a few more of those...] and then a bunch of these: ERROR [FlushWriter:123] 2011-05-02 01:21:12,690 AbstractCassandraDaemon.java (line 112) Fatal exception in thread Thread[FlushWriter:123,5,main] java.lang.RuntimeException: java.lang.RuntimeException: Insufficient disk space to flush 40009184 bytes at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Caused by: java.lang.RuntimeException: Insufficient disk space to flush 40009184 bytes at org.apache.cassandra.db.ColumnFamilyStore.getFlushPath(ColumnFamilyStore.java:597) at org.apache.cassandra.db.ColumnFamilyStore.createFlushWriter(ColumnFamilyStore.java:2100) at org.apache.cassandra.db.Memtable.writeSortedContents(Memtable.java:239) at org.apache.cassandra.db.Memtable.access$400(Memtable.java:50) at org.apache.cassandra.db.Memtable$3.runMayThrow(Memtable.java:263) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30) ... 3 more Seems like compactions stopped after this (a bunch of tmp tables there still from when those errors where generated), and I can only suspect the post flusher may have stopped at the same time. There is 890GB of disk for data, sstables are currently using 604G (139GB is old tmp tables from when it ran out of disk) and ring tells me the load on the node is 313GB. Terje On Wed, May 4, 2011 at 3:02 AM, Jonathan Ellis jbel...@gmail.com wrote: ... and are there any exceptions in the log? On Tue, May 3, 2011 at 1:01 PM, Jonathan Ellis jbel...@gmail.com wrote: Does it resolve down to 0 eventually if you stop doing writes? On Tue, May 3, 2011 at 12:56 PM, Terje Marthinussen tmarthinus...@gmail.com wrote: Cassandra 0.8 beta trunk from about 1 week ago: Pool Name Active Pending Completed ReadStage 0 0 5 RequestResponseStage 0 0 87129 MutationStage 0 0 187298 ReadRepairStage 0 0 0 ReplicateOnWriteStage 0 0 0 GossipStage 0 0 1353524 AntiEntropyStage 0 0 0 MigrationStage 0 0 10 MemtablePostFlusher 1 190 108 StreamStage 0 0 0
Re: IOException: Unable to create hard link ... /snapshots/ ... (errno 17)
On Tue, 2011-05-03 at 13:52 -0500, Jonathan Ellis wrote: you should probably look to see what errno 17 means for the link system call on your system. That the file already exists. It seems cassandra is trying to make the same hard link in parallel (under heavy write load) ? I see now i can also reproduce the problem with hadoop and ColumnFamilyOutputFormat. Turning off snapshot_before_compaction seems to be enough to prevent it. ~mck
Re: Using snapshot for backup and restore
You're right, this is an oversight. Created https://issues.apache.org/jira/browse/CASSANDRA-2596 to fix. As for a workaround, you can drop the index + recreate. (Upgrade to 0.7.5 first, if you haven't yet.) On Tue, May 3, 2011 at 3:22 AM, Arsene Lee arsene@ruckuswireless.com wrote: Hi, We are trying to use snapshot for backup and restore. We found out that snapshot doesn’t take secondary indexes. We are wondering why is that? And is there any way we can rebuild the secondary index? Regards, Arsene -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: IOException: Unable to create hard link ... /snapshots/ ... (errno 17)
Ah, that makes sense. snapshot_before_compaction is trying to snapshot, but incremental_backups already created one (for newly flushed sstables). You're probably the only one running with both options on. :) Can you create a ticket? On Tue, May 3, 2011 at 2:05 PM, Mck m...@apache.org wrote: On Tue, 2011-05-03 at 13:52 -0500, Jonathan Ellis wrote: you should probably look to see what errno 17 means for the link system call on your system. That the file already exists. It seems cassandra is trying to make the same hard link in parallel (under heavy write load) ? I see now i can also reproduce the problem with hadoop and ColumnFamilyOutputFormat. Turning off snapshot_before_compaction seems to be enough to prevent it. ~mck -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: MemtablePostFlusher with high number of pending calls?
Yes, I realize that. I am bit curious why it ran out of disk, or rather, why I have 200GB empty disk now, but unfortunately it seems like we may not have had monitoring enabled on this node to tell me what happened in terms of disk usage. I also thought that compaction was supposed to resume (try again with less data) if it fails? Terje On Wed, May 4, 2011 at 3:50 AM, Jonathan Ellis jbel...@gmail.com wrote: post flusher is responsible for updating commitlog header after a flush; each task waits for a specific flush to complete, then does its thing. so when you had a flush catastrophically fail, its corresponding post-flush task will be stuck. On Tue, May 3, 2011 at 1:20 PM, Terje Marthinussen tmarthinus...@gmail.com wrote: Just some very tiny amount of writes in the background here (some hints spooled up on another node slowly coming in). No new data. I thought there was no exceptions, but I did not look far enough back in the log at first. Going back a bit further now however, I see that about 50 hours ago: ERROR [CompactionExecutor:387] 2011-05-02 01:16:01,027 AbstractCassandraDaemon.java (line 112) Fatal exception in thread Thread[CompactionExecutor:387,1,main] java.io.IOException: No space left on device at java.io.RandomAccessFile.writeBytes(Native Method) at java.io.RandomAccessFile.write(RandomAccessFile.java:466) at org.apache.cassandra.io.util.BufferedRandomAccessFile.flush(BufferedRandomAccessFile.java:160) at org.apache.cassandra.io.util.BufferedRandomAccessFile.reBuffer(BufferedRandomAccessFile.java:225) at org.apache.cassandra.io.util.BufferedRandomAccessFile.writeAtMost(BufferedRandomAccessFile.java:356) at org.apache.cassandra.io.util.BufferedRandomAccessFile.write(BufferedRandomAccessFile.java:335) at org.apache.cassandra.io.PrecompactedRow.write(PrecompactedRow.java:102) at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:130) at org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:566) at org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:146) at org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:112) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) [followed by a few more of those...] and then a bunch of these: ERROR [FlushWriter:123] 2011-05-02 01:21:12,690 AbstractCassandraDaemon.java (line 112) Fatal exception in thread Thread[FlushWriter:123,5,main] java.lang.RuntimeException: java.lang.RuntimeException: Insufficient disk space to flush 40009184 bytes at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Caused by: java.lang.RuntimeException: Insufficient disk space to flush 40009184 bytes at org.apache.cassandra.db.ColumnFamilyStore.getFlushPath(ColumnFamilyStore.java:597) at org.apache.cassandra.db.ColumnFamilyStore.createFlushWriter(ColumnFamilyStore.java:2100) at org.apache.cassandra.db.Memtable.writeSortedContents(Memtable.java:239) at org.apache.cassandra.db.Memtable.access$400(Memtable.java:50) at org.apache.cassandra.db.Memtable$3.runMayThrow(Memtable.java:263) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30) ... 3 more Seems like compactions stopped after this (a bunch of tmp tables there still from when those errors where generated), and I can only suspect the post flusher may have stopped at the same time. There is 890GB of disk for data, sstables are currently using 604G (139GB is old tmp tables from when it ran out of disk) and ring tells me the load on the node is 313GB. Terje On Wed, May 4, 2011 at 3:02 AM, Jonathan Ellis jbel...@gmail.com wrote: ... and are there any exceptions in the log? On Tue, May 3, 2011 at 1:01 PM, Jonathan Ellis jbel...@gmail.com wrote: Does it resolve down to 0 eventually if you stop doing writes? On Tue, May 3, 2011 at 12:56 PM, Terje Marthinussen tmarthinus...@gmail.com wrote: Cassandra 0.8 beta trunk from about 1 week ago: Pool NameActive Pending Completed ReadStage 0 0 5 RequestResponseStage 0
Re: Replica data distributing between racks
Jonathan, I think you are saying each DC should have it's own (logical) token ring. Which makes sense as the only way to balance the load in each dc. I think most people assume (including me) there was a single token ring for the entire cluster. But currently two endpoints cannot have the same token regardless of the DC they are in. Or should people just bump the tokens in extra DC's to avoid the collision? Cheers Aaron On 4 May 2011, at 03:03, Eric tamme wrote: On Tue, May 3, 2011 at 10:13 AM, Jonathan Ellis jbel...@gmail.com wrote: Right, when you are computing balanced RP tokens for NTS you need to compute the tokens for each DC independently. I am confused ... sorry. Are you saying that ... I need to change how my keys are calculated to fix this problem? Or are you talking about the implementation of how replication selects a token? -Eric
Re: Problems recovering a dead node
When you say it's clean does that mean the node has no data files ? After you replaced the disk what process did you use to recover ? Also what version are you running and what's the recent upgrade history ? Cheers Aaron On 3 May 2011, at 23:09, Héctor Izquierdo Seliva wrote: Hi everyone. One of the nodes in my 6 node cluster died with disk failures. I have replaced the disks, and it's clean. It has the same configuration (same ip, same token). When I try to restart the node it starts to throw mmap underflow exceptions till it closes again. I tried setting io to standard, but it still fails. It gives errors about two decorated keys being different, and the EOFException. Here is an excerpt of the log http://pastebin.com/ZXW1wY6T I can provide more info if needed. I'm at a loss here so any help is appreciated. Thanks all for your time Héctor Izquierdo
Re: Write performance help needed
To give an idea, last March (2010) I run the a much older Cassandra on 10 HP blades (dual socket, 4 core, 16GB, 2.5 laptop HDD) and was writing around 250K columns per second with 500 python processes loading the data from wikipedia running on another 10 HP blades. This was my first out of the box no tuning (other then using sensible batch updates) test. Since then Cassandra has gotten much faster. Hope that helps Aaron On 4 May 2011, at 02:22, Jonathan Ellis wrote: You don't give many details, but I would guess: - your benchmark is not multithreaded - mongodb is not configured for durable writes, so you're really only measuring the time for it to buffer it in memory - you haven't loaded enough data to hit mongo's index doesn't fit in memory anymore On Tue, May 3, 2011 at 8:24 AM, Steve Smith stevenpsmith...@gmail.com wrote: I am working for client that needs to persist 100K-200K records per second for later querying. As a proof of concept, we are looking at several options including nosql (Cassandra and MongoDB). I have been running some tests on my laptop (MacBook Pro, 4GB RAM, 2.66 GHz, Dual Core/4 logical cores) and have not been happy with the results. The best I have been able to accomplish is 100K records in approximately 30 seconds. Each record has 30 columns, mostly made up of integers. I have tried both the Hector and Pelops APIs, and have tried writing in batches versus one at a time. The times have not varied much. I am using the out of the box configuration for Cassandra, and while I know using 1 disk will have an impact on performance, I would expect to see better write numbers than I am. As a point of reference, the same test using MongoDB I was able to accomplish 100K records in 3.5 seconds. Any tips would be appreciated. - Steve -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: MemtablePostFlusher with high number of pending calls?
Compaction does, but flush didn't until https://issues.apache.org/jira/browse/CASSANDRA-2404 On Tue, May 3, 2011 at 2:26 PM, Terje Marthinussen tmarthinus...@gmail.com wrote: Yes, I realize that. I am bit curious why it ran out of disk, or rather, why I have 200GB empty disk now, but unfortunately it seems like we may not have had monitoring enabled on this node to tell me what happened in terms of disk usage. I also thought that compaction was supposed to resume (try again with less data) if it fails? Terje On Wed, May 4, 2011 at 3:50 AM, Jonathan Ellis jbel...@gmail.com wrote: post flusher is responsible for updating commitlog header after a flush; each task waits for a specific flush to complete, then does its thing. so when you had a flush catastrophically fail, its corresponding post-flush task will be stuck. On Tue, May 3, 2011 at 1:20 PM, Terje Marthinussen tmarthinus...@gmail.com wrote: Just some very tiny amount of writes in the background here (some hints spooled up on another node slowly coming in). No new data. I thought there was no exceptions, but I did not look far enough back in the log at first. Going back a bit further now however, I see that about 50 hours ago: ERROR [CompactionExecutor:387] 2011-05-02 01:16:01,027 AbstractCassandraDaemon.java (line 112) Fatal exception in thread Thread[CompactionExecutor:387,1,main] java.io.IOException: No space left on device at java.io.RandomAccessFile.writeBytes(Native Method) at java.io.RandomAccessFile.write(RandomAccessFile.java:466) at org.apache.cassandra.io.util.BufferedRandomAccessFile.flush(BufferedRandomAccessFile.java:160) at org.apache.cassandra.io.util.BufferedRandomAccessFile.reBuffer(BufferedRandomAccessFile.java:225) at org.apache.cassandra.io.util.BufferedRandomAccessFile.writeAtMost(BufferedRandomAccessFile.java:356) at org.apache.cassandra.io.util.BufferedRandomAccessFile.write(BufferedRandomAccessFile.java:335) at org.apache.cassandra.io.PrecompactedRow.write(PrecompactedRow.java:102) at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:130) at org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:566) at org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:146) at org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:112) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) [followed by a few more of those...] and then a bunch of these: ERROR [FlushWriter:123] 2011-05-02 01:21:12,690 AbstractCassandraDaemon.java (line 112) Fatal exception in thread Thread[FlushWriter:123,5,main] java.lang.RuntimeException: java.lang.RuntimeException: Insufficient disk space to flush 40009184 bytes at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Caused by: java.lang.RuntimeException: Insufficient disk space to flush 40009184 bytes at org.apache.cassandra.db.ColumnFamilyStore.getFlushPath(ColumnFamilyStore.java:597) at org.apache.cassandra.db.ColumnFamilyStore.createFlushWriter(ColumnFamilyStore.java:2100) at org.apache.cassandra.db.Memtable.writeSortedContents(Memtable.java:239) at org.apache.cassandra.db.Memtable.access$400(Memtable.java:50) at org.apache.cassandra.db.Memtable$3.runMayThrow(Memtable.java:263) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30) ... 3 more Seems like compactions stopped after this (a bunch of tmp tables there still from when those errors where generated), and I can only suspect the post flusher may have stopped at the same time. There is 890GB of disk for data, sstables are currently using 604G (139GB is old tmp tables from when it ran out of disk) and ring tells me the load on the node is 313GB. Terje On Wed, May 4, 2011 at 3:02 AM, Jonathan Ellis jbel...@gmail.com wrote: ... and are there any exceptions in the log? On Tue, May 3, 2011 at 1:01 PM, Jonathan Ellis jbel...@gmail.com wrote: Does it resolve down to 0 eventually if you stop doing writes? On Tue, May 3, 2011 at 12:56 PM, Terje Marthinussen tmarthinus...@gmail.com wrote: Cassandra 0.8 beta
Re: Replica data distributing between racks
On Tue, May 3, 2011 at 2:46 PM, aaron morton aa...@thelastpickle.com wrote: Jonathan, I think you are saying each DC should have it's own (logical) token ring. Right. (Only with NTS, although you'd usually end up with a similar effect if you alternate DC locations for nodes in a ONTS cluster.) But currently two endpoints cannot have the same token regardless of the DC they are in. Also right. Or should people just bump the tokens in extra DC's to avoid the collision? Yes. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Unicode key encoding problem when upgrading from 0.6.13 to 0.7.5
Can you provide some details of the data returned from you do the = get_range() ? It will be interesting to see the raw bytes returned for = the keys. The likely culprit is a change in the encoding. Can you also = try to grab the bytes sent for the key when doing the single select that = fails.=20 You can grab these either on the client and/or by turing on the logging = the DEBUG in conf/log4j-server.properties Thanks Aaron On 4 May 2011, at 03:19, Henrik Schröder wrote: The way we solved this problem is that it turned out we had only a few hundred rows with unicode keys, so we simply extracted them, upgraded to 0.7, and wrote them back. However, this means that among the rows, there are a few hundred weird duplicate rows with identical keys. Is this going to be a problem in the future? Is there a chance that the good duplicate is cleaned out in favour of the bad duplicate so that we suddnely lose those rows again? /Henrik Schröder
Re: Replica data distributing between racks
On Tue, May 3, 2011 at 4:08 PM, Jonathan Ellis jbel...@gmail.com wrote: On Tue, May 3, 2011 at 2:46 PM, aaron morton aa...@thelastpickle.com wrote: Jonathan, I think you are saying each DC should have it's own (logical) token ring. Right. (Only with NTS, although you'd usually end up with a similar effect if you alternate DC locations for nodes in a ONTS cluster.) But currently two endpoints cannot have the same token regardless of the DC they are in. Also right. Or should people just bump the tokens in extra DC's to avoid the collision? Yes. I am sorry, but I do not understand fully. I would appreciate it if some one could explain with more verbosity for me. I do not understand why data insertion is even, but replication is not. I do not understand how to solve the problem. What does bumping tokens entail - Is that going to change my insertion distribution? I had no idea you can create different logical keyspaces ... and I am not sure what that exactly means... or that I even want to do it. Is there a clear solution to fixing the problem I laid out, and getting replication data evenly distributed between racks in each DC? Sorry again for needing more verbosity - I am learning as I go with this stuff. I appreciate everyones help. -Eric
Re: IOException: Unable to create hard link ... /snapshots/ ... (errno 17)
On Tue, 2011-05-03 at 14:22 -0500, Jonathan Ellis wrote: Can you create a ticket? CASSANDRA-2598
Backup full cluster
Snapshot runs on a local node. How do I ensure I have a 'point in time' snapshot of the full cluster ? Do I have to stop the writes on the full cluster and then snapshot all the nodes individually ? Thanks.
Re: MemtablePostFlusher with high number of pending calls?
Hm... peculiar. Post flush is not involved in compactions, right? May 2nd 01:06 - Out of disk 01:51 - Starts a mix of major and minor compactions on different column families It then starts a few minor compactions extra over the day, but given that there are more than 1000 sstables, and we are talking 3 minor compactions started, it is not normal I think. May 3rd 1 minor compaction started. When I checked today, there was a bunch of tmp files on the disk with last modify time from 01:something on may 2nd and 200GB empty disk... Definitely no compaction going on. Guess I will add some debug logging and see if I get lucky and run out of disk again. Terje On Wed, May 4, 2011 at 5:06 AM, Jonathan Ellis jbel...@gmail.com wrote: Compaction does, but flush didn't until https://issues.apache.org/jira/browse/CASSANDRA-2404 On Tue, May 3, 2011 at 2:26 PM, Terje Marthinussen tmarthinus...@gmail.com wrote: Yes, I realize that. I am bit curious why it ran out of disk, or rather, why I have 200GB empty disk now, but unfortunately it seems like we may not have had monitoring enabled on this node to tell me what happened in terms of disk usage. I also thought that compaction was supposed to resume (try again with less data) if it fails? Terje On Wed, May 4, 2011 at 3:50 AM, Jonathan Ellis jbel...@gmail.com wrote: post flusher is responsible for updating commitlog header after a flush; each task waits for a specific flush to complete, then does its thing. so when you had a flush catastrophically fail, its corresponding post-flush task will be stuck. On Tue, May 3, 2011 at 1:20 PM, Terje Marthinussen tmarthinus...@gmail.com wrote: Just some very tiny amount of writes in the background here (some hints spooled up on another node slowly coming in). No new data. I thought there was no exceptions, but I did not look far enough back in the log at first. Going back a bit further now however, I see that about 50 hours ago: ERROR [CompactionExecutor:387] 2011-05-02 01:16:01,027 AbstractCassandraDaemon.java (line 112) Fatal exception in thread Thread[CompactionExecutor:387,1,main] java.io.IOException: No space left on device at java.io.RandomAccessFile.writeBytes(Native Method) at java.io.RandomAccessFile.write(RandomAccessFile.java:466) at org.apache.cassandra.io.util.BufferedRandomAccessFile.flush(BufferedRandomAccessFile.java:160) at org.apache.cassandra.io.util.BufferedRandomAccessFile.reBuffer(BufferedRandomAccessFile.java:225) at org.apache.cassandra.io.util.BufferedRandomAccessFile.writeAtMost(BufferedRandomAccessFile.java:356) at org.apache.cassandra.io.util.BufferedRandomAccessFile.write(BufferedRandomAccessFile.java:335) at org.apache.cassandra.io.PrecompactedRow.write(PrecompactedRow.java:102) at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:130) at org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:566) at org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:146) at org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:112) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) [followed by a few more of those...] and then a bunch of these: ERROR [FlushWriter:123] 2011-05-02 01:21:12,690 AbstractCassandraDaemon.java (line 112) Fatal exception in thread Thread[FlushWriter:123,5,main] java.lang.RuntimeException: java.lang.RuntimeException: Insufficient disk space to flush 40009184 bytes at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Caused by: java.lang.RuntimeException: Insufficient disk space to flush 40009184 bytes at org.apache.cassandra.db.ColumnFamilyStore.getFlushPath(ColumnFamilyStore.java:597) at org.apache.cassandra.db.ColumnFamilyStore.createFlushWriter(ColumnFamilyStore.java:2100) at org.apache.cassandra.db.Memtable.writeSortedContents(Memtable.java:239) at org.apache.cassandra.db.Memtable.access$400(Memtable.java:50) at org.apache.cassandra.db.Memtable$3.runMayThrow(Memtable.java:263) at
Re: Performance tests using stress testing tool
Thanks Peter. I believe...I found the root cause. Switch that we used was bad. Now on a 4 node cluster ( Each Node has 1 CPU - Quad Core and 16 GB of RAM), I was able to get around 11,000 writes and 10,050 reads per second simultaneously (CPU usage is around 45% on all nodes. Disk queue size is in the neighbourhood of 10) Is this inline with what you usually see with Cassandra? - Original Message - From: Peter Schuller To: user@cassandra.apache.org Sent: Friday, April 29, 2011 12:21 PM Subject: Re: Performance tests using stress testing tool Thanks Peter. I am using java version of the stress testing tool from the contrib folder. Is there any issue that should be aware of? Do you recommend using pystress? I just saw Brandon file this: https://issues.apache.org/jira/browse/CASSANDRA-2578 Maybe that's it. -- / Peter Schuller
Decommissioning node is causing broken pipe error
Hi all, I ran decommission on a node in my 32 node cluster. After about an hour of streaming files to another node, I got this error on the node being decommissioned: INFO [MiscStage:1] 2011-05-03 21:49:00,235 StreamReplyVerbHandler.java (line 58) Need to re-stream file /raiddrive/MDR/MeterRecords-f-2283-Data.db to /10.206.63.208 ERROR [Streaming:1] 2011-05-03 21:49:01,580 DebuggableThreadPoolExecutor.java (line 103) Error in ThreadPoolExecutor java.lang.RuntimeException: java.io.IOException: Broken pipe at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Caused by: java.io.IOException: Broken pipe at sun.nio.ch.FileChannelImpl.transferTo0(Native Method) at sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:415) at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:516) at org.apache.cassandra.streaming.FileStreamTask.stream(FileStreamTask.java:105) at org.apache.cassandra.streaming.FileStreamTask.runMayThrow(FileStreamTask.java:67) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30) ... 3 more ERROR [Streaming:1] 2011-05-03 21:49:01,581 AbstractCassandraDaemon.java (line 112) Fatal exception in thread Thread[Streaming:1,1,main] java.lang.RuntimeException: java.io.IOException: Broken pipe at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Caused by: java.io.IOException: Broken pipe at sun.nio.ch.FileChannelImpl.transferTo0(Native Method) at sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:415) at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:516) at org.apache.cassandra.streaming.FileStreamTask.stream(FileStreamTask.java:105) at org.apache.cassandra.streaming.FileStreamTask.runMayThrow(FileStreamTask.java:67) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30) ... 3 more And this message on the node that it was streaming to: INFO [Thread-333] 2011-05-03 21:49:00,234 StreamInSession.java (line 121) Streaming of file /raiddrive/MDR/MeterRecords-f-2283-Data.db/(98605680685,197932763967) progress=49016107008/99327083282 - 49% from org.apache.cassandra.streaming.StreamInSession@33721219 failed: requesting a retry. I tried running decommission again (and running scrub + decommission), but I keep getting this error on the same file. I checked out the file and saw that it is a lot bigger than all the other sstables... 184GB instead of about 74MB. I haven't run a major compaction for a bit, so I'm trying to stream 658 sstables. I'm using Cassandra 0.7.4, I have two data directories (I know that's not good practice...), and all my nodes are on Amazon EC2. Any thoughts on what could be going on or how to prevent this? Thanks! Tamara This message is for the designated recipient only and may contain privileged, proprietary, or otherwise private information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the email by you is prohibited.
Re: Problems recovering a dead node
Hi Aaron It has no data files whatsoever. The upgrade path is 0.7.4 - 0.7.5. It turns out the initial problem was the sw raid failing silently because of another faulty disk. Now that the storage is working, I brought up the node again, same IP, same token and tried doing nodetool repair. All adjacent nodes have finished the streaming session, and now the node has a total of 248 GB of data. Is this normal when the load per node is about 18GB? Also there are 1245 pending tasks. It's been compacting or rebuilding sstables for the last 8 hours non stop. There are 2057 sstables in the data folder. Should I have done thing differently or is this the normal behaviour? Thanks! El mié, 04-05-2011 a las 07:54 +1200, aaron morton escribió: When you say it's clean does that mean the node has no data files ? After you replaced the disk what process did you use to recover ? Also what version are you running and what's the recent upgrade history ? Cheers Aaron On 3 May 2011, at 23:09, Héctor Izquierdo Seliva wrote: Hi everyone. One of the nodes in my 6 node cluster died with disk failures. I have replaced the disks, and it's clean. It has the same configuration (same ip, same token). When I try to restart the node it starts to throw mmap underflow exceptions till it closes again. I tried setting io to standard, but it still fails. It gives errors about two decorated keys being different, and the EOFException. Here is an excerpt of the log http://pastebin.com/ZXW1wY6T I can provide more info if needed. I'm at a loss here so any help is appreciated. Thanks all for your time Héctor Izquierdo