Re: Get all keys from the cluster

2012-01-23 Thread aaron morton
If you want to keep the load out of the cassandra process and do the join to 
sql off line, take a look at the bin/sstablekeys utility. This will let you 
output the keys in an sstable. You will need to do it for every sstable on 
every node, create the unique list and then check in your SQL db thingy. 

The output will be hex encoded, and the process scans through the index file. 
So while it will take some IO load but it does ask the OS to skip using the IO 
cache. 

Alternatively just write a script that iterates over all the rows in the CF and 
does the lookup.

hope that helps. 

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 22/01/2012, at 1:11 AM, Eric Czech wrote:

 Great!  I'm glad at least one of those ideas was helpful for you.
 
 That's a road we've travelled before and as one last suggestion that might 
 help, you could alter all client writers to cassandra beforehand so that they 
 write to BOTH keyspaces BEFORE beginning the SQL based transfer.  This might 
 help keep you from having to make multiple passes unless I'm not missing 
 something.
 
 On Sat, Jan 21, 2012 at 4:53 AM, Marcel Steinbach marcel.steinb...@chors.de 
 wrote:
 Thanks for your suggestions, Eric!
 
 One of the application uses 1.5TB out of 1.8TB
 
 I'm sorry, maybe that statment was slightly ambiguous. I meant to say, that 
 one application uses 1.5TB, while the others use 300GB, totalling in 1.8TB of 
 data. Our total disk capacity, however, is at about 7 TB, so we're still far 
 from running out of disk space.
 
 Is there any way that you could do that lookup in reverse where you pull the 
 records from your SQL database, figure out which keys aren't necessary, and 
 then delete any unnecessary keys that may or may not exist in cassandra? 
 
 Unfortunately, that won't work since the SQL db does only contain the keys, 
 that we want to _keep_ in cassandra.
 
 If that's not a possibility, then what about creating the same Cassandra 
 schema in a different keyspace and copying all the relevant records from the 
 current keyspace to the new keyspace using the SQL database records as a 
 basis for what is actually relevant within the new keyspace.  
 
 I like that idea. So instead of iterating over all cassandra rows, I would 
 iterate over the SQL DB, which would indeed save me a lot of IO. However, 
 rows inserted into my CF during iterating over the SQL DB might not be copied 
 into the new keyspace. But maybe we could arrange to do that 
 during low-demand-hours to minimize the amount of new inserts and 
 additionally run the copy a second time with a select on newly inserted sql 
 rows. So we'll probably go with that.
 
 Thanks again for your help!
 
 Cheers
 Marcel
 
 On 21.01.2012, at 11:52, Eric Czech wrote:
 
 Is there any way that you could do that lookup in reverse where you pull the 
 records from your SQL database, figure out which keys aren't necessary, and 
 then delete any unnecessary keys that may or may not exist in cassandra?  
 
 If that's not a possibility, then what about creating the same Cassandra 
 schema in a different keyspace and copying all the relevant records from the 
 current keyspace to the new keyspace using the SQL database records as a 
 basis for what is actually relevant within the new keyspace.  If you could 
 perform that transfer, then you could just delete the old 1.5TB keyspace 
 altogether, leaving only the data you need.  If that sort of duplication 
 would put you over the 1.8TB limit during the transfer, then maybe you could 
 consider CF compression upfront.
 
 Short of that, I can tell from experience that doing these sort of left 
 join deletes from cassandra to SQL really suck.  We have had to resort to 
 using hadoop to do this but since our hadoop/cassandra clusters are much 
 larger than our single SQL instances, keeping all the hadoop processes from 
 basically DDoSing our SQL servers while still making the process faster 
 than thrift iterations over all the rows (via custom programs) in cassandra 
 hasn't been a convincing solution.
 
 I'd say that the first solution I proposed is definitely the best, but also 
 the most unrealistic.  If that's really not a possibility for you, then I'd 
 seriously look at trying to make my second suggestion work even if it means 
 brining up new hardware or increasing the capacity of existing resources.  
 That second suggestion also has the added benefit of likely minimizing I/O 
 since it's the only solution that doesn't require reading or deleting any of 
 the unnecessary data (beyond wholesale keyspace or CF deletions) assuming 
 that the actually relevant portion of your data is significantly less than 
 1.5TB.  
 
 I hope that helps!
 
 And in the future, you should really try to avoid letting your data size get 
 beyond 40 - 50 % of your actual on-disk capacity.  Let me know if anyone in 
 the community disagrees, but I'd say you're about 600 GB past the point at 
 which you 

Get all keys from the cluster

2012-01-21 Thread Marcel Steinbach
We're running a 8 node cluster with different CFs for different applications. 
One of the application uses 1.5TB out of 1.8TB in total, but only because we 
started out with a deletion mechanism and implemented one later on. So there is 
probably a high amount of old data in there, that we don't even use anymore. 

Now we want to delete that data. To know, which rows we may delete, we have to 
lookup a SQL database. If the key is not in there anymore, we may delete that 
row in cassandra, too. 

This basically means, we have to iterate over all the rows in that CF. This 
kind of begs for hadoop, but that seems not to be an option, currently. I tried.

So we figured, we could run over the sstables files (maybe only the index), 
check the keys in the mysql, and later run the deletes on the cluster. This 
way, we could iterate on each node in parallel. 

Does that sound reasonable? Any pros/cons, maybe a killer argument to use 
hadoop for that?

Cheers
Marcelhr style=border-color:blue
pchors GmbH
brhr style=border-color:blue
pspecialists in digital and direct marketing solutionsbr
Haid-und-Neu-Straße 7br
76131 Karlsruhe, Germanybr
www.chors.com/p
pManaging Directors: Dr. Volker Hatz, Markus PlattnerbrAmtsgericht 
Montabaur, HRB 15029/p
p style=font-size:9pxThis e-mail is for the intended recipient only and may 
contain confidential or privileged information. If you have received this 
e-mail by mistake, please contact us immediately and completely delete it (and 
any attachments) and do not forward it or inform any other person of its 
contents. If you send us messages by e-mail, we take this as your authorization 
to correspond with you by e-mail. E-mail transmission cannot be guaranteed to 
be secure or error-free as information could be intercepted, amended, 
corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. 
Neither chors GmbH nor the sender accept liability for any errors or omissions 
in the content of this message which arise as a result of its e-mail 
transmission. Please note that all e-mail communications to and from chors GmbH 
may be monitored./p

Re: Get all keys from the cluster

2012-01-21 Thread Eric Czech
Is there any way that you could do that lookup in reverse where you pull
the records from your SQL database, figure out which keys aren't necessary,
and then delete any unnecessary keys that may or may not exist in
cassandra?

If that's not a possibility, then what about creating the same Cassandra
schema in a different keyspace and copying all the relevant records from
the current keyspace to the new keyspace using the SQL database records as
a basis for what is actually relevant within the new keyspace.  If you
could perform that transfer, then you could just delete the old 1.5TB
keyspace altogether, leaving only the data you need.  If that sort of
duplication would put you over the 1.8TB limit during the transfer, then
maybe you could consider CF compression upfront.

Short of that, I can tell from experience that doing these sort of left
join deletes from cassandra to SQL really suck.  We have had to resort to
using hadoop to do this but since our hadoop/cassandra clusters are much
larger than our single SQL instances, keeping all the hadoop processes from
basically DDoSing our SQL servers while still making the process faster
than thrift iterations over all the rows (via custom programs) in cassandra
hasn't been a convincing solution.

I'd say that the first solution I proposed is definitely the best, but also
the most unrealistic.  If that's really not a possibility for you, then I'd
seriously look at trying to make my second suggestion work even if it means
brining up new hardware or increasing the capacity of existing resources.
 That second suggestion also has the added benefit of likely minimizing I/O
since it's the only solution that doesn't require reading or deleting any
of the unnecessary data (beyond wholesale keyspace or CF deletions)
assuming that the actually relevant portion of your data is significantly
less than 1.5TB.

I hope that helps!

And in the future, you should really try to avoid letting your data size
get beyond 40 - 50 % of your actual on-disk capacity.  Let me know if
anyone in the community disagrees, but I'd say you're about 600 GB past the
point at which you have a lot of easy outs -- but I hope you find one
anyways!


On Sat, Jan 21, 2012 at 2:45 AM, Marcel Steinbach marcel.steinb...@chors.de
 wrote:

 We're running a 8 node cluster with different CFs for different
 applications. One of the application uses 1.5TB out of 1.8TB in total, but
 only because we started out with a deletion mechanism and implemented one
 later on. So there is probably a high amount of old data in there, that we
 don't even use anymore.

 Now we want to delete that data. To know, which rows we may delete, we
 have to lookup a SQL database. If the key is not in there anymore, we may
 delete that row in cassandra, too.

 This basically means, we have to iterate over all the rows in that CF.
 This kind of begs for hadoop, but that seems not to be an option,
 currently. I tried.

 So we figured, we could run over the sstables files (maybe only the
 index), check the keys in the mysql, and later run the deletes on the
 cluster. This way, we could iterate on each node in parallel.

 Does that sound reasonable? Any pros/cons, maybe a killer argument to
 use hadoop for that?

 Cheers
 Marcel
 hr style=border-color:blue
 pchors GmbH
 brhr style=border-color:blue
 pspecialists in digital and direct marketing solutionsbr
 Haid-und-Neu-Straße 7br
 76131 Karlsruhe, Germanybr
 www.chors.com/p
 pManaging Directors: Dr. Volker Hatz, Markus PlattnerbrAmtsgericht
 Montabaur, HRB 15029/p
 p style=font-size:9pxThis e-mail is for the intended recipient only
 and may contain confidential or privileged information. If you have
 received this e-mail by mistake, please contact us immediately and
 completely delete it (and any attachments) and do not forward it or inform
 any other person of its contents. If you send us messages by e-mail, we
 take this as your authorization to correspond with you by e-mail. E-mail
 transmission cannot be guaranteed to be secure or error-free as information
 could be intercepted, amended, corrupted, lost, destroyed, arrive late or
 incomplete, or contain viruses. Neither chors GmbH nor the sender accept
 liability for any errors or omissions in the content of this message which
 arise as a result of its e-mail transmission. Please note that all e-mail
 communications to and from chors GmbH may be monitored./p



Re: Get all keys from the cluster

2012-01-21 Thread Marcel Steinbach
Thanks for your suggestions, Eric!

 One of the application uses 1.5TB out of 1.8TB

I'm sorry, maybe that statment was slightly ambiguous. I meant to say, that one 
application uses 1.5TB, while the others use 300GB, totalling in 1.8TB of data. 
Our total disk capacity, however, is at about 7 TB, so we're still far from 
running out of disk space.

 Is there any way that you could do that lookup in reverse where you pull the 
 records from your SQL database, figure out which keys aren't necessary, and 
 then delete any unnecessary keys that may or may not exist in cassandra? 
Unfortunately, that won't work since the SQL db does only contain the keys, 
that we want to _keep_ in cassandra.

 If that's not a possibility, then what about creating the same Cassandra 
 schema in a different keyspace and copying all the relevant records from the 
 current keyspace to the new keyspace using the SQL database records as a 
 basis for what is actually relevant within the new keyspace.  
I like that idea. So instead of iterating over all cassandra rows, I would 
iterate over the SQL DB, which would indeed save me a lot of IO. However, rows 
inserted into my CF during iterating over the SQL DB might not be copied into 
the new keyspace. But maybe we could arrange to do that 
during low-demand-hours to minimize the amount of new inserts and additionally 
run the copy a second time with a select on newly inserted sql rows. So we'll 
probably go with that.

Thanks again for your help!

Cheers
Marcel

On 21.01.2012, at 11:52, Eric Czech wrote:

 Is there any way that you could do that lookup in reverse where you pull the 
 records from your SQL database, figure out which keys aren't necessary, and 
 then delete any unnecessary keys that may or may not exist in cassandra?  
 
 If that's not a possibility, then what about creating the same Cassandra 
 schema in a different keyspace and copying all the relevant records from the 
 current keyspace to the new keyspace using the SQL database records as a 
 basis for what is actually relevant within the new keyspace.  If you could 
 perform that transfer, then you could just delete the old 1.5TB keyspace 
 altogether, leaving only the data you need.  If that sort of duplication 
 would put you over the 1.8TB limit during the transfer, then maybe you could 
 consider CF compression upfront.
 
 Short of that, I can tell from experience that doing these sort of left 
 join deletes from cassandra to SQL really suck.  We have had to resort to 
 using hadoop to do this but since our hadoop/cassandra clusters are much 
 larger than our single SQL instances, keeping all the hadoop processes from 
 basically DDoSing our SQL servers while still making the process faster 
 than thrift iterations over all the rows (via custom programs) in cassandra 
 hasn't been a convincing solution.
 
 I'd say that the first solution I proposed is definitely the best, but also 
 the most unrealistic.  If that's really not a possibility for you, then I'd 
 seriously look at trying to make my second suggestion work even if it means 
 brining up new hardware or increasing the capacity of existing resources.  
 That second suggestion also has the added benefit of likely minimizing I/O 
 since it's the only solution that doesn't require reading or deleting any of 
 the unnecessary data (beyond wholesale keyspace or CF deletions) assuming 
 that the actually relevant portion of your data is significantly less than 
 1.5TB.  
 
 I hope that helps!
 
 And in the future, you should really try to avoid letting your data size get 
 beyond 40 - 50 % of your actual on-disk capacity.  Let me know if anyone in 
 the community disagrees, but I'd say you're about 600 GB past the point at 
 which you have a lot of easy outs -- but I hope you find one anyways!
 
 
 On Sat, Jan 21, 2012 at 2:45 AM, Marcel Steinbach marcel.steinb...@chors.de 
 wrote:
 We're running a 8 node cluster with different CFs for different applications. 
 One of the application uses 1.5TB out of 1.8TB in total, but only because we 
 started out with a deletion mechanism and implemented one later on. So there 
 is probably a high amount of old data in there, that we don't even use 
 anymore.
 
 Now we want to delete that data. To know, which rows we may delete, we have 
 to lookup a SQL database. If the key is not in there anymore, we may delete 
 that row in cassandra, too.
 
 This basically means, we have to iterate over all the rows in that CF. This 
 kind of begs for hadoop, but that seems not to be an option, currently. I 
 tried.
 
 So we figured, we could run over the sstables files (maybe only the index), 
 check the keys in the mysql, and later run the deletes on the cluster. This 
 way, we could iterate on each node in parallel.
 
 Does that sound reasonable? Any pros/cons, maybe a killer argument to use 
 hadoop for that?
 
 Cheers
 Marcel
 hr style=border-color:blue
 pchors GmbH
 brhr style=border-color:blue
 pspecialists in digital and 

Re: Get all keys from the cluster

2012-01-21 Thread Eric Czech
Great!  I'm glad at least one of those ideas was helpful for you.

That's a road we've travelled before and as one last suggestion that might
help, you could alter all client writers to cassandra beforehand so that
they write to BOTH keyspaces BEFORE beginning the SQL based transfer.  This
might help keep you from having to make multiple passes unless I'm not
missing something.

On Sat, Jan 21, 2012 at 4:53 AM, Marcel Steinbach marcel.steinb...@chors.de
 wrote:

Thanks for your suggestions, Eric!

 One of the application uses 1.5TB out of 1.8TB

 I'm sorry, maybe that statment was slightly ambiguous. I meant to say,
 that one application uses 1.5TB, while the others use 300GB, totalling in
 1.8TB of data. Our total disk capacity, however, is at about 7 TB, so we're
 still far from running out of disk space.

 Is there any way that you could do that lookup in reverse where you pull
 the records from your SQL database, figure out which keys aren't necessary,
 and then delete any unnecessary keys that may or may not exist in
 cassandra?

 Unfortunately, that won't work since the SQL db does only contain the
 keys, that we want to _keep_ in cassandra.

 If that's not a possibility, then what about creating the same Cassandra
 schema in a different keyspace and copying all the relevant records from
 the current keyspace to the new keyspace using the SQL database records as
 a basis for what is actually relevant within the new keyspace.

 I like that idea. So instead of iterating over all cassandra rows, I would
 iterate over the SQL DB, which would indeed save me a lot of IO. However,
 rows inserted into my CF during iterating over the SQL DB might not be
 copied into the new keyspace. But maybe we could arrange to do that
 during low-demand-hours to minimize the amount of new inserts and
 additionally run the copy a second time with a select on newly inserted
 sql rows. So we'll probably go with that.

 Thanks again for your help!

 Cheers
 Marcel

 On 21.01.2012, at 11:52, Eric Czech wrote:

 Is there any way that you could do that lookup in reverse where you pull
 the records from your SQL database, figure out which keys aren't necessary,
 and then delete any unnecessary keys that may or may not exist in
 cassandra?

 If that's not a possibility, then what about creating the same Cassandra
 schema in a different keyspace and copying all the relevant records from
 the current keyspace to the new keyspace using the SQL database records as
 a basis for what is actually relevant within the new keyspace.  If you
 could perform that transfer, then you could just delete the old 1.5TB
 keyspace altogether, leaving only the data you need.  If that sort of
 duplication would put you over the 1.8TB limit during the transfer, then
 maybe you could consider CF compression upfront.

 Short of that, I can tell from experience that doing these sort of left
 join deletes from cassandra to SQL really suck.  We have had to resort to
 using hadoop to do this but since our hadoop/cassandra clusters are much
 larger than our single SQL instances, keeping all the hadoop processes from
 basically DDoSing our SQL servers while still making the process faster
 than thrift iterations over all the rows (via custom programs) in cassandra
 hasn't been a convincing solution.

 I'd say that the first solution I proposed is definitely the best, but
 also the most unrealistic.  If that's really not a possibility for you,
 then I'd seriously look at trying to make my second suggestion work even if
 it means brining up new hardware or increasing the capacity of existing
 resources.  That second suggestion also has the added benefit of likely
 minimizing I/O since it's the only solution that doesn't require reading or
 deleting any of the unnecessary data (beyond wholesale keyspace or CF
 deletions) assuming that the actually relevant portion of your data is
 significantly less than 1.5TB.

 I hope that helps!

 And in the future, you should really try to avoid letting your data size
 get beyond 40 - 50 % of your actual on-disk capacity.  Let me know if
 anyone in the community disagrees, but I'd say you're about 600 GB past the
 point at which you have a lot of easy outs -- but I hope you find one
 anyways!


 On Sat, Jan 21, 2012 at 2:45 AM, Marcel Steinbach 
 marcel.steinb...@chors.de wrote:

 We're running a 8 node cluster with different CFs for different
 applications. One of the application uses 1.5TB out of 1.8TB in total, but
 only because we started out with a deletion mechanism and implemented one
 later on. So there is probably a high amount of old data in there, that we
 don't even use anymore.

 Now we want to delete that data. To know, which rows we may delete, we
 have to lookup a SQL database. If the key is not in there anymore, we may
 delete that row in cassandra, too.

 This basically means, we have to iterate over all the rows in that CF.
 This kind of begs for hadoop, but that seems not to be an option,
 currently. I