Re: Get all keys from the cluster

Marcel Steinbach Sat, 21 Jan 2012 03:54:22 -0800

Thanks for your suggestions, Eric!

> One of the application uses 1.5TB out of 1.8TB


I'm sorry, maybe that statment was slightly ambiguous. I meant to say, that one 
application uses 1.5TB, while the others use 300GB, totalling in 1.8TB of data. 
Our total disk capacity, however, is at about 7 TB, so we're still far from 
running out of disk space.

> Is there any way that you could do that lookup in reverse where you pull the 
> records from your SQL database, figure out which keys aren't necessary, and 
> then delete any unnecessary keys that may or may not exist in cassandra? 
Unfortunately, that won't work since the SQL db does only contain the keys, 
that we want to _keep_ in cassandra.

> If that's not a possibility, then what about creating the same Cassandra 
> schema in a different keyspace and copying all the relevant records from the 
> current keyspace to the new keyspace using the SQL database records as a 
> basis for what is actually "relevant" within the new keyspace.  
I like that idea. So instead of iterating over all cassandra rows, I would 
iterate over the SQL DB, which would indeed save me a lot of IO. However, rows 
inserted into my CF during iterating over the SQL DB might not be copied into 
the new keyspace. But maybe we could arrange to do that 
during low-demand-hours to minimize the amount of new inserts and additionally 
run the "copy" a second time with a select on newly inserted sql rows. So we'll 
probably go with that.

Thanks again for your help!

Cheers
Marcel

On 21.01.2012, at 11:52, Eric Czech wrote:

> Is there any way that you could do that lookup in reverse where you pull the 
> records from your SQL database, figure out which keys aren't necessary, and 
> then delete any unnecessary keys that may or may not exist in cassandra?  
> 
> If that's not a possibility, then what about creating the same Cassandra 
> schema in a different keyspace and copying all the relevant records from the 
> current keyspace to the new keyspace using the SQL database records as a 
> basis for what is actually "relevant" within the new keyspace.  If you could 
> perform that transfer, then you could just delete the old 1.5TB keyspace 
> altogether, leaving only the data you need.  If that sort of duplication 
> would put you over the 1.8TB limit during the transfer, then maybe you could 
> consider CF compression upfront.
> 
> Short of that, I can tell from experience that doing these sort of "left 
> join" deletes from cassandra to SQL really suck.  We have had to resort to 
> using hadoop to do this but since our hadoop/cassandra clusters are much 
> larger than our single SQL instances, keeping all the hadoop processes from 
> basically "DDoS"ing our SQL servers while still making the process faster 
> than thrift iterations over all the rows (via custom programs) in cassandra 
> hasn't been a convincing solution.
> 
> I'd say that the first solution I proposed is definitely the best, but also 
> the most unrealistic.  If that's really not a possibility for you, then I'd 
> seriously look at trying to make my second suggestion work even if it means 
> brining up new hardware or increasing the capacity of existing resources.  
> That second suggestion also has the added benefit of likely minimizing I/O 
> since it's the only solution that doesn't require reading or deleting any of 
> the unnecessary data (beyond wholesale keyspace or CF deletions) assuming 
> that the actually relevant portion of your data is significantly less than 
> 1.5TB.  
> 
> I hope that helps!
> 
> And in the future, you should really try to avoid letting your data size get 
> beyond 40 - 50 % of your actual on-disk capacity.  Let me know if anyone in 
> the community disagrees, but I'd say you're about 600 GB past the point at 
> which you have a lot of easy outs -- but I hope you find one anyways!
> 
> 
> On Sat, Jan 21, 2012 at 2:45 AM, Marcel Steinbach <marcel.steinb...@chors.de> 
> wrote:
> We're running a 8 node cluster with different CFs for different applications. 
> One of the application uses 1.5TB out of 1.8TB in total, but only because we 
> started out with a deletion mechanism and implemented one later on. So there 
> is probably a high amount of old data in there, that we don't even use 
> anymore.
> 
> Now we want to delete that data. To know, which rows we may delete, we have 
> to lookup a SQL database. If the key is not in there anymore, we may delete 
> that row in cassandra, too.
> 
> This basically means, we have to iterate over all the rows in that CF. This 
> kind of begs for hadoop, but that seems not to be an option, currently. I 
> tried.
> 
> So we figured, we could run over the sstables files (maybe only the index), 
> check the keys in the mysql, and later run the deletes on the cluster. This 
> way, we could iterate on each node in parallel.
> 
> Does that sound reasonable? Any pros/cons, maybe a "killer" argument to use 
> hadoop for that?
> 
> Cheers
> Marcel
> <hr style="border-color:blue">
> <p>chors GmbH
> <br><hr style="border-color:blue">
> <p>specialists in digital and direct marketing solutions<br>
> Haid-und-Neu-Straße 7<br>
> 76131 Karlsruhe, Germany<br>
> www.chors.com</p>
> <p>Managing Directors: Dr. Volker Hatz, Markus Plattner<br>Amtsgericht 
> Montabaur, HRB 15029</p>
> <p style="font-size:9px">This e-mail is for the intended recipient only and 
> may contain confidential or privileged information. If you have received this 
> e-mail by mistake, please contact us immediately and completely delete it 
> (and any attachments) and do not forward it or inform any other person of its 
> contents. If you send us messages by e-mail, we take this as your 
> authorization to correspond with you by e-mail. E-mail transmission cannot be 
> guaranteed to be secure or error-free as information could be intercepted, 
> amended, corrupted, lost, destroyed, arrive late or incomplete, or contain 
> viruses. Neither chors GmbH nor the sender accept liability for any errors or 
> omissions in the content of this message which arise as a result of its 
> e-mail transmission. Please note that all e-mail communications to and from 
> chors GmbH may be monitored.</p>
>

Re: Get all keys from the cluster

Reply via email to