Is there any way that you could do that lookup in reverse where you pull
the records from your SQL database, figure out which keys aren't necessary,
and then delete any unnecessary keys that may or may not exist in
cassandra?

If that's not a possibility, then what about creating the same Cassandra
schema in a different keyspace and copying all the relevant records from
the current keyspace to the new keyspace using the SQL database records as
a basis for what is actually "relevant" within the new keyspace.  If you
could perform that transfer, then you could just delete the old 1.5TB
keyspace altogether, leaving only the data you need.  If that sort of
duplication would put you over the 1.8TB limit during the transfer, then
maybe you could consider CF compression upfront.

Short of that, I can tell from experience that doing these sort of "left
join" deletes from cassandra to SQL really suck.  We have had to resort to
using hadoop to do this but since our hadoop/cassandra clusters are much
larger than our single SQL instances, keeping all the hadoop processes from
basically "DDoS"ing our SQL servers while still making the process faster
than thrift iterations over all the rows (via custom programs) in cassandra
hasn't been a convincing solution.

I'd say that the first solution I proposed is definitely the best, but also
the most unrealistic.  If that's really not a possibility for you, then I'd
seriously look at trying to make my second suggestion work even if it means
brining up new hardware or increasing the capacity of existing resources.
 That second suggestion also has the added benefit of likely minimizing I/O
since it's the only solution that doesn't require reading or deleting any
of the unnecessary data (beyond wholesale keyspace or CF deletions)
assuming that the actually relevant portion of your data is significantly
less than 1.5TB.

I hope that helps!

And in the future, you should really try to avoid letting your data size
get beyond 40 - 50 % of your actual on-disk capacity.  Let me know if
anyone in the community disagrees, but I'd say you're about 600 GB past the
point at which you have a lot of easy outs -- but I hope you find one
anyways!


On Sat, Jan 21, 2012 at 2:45 AM, Marcel Steinbach <marcel.steinb...@chors.de
> wrote:

> We're running a 8 node cluster with different CFs for different
> applications. One of the application uses 1.5TB out of 1.8TB in total, but
> only because we started out with a deletion mechanism and implemented one
> later on. So there is probably a high amount of old data in there, that we
> don't even use anymore.
>
> Now we want to delete that data. To know, which rows we may delete, we
> have to lookup a SQL database. If the key is not in there anymore, we may
> delete that row in cassandra, too.
>
> This basically means, we have to iterate over all the rows in that CF.
> This kind of begs for hadoop, but that seems not to be an option,
> currently. I tried.
>
> So we figured, we could run over the sstables files (maybe only the
> index), check the keys in the mysql, and later run the deletes on the
> cluster. This way, we could iterate on each node in parallel.
>
> Does that sound reasonable? Any pros/cons, maybe a "killer" argument to
> use hadoop for that?
>
> Cheers
> Marcel
> <hr style="border-color:blue">
> <p>chors GmbH
> <br><hr style="border-color:blue">
> <p>specialists in digital and direct marketing solutions<br>
> Haid-und-Neu-Straße 7<br>
> 76131 Karlsruhe, Germany<br>
> www.chors.com</p>
> <p>Managing Directors: Dr. Volker Hatz, Markus Plattner<br>Amtsgericht
> Montabaur, HRB 15029</p>
> <p style="font-size:9px">This e-mail is for the intended recipient only
> and may contain confidential or privileged information. If you have
> received this e-mail by mistake, please contact us immediately and
> completely delete it (and any attachments) and do not forward it or inform
> any other person of its contents. If you send us messages by e-mail, we
> take this as your authorization to correspond with you by e-mail. E-mail
> transmission cannot be guaranteed to be secure or error-free as information
> could be intercepted, amended, corrupted, lost, destroyed, arrive late or
> incomplete, or contain viruses. Neither chors GmbH nor the sender accept
> liability for any errors or omissions in the content of this message which
> arise as a result of its e-mail transmission. Please note that all e-mail
> communications to and from chors GmbH may be monitored.</p>
>

Reply via email to