Hi,

I am going to do the $subject. Plan is to use map-reduce jobs to do the
purging and archival process, so it can handle the purging/archiving large
amount of data. This feature has ability to purge/archive data manually
by specifying a duration and also automatically purge/archive older data
(only keeping data of last N number of days). User can select the stream
name, version and the duration of the data that he needs to archive.

Here I have described the model that we came up to achieve the above
functionality.

1. Select the data specified in the time duration and insert those data
into a column family. (use a Hive query)
2. Insert rowkeys of selected data into a temporary CF (use a Hive query) -
 If we have large amount of data to archive, we can't keep all the rowkeys
in memory, so we insert them into a temp CF.
3. Write a class analyzer to read the rowkeys from temporary CF and then
delete data in the original CF (use custom map-reduce jobs), finally we
will delete the temporary CF.


Class analyzer can be included in a hive script, because of that
purging/archiving can be done using single Hive script that can be
generated programmatically. As a advantage of that we can reuse the
scheduling functionality implemented for Hive scripts as well.

This is the model we came up for archiving/purging Cassandra data in BAM,
if you have any concerns Please raise them.

Thanks,
KasunW.
_______________________________________________
Dev mailing list
Dev@wso2.org
http://wso2.org/cgi-bin/mailman/listinfo/dev

Reply via email to