Hi, I am going to do the $subject. Plan is to use map-reduce jobs to do the purging and archival process, so it can handle the purging/archiving large amount of data. This feature has ability to purge/archive data manually by specifying a duration and also automatically purge/archive older data (only keeping data of last N number of days). User can select the stream name, version and the duration of the data that he needs to archive.
Here I have described the model that we came up to achieve the above functionality. 1. Select the data specified in the time duration and insert those data into a column family. (use a Hive query) 2. Insert rowkeys of selected data into a temporary CF (use a Hive query) - If we have large amount of data to archive, we can't keep all the rowkeys in memory, so we insert them into a temp CF. 3. Write a class analyzer to read the rowkeys from temporary CF and then delete data in the original CF (use custom map-reduce jobs), finally we will delete the temporary CF. Class analyzer can be included in a hive script, because of that purging/archiving can be done using single Hive script that can be generated programmatically. As a advantage of that we can reuse the scheduling functionality implemented for Hive scripts as well. This is the model we came up for archiving/purging Cassandra data in BAM, if you have any concerns Please raise them. Thanks, KasunW.
_______________________________________________ Dev mailing list Dev@wso2.org http://wso2.org/cgi-bin/mailman/listinfo/dev