Hi, I've been having a few issues with certain versions of data "missing" and/or less data than expected in my tables. I've read over quite a few old threads and I think I understand what is actually happening, but just wanted to possibly confirm my thinking. I hope I am not rehashing too many old topics here.
First, I am using the timestamp as an added dimension to my data. Basically, its log data and the timestamp of the log entry is the same as the timestamp for each cell. I want to keep all the data, so I have max versions for the column set to Long.MAX_VALUE. For my main table, there should at most only be one cell per second, so the versioning here is working as expected. However, for my other tables I could have many, 1000s per second, cells. My understanding now is that if I have two cells with the same exact timestamp, that one will be removed with a major compact (which would explain why data had seem to be "missing" to me). Values are different and so I end up losing data here when the duplicate timestamped cells are removed. To solve this, I plan to use microseconds which should solve the problem of duplicates and no cells being removed on major compact. However, I'm wondering if I have another problem. To insert data, I run a MapReduce process and the data is not inserted in any chronological order. If timestamps are inserted in random order, will this possibly result in cells being removed at major compact? I think I recall something to the effect that if a previous dated timestamp is inserted after a future timestamp, that the previous timestamped cell is ignored (or removed on major compact)? Is that the case, or have I totally made that part up? If so, it would seem that I should not use the timestamp/version at all, but rather create another column for this time data and just use current time in microseconds for the timestamp at insertion. Finally, with regards to number of versions. I could possibly have 10s or 100s of millions of versions per row. If I pull back the row and try to loop the KVs, is this going to read the whole thing into memory at once and possibly cause OOME? Apologies for the long mail. Thanks J