Hi, I'm looking for advice re. the best way to structure my row IDs. Monotonically increasing IDs have the very appealing property that I can quickly scan all recently-ingested unprocessed rows, particularly because I maintain a "checkpoint" of the most-recently processed row.
Of course, the problem with increasing IDs is that it's the lowest-order bits which are changing, which (I think?) means it's less optimal for distributing data across my cluster. I guess that the ways to get around this are to either reverse the ID or to define partitions, and use the partition ID as the high-order bits of the row id? Reversing the ID will destroy the property I describe above; I guess that using partitions may preserve it as long as I use a BatchScanner, but would a BatchScanner play nicely with AccumuloInputFormat? So many questions. Anyways, I think there's a pretty good chance that I'm missing something obvious in this analysis. For instance, if it's easy to "rebalance" the data across my tablet servers periodically, then I'd probably just stick with increasing IDs. Very interested to hear your advice, or the pros and cons of any of these approaches. Thanks, -Russ