Hello, I need to sample 1million rows from a large HBase table. What is an efficient way of doing this?
I thought about a RandomRowFilter on a scan of the source table to get approximately the right amount of rows in combination with a Mapper. However since MapReduce counters cannot be reliably retrieved while the job is running I would need an external counter to keep track of the number of sampled records and stop the job at 1 million. A variation would be to apply a RandomRowFilter as well as a KeyOnlyFilter on the scan and then open a connection to the source table inside each mapper to retrieve the values for the row key. If there is a simpler more efficient way I would be glad to hear about it. Thank you, /David