Although, I have no idea of your use case, I would be surprised if during 
sampling you want to stop exactly at the 1M mark.

Here is one approach you might use:
May be if you store the total count of rows separately say 90M, then you can 
randomly pick 1 in 90 rows in your MR job doing a global scan. If your key is 
uniformly distributed, you can use mod-ranges and prefix filters to achieve 
that. This way, you don't have to instrument your MR jobs to monitor current 
progress of jobs

A drawback with this approach though it that it is an full scan. But you may 
use the basic idea above and restrict global to somewhat limited scan for 
efficiency at the loss of sampling randomness.

hth,
Abhishek

-----Original Message-----
From: David Koch [mailto:ogd...@googlemail.com] 
Sent: Friday, October 12, 2012 8:05 AM
To: user@hbase.apache.org
Subject: Efficient way to sample from large HBase table.

Hello,

I need to sample 1million rows from a large HBase table. What is an efficient 
way of doing this?

I thought about a RandomRowFilter on a scan of the source table to get 
approximately the right amount of rows in combination with a Mapper.
However since MapReduce counters cannot be reliably retrieved while the job is 
running I would need an external counter to keep track of the number of sampled 
records and stop the job at 1 million.

A variation would be to apply a RandomRowFilter as well as a KeyOnlyFilter on 
the scan and then open a connection to the source table inside each mapper to 
retrieve the values for the row key.

If there is a simpler more efficient way I would be glad to hear about it.

Thank you,

/David

Reply via email to