Re: Grabbing Random Sample of rows based on a column

Ted Dunning Mon, 28 Nov 2016 09:45:39 -0800

The answer is probably yes. (Get it?)

If you just want a random sample of one column, try random() < p as a qualifier 
in the where clause.

If you want samples where the likelihood varies with the value of a column, the 
answer is slightly more elaborate.  For instance, suppose you  want about a 
thousand samples from each city in the data. This means that you should have 
p=1 for all cities where there are less than a thousand samples at all and 
p=1000/n where n is the number of samples for the current city. So what you 
want is a two pass query that counts the cities and then uses these counts to 
get probabilities. I am not up for typing that on a phone, but it should be 
straightforward. 

This same task can be done in a single pass by using what is called reservoir 
sampling. You can use two levels of reservoir sampling with a counter to bias 
the results but that will require a user defined aggregator that can work on 
two levels and I don't think that is possible/easy yet with drill. 

Sent from my iPhone

> On Nov 28, 2016, at 8:27, John Omernik <[email protected]> wrote:
> 
> Is there a way to grab a random return of data from Drill?
> 
> For example, let's say I have a table with 1 billion rows, and I want to
> return 100,000 at random based on a sampling of a specific column... is
> that possible?
> 
> Thanks
> 
> John

Re: Grabbing Random Sample of rows based on a column

Reply via email to