The answer is probably yes. (Get it?) If you just want a random sample of one column, try random() < p as a qualifier in the where clause.
If you want samples where the likelihood varies with the value of a column, the answer is slightly more elaborate. For instance, suppose you want about a thousand samples from each city in the data. This means that you should have p=1 for all cities where there are less than a thousand samples at all and p=1000/n where n is the number of samples for the current city. So what you want is a two pass query that counts the cities and then uses these counts to get probabilities. I am not up for typing that on a phone, but it should be straightforward. This same task can be done in a single pass by using what is called reservoir sampling. You can use two levels of reservoir sampling with a counter to bias the results but that will require a user defined aggregator that can work on two levels and I don't think that is possible/easy yet with drill. Sent from my iPhone > On Nov 28, 2016, at 8:27, John Omernik <[email protected]> wrote: > > Is there a way to grab a random return of data from Drill? > > For example, let's say I have a table with 1 billion rows, and I want to > return 100,000 at random based on a sampling of a specific column... is > that possible? > > Thanks > > John
