This is less about a random sampling of one column, more about based on a column, grab random from that column, but return the whole row...
On Mon, Nov 28, 2016 at 11:45 AM, Ted Dunning <[email protected]> wrote: > > The answer is probably yes. (Get it?) > > If you just want a random sample of one column, try random() < p as a > qualifier in the where clause. > > If you want samples where the likelihood varies with the value of a > column, the answer is slightly more elaborate. For instance, suppose you > want about a thousand samples from each city in the data. This means that > you should have p=1 for all cities where there are less than a thousand > samples at all and p=1000/n where n is the number of samples for the > current city. So what you want is a two pass query that counts the cities > and then uses these counts to get probabilities. I am not up for typing > that on a phone, but it should be straightforward. > > This same task can be done in a single pass by using what is called > reservoir sampling. You can use two levels of reservoir sampling with a > counter to bias the results but that will require a user defined aggregator > that can work on two levels and I don't think that is possible/easy yet > with drill. > > Sent from my iPhone > > > On Nov 28, 2016, at 8:27, John Omernik <[email protected]> wrote: > > > > Is there a way to grab a random return of data from Drill? > > > > For example, let's say I have a table with 1 billion rows, and I want to > > return 100,000 at random based on a sampling of a specific column... is > > that possible? > > > > Thanks > > > > John >
