If you have large FlowFiles and are trying to sample records from
each, you can use SampleRecord. It has Interval Sampling,
Probabilistic Sampling, and Reservoir Sampling strategies, and I have
a PR [1] up to add Range Sampling [2].

Regards,
Matt

[1] https://github.com/apache/nifi/pull/5878
[2] https://issues.apache.org/jira/browse/NIFI-9814

On Thu, May 19, 2022 at 6:20 AM James McMahon <jsmcmah...@gmail.com> wrote:
>
> I have been tasked to draw samples from very large raw data sets for triage 
> analysis. I am to provide multiple sampling methods. Drawing a random sample 
> of N records is one method. A second method is to draw a fixed sample of 
> 1,032 records from stratified defined date boundaries in a set. The latter is 
> of interest because raw data can substantially change structure or even 
> format at points in time, and we need to be able to sample within those data 
> boundaries.
>
> Can anyone offer a link to an example of how nifi may be used to draw samples 
> randomly and/or in a systematic way from raw data collections?

Reply via email to