Re: NiFi to draw samples from very large raw data sets

2022-05-19 Thread Joe Gresock
James, This sounds like an interesting project. I would recommend RouteOnAttribute with a "sample" property with value "${random():mod(1032):equals(100)}" (the second number could be anything between 0 and 1031), and then routing the "sample" relationship to your sampling path. I'm not sure I un

Re: NiFi to draw samples from very large raw data sets

2022-05-19 Thread Joe Gresock
Also, I just realized I misread your sampling requirement. You would use the approach above if you wanted to sample *every 1032th flowfile*, but you want a sample size of 1032 total. You can still use a randomizing selection approach as I described (though your mod value would depend on what freq

Re: NiFi to draw samples from very large raw data sets

2022-05-19 Thread Matt Burgess
If you have large FlowFiles and are trying to sample records from each, you can use SampleRecord. It has Interval Sampling, Probabilistic Sampling, and Reservoir Sampling strategies, and I have a PR [1] up to add Range Sampling [2]. Regards, Matt [1] https://github.com/apache/nifi/pull/5878 [2] h

RE: NiFi to draw samples from very large raw data sets

2022-05-19 Thread Hendrik Ruijter
Your use case to draw a random sample of N records as one method of course depends on the distribution you use. The uniform distribution documented in the expression language guide is e.g. ${random():mod(10):plus(1)} returns random number between 1 and 10 inclusive. There are numerous algorithms

Re: NiFi to draw samples from very large raw data sets

2022-05-23 Thread James McMahon
These replies have all been very helpful and I wanted to get back to you and say thanks. We will have both situations to contend with: large numbers of flowfiles, each representing an atomic record or object, and smaller numbers of very large flowfiles from which we will draw a sample of records. I