Your use case to draw a random sample of N records as one method of course 
depends on the distribution you use. The uniform distribution documented in the 
expression language guide is e.g.
 ${random():mod(10):plus(1)} returns random number between 1 and 10 inclusive.
There are numerous algorithms to create normal (Gaussian) distributions from 
uniform distributions, e.g. Box-Müller. You can create lots of other 
interesting distributions too.

A RouteOnAttribute processor with two legs, one with your random sample, and 
one with the entire flow should work. For example, ${vodafone.items:equals(1)} 
would pick one flowfile from a uniform distribution 
${random():mod(1000):plus(1)} where the attribute is tested to provide a 1 per 
mille sample.

I would use an UpdatAttribute processor with state to sample a fixed number of 
flowfiles but your use case is not detailed enough to answer at present in my 
humble opinion. A typical pattern is to increment an index
${getStateValue("vodafone.items"):plus(${vodafone.pagesize})} in the 
UpdateAttribute processor, next the RouteOnAttribute processor with two legs:
loop reset ${vodafone.items:equals(0)}
, and loop.next 
${vodafone.items:gt(0):and(${vodafone.items:lt(${vodafone.total})})}

From: James McMahon <jsmcmah...@gmail.com>
Sent: den 19 maj 2022 12:21
To: users@nifi.apache.org
Subject: NiFi to draw samples from very large raw data sets

This e-mail was sent to you by someone outside the organization. Please make 
sure it is a trusted contact before clicking on the links or downloading any 
file. Protecting our systems is in your hands!
________________________________
I have been tasked to draw samples from very large raw data sets for triage 
analysis. I am to provide multiple sampling methods. Drawing a random sample of 
N records is one method. A second method is to draw a fixed sample of 1,032 
records from stratified defined date boundaries in a set. The latter is of 
interest because raw data can substantially change structure or even format at 
points in time, and we need to be able to sample within those data boundaries.

Can anyone offer a link to an example of how nifi may be used to draw samples 
randomly and/or in a systematic way from raw data collections?

Reply via email to