I have been tasked to draw samples from very large raw data sets for triage
analysis. I am to provide multiple sampling methods. Drawing a random
sample of N records is one method. A second method is to draw a fixed
sample of 1,032 records from stratified defined date boundaries in a set.
The latte
James,
This sounds like an interesting project. I would recommend
RouteOnAttribute with a "sample" property with value
"${random():mod(1032):equals(100)}" (the second number could be anything
between 0 and 1031), and then routing the "sample" relationship to your
sampling path. I'm not sure I un
Also, I just realized I misread your sampling requirement. You would use
the approach above if you wanted to sample *every 1032th flowfile*, but you
want a sample size of 1032 total. You can still use a randomizing
selection approach as I described (though your mod value would depend on
what freq
If you have large FlowFiles and are trying to sample records from
each, you can use SampleRecord. It has Interval Sampling,
Probabilistic Sampling, and Reservoir Sampling strategies, and I have
a PR [1] up to add Range Sampling [2].
Regards,
Matt
[1] https://github.com/apache/nifi/pull/5878
[2] h
e.items:gt(0):and(${vodafone.items:lt(${vodafone.total})})}
From: James McMahon
Sent: den 19 maj 2022 12:21
To: users@nifi.apache.org
Subject: NiFi to draw samples from very large raw data sets
This e-mail was sent to you by someone outside the organization. Please make
sure it is a trusted cont
${vodafone.items:gt(0):and(${vodafone.items:lt(${vodafone.total})})}
>
>
>
> *From:* James McMahon
> *Sent:* den 19 maj 2022 12:21
> *To:* users@nifi.apache.org
> *Subject:* NiFi to draw samples from very large raw data sets
>
>
>
> This e-mail was sent to you by so