NiFi to draw samples from very large raw data sets

2022-05-19 Thread James McMahon
I have been tasked to draw samples from very large raw data sets for triage analysis. I am to provide multiple sampling methods. Drawing a random sample of N records is one method. A second method is to draw a fixed sample of 1,032 records from stratified defined date boundaries in a set. The latte

Re: NiFi to draw samples from very large raw data sets

2022-05-19 Thread Joe Gresock
James, This sounds like an interesting project. I would recommend RouteOnAttribute with a "sample" property with value "${random():mod(1032):equals(100)}" (the second number could be anything between 0 and 1031), and then routing the "sample" relationship to your sampling path. I'm not sure I un

Re: NiFi to draw samples from very large raw data sets

2022-05-19 Thread Joe Gresock
Also, I just realized I misread your sampling requirement. You would use the approach above if you wanted to sample *every 1032th flowfile*, but you want a sample size of 1032 total. You can still use a randomizing selection approach as I described (though your mod value would depend on what freq

Re: NiFi to draw samples from very large raw data sets

2022-05-19 Thread Matt Burgess
If you have large FlowFiles and are trying to sample records from each, you can use SampleRecord. It has Interval Sampling, Probabilistic Sampling, and Reservoir Sampling strategies, and I have a PR [1] up to add Range Sampling [2]. Regards, Matt [1] https://github.com/apache/nifi/pull/5878 [2] h

RE: NiFi to draw samples from very large raw data sets

2022-05-19 Thread Hendrik Ruijter
e.items:gt(0):and(${vodafone.items:lt(${vodafone.total})})} From: James McMahon Sent: den 19 maj 2022 12:21 To: users@nifi.apache.org Subject: NiFi to draw samples from very large raw data sets This e-mail was sent to you by someone outside the organization. Please make sure it is a trusted cont

Re: NiFi to draw samples from very large raw data sets

2022-05-23 Thread James McMahon
${vodafone.items:gt(0):and(${vodafone.items:lt(${vodafone.total})})} > > > > *From:* James McMahon > *Sent:* den 19 maj 2022 12:21 > *To:* users@nifi.apache.org > *Subject:* NiFi to draw samples from very large raw data sets > > > > This e-mail was sent to you by so