date:20200815

Detect duplicate records

2020-08-15 Thread Robert R. Bruno

I wanted to see if anyone knew is there a clever way to detect duplicate records much like you can with entire flow files with DetectDuplicate? I'd really rather not have to split my records into individual flow files since this flow is such high volume. Thanks so much in advance.

Re: Detect duplicate records

2020-08-15 Thread Josh Friberg-Wyckoff

In theory I would think you could use the ExecuteStreamCommand to use the builtin Operating System sort commands to grab unique records. The Windows Sort command has an undocumented unique option. The sort command on Linux distros also has a unique option as well. On Sat, Aug 15, 2020 at 5:53 AM

Re: Detect duplicate records

2020-08-15 Thread Josh Friberg-Wyckoff

This looks interesting as well. https://stackoverflow.com/questions/52674532/remove-duplicates-in-nifi On Sat, Aug 15, 2020 at 10:23 AM Josh Friberg-Wyckoff wrote: > In theory I would think you could use the ExecuteStreamCommand to use the > builtin Operating System sort commands to grab unique

Re: Detect duplicate records

2020-08-15 Thread Josh Friberg-Wyckoff

Gosh, I should search the NiFi resources first. They have current JIRA for what you are wanting. https://issues.apache.org/jira/browse/NIFI-6047 On Sat, Aug 15, 2020 at 10:35 AM Josh Friberg-Wyckoff wrote: > This looks interesting as well. > https://stackoverflow.com/questions/52674532/remove-d

Re: Detect duplicate records

2020-08-15 Thread Matt Burgess

In addition to the SO answer, if you know all the fields in the record, you can use QueryRecord with SELECT DISTINCT field1,field2... FROM FLOWFILE. The SO answer might be more performant but is more complex, and QueryRecord will do the operations in-memory so it might not handle very large flowfil

Re: Detect duplicate records

2020-08-15 Thread James McMahon

If you opt to try a few of these options, please tell us which appeared to be the best from a performance perspective - with our understanding that results may vary depending on the size of the incoming data. It would be very interesting to learn what you found. On Sat, Aug 15, 2020 at 6:53 AM Rob

Re: Detect duplicate records

2020-08-15 Thread Jens M. Kofoed

Just some info about DISTINCT. In MySQL a union is much much faster than a DISTINCT. The DICTINCT create a new temp table with the result of the query. Sorting it and removing duplicates. If you make a union with a select id=-1, the result is exactly the same. All duplicates are removed. A DISTINCT

Re: Detect duplicate records

2020-08-15 Thread Robert R. Bruno

Sorry I should have been more clear. My need is to detect if each record has been seen in the past. So I need a solution that would be able to go record by record against something like a redis cache that would tell me either first time the record was seen or not and update the cache accordingly.

Re: Detect duplicate records

2020-08-15 Thread Otto Fowler

I was working on something for this, but in discussion with some of sme’s on the project, decided to shelve it. I don’t think I had gotten to the point of a jira. https://apachenifi.slack.com/archives/C0L9S92JY/p1589911056303500 On August 15, 2020 at 14:12:07, Robert R. Bruno (rbru...@gmail.com

Re: Detect duplicate records

2020-08-15 Thread Josh Friberg-Wyckoff

If that is the case and this is high volume like you say, I would think it would be more efficient to offload the task to a separate program then having a processor for NiFi doing it. On Sat, Aug 15, 2020, 2:52 PM Otto Fowler wrote: > I was working on something for this, but in discussion with s

Re: Detect duplicate records

2020-08-15 Thread Robert R. Bruno

Yep we were leaning towards off loading it to an external program and then putting data back to nifi for final delivery. Looks like that will be best from the sounds of it. Again thanks all! On Sat, Aug 15, 2020, 16:24 Josh Friberg-Wyckoff wrote: > If that is the case and this is high volume l

Detect duplicate records

Re: Detect duplicate records

Re: Detect duplicate records

Re: Detect duplicate records

Re: Detect duplicate records

Re: Detect duplicate records

Re: Detect duplicate records

Re: Detect duplicate records

Re: Detect duplicate records

Re: Detect duplicate records

Re: Detect duplicate records

11 matches

Site Navigation

Mail list logo

Footer information