[GitHub] [nifi] adamfisher commented on pull request #3317: NIFI-6047 Add DetectDuplicateRecord Processor

2020-10-26 Thread GitBox


adamfisher commented on pull request #3317:
URL: https://github.com/apache/nifi/pull/3317#issuecomment-716745290


   Yeah I like where you're headed with that. Are you thinking hashing across
   data sets would be the hashing implementation? Then we would have a
   separate detective wicked record processor for the bloom filter
   implementation which would be an in memory one.
   
   My time now is limited working on this since the use case has come and gone
   for it. I was really hoping to get this pushed through last year before but
   since git jujitsu tripped me up I was never able to get it into the main
   line.
   
   On Mon, Oct 26, 2020, 2:31 PM Mike  wrote:
   
   > Broadly speaking, we have two use cases that don't overlap that much:
   > deduplication over one file vs over a data lake. Given the fact that NiFi
   > follows a Unix-like philosophy of "simple tools that chain well together,"
   > I think the solution we're headed toward may be two processors.
   >
   > I think this processor could work out if we pare it back to in-memory
   > deduplication of a single record set. That way users won't have to turn a
   > lot of knobs and dials to configure it. Then combined with my submission
   > which would require a DMC as it is focused on the entire data lake I think
   > we could hit both use cases with a targeted tool that is fairly intuitive.
   >
   > Thoughts?
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub
   > , or
   > unsubscribe
   > 

   > .
   >
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [nifi] adamfisher commented on pull request #3317: NIFI-6047 Add DetectDuplicateRecord Processor

2020-10-26 Thread GitBox


adamfisher commented on pull request #3317:
URL: https://github.com/apache/nifi/pull/3317#issuecomment-716726587


   The cache key identifier could be used to identify the grouping of the data 
set and used as a key prefix. It sounds like it's just a matter of how we store 
record level hashes. The bloom filter stored in one cache record is necessary 
but the size is fixed regardless of the data set size. It's only driven by the 
capacity set on the bloom filter during the initial configuration. We could set 
that value to have a maximum to help protect the user from corruption. Storing 
individual hash sets for each record is obviously an option with this 
implementation already but it's good to be able to use Bloom filters because I 
think there will be scenarios for people don't care about exactness in unique 
records because it's primarily used to eliminate a lot of duplicate data 
processing. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [nifi] adamfisher commented on pull request #3317: NIFI-6047 Add DetectDuplicateRecord Processor

2020-10-25 Thread GitBox


adamfisher commented on pull request #3317:
URL: https://github.com/apache/nifi/pull/3317#issuecomment-716163041


   It's been a while since we discussed this so I'm not familiar with the uuid 
strategy. if you are talking about the uuid on a flow file then that could be 
specified using the cache key identifier option already. User defined record 
path values could also be used to capture a uuid that appears in the record 
data. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [nifi] adamfisher commented on pull request #3317: NIFI-6047 Add DetectDuplicateRecord Processor

2020-10-23 Thread GitBox


adamfisher commented on pull request #3317:
URL: https://github.com/apache/nifi/pull/3317#issuecomment-715482458


   I'm sorry but I don't know what to use for the patch file. Total newb to git 
when it comes to these more advanced scenarios. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [nifi] adamfisher commented on pull request #3317: NIFI-6047 Add DetectDuplicateRecord Processor

2020-10-22 Thread GitBox


adamfisher commented on pull request #3317:
URL: https://github.com/apache/nifi/pull/3317#issuecomment-714820284


   Last I remember getting the tests running but it was really hard to get the 
whole framework to build. It's super close. Please let me know if you need 
anything. I did add commit permissions to this Branch if you want to work on 
this one. If I recall really it's just some test tweaking and then it's ready 
for merge. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [nifi] adamfisher commented on pull request #3317: NIFI-6047 Add DetectDuplicateRecord Processor

2020-06-30 Thread GitBox


adamfisher commented on pull request #3317:
URL: https://github.com/apache/nifi/pull/3317#issuecomment-651851517


   @MikeThomsen I tried following your steps for rebasing and I thought it all 
went ok but I seem to have a lot more commits now. Would you be able to advise? 
I'm not a git expert.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org