[ 
https://issues.apache.org/jira/browse/NIFI-11945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17753453#comment-17753453
 ] 

Peter Kimberley commented on NIFI-11945:
----------------------------------------

In the process of reviewing this processor, I identified the following 
additional problems, which are resolved in the referenced PR:

h2. Other issues resolved
# Expression language attributes `field.name`, `field.value` and `field.type` 
are referenced in the documentation but not implemented. This can be confusing 
for users of this processor. These attributes are removed in favour of a 
simpler `RecordPath` syntax in dynamic properties
# Typos and confusing documentation (e.g. saying duplication only works on a 
per-file basis in one area, while contradicting this in another)
# Reliance on map cache values to be put separately. This is non-atomic, so is 
not safe when run using multiple workers. Now using the 
`DistributedMapCacheClient::putIfAbsent()` method to achieve atomicity
# NPE when attempting to reference a non-existent record field or one with a 
value of `null`. Added handling to treat this as an empty string.
# Hash set filter code path was never reachable due to incorrect equality check

h2. Other minor changes
# Removed redundant classes and constants
# Improved test coverage
# Extracted repeated strings as constant members

> DeduplicateRecord does not add keys to distributed map cache
> ------------------------------------------------------------
>
>                 Key: NIFI-11945
>                 URL: https://issues.apache.org/jira/browse/NIFI-11945
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Core Framework
>    Affects Versions: 1.23.0
>         Environment: Docker
>            Reporter: Peter Kimberley
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> The `DeduplicateRecord` processor supports the use of a distributed map cache 
> (DMC).
> After generating the record key, it checks for the existence of that key in 
> the cache. It then calls `DistributedMapCacheClientWrapper::put()`, which in 
> this case, is a noop. Therefore, a cache entry is never written and records 
> are always routed to the `non-duplicate` relationship.
> The correct behaviour would be for 
> `DistributedMapCacheClientWrapper:contains()` to call 
> `DistributedMapCacheClient::putIfAbsent()`, which would atomically check/set 
> the key in the target cache.
> An additional problem is a NPE where a DMC is used and the 
> `DeduplicateRecord` property `Record Hashing Algorithm` is set to `NONE`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to