[ 
https://issues.apache.org/jira/browse/NIFI-988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14904753#comment-14904753
 ] 

ASF GitHub Bot commented on NIFI-988:
-------------------------------------

Github user joemeszaros commented on the pull request:

    https://github.com/apache/nifi/pull/92#issuecomment-142654373
  
    I have several tracking event files, containing user interactions, e.g. 
user.x liked item.y in the following format:
    
    |UserId  | Action | ItemId |
    | ------------- | ------------- | ------------- |
    | user.x | like  | item.y |
    | user.xx | like  | item.z |
    |...||
    
    I need to enrich these event files e.g. with the title of the associated 
item from a separate item file, containing the item metadata:
    
    |ItemId  | Title |
    | ------------- | ------------- |
    | item.y | Title for item.y  |
    | item.z | Title for item.z  |
    |...||
    
    and the enriched event file should like this:
    
    |UserId  | Action | ItemId | Title
    | ------------- | ------------- | ------------- | ------------- |
    | user.x | like  | item.y | Title for item.y|
    | user.xx | like  | item.z | Title for item.z|
    
    My idea was to cache the item file in a distributed cache, because it is a 
typical controller service functionality, and use the same cache to extend the 
event files one-by-one, when looking for a title, based on the ItemId. In that 
case I need to read the item file only once. I created a workflow, which grabs 
the item file, creates a flow file for each item (each line), where the ItemId 
is added as a custom flow file attribute and puts those flow files into the 
distributed cache, using the PutDistributedMapCache processor. The cache key is 
the custom ItemId attribute, and the metadata is the cache value. During the 
event file enrichment I use this item catalogue cache to look for an ItemId and 
get e.g. the title. 
    
    (My workflow is not so simple, because I use JSON conversion, and 
additional processors as well)
    
    The DetectDuplicate was not an appropriate processor for me, because (as it 
names suggests) it is used for duplicate detection and caches a custom flow 
file attribute, not the flow file content.
    
    I hope I was able to highlight my rationality behind this new processor  :-)



> PutDistributedMapCache processor
> --------------------------------
>
>                 Key: NIFI-988
>                 URL: https://issues.apache.org/jira/browse/NIFI-988
>             Project: Apache NiFi
>          Issue Type: New Feature
>          Components: Core Framework
>            Reporter: Joe Mészáros
>            Priority: Minor
>              Labels: cache, distributed, feature, new, put
>
> There is a standard controller service, called DistributedMapCacheServer, 
> which provides a distributed cache, and an associated 
> DistributedMapCacheClientService to interact with the cache. But there is not 
> any standard processor, which puts data into the cache, and helps the user to 
> leverage the distributed cache capabilities.
> The purpose of PutDistributedMapCache is very similar to the egress 
> processors: it gets the content of a FlowFile and puts it to a distributed 
> map cache, using a cache key computed from FlowFile attributes. If the cache 
> already contains the entry and the cache update strategy is 'keep original' 
> the entry is not replaced.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to