[
https://issues.apache.org/jira/browse/NIFI-988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14904753#comment-14904753
]
ASF GitHub Bot commented on NIFI-988:
-------------------------------------
Github user joemeszaros commented on the pull request:
https://github.com/apache/nifi/pull/92#issuecomment-142654373
I have several tracking event files, containing user interactions, e.g.
user.x liked item.y in the following format:
|UserId | Action | ItemId |
| ------------- | ------------- | ------------- |
| user.x | like | item.y |
| user.xx | like | item.z |
|...||
I need to enrich these event files e.g. with the title of the associated
item from a separate item file, containing the item metadata:
|ItemId | Title |
| ------------- | ------------- |
| item.y | Title for item.y |
| item.z | Title for item.z |
|...||
and the enriched event file should like this:
|UserId | Action | ItemId | Title
| ------------- | ------------- | ------------- | ------------- |
| user.x | like | item.y | Title for item.y|
| user.xx | like | item.z | Title for item.z|
My idea was to cache the item file in a distributed cache, because it is a
typical controller service functionality, and use the same cache to extend the
event files one-by-one, when looking for a title, based on the ItemId. In that
case I need to read the item file only once. I created a workflow, which grabs
the item file, creates a flow file for each item (each line), where the ItemId
is added as a custom flow file attribute and puts those flow files into the
distributed cache, using the PutDistributedMapCache processor. The cache key is
the custom ItemId attribute, and the metadata is the cache value. During the
event file enrichment I use this item catalogue cache to look for an ItemId and
get e.g. the title.
(My workflow is not so simple, because I use JSON conversion, and
additional processors as well)
The DetectDuplicate was not an appropriate processor for me, because (as it
names suggests) it is used for duplicate detection and caches a custom flow
file attribute, not the flow file content.
I hope I was able to highlight my rationality behind this new processor :-)
> PutDistributedMapCache processor
> --------------------------------
>
> Key: NIFI-988
> URL: https://issues.apache.org/jira/browse/NIFI-988
> Project: Apache NiFi
> Issue Type: New Feature
> Components: Core Framework
> Reporter: Joe Mészáros
> Priority: Minor
> Labels: cache, distributed, feature, new, put
>
> There is a standard controller service, called DistributedMapCacheServer,
> which provides a distributed cache, and an associated
> DistributedMapCacheClientService to interact with the cache. But there is not
> any standard processor, which puts data into the cache, and helps the user to
> leverage the distributed cache capabilities.
> The purpose of PutDistributedMapCache is very similar to the egress
> processors: it gets the content of a FlowFile and puts it to a distributed
> map cache, using a cache key computed from FlowFile attributes. If the cache
> already contains the entry and the cache update strategy is 'keep original'
> the entry is not replaced.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)