[jira] [Updated] (NIFI-7940) Create a ScriptedPartitionRecord

Mark Payne (Jira) Thu, 22 Oct 2020 07:05:03 -0700


     [ 
https://issues.apache.org/jira/browse/NIFI-7940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Mark Payne updated NIFI-7940:
-----------------------------
    Description: 
In 1.12.0, we introduced the ScriptedTransformRecord. This has worked very well 
for many different transformations that are simple in code but very difficult 
with the DSL's that NiFi supports. In addition to transforming records, another 
use case that can be made dramatically easier with scripting is partitioning 
records.

The PartitionRecord processor is very powerful and easy to use, but RecordPath 
is somewhat limited in the functions that it provides. For example, recently in 
the Apache Slack channel, we had someone asking about how to route data based 
on whether or not the "timestamp" field matches a given regular expression. 
Attempts were made using QueryRecord with RPATH but that didn't work because 
the timestamp field is a top-level field, not a Record. Tried using 
UpdateRecord but that failed because the matchesRegex function of RecordPath is 
a predicate so can't be used to partition on. Eventually a pattern was found 
with QueryRecord using the `SIMILAR TO` but that function does not support for 
regular expressions.

A Scripted processor would likely make this far more trivial to handle. This 
processor should be focused around making it dead simple to partition (and 
subsequently route based on added attributes) records with a scripting 
language. So, it will be important, like ScriptedTransformRecord, to make the 
processor geared more toward ease of use than being given the full power of 
FlowFiles, sessions, etc.

The script should have the same bindings as ScriptedTransformRecord:
 * attributes
 * log
 * record
 * recordIndex

Unlike ScriptedTransformRecord, though, the ScriptedPartitionRecord should 
return one of three things:
 * A string (or a primitive value such as an int, that can be turned into a 
String) representing the partition for the Record
 * A collection of strings/primitives representing multiple partitions that the 
Record should go to (indicating that the Record should be added to multiple 
Record Writers)
 * A null value or an empty collection indicating that the Record should be 
dropped.

The processor should then write the Record to a FlowFile for each of the 
Partitions returned. For each outbound FlowFile, an attribute should be added 
indicating the partition for that FlowFile for easy follow-on routing via 
RouteOnAttribute, etc.

The processor should keep a counter for how many Records were dropped.

The processor should keep counters for how many Records were routed to each 
Partition.

The processor should include additionalDetails.html to provide sufficient 
documentation. Similar to ScriptedTransformRecord, the documentation should 
include several examples, spanning at least Groovy and Python (since those are 
by far the most often used languages we see used in script processors). Each 
example provided in the additionalDetails.html should also have an accompanying 
unit test to verify the behavior.

Given the similarities to the ScriptedTransformRecord processor, there's a high 
likelihood that the ScriptedTransformRecord processor could be refactored into 
an AbstractScriptedRecord processor with both ScriptedTransformRecord and 
ScriptedPartitionRecord extending from it.

  was:
In 1.12.0, we introduced the ScriptedTransformRecord. This has worked very well 
for many different transformations that are simple in code but very difficult 
with the DSL's that NiFi supports. In addition to transforming records, another 
use case that can be made dramatically easier with scripting is partitioning 
records.

The PartitionRecord processor is very powerful and easy to use, but RecordPath 
is somewhat limited in the functions that it provides. For example, recently in 
the Apache Slack channel, we had someone asking about how to route data based 
on whether or not the "timestamp" field matches a given regular expression. 
Attempts were made using QueryRecord with RPATH but that didn't work because 
the timestamp field is a top-level field, not a Record. Tried using 
UpdateRecord but that failed because the matchesRegex function of RecordPath is 
a predicate so can't be used to partition on. Eventually a pattern was found 
with QueryRecord using the `SIMILAR TO` but that function does not support for 
regular expressions.

A Scripted processor would likely make this far more trivial to handle. This 
processor should be focused around making it dead simple to partition (and 
subsequently route based on added attributes) records with a scripting 
language. So, it will be important, like ScriptedTransformRecord, to make the 
processor geared more toward ease of use than being given the full power of 
FlowFiles, sessions, etc.

The script should have the same bindings as ScriptedTransformRecord:
 * attributes
 * log
 * record
 * recordIndex

Unlike ScriptedTransformRecord, though, the ScriptedPartitionRecord should 
return one of three things:
 * A string (or a primitive value such as an int, that can be turned into a 
String) representing the partition for the Record
 * A collection of strings/primitives representing multiple partitions that the 
Record should go to (indicating that the Record should be added to multiple 
Record Writers)
 * A null value or an empty collection indicating that the Record should be 
dropped.

The processor should then write the Record to a FlowFile for each of the 
Partitions returned. For each outbound FlowFile, an attribute should be added 
indicating the partition for that FlowFile for easy follow-on routing via 
RouteOnAttribute, etc.

The processor should keep a counter for how many Records were dropped.

The processor should keep counters for how many Records were routed to each 
Partition.

The processor should include additionalDetails.html to provide sufficient 
documentation. Similar to ScriptedTransformRecord, the documentation should 
include several examples, spanning at least Groovy and Python (since those are 
by far the most often used languages we see used in script processors). Each 
example provided in the additionalDetails.html should also have an accompanying 
unit test to verify the behavior.


> Create a ScriptedPartitionRecord
> --------------------------------
>
>                 Key: NIFI-7940
>                 URL: https://issues.apache.org/jira/browse/NIFI-7940
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Extensions
>            Reporter: Mark Payne
>            Priority: Major
>
> In 1.12.0, we introduced the ScriptedTransformRecord. This has worked very 
> well for many different transformations that are simple in code but very 
> difficult with the DSL's that NiFi supports. In addition to transforming 
> records, another use case that can be made dramatically easier with scripting 
> is partitioning records.
> The PartitionRecord processor is very powerful and easy to use, but 
> RecordPath is somewhat limited in the functions that it provides. For 
> example, recently in the Apache Slack channel, we had someone asking about 
> how to route data based on whether or not the "timestamp" field matches a 
> given regular expression. Attempts were made using QueryRecord with RPATH but 
> that didn't work because the timestamp field is a top-level field, not a 
> Record. Tried using UpdateRecord but that failed because the matchesRegex 
> function of RecordPath is a predicate so can't be used to partition on. 
> Eventually a pattern was found with QueryRecord using the `SIMILAR TO` but 
> that function does not support for regular expressions.
> A Scripted processor would likely make this far more trivial to handle. This 
> processor should be focused around making it dead simple to partition (and 
> subsequently route based on added attributes) records with a scripting 
> language. So, it will be important, like ScriptedTransformRecord, to make the 
> processor geared more toward ease of use than being given the full power of 
> FlowFiles, sessions, etc.
> The script should have the same bindings as ScriptedTransformRecord:
>  * attributes
>  * log
>  * record
>  * recordIndex
> Unlike ScriptedTransformRecord, though, the ScriptedPartitionRecord should 
> return one of three things:
>  * A string (or a primitive value such as an int, that can be turned into a 
> String) representing the partition for the Record
>  * A collection of strings/primitives representing multiple partitions that 
> the Record should go to (indicating that the Record should be added to 
> multiple Record Writers)
>  * A null value or an empty collection indicating that the Record should be 
> dropped.
> The processor should then write the Record to a FlowFile for each of the 
> Partitions returned. For each outbound FlowFile, an attribute should be added 
> indicating the partition for that FlowFile for easy follow-on routing via 
> RouteOnAttribute, etc.
> The processor should keep a counter for how many Records were dropped.
> The processor should keep counters for how many Records were routed to each 
> Partition.
> The processor should include additionalDetails.html to provide sufficient 
> documentation. Similar to ScriptedTransformRecord, the documentation should 
> include several examples, spanning at least Groovy and Python (since those 
> are by far the most often used languages we see used in script processors). 
> Each example provided in the additionalDetails.html should also have an 
> accompanying unit test to verify the behavior.
> Given the similarities to the ScriptedTransformRecord processor, there's a 
> high likelihood that the ScriptedTransformRecord processor could be 
> refactored into an AbstractScriptedRecord processor with both 
> ScriptedTransformRecord and ScriptedPartitionRecord extending from it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (NIFI-7940) Create a ScriptedPartitionRecord

Reply via email to