arenger opened a new pull request #3455: NIFI-5900 Add SelectJson processor
URL: https://github.com/apache/nifi/pull/3455
 
 
   
   ### Overview
   
   The goal of this PR is to further fortify NiFi when working with large JSON 
files.  As noted in the [NiFi 
overview](https://nifi.apache.org/docs/nifi-docs/html/overview.html), systems 
will invariably receive "data that is too big, too small, too fast, too slow, 
corrupt, wrong, or in the wrong format."  In the case of "too big", NiFi (or 
any JVM) can continue just fine and handle large files with ease if it does so 
in a streaming fashion, but the current JSON processors use a DOM approach that 
is limited by available heap space.  This PR recommends the addition of a 
`SelectJson` processor that can be employed when large JSON files are expected 
or possible.
   
   The current `EvaluateJsonPath` and `SplitJson` processors both leverage the 
[Jayway JsonPath](https://github.com/json-path/JsonPath) library.  The Jayway 
implementation has excellent support for JSON Path expressions, but requires 
that the entire JSON file be loaded into memory.  It builds a document object 
model (DOM) before evaluating the targeted JSON Path.  This is already noted as 
a "System Resource Consideration" in the 
[documentation](https://github.com/apache/nifi/blob/rel/nifi-1.9.1/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/SplitJson.java#L85)
 for the `SplitJson` processor, and the same is true for `EvaluateJsonPath`.
   
   The proposed `SelectJson` processor uses an alternate library called 
[JsonSurfer](https://github.com/jsurfer/JsonSurfer) to evaluate a JSON Path 
without loading the whole document into memory all at once, similar to SAX 
implementations for XML processing. This allows for near-constant memory usage, 
independent of file size, as shown in the following test results:
   
   
![SelectJsonMemory](https://user-images.githubusercontent.com/1693576/56772330-a059db00-6787-11e9-9cd2-08d201bfb7ab.png)
   
   The trade-off is between heap space usage and JSON Path functionality.  The 
`SelectJson` processor supports almost all of JSON Path, with a few limitations 
mentioned in the `@CapabilityDescription`.  For full JSON Path support and/or 
multiple JSON Path expressions, `EvaluateJsonPath` and/or `SplitJson` processor 
should be used.  When memory conservation is important, the `SelectJson` 
processor should be used.
   
   ### Licensing
   
   The [JsonSurfer](https://github.com/jsurfer/JsonSurfer) library is covered 
by the MIT License which is [compatible with Apache 
2.0](https://www.apache.org/legal/resolved.html#category-a).  
   
   ### Testing
   
   This PR is a follow-on from #3414 in which I proposed a similar solution 
that required extenseive unit testing.  Tests from that PR were adapted and 
preserved for this PR, even though many of them are testing the `JsonSurf` 
library.  This is a much simpler PR since the path processing is handled in a 
third-party library.
   
   As for the memory statistics noted above, they were gathered using the same 
methodology described in #3414.  For posterity, here's a python script to 
generate JSON files of arbitrary size:
   
   ```
   import uuid
   
   (I, J, K) = (1, 8737, 3)
   with open('out.json', 'w') as f:
       f.write("[")
       for i in range(0,I):
           f.write("[")
           for j in range(0,J):
               f.write("[")
               for k in range(0,K):
                   f.write('"' + str(uuid.uuid4()) + '"');
                   if (k < K - 1):
                       f.write(",")
               f.write("],\n" if j < J - 1 else "]\n")
           f.write("],\n" if i < I - 1 else "]\n")
       f.write("]\n")
   ```
   
   ### How to use SelectJson Processor
   
   Given an incoming FlowFile and a valid JSON Path setting, `SelectJson` will 
send one or more FlowFiles to the `selected` relation, and the original 
FlowFile will be sent to the `original` relation.  If JSON Path did not match 
any object or array in the document, then the document will be passed to the 
`failure` relation.
   
   #### JSON Path Examples
   
   Here is a sample JSON file, followed by JSON Path expressions and the 
content of the FlowFiles that would be output from the `SplitLargeJson` 
processor.
   
   Sample JSON:
   ```
   [
     {
       "name": "Seattle",
       "weather": [
         {
           "main": "Snow",
           "description": "light snow"
         }
       ]
     },
     {
       "name": "Washington, DC",
       "weather": [
         {
           "main": "Mist",
           "description": "mist"
         },
         {
           "main": "Fog",
           "description": "fog"
         }
       ]
     }
   ]
   ```
   
   * JSON Path Expression: `$[1].weather.*`
       - FlowFile 0: `{"main":"Mist","description":"mist"}`
       - FlowFile 1: `{"main":"Fog","description":"fog"}`
   * JSON Path Expression: `$[1].name`
       - FlowFile 0: `"Washington, DC"`
   * JSON Path Expression: `$[*]['weather'][*]['main']`
       - FlowFile 0: `"Snow"`
       - FlowFile 1: `"Mist"`
       - FlowFile 2: `"Fog"`
   
   
   ### Checklist
   
   - [x] Is there a JIRA ticket associated with this PR? Is it referenced in 
the commit message?
   - [x] Does your PR title start with NIFI-XXXX where XXXX is the JIRA number 
you are trying to resolve? Pay particular attention to the hyphen "-" character.
   - [x] Has your PR been rebased against the latest commit within the target 
branch (typically master)?
   - [x] Is your initial contribution a single, squashed commit?
   
   - [ ] Have you ensured that the full suite of tests is executed via mvn 
-Pcontrib-check clean install at the root nifi folder?
         (Note: `mvn clean install` completes without error after disabling 
`FileBasedClusterNodeFirewallTest` and `DBCPServiceTest`.
          Adding `-Pcontrib-check` fails , but it appears to fail on `master` 
branch too)
   - [x] Have you written or updated unit tests to verify your changes?
   - [x] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)? 
   - [x] If applicable, have you updated the LICENSE file, including the main 
LICENSE file under nifi-assembly?
   - [x] If applicable, have you updated the NOTICE file, including the main 
NOTICE file found under nifi-assembly?
   - [x] If adding new Properties, have you added .displayName in addition to 
.name (programmatic access) for each of the new properties?
   
   - [ ] Have you ensured that format looks appropriate for the output in which 
it is rendered?
   
   ### See Also
   SplitLargeJson: #3414
   StreamingJsonReader: #3222
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to