Re: Large JSON File Best Practice Question

Joe Witt Fri, 17 Aug 2018 09:27:20 -0700

Ben,

I'm not sure that you could reliably convert the format of data and
retain schema information unless both formats allow for explicit
schema retention in them (as Avro does for instance).  JSON doesn't
really offer that.  So when you say you want to convert even for
unknown fields but there is no explicit type information/schema
information to follow it I'm not sure what a non destructive/lossy
conversion would look like.


You might still want to give the existing readers/writers a go to
experiment and find the line of how far you can go.  You could also
script or write your own reader which extracts a sufficient (for your
purposes) schema in the process perhaps and places it on the flowfile.

Thanks
On Fri, Aug 10, 2018 at 4:47 PM Benjamin Janssen <bjanss...@gmail.com> wrote:
>
> I am not.  I continued googling for a bit after sending my email and stumbled 
> upon a slide deck by Brian Bende.  I think my initial concern looking at it 
> is that it seems to require schema knowledge.
>
> For most of our data sets, we operate in a space where we have a handful of 
> guaranteed fields and who knows what other fields the upstream provider is 
> going to send us.  We want to operate on the data in a manner that is 
> non-destructive to unanticipated fields.  Is that a blocker for using the 
> RecordReader stuff?
>
> On Fri, Aug 10, 2018 at 4:30 PM Joe Witt <joe.w...@gmail.com> wrote:
>>
>> ben
>>
>> are you familiar with the record readers, writers, and associated processors?
>>
>> i suspect if you make a record writer for your custom format at the end of 
>> the flow chain youll get great performance and control.
>>
>> thanks
>>
>> On Fri, Aug 10, 2018, 4:27 PM Benjamin Janssen <bjanss...@gmail.com> wrote:
>>>
>>> All, I'm seeking some advice on best practices for dealing with FlowFiles 
>>> that contain a large volume of JSON records.
>>>
>>> My flow works like this:
>>>
>>> Receive a FlowFile with millions of JSON records in it.
>>>
>>> Potentially filter out some of the records based on the value of the JSON 
>>> fields.  (custom processor uses a regex and a json path to produce a 
>>> "matched" and "not matched" output path)
>>>
>>> Potentially split the FlowFile into multiple FlowFiles based on the value 
>>> of one of the JSON fields (custom processor uses a json path and groups 
>>> into output FlowFiles based on the value).
>>>
>>> Potentially split the FlowFile into uniformly sized smaller chunks to 
>>> prevent choking downstream systems on the file size (we use SplitText when 
>>> the data is newline delimited, don't currently have a way when the data is 
>>> a JSON array of records)
>>>
>>> Strip out some of the JSON fields (using a JoltTransformJSON).
>>>
>>> At the end, wrap each JSON record in a proprietary format (custom processor 
>>> wraps each JSON record)
>>>
>>> This flow is roughly similar across several different unrelated data sets.
>>>
>>> The input data files are occasionally provided in a single JSON array and 
>>> occasionally as newline delimited JSON records.  In general, we've found 
>>> newline delimited JSON records far easier to work with because we can 
>>> process them one at a time without loading the entire FlowFile into memory 
>>> (which we have to do for the array variant).
>>>
>>> However, if we are to use JoltTransformJSON to strip out or modify some of 
>>> the JSON contents, it appears to only operate on an array (which is 
>>> problematic from the memory footprint standpoint).
>>>
>>> We don't really want to break our FlowFiles up into individual JSON records 
>>> as the number of FlowFiles the system would have to handle would be orders 
>>> of magnitudes larger than it is now.
>>>
>>> Is our approach of moving towards newline delimited JSON a good one?  If 
>>> so, is there anything that would be recommended for replacing 
>>> JoltTransformJSON?  Or should we build a custom processor?  Or is this a 
>>> reasonable feature request for the JoltTransformJSON processor to support 
>>> new line delimited json?
>>>
>>> Or should we be looking into ways to do lazy loading of the JSON arrays in 
>>> our custom processors (I have no clue how easy or hard this would be to 
>>> do)?  My little bit of googling suggests this would be difficult.

Re: Large JSON File Best Practice Question

Reply via email to