Re: Generalizing QueryRecord to changing inferred CSV headers

Matt Burgess Tue, 18 Apr 2023 15:25:01 -0700

I should mention for aggregate values like COUNT(), check out the
CalculateRecordStats processor, not sure if it takes a `/` value (or
whatever means 'select all fields') for a RecordPath or not, if not we
should probably support if prudent. It might also be a nice
improvement to add MAX/MIN support as well, feel free to file a Jira
for that improvement if it will help.


- Matt

On Tue, Apr 18, 2023 at 6:16 PM Matt Burgess <[email protected]> wrote:
>
> Jim,
>
> QueryRecord uses Apache Calcite under the hood and is thus at the
> mercy of the SQL standard (and any additional rules/dialect from
> Apache Calcite) so in general you can't select "all except X" or "all
> except change X to Y". Does it need to be SQL executed against the
> individual fields? If not, take a look at ScriptedTransformRecord doc
> (and its Additional Details page). IIRC you're a Groovy guy now ;) so
> you should be able to alter the fields as you see fit using Groovy
> rather than SQL (alternatively Jython as you've done a bunch of that
> as well).
>
> Regards,
> Matt
>
> On Tue, Apr 18, 2023 at 6:04 PM James McMahon <[email protected]> wrote:
> >
> > Hello. I recently asked the community a question about processing CSV 
> > files. I received some helpful advice about using processors such as 
> > ConvertRecord and QueryRecord, and was encouraged to employ Readers and 
> > RecordSetWriters. I've done that, and thank all who replied.
> >
> > My incoming CSV files come in with different headers because they are 
> > widely different data sets. The header structure is not known in advance. 
> > As such, I configure a QueryRecord processor with a CSVReader that employs 
> > a Schema Access Strategy that is Use String Fields From Header. I configure 
> > a CSVRecordSetWriter that sets Infer Record Schema as its Schema Access 
> > Strategy.
> >
> > Now I want to use that QueryRecord processor to characterize the various 
> > fields using SQL. Record counts, min and max values - things of that 
> > nature. But in all the examples I find in YouTube and in the open source, 
> > the authors presume a knowledge of the fields in advance. For example 
> > Property year is set by Value select "year" from FLOWFILE.
> >
> > We simply don't have that luxury, that awareness in advance. After all, 
> > that's the very reason we inferred the schema in the reader and writer 
> > configuration. The fields are more often than not going to be very 
> > different. Hard wiring them into QueryRecord is not a flow solution that is 
> > flexible enough. We need to grab them from the inferred schema the Reader 
> > and Writer services identified.
> >
> > What syntax or notation can we use in the QueryRecord sql to say "for each 
> > field found in the header, execute this sql against that field"? I guess 
> > what I'm looking for is iteration through all the inferred schema fields, 
> > and dynamic assignment of the field name in the SQL.
> >
> > Has anyone faced this same challenge? How did you solve it?
> > Is there another way to approach this problem?
> >
> > Thank you in advance,
> > Jim

Re: Generalizing QueryRecord to changing inferred CSV headers

Reply via email to