Re: Generalizing QueryRecord to changing inferred CSV headers

Matt Burgess Tue, 18 Apr 2023 15:17:45 -0700

Jim,

QueryRecord uses Apache Calcite under the hood and is thus at the
mercy of the SQL standard (and any additional rules/dialect from
Apache Calcite) so in general you can't select "all except X" or "all
except change X to Y". Does it need to be SQL executed against the
individual fields? If not, take a look at ScriptedTransformRecord doc
(and its Additional Details page). IIRC you're a Groovy guy now ;) so
you should be able to alter the fields as you see fit using Groovy
rather than SQL (alternatively Jython as you've done a bunch of that
as well).


Regards,
Matt

On Tue, Apr 18, 2023 at 6:04 PM James McMahon <[email protected]> wrote:
>
> Hello. I recently asked the community a question about processing CSV files. 
> I received some helpful advice about using processors such as ConvertRecord 
> and QueryRecord, and was encouraged to employ Readers and RecordSetWriters. 
> I've done that, and thank all who replied.
>
> My incoming CSV files come in with different headers because they are widely 
> different data sets. The header structure is not known in advance. As such, I 
> configure a QueryRecord processor with a CSVReader that employs a Schema 
> Access Strategy that is Use String Fields From Header. I configure a 
> CSVRecordSetWriter that sets Infer Record Schema as its Schema Access 
> Strategy.
>
> Now I want to use that QueryRecord processor to characterize the various 
> fields using SQL. Record counts, min and max values - things of that nature. 
> But in all the examples I find in YouTube and in the open source, the authors 
> presume a knowledge of the fields in advance. For example Property year is 
> set by Value select "year" from FLOWFILE.
>
> We simply don't have that luxury, that awareness in advance. After all, 
> that's the very reason we inferred the schema in the reader and writer 
> configuration. The fields are more often than not going to be very different. 
> Hard wiring them into QueryRecord is not a flow solution that is flexible 
> enough. We need to grab them from the inferred schema the Reader and Writer 
> services identified.
>
> What syntax or notation can we use in the QueryRecord sql to say "for each 
> field found in the header, execute this sql against that field"? I guess what 
> I'm looking for is iteration through all the inferred schema fields, and 
> dynamic assignment of the field name in the SQL.
>
> Has anyone faced this same challenge? How did you solve it?
> Is there another way to approach this problem?
>
> Thank you in advance,
> Jim

Re: Generalizing QueryRecord to changing inferred CSV headers

Reply via email to