Re: Handling CSVs dynamically with NiFi

Bryan Bende Thu, 06 Apr 2023 06:42:20 -0700

Just to clarify, I would not recommend using the InferAvroSchema
processor as it is from an older approach and it should be considered
deprecated if it is not already. The infer capability is built into
the record readers as a Schema Access Strategy for all record readers.


On Thu, Apr 6, 2023 at 9:39 AM James McMahon <jsmcmah...@gmail.com> wrote:
>
> Thank you both very much, Bryan and Mike. Mike, had you considered the 
> approach mentioned by Bryan - a Reader processor to infer schema  -  and 
> found it wasn't suitable for your use case, for some reason? For instance, 
> perhaps you were employing a version of Apache NiFi that did not afford 
> access to a CsvReader or InferAvroSchema processor?
> Jim
>
> On Thu, Apr 6, 2023 at 9:30 AM Mike Sofen <mso...@runbox.com> wrote:
>>
>> Hi James,
>>
>>
>>
>> I don’t have time to go into details, but I had nearly the same scenario and 
>> solved it by using Nifi as the file processing piece only, sending valid CSV 
>> files (valid as in CSV formatting) and leveraged Postgres to land the CSV 
>> data into pre-built staging tables and from there did content validations 
>> and packaging into jsonb for storage into a single target table.
>>
>>
>>
>> In my case, an external file source had to “register” a single file (to 
>> allow creating the matching staging table) prior to sending data.  I used 
>> Nifi for that pre-staging step to derive the schema for the staging table 
>> for a file and I used a complex stored procedure to handle a massive amount 
>> of logic around the contents of a file when processing the actual files 
>> prior to storing into the destination table.
>>
>>
>>
>> Nifi was VERY fast and efficient in this, as was Postgres.
>>
>>
>>
>> Mike Sofen
>>
>>
>>
>> From: James McMahon <jsmcmah...@gmail.com>
>> Sent: Thursday, April 06, 2023 4:35 AM
>> To: users <users@nifi.apache.org>
>> Subject: Handling CSVs dynamically with NiFi
>>
>>
>>
>> We have a task requiring that we transform incoming CSV files to JSON. The 
>> CSVs vary in schema.
>>
>>
>>
>> There are a number of interesting flow examples out there illustrating how 
>> one can set up a flow to handle the case where the CSV schema is well known 
>> and fixed, but none for the generalized case.
>>
>>
>>
>> The structure of the incoming CSV files will not be known in advance in our 
>> use case. Our nifi flow must be generalized because I cannot configure and 
>> rely on a service that defines a specific fixed Avro schema registry. An 
>> Avro schema registry seems to presume an awareness of the CSV structure in 
>> advance. We don't have that luxury in this use case, with CSVs arriving from 
>> many different providers and so characterized by schemas that are unknown.
>>
>>
>>
>> What is the best way to get around this challenge? Does anyone know of an 
>> example where NiFi builds the schema on the fly as CSVs arrive for 
>> processing, dynamically defining the Avro schema for the CSV?
>>
>>
>>
>> Thanks in advance for any thoughts.

Re: Handling CSVs dynamically with NiFi

Reply via email to