Re: Handling CSVs dynamically with NiFi

James McMahon Wed, 12 Apr 2023 05:55:58 -0700

Thank you very much Isha. This is helpful. Assuming I wanted to route to N
different output paths, does it follow that I need to use N different Query
Record processors tailored to filter for just one subset?
I'll have to experiment with it to develop more of a feel for how it can be
used.
Thanks again for taking a moment to reply with the suggestion.
Jim


On Wed, Apr 12, 2023 at 6:54 AM Isha Lamboo <isha.lam...@virtualsciences.nl>
wrote:

> Hi James,
>
>
>
> One option you can use is the QueryRecord processor. It allows you to
> filter records with a SQL-like query for any combination of fields that
> your downstream tools require. You can add one for each different output
> required and send a copy of the main json file to each.
>
>
>
> This approach should work better if you have a limited number of different
> output files with many records each. If your goal is hundreds of different
> json files with a handful of records each, then splitting per row might be
> quicker than copying the entire json file that many times.
>
>
>
> Regards,
>
>
>
> Isha
>
>
>
> *Van:* James McMahon <jsmcmah...@gmail.com>
> *Verzonden:* vrijdag 7 april 2023 17:14
> *Aan:* users@nifi.apache.org
> *Onderwerp:* Re: Handling CSVs dynamically with NiFi
>
>
>
> Hello Bryan. Thank you for your question.
>
> A downstream consumer requires the complete set in json. So that's part of
> why I convert.
>
> Other downstream tools require json input, but not the entire set. The
> data needs to be routed based on certain features. Geographic location.
> Sales data by zip codes. Etc, etc. Splitting the records out seemed to be a
> reasonable option to route individual records.
>
> I appreciate you taking the time to ask. You are far more familiar with
> nifi best practices than me. If there is a better way than what I intended,
> please fire away. I'd love to mArch down a better path if there is one.
>
> Cheers,
>
> Jim
>
>
>
> On Fri, Apr 7, 2023 at 10:57 AM Bryan Bende <bbe...@gmail.com> wrote:
>
> James,
>
> I'm not sure what the end goal is, but why do you need to use
> EvaluateJsonPath and SplitJson?
>
> Generally you don't want to split a flow file of multiple records into
> 1 record per flow file, this is an anti-pattern that leads to poor
> performance in the flow.
>
> Thanks,
>
> Bryan
>
> On Fri, Apr 7, 2023 at 9:41 AM James McMahon <jsmcmah...@gmail.com> wrote:
> >
> > Very interesting, very helpful insights. Thank you again, Mike.
> > Late last night I decided to punt on a pure NiFi solution. I knew I
> could do this easily with Groovy scripting, and I knew that was well-within
> my wheelhouse. So that's what I did: Groovy from an ExecuteScript
> processor. I'm 90% of the way there. Just a few more refinements to get
> just what I want, which I'll tackle later tonight.
> > Groovy is pretty cool. Flexible, easily tailored to just what you need.
> I like having that flexibility. And I like having options, too: your
> results have motivated me to look at using QueryRecords, etc etc.
> > Jim
> >
> > On Fri, Apr 7, 2023 at 9:32 AM Mike Sofen <mso...@runbox.com> wrote:
> >>
> >> This is where I felt Nifi wasn’t the right tool for the job and
> Postgres was.  After I imported the CSV directly into a staging table in
> the database (using Nifi), I converted the payload part of the columns into
> jsonb and stored that into the final table in a column with additional
> columns as relational data (timestamps, identifiers, etc).  It was an
> object-relational data model.
> >>
> >>
> >>
> >> THEN, using the amazingly powerful Postgres jsonb functions, I was able
> to extract the unique keys in an entire dataset or across multiple datasets
> (to build a data catalog for example), perform a wide range of validations
> on individual keys, etc.  I use the word amazing because they are not just
> powerful functions but they run surprisingly fast given the amount of
> string data they are traversing.
> >>
> >>
> >>
> >> Mike Sofen
> >>
> >>
> >>
> >> From: James McMahon <jsmcmah...@gmail.com>
> >> Sent: Thursday, April 06, 2023 2:03 PM
> >> To: users@nifi.apache.org
> >> Subject: Re: Handling CSVs dynamically with NiFi
> >>
> >>
> >>
> >> Can I ask you one follow-up? I've gotten my ConvertRecord to work. I
> created a CsvReader service with Schema Access Strategy of Use String
> Fields From Header. I created a JsonRecordSetWriter service with Schema
> Write Strategy of Do Not Write Schema.
> >>
> >> When ConvertRecord is finished, my result looks like this sample:
> >>
> >> [ {
> >>   "Bank Name�" : "Almena State Bank",
> >>   "City�" : "Almena",
> >>   "State�" : "KS",
> >>   "Cert�" : "15426",
> >>   "Acquiring Institution�" : "Equity Bank",
> >>   "Closing Date�" : "23-Oct-20",
> >>   "Fund" : "10538"
> >> }, {
> >>   "Bank Name�" : "First City Bank of Florida",
> >>   "City�" : "Fort Walton Beach",
> >>   "State�" : "FL",
> >>   "Cert�" : "16748",
> >>   "Acquiring Institution�" : "United Fidelity Bank, fsb",
> >>   "Closing Date�" : "16-Oct-20",
> >>   "Fund" : "10537"
> >> }, {
> >>   "Bank Name�" : "The First State Bank",
> >>   "City�" : "Barboursville",
> >>   "State�" : "WV",
> >>   "Cert�" : "14361",
> >>   "Acquiring Institution�" : "MVB Bank, Inc.",
> >>   "Closing Date�" : "3-Apr-20",
> >>   "Fund" : "10536"
> >> }]
> >>
> >>
> >>
> >> I don't really have a schema. How can I use a combination of SplitJson
> and EvaluateJsonPath to split each json object out to its own nifi
> flowfile, and to pull the json key values out to define the fields in the
> csv header? I've found a few examples through research that allude to this,
> but they all seem to have a fixed schema and they don't offer
> configurations for the SplitJson. In a case where my json keys definition
> changes depending on the lfowfile, what should JsonPathExpression be set to
> in the SplitJson configuration?
> >>
> >>
> >>
> >> On Thu, Apr 6, 2023 at 9:59 AM Mike Sofen <mso...@runbox.com> wrote:
> >>
> >> Jim – that’s exactly what I did on that “pre” step – generate a schema
> from the CSVReader and use that to dynamically create the DDL sql needed to
> build the staging table in Postgres.  In my solution, there are 2 separate
> pipelines running – this pre step and the normal file processing.
> >>
> >>
> >>
> >> I used the pre step to ensure that all incoming files were from a known
> and valid source and that they conformed to the schema for that source – a
> very tidy way to ensure data quality.
> >>
> >>
> >>
> >> Mike
> >>
> >>
> >>
> >> From: James McMahon <jsmcmah...@gmail.com>
> >> Sent: Thursday, April 06, 2023 6:39 AM
> >> To: users@nifi.apache.org
> >> Subject: Re: Handling CSVs dynamically with NiFi
> >>
> >>
> >>
> >> Thank you both very much, Bryan and Mike. Mike, had you considered the
> approach mentioned by Bryan - a Reader processor to infer schema  -  and
> found it wasn't suitable for your use case, for some reason? For instance,
> perhaps you were employing a version of Apache NiFi that did not afford
> access to a CsvReader or InferAvroSchema processor?
> >>
> >> Jim
> >>
> >>
> >>
> >> On Thu, Apr 6, 2023 at 9:30 AM Mike Sofen <mso...@runbox.com> wrote:
> >>
> >> Hi James,
> >>
> >>
> >>
> >> I don’t have time to go into details, but I had nearly the same
> scenario and solved it by using Nifi as the file processing piece only,
> sending valid CSV files (valid as in CSV formatting) and leveraged Postgres
> to land the CSV data into pre-built staging tables and from there did
> content validations and packaging into jsonb for storage into a single
> target table.
> >>
> >>
> >>
> >> In my case, an external file source had to “register” a single file (to
> allow creating the matching staging table) prior to sending data.  I used
> Nifi for that pre-staging step to derive the schema for the staging table
> for a file and I used a complex stored procedure to handle a massive amount
> of logic around the contents of a file when processing the actual files
> prior to storing into the destination table.
> >>
> >>
> >>
> >> Nifi was VERY fast and efficient in this, as was Postgres.
> >>
> >>
> >>
> >> Mike Sofen
> >>
> >>
> >>
> >> From: James McMahon <jsmcmah...@gmail.com>
> >> Sent: Thursday, April 06, 2023 4:35 AM
> >> To: users <users@nifi.apache.org>
> >> Subject: Handling CSVs dynamically with NiFi
> >>
> >>
> >>
> >> We have a task requiring that we transform incoming CSV files to JSON.
> The CSVs vary in schema.
> >>
> >>
> >>
> >> There are a number of interesting flow examples out there illustrating
> how one can set up a flow to handle the case where the CSV schema is well
> known and fixed, but none for the generalized case.
> >>
> >>
> >>
> >> The structure of the incoming CSV files will not be known in advance in
> our use case. Our nifi flow must be generalized because I cannot configure
> and rely on a service that defines a specific fixed Avro schema registry.
> An Avro schema registry seems to presume an awareness of the CSV structure
> in advance. We don't have that luxury in this use case, with CSVs arriving
> from many different providers and so characterized by schemas that are
> unknown.
> >>
> >>
> >>
> >> What is the best way to get around this challenge? Does anyone know of
> an example where NiFi builds the schema on the fly as CSVs arrive for
> processing, dynamically defining the Avro schema for the CSV?
> >>
> >>
> >>
> >> Thanks in advance for any thoughts.
>
>

Re: Handling CSVs dynamically with NiFi

Reply via email to