Thank you very much Isha. This is helpful. Assuming I wanted to route to N different output paths, does it follow that I need to use N different Query Record processors tailored to filter for just one subset? I'll have to experiment with it to develop more of a feel for how it can be used. Thanks again for taking a moment to reply with the suggestion. Jim
On Wed, Apr 12, 2023 at 6:54 AM Isha Lamboo <isha.lam...@virtualsciences.nl> wrote: > Hi James, > > > > One option you can use is the QueryRecord processor. It allows you to > filter records with a SQL-like query for any combination of fields that > your downstream tools require. You can add one for each different output > required and send a copy of the main json file to each. > > > > This approach should work better if you have a limited number of different > output files with many records each. If your goal is hundreds of different > json files with a handful of records each, then splitting per row might be > quicker than copying the entire json file that many times. > > > > Regards, > > > > Isha > > > > *Van:* James McMahon <jsmcmah...@gmail.com> > *Verzonden:* vrijdag 7 april 2023 17:14 > *Aan:* users@nifi.apache.org > *Onderwerp:* Re: Handling CSVs dynamically with NiFi > > > > Hello Bryan. Thank you for your question. > > A downstream consumer requires the complete set in json. So that's part of > why I convert. > > Other downstream tools require json input, but not the entire set. The > data needs to be routed based on certain features. Geographic location. > Sales data by zip codes. Etc, etc. Splitting the records out seemed to be a > reasonable option to route individual records. > > I appreciate you taking the time to ask. You are far more familiar with > nifi best practices than me. If there is a better way than what I intended, > please fire away. I'd love to mArch down a better path if there is one. > > Cheers, > > Jim > > > > On Fri, Apr 7, 2023 at 10:57 AM Bryan Bende <bbe...@gmail.com> wrote: > > James, > > I'm not sure what the end goal is, but why do you need to use > EvaluateJsonPath and SplitJson? > > Generally you don't want to split a flow file of multiple records into > 1 record per flow file, this is an anti-pattern that leads to poor > performance in the flow. > > Thanks, > > Bryan > > On Fri, Apr 7, 2023 at 9:41 AM James McMahon <jsmcmah...@gmail.com> wrote: > > > > Very interesting, very helpful insights. Thank you again, Mike. > > Late last night I decided to punt on a pure NiFi solution. I knew I > could do this easily with Groovy scripting, and I knew that was well-within > my wheelhouse. So that's what I did: Groovy from an ExecuteScript > processor. I'm 90% of the way there. Just a few more refinements to get > just what I want, which I'll tackle later tonight. > > Groovy is pretty cool. Flexible, easily tailored to just what you need. > I like having that flexibility. And I like having options, too: your > results have motivated me to look at using QueryRecords, etc etc. > > Jim > > > > On Fri, Apr 7, 2023 at 9:32 AM Mike Sofen <mso...@runbox.com> wrote: > >> > >> This is where I felt Nifi wasn’t the right tool for the job and > Postgres was. After I imported the CSV directly into a staging table in > the database (using Nifi), I converted the payload part of the columns into > jsonb and stored that into the final table in a column with additional > columns as relational data (timestamps, identifiers, etc). It was an > object-relational data model. > >> > >> > >> > >> THEN, using the amazingly powerful Postgres jsonb functions, I was able > to extract the unique keys in an entire dataset or across multiple datasets > (to build a data catalog for example), perform a wide range of validations > on individual keys, etc. I use the word amazing because they are not just > powerful functions but they run surprisingly fast given the amount of > string data they are traversing. > >> > >> > >> > >> Mike Sofen > >> > >> > >> > >> From: James McMahon <jsmcmah...@gmail.com> > >> Sent: Thursday, April 06, 2023 2:03 PM > >> To: users@nifi.apache.org > >> Subject: Re: Handling CSVs dynamically with NiFi > >> > >> > >> > >> Can I ask you one follow-up? I've gotten my ConvertRecord to work. I > created a CsvReader service with Schema Access Strategy of Use String > Fields From Header. I created a JsonRecordSetWriter service with Schema > Write Strategy of Do Not Write Schema. > >> > >> When ConvertRecord is finished, my result looks like this sample: > >> > >> [ { > >> "Bank Name�" : "Almena State Bank", > >> "City�" : "Almena", > >> "State�" : "KS", > >> "Cert�" : "15426", > >> "Acquiring Institution�" : "Equity Bank", > >> "Closing Date�" : "23-Oct-20", > >> "Fund" : "10538" > >> }, { > >> "Bank Name�" : "First City Bank of Florida", > >> "City�" : "Fort Walton Beach", > >> "State�" : "FL", > >> "Cert�" : "16748", > >> "Acquiring Institution�" : "United Fidelity Bank, fsb", > >> "Closing Date�" : "16-Oct-20", > >> "Fund" : "10537" > >> }, { > >> "Bank Name�" : "The First State Bank", > >> "City�" : "Barboursville", > >> "State�" : "WV", > >> "Cert�" : "14361", > >> "Acquiring Institution�" : "MVB Bank, Inc.", > >> "Closing Date�" : "3-Apr-20", > >> "Fund" : "10536" > >> }] > >> > >> > >> > >> I don't really have a schema. How can I use a combination of SplitJson > and EvaluateJsonPath to split each json object out to its own nifi > flowfile, and to pull the json key values out to define the fields in the > csv header? I've found a few examples through research that allude to this, > but they all seem to have a fixed schema and they don't offer > configurations for the SplitJson. In a case where my json keys definition > changes depending on the lfowfile, what should JsonPathExpression be set to > in the SplitJson configuration? > >> > >> > >> > >> On Thu, Apr 6, 2023 at 9:59 AM Mike Sofen <mso...@runbox.com> wrote: > >> > >> Jim – that’s exactly what I did on that “pre” step – generate a schema > from the CSVReader and use that to dynamically create the DDL sql needed to > build the staging table in Postgres. In my solution, there are 2 separate > pipelines running – this pre step and the normal file processing. > >> > >> > >> > >> I used the pre step to ensure that all incoming files were from a known > and valid source and that they conformed to the schema for that source – a > very tidy way to ensure data quality. > >> > >> > >> > >> Mike > >> > >> > >> > >> From: James McMahon <jsmcmah...@gmail.com> > >> Sent: Thursday, April 06, 2023 6:39 AM > >> To: users@nifi.apache.org > >> Subject: Re: Handling CSVs dynamically with NiFi > >> > >> > >> > >> Thank you both very much, Bryan and Mike. Mike, had you considered the > approach mentioned by Bryan - a Reader processor to infer schema - and > found it wasn't suitable for your use case, for some reason? For instance, > perhaps you were employing a version of Apache NiFi that did not afford > access to a CsvReader or InferAvroSchema processor? > >> > >> Jim > >> > >> > >> > >> On Thu, Apr 6, 2023 at 9:30 AM Mike Sofen <mso...@runbox.com> wrote: > >> > >> Hi James, > >> > >> > >> > >> I don’t have time to go into details, but I had nearly the same > scenario and solved it by using Nifi as the file processing piece only, > sending valid CSV files (valid as in CSV formatting) and leveraged Postgres > to land the CSV data into pre-built staging tables and from there did > content validations and packaging into jsonb for storage into a single > target table. > >> > >> > >> > >> In my case, an external file source had to “register” a single file (to > allow creating the matching staging table) prior to sending data. I used > Nifi for that pre-staging step to derive the schema for the staging table > for a file and I used a complex stored procedure to handle a massive amount > of logic around the contents of a file when processing the actual files > prior to storing into the destination table. > >> > >> > >> > >> Nifi was VERY fast and efficient in this, as was Postgres. > >> > >> > >> > >> Mike Sofen > >> > >> > >> > >> From: James McMahon <jsmcmah...@gmail.com> > >> Sent: Thursday, April 06, 2023 4:35 AM > >> To: users <users@nifi.apache.org> > >> Subject: Handling CSVs dynamically with NiFi > >> > >> > >> > >> We have a task requiring that we transform incoming CSV files to JSON. > The CSVs vary in schema. > >> > >> > >> > >> There are a number of interesting flow examples out there illustrating > how one can set up a flow to handle the case where the CSV schema is well > known and fixed, but none for the generalized case. > >> > >> > >> > >> The structure of the incoming CSV files will not be known in advance in > our use case. Our nifi flow must be generalized because I cannot configure > and rely on a service that defines a specific fixed Avro schema registry. > An Avro schema registry seems to presume an awareness of the CSV structure > in advance. We don't have that luxury in this use case, with CSVs arriving > from many different providers and so characterized by schemas that are > unknown. > >> > >> > >> > >> What is the best way to get around this challenge? Does anyone know of > an example where NiFi builds the schema on the fly as CSVs arrive for > processing, dynamically defining the Avro schema for the CSV? > >> > >> > >> > >> Thanks in advance for any thoughts. > >