Are you seeing the screwups happening consistently? On Mon, Jan 6, 2020 at 6:10 PM Shawn Weeks <swe...@weeksconsulting.us> wrote:
> I’m poking around to see if I can make the csv parsers fail on a schema > mismatch like that. A stream command would be a good option though. > > > > Thanks > > Shawn > > > > *From: *Mike Thomsen <mikerthom...@gmail.com> > *Reply-To: *"users@nifi.apache.org" <users@nifi.apache.org> > *Date: *Monday, January 6, 2020 at 4:35 PM > *To: *"users@nifi.apache.org" <users@nifi.apache.org> > *Subject: *Re: Validating CSV File > > > > We have a lot of the same issues where I work, and our solution is to use > ExecuteStreamCommand to pass CSVs off to Python scripts that will read > stdin line by line to check to see if the export isn't screwed up. Some of > our sources are good and we don't have to do that, but others are > minefields in terms of the quality of the upstream data source, and that's > the only way we've found where we can predictably handle such things. > > > > On Mon, Jan 6, 2020 at 4:57 PM Shawn Weeks <swe...@weeksconsulting.us> > wrote: > > That's the challenge, the values can be null but I want to know the fields > are missing(aka not enough delimiters). I run into a common scenario where > line feeds end up in the data making a short row. Currently the reader just > ignores the fact that there aren't enough delimiters and makes them null. > > On 1/6/20, 3:50 PM, "Matt Burgess" <mattyb...@apache.org> wrote: > > Shawn, > > Your schema indicates that the fields are optional because of the > "type" : ["null", "string"] , so IIRC they won't be marked as invalid > because they are treated as null (I'm not sure there's a difference in > the code between missing and null fields). > > You can try "type": "string" in ValidateRecord to see if that fixes > it, or there's a "StrNotNullOrEmpty" operator in ValidateCSV. > > Regards, > Matt > > On Mon, Jan 6, 2020 at 4:35 PM Shawn Weeks <swe...@weeksconsulting.us> > wrote: > > > > I’m trying to validate that a csv file has the number of fields > defined in it’s Avro schema. Consider the following schema and CSVs. I > would like to be able to reject the invalid csv as missing fields. > > > > > > > > { > > > > "type" : "record", > > > > "namespace" : "nifi", > > > > "name" : "nifi", > > > > "fields" : [ > > > > { "name" : "c1" , "type" : ["null", "string"] }, > > > > { "name" : "c2" , "type" : ["null", "string"] }, > > > > { "name" : "c3" , "type" : ["null", "string"] } > > > > ] > > > > } > > > > > > > > Good CSV > > > > c1,c2,c3 > > > > hello,world,1 > > > > hello,world, > > > > hello,, > > > > > > > > Bad CSV > > > > c1,c2,c3 > > > > hello,world,1 > > > > hello,world > > > > hello > > > > > >