Are you seeing the screwups happening consistently?

On Mon, Jan 6, 2020 at 6:10 PM Shawn Weeks <swe...@weeksconsulting.us>
wrote:

> I’m poking around to see if I can make the csv parsers fail on a schema
> mismatch like that. A stream command would be a good option though.
>
>
>
> Thanks
>
> Shawn
>
>
>
> *From: *Mike Thomsen <mikerthom...@gmail.com>
> *Reply-To: *"users@nifi.apache.org" <users@nifi.apache.org>
> *Date: *Monday, January 6, 2020 at 4:35 PM
> *To: *"users@nifi.apache.org" <users@nifi.apache.org>
> *Subject: *Re: Validating CSV File
>
>
>
> We have a lot of the same issues where I work, and our solution is to use
> ExecuteStreamCommand to pass CSVs off to Python scripts that will read
> stdin line by line to check to see if the export isn't screwed up. Some of
> our sources are good and we don't have to do that, but others are
> minefields in terms of the quality of the upstream data source, and that's
> the only way we've found where we can predictably handle such things.
>
>
>
> On Mon, Jan 6, 2020 at 4:57 PM Shawn Weeks <swe...@weeksconsulting.us>
> wrote:
>
> That's the challenge, the values can be null but I want to know the fields
> are missing(aka not enough delimiters). I run into a common scenario where
> line feeds end up in the data making a short row. Currently the reader just
> ignores the fact that there aren't enough delimiters and makes them null.
>
> On 1/6/20, 3:50 PM, "Matt Burgess" <mattyb...@apache.org> wrote:
>
>     Shawn,
>
>     Your schema indicates that the fields are optional because of the
>     "type" :  ["null", "string"] , so IIRC they won't be marked as invalid
>     because they are treated as null (I'm not sure there's a difference in
>     the code between missing and null fields).
>
>     You can try "type": "string" in ValidateRecord to see if that fixes
>     it, or there's a "StrNotNullOrEmpty" operator in ValidateCSV.
>
>     Regards,
>     Matt
>
>     On Mon, Jan 6, 2020 at 4:35 PM Shawn Weeks <swe...@weeksconsulting.us>
> wrote:
>     >
>     > I’m trying to validate that a csv file has the number of fields
> defined in it’s Avro schema. Consider the following schema and CSVs. I
> would like to be able to reject the invalid csv as missing fields.
>     >
>     >
>     >
>     > {
>     >
>     >    "type" : "record",
>     >
>     >    "namespace" : "nifi",
>     >
>     >    "name" : "nifi",
>     >
>     >    "fields" : [
>     >
>     >       { "name" : "c1" , "type" :  ["null", "string"] },
>     >
>     >       { "name" : "c2" , "type" : ["null", "string"] },
>     >
>     >       { "name" : "c3" , "type" : ["null", "string"] }
>     >
>     >    ]
>     >
>     > }
>     >
>     >
>     >
>     > Good CSV
>     >
>     > c1,c2,c3
>     >
>     > hello,world,1
>     >
>     > hello,world,
>     >
>     > hello,,
>     >
>     >
>     >
>     > Bad CSV
>     >
>     > c1,c2,c3
>     >
>     > hello,world,1
>     >
>     > hello,world
>     >
>     > hello
>     >
>     >
>
>

Reply via email to