[ https://issues.apache.org/jira/browse/SPARK-23173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16334226#comment-16334226 ]
Burak Yavuz commented on SPARK-23173: ------------------------------------- In terms of usability, I prefer 1. In terms of the viewpoint of a data engineer, I would like 2 as well if that's not too hard. Basically, if I expect that my data doesn't have nulls, but is suddenly outputting them, I would rather have it fail initially (or get written out to the \_corrupt\_record column). In an ideal world, I should be able to either permit nullable fields (Option 1), or have the record be written out as corrupt. > from_json can produce nulls for fields which are marked as non-nullable > ----------------------------------------------------------------------- > > Key: SPARK-23173 > URL: https://issues.apache.org/jira/browse/SPARK-23173 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.2.1 > Reporter: Herman van Hovell > Priority: Major > > The {{from_json}} function uses a schema to convert a string into a Spark SQL > struct. This schema can contain non-nullable fields. The underlying > {{JsonToStructs}} expression does not check if a resulting struct respects > the nullability of the schema. This leads to very weird problems in consuming > expressions. In our case parquet writing would produce an illegal parquet > file. > There are roughly solutions here: > # Assume that each field in schema passed to {{from_json}} is nullable, and > ignore the nullability information set in the passed schema. > # Validate the object during runtime, and fail execution if the data is null > where we are not expecting this. > I currently am slightly in favor of option 1, since this is the more > performant option and a lot easier to do. > WDYT? cc [~rxin] [~marmbrus] [~hyukjin.kwon] [~brkyvz] -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org