MaxGekk commented on issue #23325: [SPARK-26376][SQL] Skip inputs without tokens by JSON datasource URL: https://github.com/apache/spark/pull/23325#issuecomment-448545158 > always return null if the token is empty, no matter it's row or array or map. In general, returning `null`s for malformed rows is fine. The problem is "embedding" the corrupt column into user's schema if the root type is `StructType`, and ignoring the corrupt column for other types. I think extending user's schema by additional column wasn't right design decision. I would wrap user's root type by a struct with 2 columns - parsed column of user specified type (struct, array and map) and a column of string type with unparsed text. If the input wasn't parsed, just put `null` to the first column independently of its type. > never return null if the token is empty. For struct type, return a row with all null fields. For array/map, return empty array/map. This looks like more consistent approach across supported types but even it raises some questions: - how to distinguish a row with all null, empty array/map in the input from unparsed input? - performance penalty. Probably it could be avoided. - if schema is not flat, let's say `a array<...>, b map<...>`, why do we return a row with nulls instead of a row with empty array and empty map? - corrupt column feature is still not supported for arrays and maps If need to make a choice of the 2 approaches above, I would prefer the first one probably. And I would re-implement the `columnNameOfCorruptRecord` feature to support other types.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org