MaxGekk commented on issue #23325: [SPARK-26376][SQL] Skip inputs without 
tokens by JSON datasource
URL: https://github.com/apache/spark/pull/23325#issuecomment-448545158
 
 
   > always return null if the token is empty, no matter it's row or array or 
map.
   
   In general, returning `null`s for malformed rows is fine. The problem is 
"embedding" the corrupt column into user's schema if the root type is 
`StructType`, and ignoring the corrupt column for other types. I think 
extending user's schema by additional column wasn't right design decision. I 
would wrap user's root type by a struct with 2 columns - parsed column of user 
specified type (struct, array and map) and a column of string type with 
unparsed text. If the input wasn't parsed, just put `null` to the first column 
independently of its type.
   
   > never return null if the token is empty. For struct type, return a row 
with all null fields. For array/map, return empty array/map.
   
   This looks like more consistent approach across supported types but even it 
raises some questions:
   - how to distinguish a row with all null, empty array/map in the input from 
unparsed input?
   - performance penalty. Probably it could be avoided.
   - if schema is not flat, let's say `a array<...>, b map<...>`, why do we 
return a row with nulls instead of a row with empty array and empty map? 
   - corrupt column feature is still not supported for arrays and maps
   
   If need to make a choice of the 2 approaches above, I would prefer the first 
one probably. And I would re-implement the `columnNameOfCorruptRecord` feature 
to support other types.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to