Brian Schaefer created SPARK-39168: -------------------------------------- Summary: Consider all values in a python list when inferring schema Key: SPARK-39168 URL: https://issues.apache.org/jira/browse/SPARK-39168 Project: Spark Issue Type: New Feature Components: PySpark Affects Versions: 3.2.1 Reporter: Brian Schaefer
Schema inference fails on the following case: {code:python} >>> data = [{"a": [1, None], "b": [None, 2]}] >>> spark.createDataFrame(data) ValueError: Some of types cannot be determined after inferring {code} This is because only the first value in the array is used to infer the element type for the array: [https://github.com/apache/spark/blob/b63674ea5f746306a96ab8c39c23a230a6cb9566/python/pyspark/sql/types.py#L1260]. The element type of the "b" array is inferred as {{NullType}} but I think it makes sense to infer the element type as {{{}LongType{}}}. One approach to address the above would be to infer the type from the first non-null value in the array. However, consider a case with structs: {code:python} >>> spark.conf.set("spark.sql.pyspark.inferNestedDictAsStruct.enabled", True) >>> data = [{"a": [{"b": 1}, {"c": 2}]}] >>> spark.createDataFrame(data).schema StructType([StructField('a', ArrayType(StructType([StructField('b', LongType(), True)]), True), True)]) {code} The element type of the "a" array is inferred as a struct with one field, "b". However, it would be convenient to infer the element type as a struct with both fields "b" and "c". Omitted fields from each dictionary would become null values in each struct: {code:java} +----------------------+ | a| +----------------------+ |[{1, null}, {null, 1}]| +----------------------+ {code} To support both of these cases, the type of each array element could be inferred, and those types could be merged, similar to the approach [here|https://github.com/apache/spark/blob/b63674ea5f746306a96ab8c39c23a230a6cb9566/python/pyspark/sql/session.py#L574-L576]. -- This message was sent by Atlassian Jira (v8.20.7#820007) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org