[ https://issues.apache.org/jira/browse/SPARK-40820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17620099#comment-17620099 ]
Anthony Wainer Cachay Guivin commented on SPARK-40820: ------------------------------------------------------ Here an example, many dataframes are being created from a schema, this schema is created from a Json. The input parameters to create a schema is StructType.fromJson(json), this internally uses StructField.fromJson(). The issue is when the StructField parses the Json, which forces to define the nullable and metadata attributes inside. ![image]([https://user-images.githubusercontent.com/7476964/196637396-d437278c-f462-41dd-8323-3d613c05214b.png]) it is understandable that name and type are mandatory, but the others should be optional. The current parsing does not allow this. If more than 1000 fields are defined, this would be a headache and unnecessary metadata. > Creating StructType from Json > ----------------------------- > > Key: SPARK-40820 > URL: https://issues.apache.org/jira/browse/SPARK-40820 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 3.3.0 > Reporter: Anthony Wainer Cachay Guivin > Priority: Minor > > When create a StructType from a Python dictionary you utilize the > [StructType.fromJson|https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L569-L571] > method. > To create a schema can be created as follows from the code below, but it > requires to put inside the json: Nullable and Metadata, this is inconsistent > because within the DataType class this by default. > {code:python} > json = { > "name": "name", > "type": "string" > } > StructField.fromJson(json) > {code} > Error: > {code:python} > from pyspark.sql.types import StructField > json = { > "name": "name", > "type": "string" > } > StructField.fromJson(json) > >> > Traceback (most recent call last): > File "code.py", line 90, in runcode > exec(code, self.locals) > File "<input>", line 1, in <module> > File "pyspark/sql/types.py", line 583, in fromJson > json["nullable"], > KeyError: 'nullable' {code} > > Proposed coding solution: > Instead use indexes for getting from a dictionary, it would be better to use > .get > {code:python} > def fromJson(cls, json: Dict[str, Any]) -> "StructField": > return StructField( > json["name"], > _parse_datatype_json_value(json["type"]), > json.get("nullable"), > json.get("metadata"), > ) > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org