[ 
https://issues.apache.org/jira/browse/SPARK-40820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17620099#comment-17620099
 ] 

Anthony Wainer Cachay Guivin commented on SPARK-40820:
------------------------------------------------------

Here an example, many dataframes are being created from a schema, this schema 
is created from a Json.
The input parameters to create a schema is StructType.fromJson(json), this 
internally uses StructField.fromJson().

The issue is when the StructField parses the Json, which forces to define the 
nullable and metadata attributes inside.

![image]([https://user-images.githubusercontent.com/7476964/196637396-d437278c-f462-41dd-8323-3d613c05214b.png])

it is understandable that name and type are mandatory, but the others should be 
optional.

The current parsing does not allow this. If more than 1000 fields are defined, 
this would be a headache and unnecessary metadata.

> Creating StructType from Json
> -----------------------------
>
>                 Key: SPARK-40820
>                 URL: https://issues.apache.org/jira/browse/SPARK-40820
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.3.0
>            Reporter: Anthony Wainer Cachay Guivin
>            Priority: Minor
>
> When create a StructType from a Python dictionary you utilize the 
> [StructType.fromJson|https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L569-L571]
>  method.
> To create a schema can be created as follows from the code below, but it 
> requires to put inside the json: Nullable and Metadata, this is inconsistent 
> because within the DataType class this by default.
> {code:python}
> json = {
>             "name": "name",
>             "type": "string"
>         }
> StructField.fromJson(json)
> {code}
> Error:
> {code:python}
> from pyspark.sql.types import StructField
> json = {
>             "name": "name",
>             "type": "string"
>         }
> StructField.fromJson(json)
> >>
> Traceback (most recent call last):
>   File "code.py", line 90, in runcode
>     exec(code, self.locals)
>   File "<input>", line 1, in <module>
>   File "pyspark/sql/types.py", line 583, in fromJson
>     json["nullable"],
> KeyError: 'nullable' {code}
>  
> Proposed coding solution:
> Instead use indexes for getting from a dictionary, it would be better to use 
> .get
> {code:python}
> def fromJson(cls, json: Dict[str, Any]) -> "StructField":
>         return StructField(
>             json["name"],
>             _parse_datatype_json_value(json["type"]),
>             json.get("nullable"),
>             json.get("metadata"),
>         )
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to