Hi,
You could rearrange the DataFrame so that writing the DataFrame as-is
produces your structure:
df = spark.createDataFrame([(1, "a1"), (2, "a2"), (3, "a3")], "id int,
datA string")
+---+----+
| id|datA|
+---+----+
| 1| a1|
| 2| a2|
| 3| a3|
+---+----+
df2 = df.select(df.id, struct(df.datA).alias("stuff"))
root
|-- id: integer (nullable = true)
|-- stuff: struct (nullable = false)
| |-- datA: string (nullable = true)
+---+-----+
| id|stuff|
+---+-----+
| 1| {a1}|
| 2| {a2}|
| 3| {a3}|
+---+-----+
df2.write.json("data.json")
{"id":1,"stuff":{"datA":"a1"}}
{"id":2,"stuff":{"datA":"a2"}}
{"id":3,"stuff":{"datA":"a3"}}
Looks pretty much like what you described.
Enrico
Am 04.05.23 um 06:37 schrieb Marco Costantini:
Hello,
Let's say I have a very simple DataFrame, as below.
+---+----+
| id|datA|
+---+----+
| 1| a1|
| 2| a2|
| 3| a3|
+---+----+
Let's say I have a requirement to write this to a bizarre JSON
structure. For example:
{
"id": 1,
"stuff": {
"datA": "a1"
}
}
How can I achieve this with PySpark? I have only seen the following:
- writing the DataFrame as-is (doesn't meet requirement)
- using a UDF (seems frowned upon)
What I have tried is to do this within a `foreach`. I have had some
success, but also some problems with other requirements (serializing
other things).
Any advice? Please and thank you,
Marco.
---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]