you can try
df2.coalesce(1).write.mode("overwrite").json("/tmp/pairs.json")
hdfs dfs -ls /tmp/pairs.json
Found 2 items
-rw-r--r-- 3 hduser supergroup 0 2023-05-04 22:21
/tmp/pairs.json/_SUCCESS
-rw-r--r-- 3 hduser supergroup 96 2023-05-04 22:21
Hi Mich,
Thank you.
Are you saying this satisfies my requirement?
On the other hand, I am smelling something going on. Perhaps the Spark
'part' files should not be thought of as files, but rather pieces of a
conceptual file. If that is true, then your approach (of which I'm well
aware) makes
AWS S3, or Google gs are hadoop compatible file systems (HCFS) , so they do
sharding to improve read performance when writing to HCFS file systems.
Let us take your code for a drive
import findspark
findspark.init()
from pyspark.sql import SparkSession
from pyspark.sql.functions import struct
Hello,
I am testing writing my DataFrame to S3 using the DataFrame `write` method.
It mostly does a great job. However, it fails one of my requirements. Here
are my requirements.
- Write to S3
- use `partitionBy` to automatically make folders based on my chosen
partition columns
- control the
Hi Enrico,
What a great answer. Thank you. Seems like I need to get comfortable with
the 'struct' and then I will be golden. Thank you again, friend.
Marco.
On Thu, May 4, 2023 at 3:00 AM Enrico Minack wrote:
> Hi,
>
> You could rearrange the DataFrame so that writing the DataFrame as-is
>
Hi,
You could rearrange the DataFrame so that writing the DataFrame as-is
produces your structure:
df = spark.createDataFrame([(1, "a1"), (2, "a2"), (3, "a3")], "id int,
datA string")
+---++
| id|datA|
+---++
| 1| a1|
| 2| a2|
| 3| a3|
+---++
df2 = df.select(df.id,