Re: Write DataFrame with Partition and choose Filename in PySpark

2023-05-04 Thread Mich Talebzadeh
you can try df2.coalesce(1).write.mode("overwrite").json("/tmp/pairs.json") hdfs dfs -ls /tmp/pairs.json Found 2 items -rw-r--r-- 3 hduser supergroup 0 2023-05-04 22:21 /tmp/pairs.json/_SUCCESS -rw-r--r-- 3 hduser supergroup 96 2023-05-04 22:21

Re: Write DataFrame with Partition and choose Filename in PySpark

2023-05-04 Thread Marco Costantini
Hi Mich, Thank you. Are you saying this satisfies my requirement? On the other hand, I am smelling something going on. Perhaps the Spark 'part' files should not be thought of as files, but rather pieces of a conceptual file. If that is true, then your approach (of which I'm well aware) makes

Re: Write DataFrame with Partition and choose Filename in PySpark

2023-05-04 Thread Mich Talebzadeh
AWS S3, or Google gs are hadoop compatible file systems (HCFS) , so they do sharding to improve read performance when writing to HCFS file systems. Let us take your code for a drive import findspark findspark.init() from pyspark.sql import SparkSession from pyspark.sql.functions import struct

Write DataFrame with Partition and choose Filename in PySpark

2023-05-04 Thread Marco Costantini
Hello, I am testing writing my DataFrame to S3 using the DataFrame `write` method. It mostly does a great job. However, it fails one of my requirements. Here are my requirements. - Write to S3 - use `partitionBy` to automatically make folders based on my chosen partition columns - control the

Re: Write custom JSON from DataFrame in PySpark

2023-05-04 Thread Marco Costantini
Hi Enrico, What a great answer. Thank you. Seems like I need to get comfortable with the 'struct' and then I will be golden. Thank you again, friend. Marco. On Thu, May 4, 2023 at 3:00 AM Enrico Minack wrote: > Hi, > > You could rearrange the DataFrame so that writing the DataFrame as-is >

Re: Write custom JSON from DataFrame in PySpark

2023-05-04 Thread Enrico Minack
Hi, You could rearrange the DataFrame so that writing the DataFrame as-is produces your structure: df = spark.createDataFrame([(1, "a1"), (2, "a2"), (3, "a3")], "id int, datA string") +---++ | id|datA| +---++ |  1|  a1| |  2|  a2| |  3|  a3| +---++ df2 = df.select(df.id,