Re: Getting PySpark Partitions Locations

2020-06-25 Thread Sean Owen
You can always list the S3 output path, of course. On Thu, Jun 25, 2020 at 7:52 AM Tzahi File wrote: > Hi, > > I'm using pyspark to write df to s3, using the following command: > "df.write.partitionBy("day","hour","country").mode("overwrite").parquet(s3_output)". > > Is there any way to get the

Re: Getting PySpark Partitions Locations

2020-06-25 Thread Sanjeev Mishra
You can use catalog apis see following https://stackoverflow.com/questions/54268845/how-to-check-the-number-of-partitions-of-a-spark-dataframe-without-incurring-the/54270537 On Thu, Jun 25, 2020 at 6:19 AM Tzahi File wrote: > I don't want to query with a distinct on the partitioned columns, the

Re: Getting PySpark Partitions Locations

2020-06-25 Thread Tzahi File
I don't want to query with a distinct on the partitioned columns, the df contains over 1 Billion of records. I just want to know the partitions that were created.. On Thu, Jun 25, 2020 at 4:04 PM Jörn Franke wrote: > By doing a select on the df ? > > Am 25.06.2020 um 14:52 schrieb Tzahi File : >

Re: Getting PySpark Partitions Locations

2020-06-25 Thread Jörn Franke
By doing a select on the df ? > Am 25.06.2020 um 14:52 schrieb Tzahi File : > >  > Hi, > > I'm using pyspark to write df to s3, using the following command: > "df.write.partitionBy("day","hour","country").mode("overwrite").parquet(s3_output)". > > Is there any way to get the partitions create

Getting PySpark Partitions Locations

2020-06-25 Thread Tzahi File
Hi, I'm using pyspark to write df to s3, using the following command: "df.write.partitionBy("day","hour","country").mode("overwrite").parquet(s3_output)". Is there any way to get the partitions created? e.g. day=2020-06-20/hour=1/country=US day=2020-06-20/hour=2/country=US .. -- Tzahi File