Metrics Problem

2020-06-25 Thread Bryan Jeffrey
Hello. I am running Spark 2.4.4. I have implemented a custom metrics producer. It works well when I run locally, or specify the metrics producer only for the driver. When I ask for executor metrics I run into ClassNotFoundExceptions *Is it possible to pass a metrics JAR via --jars? If so what

Blog : Apache Spark Window Functions

2020-06-25 Thread neeraj bhadani
Hi Team, I would like to share with the community that my blog on "Apache Spark Window Functions" got published. PFB link if anyone interested. Link: https://medium.com/expedia-group-tech/deep-dive-into-apache-spark-window-functions-7b4e39ad3c86 Please share your thoughts and feedback.

Re: Getting PySpark Partitions Locations

2020-06-25 Thread Sean Owen
You can always list the S3 output path, of course. On Thu, Jun 25, 2020 at 7:52 AM Tzahi File wrote: > Hi, > > I'm using pyspark to write df to s3, using the following command: > "df.write.partitionBy("day","hour","country").mode("overwrite").parquet(s3_output)". > > Is there any way to get the

Re: Getting PySpark Partitions Locations

2020-06-25 Thread Sanjeev Mishra
You can use catalog apis see following https://stackoverflow.com/questions/54268845/how-to-check-the-number-of-partitions-of-a-spark-dataframe-without-incurring-the/54270537 On Thu, Jun 25, 2020 at 6:19 AM Tzahi File wrote: > I don't want to query with a distinct on the partitioned columns,

Re: Getting PySpark Partitions Locations

2020-06-25 Thread Tzahi File
I don't want to query with a distinct on the partitioned columns, the df contains over 1 Billion of records. I just want to know the partitions that were created.. On Thu, Jun 25, 2020 at 4:04 PM Jörn Franke wrote: > By doing a select on the df ? > > Am 25.06.2020 um 14:52 schrieb Tzahi File :

Re: Getting PySpark Partitions Locations

2020-06-25 Thread Jörn Franke
By doing a select on the df ? > Am 25.06.2020 um 14:52 schrieb Tzahi File : > >  > Hi, > > I'm using pyspark to write df to s3, using the following command: > "df.write.partitionBy("day","hour","country").mode("overwrite").parquet(s3_output)". > > Is there any way to get the partitions

Getting PySpark Partitions Locations

2020-06-25 Thread Tzahi File
Hi, I'm using pyspark to write df to s3, using the following command: "df.write.partitionBy("day","hour","country").mode("overwrite").parquet(s3_output)". Is there any way to get the partitions created? e.g. day=2020-06-20/hour=1/country=US day=2020-06-20/hour=2/country=US .. -- Tzahi File

Re: Where are all the jars gone ?

2020-06-25 Thread Anwar AliKhan
I know I can arrive at the same result with this code, val range100 = spark.range(1,101).agg((sum('id) as "sum")).first.get(0) println(f"sum of range100 = $range100") so I am not stuck, I was just curious  why the code breaks using the current link libraries.

Suggested Amendment to ./dev/make-distribution.sh

2020-06-25 Thread Anwar AliKhan
 May I suggest amending your ./dev/make-distribution.sh. 蘿 To include a  check if these two previously mentioned packages  are installed and if not 樂 install them as part of build process . The build process time will increase if the packages are not installed. Long build process is normal