Re: Standard practices for building dashboards for spark processed data

2020-02-25 Thread Roland Johann
Hi Ani, Prometheus is not well suited for ingesting explicit timeseries data. Its purpose is for technical monitoring. If you want to monitor your spark jobs with prometheus you can publish the metrics so prometheus can scrape it. What you propably are looking for is a timeseries database that

Re: [Spark SQL] Memory problems with packing too many joins into the same WholeStageCodegen

2020-02-25 Thread Jianneng Li
I could be wrong, but I'm guessing that it uses UDF as the build side of a hash join. So the hash table is inside the UDF, and the UDF is called to perform the join. There are limitations to this approach of course, you can't do all

Standard practices for building dashboards for spark processed data

2020-02-25 Thread Aniruddha P Tekade
Hello, I am trying to build a data pipeline that uses spark structured streaming with delta project and runs into Kubernetes. Due to this, I get my output files only into parquet format. Since I am asked to use the prometheus and grafana for building the dashboard for this pipeline, I run an

Re: Integration testing Framework Spark SQL Scala

2020-02-25 Thread Ruijing Li
Just wanted to follow up on this. If anyone has any advice, I’d be interested in learning more! On Thu, Feb 20, 2020 at 6:09 PM Ruijing Li wrote: > Hi all, > > I’m interested in hearing the community’s thoughts on best practices to do > integration testing for spark sql jobs. We run a lot of

Re: What options do I have to handle third party classes that are not serializable?

2020-02-25 Thread Jeff Evans
Did you try this? https://stackoverflow.com/a/2114387/375670 On Tue, Feb 25, 2020 at 10:23 AM yeikel valdes wrote: > I am currently using a third party library(Lucene) with Spark that is not > serializable. Due to that reason, it generates the following exception : > > Job aborted due to

What options do I have to handle third party classes that are not serializable?

2020-02-25 Thread yeikel valdes
I am currently using a third party library(Lucene) with Spark that is not serializable. Due to that reason, it generates the following exception  : Job aborted due to stage failure: Task 144.0 in stage 25.0 (TID 2122) had a not serializable result: org.apache.lucene.facet.FacetsConfig

Re: [Spark SQL] NegativeArraySizeException When Parse InternalRow to DTO Field with Type Array[String]

2020-02-25 Thread Proust (Feng Guizhou) [Travel Search & Discovery]
Also I tried to disable the Kryo reference tracking, then the problem will simply change to Java stack overflow exception. From: Proust (Feng Guizhou) [Travel Search & Discovery] Sent: Tuesday, February 25, 2020 11:28 PM To: Sandeep Patra Cc:

Re: [Spark SQL] NegativeArraySizeException When Parse InternalRow to DTO Field with Type Array[String]

2020-02-25 Thread Proust (Feng Guizhou) [Travel Search & Discovery]
Thanks for the information, tried both JavaSerializer and KryoSerializer, same problem encountered And the stacktrace looks very different from the one mentioned in the stackoverflow link From: Sandeep Patra Sent: Sunday, February 23, 2020 8:04 PM To: Proust

Re: [Spark SQL] Memory problems with packing too many joins into the same WholeStageCodegen

2020-02-25 Thread yeikel valdes
Can you please explain what you mean with that? How do you use a udf to replace a join? Thanks On Mon, 24 Feb 2020 22:06:40 -0500 jianneng...@workday.com wrote Thanks Genie. Unfortunately, the joins I'm doing in this case are large, so UDF likely won't work. Jianneng From: Liu