Re: Job migrated from EMR to Dataproc takes 20 hours instead of 90 minutes

2022-05-24 Thread Ori Popowski
Hi, Both jobs use spark.dynamicAllocation.enabled so there's no need to change the number of executors. There are 702 executors in the Dataproc cluster so this is not the problem. About number of partitions - this I didn't change and it's still 400. While writing this now, I am realising that I

Re: Spark Push-Based Shuffle causing multiple stage failures

2022-05-24 Thread Ye Zhou
Hi, Han. Thanks for trying out the push based shuffle. Please make sure you configure both the Spark client side configuration and server side configurations. The client side configuration looks good, and from the error message, looks like you are missing the server side configurations. Please

Re: Spark Push-Based Shuffle causing multiple stage failures

2022-05-24 Thread Mridul Muralidharan
+CC zhouye...@gmail.com On Mon, May 23, 2022 at 7:11 AM Han Altae-Tran wrote: > Hi, > > First of all, I am very thankful for all of the amazing work that goes > into this project! It has opened up so many doors for me! I am a long > time Spark user, and was very excited to start working with

GCP Dataproc - adding multiple packages(kafka, mongodb) while submitting spark jobs not working

2022-05-24 Thread karan alang
Hello All, I've a Structured Streaming job on GCP Dataproc, and i'm trying to pass multiple packages (kafka, mongoDB) to the dataproc submit command, and that is not working. Command that is working (when i add single dependency eg. Kafka) : ``` gcloud dataproc jobs submit pyspark main.py \

Re: Job migrated from EMR to Dataproc takes 20 hours instead of 90 minutes

2022-05-24 Thread Ranadip Chatterjee
Hi Ori, A single task for the final step can result from various scenarios like an aggregate operation that results in only 1 value (e.g count) or a key based aggregate with only 1 key for example. There could be other scenarios as well. However, that would be the case in both EMR and Dataproc if

Re: Problem with implementing the Datasource V2 API for Salesforce

2022-05-24 Thread Gourav Sengupta
Hi, in the spirit of not fitting the solution to the problem, would it not be better to first create a producer for your job and use a broker like Kafka or Kinesis or Pulsar? Regards, Gourav Sengupta On Sat, May 21, 2022 at 3:46 PM Rohit Pant wrote: > Hi all, > > I am trying to implement a