Re: Unable to force small partitions in streaming job without repartitioning

2022-02-12 Thread Chris Coutinho
Hi Gourav, The static table is broadcasted prior to the join so the shuffle is primarily to avoid OOME during the UDF. It's not quite a Cartesian product, but yes the join results in multiple records per input record. The number of output records varies depending on the number of duplicates in

Re: Unable to access Google buckets using spark-submit

2022-02-12 Thread Gourav Sengupta
Hi, agree with Holden, have faced quite a few issues with FUSE. Also trying to understand "spark-submit from local" . Are you submitting your SPARK jobs from a local laptop or in local mode from a GCP dataproc / system? If you are submitting the job from your local laptop, there will be

Re: Unable to force small partitions in streaming job without repartitioning

2022-02-12 Thread Gourav Sengupta
hi, Did you try to sorting while writing out the data? All of this engineering may not be required in that case. Regards, Gourav Sengupta On Sat, Feb 12, 2022 at 8:42 PM Chris Coutinho wrote: > Setting the option in the cluster configuration solved the issue, and now > we're able to

Re: Unable to force small partitions in streaming job without repartitioning

2022-02-12 Thread Chris Coutinho
Setting the option in the cluster configuration solved the issue, and now we're able to specify the row group size based on the block size as intended. Thanks! On Fri, Feb 11, 2022 at 6:59 PM Adam Binford wrote: > Writing to Delta might not support the write.option method. We set >

Re: Unable to access Google buckets using spark-submit

2022-02-12 Thread Holden Karau
You can also put the GS access jar with your Spark jars — that’s what the class not found exception is pointing you towards. On Fri, Feb 11, 2022 at 11:58 PM Mich Talebzadeh wrote: > BTW I also answered you in in stackoverflow : > > >

Failed to construct kafka consumer, Failed to load SSL keystore + Spark Streaming

2022-02-12 Thread joyan sil
Hi All, I am trying to read from Kafka using spark streaming from spark-shell but getting the below error. Any suggestions to fix this is much appreciated. I am running from spark-shell hence it is client mode and the files are available in the local filesystem. I tried to access the files as

Re: Unable to access Google buckets using spark-submit

2022-02-12 Thread Mich Talebzadeh
BTW I also answered you in in stackoverflow : https://stackoverflow.com/questions/71088934/unable-to-access-google-buckets-using-spark-submit HTH view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh

Re: Unable to access Google buckets using spark-submit

2022-02-12 Thread Mich Talebzadeh
You are trying to access a Google storage bucket gs:// from your local host. It does not see it because spark-submit assumes that it is a local file system on the host which is not. You need to mount gs:// bucket as a local file system. You can use the tool called gcsfuse