Re: Bechmarks on Spark running on Yarn versus Spark on K8s

2021-07-23 Thread Julien Laurenceau
Hi, Good question ! It is very dependent to your jobs and developer team. Things that mostly differ in my view is : 1/ data locality & fast-read If your data are stored in an HDFS cluster (not HCFS) and your Spark compute nodes are allowed to run on the Hadoop nodes, then definitely use Yarn to

Re: intermittent Kryo serialization failures in Spark

2019-09-20 Thread Julien Laurenceau
Hi, Did you try without the broadcast ? Regards JL Le jeu. 19 sept. 2019 à 06:41, Vadim Semenov a écrit : > Pre-register your classes: > > ``` > import com.esotericsoftware.kryo.Kryo > import org.apache.spark.serializer.KryoRegistrator > > class MyKryoRegistrator extends KryoRegistrator { >

Re: Parquet read performance for different schemas

2019-09-20 Thread Julien Laurenceau
Hi Tomas, Parquet tuning time !!! I strongly recommend you to read papers by CERN on spark parquet tuning https://db-blog.web.cern.ch/blog/luca-canali/2017-06-diving-spark-and-parquet-workloads-example You have to check the size of the row groups in your parquet files and maybe tweak it a

Re: error while connecting to azure blob storage

2019-08-23 Thread Julien Laurenceau
Hi Did you just publicly disclosed a real token to tour blob storage ? If so, please be aware that you share the data and maybe write permission... Refresh tout token ! Regards Le ven. 23 août 2019 à 08:33, Krishna Chandran Nair < kcn...@qatarairways.com.qa> a écrit : > > > > > Hi Team, > > > >

Re: Spark 2.4.3 with hadoop 3.2 docker image.

2019-07-06 Thread Julien Laurenceau
Hi Did you try using the image build by mesosphere ? I am not sure they already build the combo 2.4 / 3.2 but they provide a project on github that Can be used to generate tour custom combo. It is named mesosphere/spark-build Regards Le jeu. 4 juil. 2019 à 19:13, José Luis Pedrosa a écrit : >

spark checkpoint between 2 jobs and HDFS ramfs with storage policy

2019-05-21 Thread Julien Laurenceau
Hi, I am looking for a setup that would be to be able to split a single spark processing into 2 jobs (operational constraints) without wasting too much time persisting the data between the two jobs during spark checkpoint/writes. I have a config with a lot of ram and I'm willing to configure a a