Re: Bechmarks on Spark running on Yarn versus Spark on K8s

2021-07-23 Thread Julien Laurenceau
Hi, Good question ! It is very dependent to your jobs and developer team. Things that mostly differ in my view is : 1/ data locality & fast-read If your data are stored in an HDFS cluster (not HCFS) and your Spark compute nodes are allowed to run on the Hadoop nodes, then definitely use Yarn to b

Re: intermittent Kryo serialization failures in Spark

2019-09-20 Thread Julien Laurenceau
Hi, Did you try without the broadcast ? Regards JL Le jeu. 19 sept. 2019 à 06:41, Vadim Semenov a écrit : > Pre-register your classes: > > ``` > import com.esotericsoftware.kryo.Kryo > import org.apache.spark.serializer.KryoRegistrator > > class MyKryoRegistrator extends KryoRegistrator { > ov

Re: Parquet read performance for different schemas

2019-09-20 Thread Julien Laurenceau
Hi Tomas, Parquet tuning time !!! I strongly recommend you to read papers by CERN on spark parquet tuning https://db-blog.web.cern.ch/blog/luca-canali/2017-06-diving-spark-and-parquet-workloads-example You have to check the size of the row groups in your parquet files and maybe tweak it a little

Re: error while connecting to azure blob storage

2019-08-23 Thread Julien Laurenceau
Hi Did you just publicly disclosed a real token to tour blob storage ? If so, please be aware that you share the data and maybe write permission... Refresh tout token ! Regards Le ven. 23 août 2019 à 08:33, Krishna Chandran Nair < kcn...@qatarairways.com.qa> a écrit : > > > > > Hi Team, > > > >

Re: Spark 2.4.3 with hadoop 3.2 docker image.

2019-07-06 Thread Julien Laurenceau
Hi Did you try using the image build by mesosphere ? I am not sure they already build the combo 2.4 / 3.2 but they provide a project on github that Can be used to generate tour custom combo. It is named mesosphere/spark-build Regards Le jeu. 4 juil. 2019 à 19:13, José Luis Pedrosa a écrit : > Hi

spark checkpoint between 2 jobs and HDFS ramfs with storage policy

2019-05-21 Thread Julien Laurenceau
Hi, I am looking for a setup that would be to be able to split a single spark processing into 2 jobs (operational constraints) without wasting too much time persisting the data between the two jobs during spark checkpoint/writes. I have a config with a lot of ram and I'm willing to configure a a