Hi,
Good question !
It is very dependent to your jobs and developer team.
Things that mostly differ in my view is :
1/ data locality & fast-read
If your data are stored in an HDFS cluster (not HCFS) and your Spark
compute nodes are allowed to run on the Hadoop nodes, then definitely use
Yarn to
Hi,
Did you try without the broadcast ?
Regards
JL
Le jeu. 19 sept. 2019 à 06:41, Vadim Semenov
a écrit :
> Pre-register your classes:
>
> ```
> import com.esotericsoftware.kryo.Kryo
> import org.apache.spark.serializer.KryoRegistrator
>
> class MyKryoRegistrator extends KryoRegistrator {
>
Hi Tomas,
Parquet tuning time !!!
I strongly recommend you to read papers by CERN on spark parquet tuning
https://db-blog.web.cern.ch/blog/luca-canali/2017-06-diving-spark-and-parquet-workloads-example
You have to check the size of the row groups in your parquet files and
maybe tweak it a
Hi
Did you just publicly disclosed a real token to tour blob storage ?
If so, please be aware that you share the data and maybe write permission...
Refresh tout token !
Regards
Le ven. 23 août 2019 à 08:33, Krishna Chandran Nair <
kcn...@qatarairways.com.qa> a écrit :
>
>
>
>
> Hi Team,
>
>
>
>
Hi
Did you try using the image build by mesosphere ?
I am not sure they already build the combo 2.4 / 3.2 but they provide a
project on github that Can be used to generate tour custom combo. It is
named mesosphere/spark-build
Regards
Le jeu. 4 juil. 2019 à 19:13, José Luis Pedrosa a
écrit :
>
Hi,
I am looking for a setup that would be to be able to split a single spark
processing into 2 jobs (operational constraints) without wasting too much
time persisting the data between the two jobs during spark
checkpoint/writes.
I have a config with a lot of ram and I'm willing to configure a a