Hi,
Good question !
It is very dependent to your jobs and developer team.
Things that mostly differ in my view is :
1/ data locality & fast-read
If your data are stored in an HDFS cluster (not HCFS) and your Spark
compute nodes are allowed to run on the Hadoop nodes, then definitely use
Yarn to b
Hi,
Did you try without the broadcast ?
Regards
JL
Le jeu. 19 sept. 2019 à 06:41, Vadim Semenov
a écrit :
> Pre-register your classes:
>
> ```
> import com.esotericsoftware.kryo.Kryo
> import org.apache.spark.serializer.KryoRegistrator
>
> class MyKryoRegistrator extends KryoRegistrator {
> ov
Hi Tomas,
Parquet tuning time !!!
I strongly recommend you to read papers by CERN on spark parquet tuning
https://db-blog.web.cern.ch/blog/luca-canali/2017-06-diving-spark-and-parquet-workloads-example
You have to check the size of the row groups in your parquet files and
maybe tweak it a little
Hi
Did you just publicly disclosed a real token to tour blob storage ?
If so, please be aware that you share the data and maybe write permission...
Refresh tout token !
Regards
Le ven. 23 août 2019 à 08:33, Krishna Chandran Nair <
kcn...@qatarairways.com.qa> a écrit :
>
>
>
>
> Hi Team,
>
>
>
>
Hi
Did you try using the image build by mesosphere ?
I am not sure they already build the combo 2.4 / 3.2 but they provide a
project on github that Can be used to generate tour custom combo. It is
named mesosphere/spark-build
Regards
Le jeu. 4 juil. 2019 à 19:13, José Luis Pedrosa a
écrit :
> Hi
Hi,
I am looking for a setup that would be to be able to split a single spark
processing into 2 jobs (operational constraints) without wasting too much
time persisting the data between the two jobs during spark
checkpoint/writes.
I have a config with a lot of ram and I'm willing to configure a a