Hi Jason,
Thanks for the reply. I think you overcome the jumbled records by sorting
(around here
https://github.com/ubiquibit-inc/sensor-failure/blob/20cda6c245022fb99685c1eec0878e0da4cc0ced/src/main/scala/com/ubiquibit/buoy/jobs/StationInterrupts.scala#L82
)
I also found that we can sort the dat
Hello,
I have a 110 node cluster with each executor having 50 GB memory and I
want to broadcast a variable of 70GB with each machine have 244 GB of
memory. I am having difficulty doing that. I was wondering at what size is
it unwise to broadcast a variable. Is there a general rule of thumb?
-
Default is 10mb. Depends on memory available, and what the network transfer
effects are going to be. You can specify spark.sql.autoBroadcastJoinThreshold
to increase the threshold in case of spark sql. But you definitely shouldn't be
broadcasting gigabytes.
From:
I am using spark.sparkContext.broadcast() to broadcast. Is this even true if
the memory on our machines is 244 Gb a 70 Gb variable can't be broadcasted
even with high network speed?
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
--
You will probably need to do a couple of things. One, you will need to
probably increase the "spark.sql.broadcastTimeout" setting. As well, when
you broadcast a variable it gets replicated once per executor not once per
machine so you will need to increase your executor size and allow more
cores to
unsubscribe
If you need to reduce the number of partitions you could also try df.coalesce
On Thu, 04 Apr 2019 06:52:26 -0700 jasonnerot...@gmail.com wrote
Have you tried something like this?
spark.conf.set("spark.sql.shuffle.partitions", "5" )
On Wed, Apr 3, 2019 at 8:37 PM Arthur Li wrote:
H
What about a simple call to nanotime?
long startTime = System.nanoTime();
//Spark work here
long endTime = System.nanoTime();
long duration = (endTime - startTime)
println(duration)
Count recomputes the df so it makes sense it takes longer for you.
On Tue, 02 Apr 2019 07:06:30 -0700 kol
Have you tried looking at Spark GUI to see the time it takes to load from
HDFS?
Spark GUI by default runs on port 4040. However, you can set in spark-submit
${SPARK_HOME}/bin/spark-submit \
…...
--conf "spark.ui.port="
and access it through hostname:port
HTH
Dr Mich Talebzadeh
LinkedI