Hi all,
Please see below for a list of upcoming technical talks
on BigDL and Analytics Zoo (
https://github.com/intel-analytics/analytics-zoo/) in the coming weeks:
- Engineers from CERN will present a technical talk Deep Learning on
Apache Spark at CERN’s Large Hardon Collider with
Hello there,
I am running a Java Spark application. Most of the modules write to a log
file (not the spark log file). I can either use "java -jar " or
"spark-submit" to run the application.
If I use "java -jar myApp.jar" the log file will be generated in the
directory $LOG_DIR or in a default
saving/checkpointing would be preferable in case of a big data set because:
- the RDD gets saved to HDFS and the DAG gets truncated so if some
partitions/executors fail it won't result in recomputing everything
- you don't use memory for caching therefore the JVM heap is going to be
smaller
Hi,
in my point of view a good approach is first persist your data in
StorageLevel.Memory_And_Disk and then perform join. This will accelerate
your computation because data will be presented in memory and in your
local intermediate storage device.
--Iacovos
On 4/18/19 8:49 PM, Subash
Hi All,
I have a doubt about checkpointing and persist/saving.
Say we have one RDD - containing huge data,
1. We checkpoint and perform join
2. We persist as StorageLevel.MEMORY_AND_DISK and perform join
3. We save that intermediate RDD and perform join (using same RDD - saving
is to just
What is the size of the data? How much time does it need on HDFS and how much
on Oracle? How many partitions do you have on Oracle side?
> Am 06.04.2019 um 16:59 schrieb Lian Jiang :
>
> Hi,
>
> My spark job writes into oracle db using:
> df.coalesce(10).write.format("jdbc").option("url", url)
Hi,
My job is writing ~10 partitions with insertInto. With the same input /
output data the total duration of the job is very different depending on
how many partitions the target table has.
Target table with 10 of partitions:
1 min 30 s
Target table with ~1 partitions:
13 min 0 s
It seems
Dear all,
I'm on a case that when certain table being exposed to broadcast join, the
query will eventually failed with remote block error.
Firstly. We set the spark.sql.autoBroadcastJoinThreshold to 10MB, namely
10485760
[image: image.png]
Then we proceed to perform query. In the SQL plan, we