BigDL and Analytics Zoo talks at upcoming Spark+AI Summit and Strata London

2019-04-18 Thread Jason Dai
Hi all, Please see below for a list of upcoming technical talks on BigDL and Analytics Zoo ( https://github.com/intel-analytics/analytics-zoo/) in the coming weeks: - Engineers from CERN will present a technical talk Deep Learning on Apache Spark at CERN’s Large Hardon Collider with

Spark-submit and no java log file generated

2019-04-18 Thread Mann Du
Hello there, I am running a Java Spark application. Most of the modules write to a log file (not the spark log file). I can either use "java -jar " or "spark-submit" to run the application. If I use "java -jar myApp.jar" the log file will be generated in the directory $LOG_DIR or in a default

Re: Difference between Checkpointing and Persist

2019-04-18 Thread Vadim Semenov
saving/checkpointing would be preferable in case of a big data set because: - the RDD gets saved to HDFS and the DAG gets truncated so if some partitions/executors fail it won't result in recomputing everything - you don't use memory for caching therefore the JVM heap is going to be smaller

Re: Difference between Checkpointing and Persist

2019-04-18 Thread Jack Kolokasis
Hi,     in my point of view a good approach is first persist your data in StorageLevel.Memory_And_Disk and then perform join. This will accelerate your computation because data will be presented in memory and in your local intermediate storage device. --Iacovos On 4/18/19 8:49 PM, Subash

Difference between Checkpointing and Persist

2019-04-18 Thread Subash Prabakar
Hi All, I have a doubt about checkpointing and persist/saving. Say we have one RDD - containing huge data, 1. We checkpoint and perform join 2. We persist as StorageLevel.MEMORY_AND_DISK and perform join 3. We save that intermediate RDD and perform join (using same RDD - saving is to just

Re: writing into oracle database is very slow

2019-04-18 Thread Jörn Franke
What is the size of the data? How much time does it need on HDFS and how much on Oracle? How many partitions do you have on Oracle side? > Am 06.04.2019 um 16:59 schrieb Lian Jiang : > > Hi, > > My spark job writes into oracle db using: > df.coalesce(10).write.format("jdbc").option("url", url)

[Spark SQL]: Slow insertInto overwrite if target table has many partitions

2019-04-18 Thread Juho Autio
Hi, My job is writing ~10 partitions with insertInto. With the same input / output data the total duration of the job is very different depending on how many partitions the target table has. Target table with 10 of partitions: 1 min 30 s Target table with ~1 partitions: 13 min 0 s It seems

autoBroadcastJoinThreshold not working as expected

2019-04-18 Thread Mike Chan
Dear all, I'm on a case that when certain table being exposed to broadcast join, the query will eventually failed with remote block error. Firstly. We set the spark.sql.autoBroadcastJoinThreshold to 10MB, namely 10485760 [image: image.png] Then we proceed to perform query. In the SQL plan, we