Re: Structured streaming flatMapGroupWithState results out of order messages when reading from Kafka

2019-04-10 Thread Akila Wajirasena
Hi Jason, Thanks for the reply. I think you overcome the jumbled records by sorting (around here https://github.com/ubiquibit-inc/sensor-failure/blob/20cda6c245022fb99685c1eec0878e0da4cc0ced/src/main/scala/com/ubiquibit/buoy/jobs/StationInterrupts.scala#L82 ) I also found that we can sort the dat

Unable to broadcast a very large variable

2019-04-10 Thread V0lleyBallJunki3
Hello, I have a 110 node cluster with each executor having 50 GB memory and I want to broadcast a variable of 70GB with each machine have 244 GB of memory. I am having difficulty doing that. I was wondering at what size is it unwise to broadcast a variable. Is there a general rule of thumb? -

Re: Unable to broadcast a very large variable

2019-04-10 Thread Ashic Mahtab
Default is 10mb. Depends on memory available, and what the network transfer effects are going to be. You can specify spark.sql.autoBroadcastJoinThreshold to increase the threshold in case of spark sql. But you definitely shouldn't be broadcasting gigabytes. From:

Re: Unable to broadcast a very large variable

2019-04-10 Thread V0lleyBallJunki3
I am using spark.sparkContext.broadcast() to broadcast. Is this even true if the memory on our machines is 244 Gb a 70 Gb variable can't be broadcasted even with high network speed? -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --

Re: Unable to broadcast a very large variable

2019-04-10 Thread Dillon Dukek
You will probably need to do a couple of things. One, you will need to probably increase the "spark.sql.broadcastTimeout" setting. As well, when you broadcast a variable it gets replicated once per executor not once per machine so you will need to increase your executor size and allow more cores to

Re: Unable to broadcast a very large variable

2019-04-10 Thread Siddharth Reddy
unsubscribe

Re: Question about relationship between number of files and initial tasks(partitions)

2019-04-10 Thread yeikel valdes
If you need to reduce the number of partitions you could also try df.coalesce On Thu, 04 Apr 2019 06:52:26 -0700 jasonnerot...@gmail.com wrote Have you tried something like this? spark.conf.set("spark.sql.shuffle.partitions", "5" )  On Wed, Apr 3, 2019 at 8:37 PM Arthur Li wrote: H

Re:Load Time from HDFS

2019-04-10 Thread yeikel valdes
What about a simple call to nanotime? long startTime = System.nanoTime(); //Spark work here long endTime = System.nanoTime(); long duration = (endTime - startTime) println(duration) Count recomputes the df so it makes sense it takes longer for you. On Tue, 02 Apr 2019 07:06:30 -0700 kol

Re: Load Time from HDFS

2019-04-10 Thread Mich Talebzadeh
Have you tried looking at Spark GUI to see the time it takes to load from HDFS? Spark GUI by default runs on port 4040. However, you can set in spark-submit ${SPARK_HOME}/bin/spark-submit \ …... --conf "spark.ui.port=" and access it through hostname:port HTH Dr Mich Talebzadeh LinkedI