Does explode lead to more usage of memory

2020-01-18 Thread V0lleyBallJunki3
I am using a dataframe and has structure like this : root |-- orders: array (nullable = true) ||-- element: struct (containsNull = true) |||-- amount: double (nullable = true) |||-- id: string (nullable = true) |-- user: string (nullable = true) |-- language: string

Re: Can reduced parallelism lead to no shuffle spill?

2019-11-07 Thread V0lleyBallJunki3
I am just using the above example to understand how Spark handles partitions -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Can reduced parallelism lead to no shuffle spill?

2019-11-07 Thread V0lleyBallJunki3
Consider an example where I have a cluster with 5 nodes and each node has 64 cores with 244 GB memory. I decide to run 3 executors on each node and set executor-cores to 21 and executor memory of 80GB, so that each executor can execute 21 tasks in parallel. Now consider that 315(63 * 5)

spark.sql.autoBroadcastJoinThreshold not taking effect

2019-05-10 Thread V0lleyBallJunki3
Hello, I have set spark.sql.autoBroadcastJoinThreshold=1GB and I am running the spark job. However, my application is failing with: at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at

Re: Spark not doing a broadcast join inspite of the table being well below spark.sql.autoBroadcastJoinThreshold

2019-05-10 Thread V0lleyBallJunki3
So what I discovered was that if I write the table being joined to the disk and then read it again Spark correctly broadcasts it. I think it is because when Spark estimates the size of smaller table it estimates it incorrectly to be much bigger that what it is and hence decides to do a

Spark not doing a broadcast join inspite of the table being well below spark.sql.autoBroadcastJoinThreshold

2019-05-09 Thread V0lleyBallJunki3
I have a small table well below 50 MB that I want to broadcast join with a larger table. However, if I set spark.sql.autoBroadcastJoinThreshold to 100 MB spark still decides to do a SortMergeJoin instead of a broadcast join. I have to set an explicit broadcast hint on the table for it to do the

Best notebook for developing for apache spark using scala on Amazon EMR Cluster

2019-04-30 Thread V0lleyBallJunki3
Hello. I am using Zeppelin on Amazon EMR cluster while developing Apache Spark programs in Scala. The problem is that once that cluster is destroyed I lose all the notebooks on it. So over a period of time I have a lot of notebooks that require to be manually exported into my local disk and from

Re: Unable to broadcast a very large variable

2019-04-11 Thread V0lleyBallJunki3
I am not using pyspark. The job is written in Scala -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Unable to broadcast a very large variable

2019-04-10 Thread V0lleyBallJunki3
I am using spark.sparkContext.broadcast() to broadcast. Is this even true if the memory on our machines is 244 Gb a 70 Gb variable can't be broadcasted even with high network speed? -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Unable to broadcast a very large variable

2019-04-10 Thread V0lleyBallJunki3
Hello, I have a 110 node cluster with each executor having 50 GB memory and I want to broadcast a variable of 70GB with each machine have 244 GB of memory. I am having difficulty doing that. I was wondering at what size is it unwise to broadcast a variable. Is there a general rule of thumb?

Re: Any way to see the size of the broadcast variable?

2018-10-09 Thread V0lleyBallJunki3
Yes each of the executors have 60GB -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Any way to see the size of the broadcast variable?

2018-10-09 Thread V0lleyBallJunki3
Hello, I have set the value of spark.sql.autoBroadcastJoinThreshold to a very high value of 20 GB. I am joining a table that I am sure is below this variable, however spark is doing a SortMergeJoin. If I set a broadcast hint then spark does a broadcast join and job finishes much faster.

Set can be passed in as an input argument but not as output

2018-09-03 Thread V0lleyBallJunki3
I find that if the input is a Set then Spark doesn't try to find an encoder for the Set but at the same time if the output of a method is a Set it does try to find an encoder and if not found errors out. My understanding is that even the input set has to be transferred to the executors? The first

java.lang.UnsupportedOperationException: No Encoder found for Set[String]

2018-08-15 Thread V0lleyBallJunki3
Hello, I am using Spark 2.2.2 with Scala 2.11.8. I wrote a short program val spark = SparkSession.builder().master("local[4]").getOrCreate() case class TestCC(i: Int, ss: Set[String]) import spark.implicits._ import spark.sqlContext.implicits._ val testCCDS = Seq(TestCC(1,Set("SS","Salil")),