performance of IN clause

2018-10-17 Thread Jayesh Lalwani
Is there a significant differrence in how a IN clause performs when compared to a JOIN? Let's say I have 2 tables, A and B/ B has 50million rows and A has 1 million Will this query? *Select * from A where join_key in (Select join_key from B)* *perform much worse than* * Select * from A* *INNER j

Re: [External Sender] Pitfalls of partitioning by host?

2018-08-28 Thread Jayesh Lalwani
If you group by the host that you have computed using the UDF, Spark is always going to shuffle your dataset, even if the end result is that all the new partitions look exactly like the old partitions, just placed on differrent nodes. Remember the hostname will probably hash differrently than the p

Does row_number over a window cause a shuffle?

2018-08-03 Thread Jayesh Lalwani
I have some code that adds a column that contains a row_number over a window. It looks somewhat like this val sortColumns: List[Column] = r.sortFields.map(sf => sf.map(col(_))).getOrElse(List(col(s"defaultSortCol"))) val partitionWindow = Window.partitionBy(s"groupByCol") val window = partitionWin

Re: [External Sender] re: streaming, batch / spark 2.2.1

2018-08-02 Thread Jayesh Lalwani
What is differrent between the 2 systems? If one system processes records faster than the other, simply because it does less processing, then you can expect the first system to have a higher throughput than the second. It's hard to say why one system has double the throughput of another without kno

Spark on Kubernetes: Kubernetes killing executors because of overallocation of memory

2018-08-02 Thread Jayesh Lalwani
We are running Spark 2.3 on a Kubernetes cluster. We have set the following spark configuration options "spark.executor.memory": "7g", "spark.driver.memory": "2g", "spark.memory.fraction": "0.75" WHat we see is a) In the SPark UI, 5G has been allocated to each executor, which makes sense