Is there a significant differrence in how a IN clause performs when
compared to a JOIN?
Let's say I have 2 tables, A and B/ B has 50million rows and A has 1 million
Will this query?
*Select * from A where join_key in (Select join_key from B)*
*perform much worse than*
* Select * from A*
*INNER j
If you group by the host that you have computed using the UDF, Spark is
always going to shuffle your dataset, even if the end result is that all
the new partitions look exactly like the old partitions, just placed on
differrent nodes. Remember the hostname will probably hash differrently
than the p
I have some code that adds a column that contains a row_number over a
window. It looks somewhat like this
val sortColumns: List[Column] = r.sortFields.map(sf =>
sf.map(col(_))).getOrElse(List(col(s"defaultSortCol")))
val partitionWindow = Window.partitionBy(s"groupByCol")
val window = partitionWin
What is differrent between the 2 systems? If one system processes records
faster than the other, simply because it does less processing, then you can
expect the first system to have a higher throughput than the second. It's
hard to say why one system has double the throughput of another without
kno
We are running Spark 2.3 on a Kubernetes cluster. We have set the following
spark configuration options
"spark.executor.memory": "7g",
"spark.driver.memory": "2g",
"spark.memory.fraction": "0.75"
WHat we see is
a) In the SPark UI, 5G has been allocated to each executor, which makes
sense