Hey Enrico it does help to understand it, thanks for explaining.
Regarding this comment
> PySpark and Scala should behave identically here
Is it ok that Scala and PySpark optimization works differently in this case?
вт, 5 дек. 2023 г. в 20:08, Enrico Minack :
> Hi Michail,
>
> with spark.conf
Hi Michail,
with spark.conf.set("spark.sql.planChangeLog.level", "WARN") you can see
how Spark optimizes the query plan.
In PySpark, the plan is optimized into
Project ...
+- CollectMetrics 2, [count(1) AS count(1)#200L]
+- LocalTableScan , [col1#125, col2#126L, col3#127, col4#132L]
The en
Hi Michail,
observations as well as ordinary accumulators only observe / process
rows that are iterated / consumed by downstream stages. If the query
plan decides to skip one side of the join, that one will be removed from
the final plan completely. Then, the Observation will not retrieve any
met
Hey folks, I actively using observe method on my spark jobs and noticed
interesting behavior:
Here is an example of working and non working code:
https://gist.github.com/Coola4kov/8aeeb05abd39794f8362a3cf1c66519c
In a few words, if I'm joining dataframe after some filter rules and
it became empty,