[no subject]

2020-12-08 Thread Владимир Курятков
unsubscribe


Unsubscribe

2020-12-08 Thread David Zhou
Unsubscribe


Re:

2020-12-08 Thread William Shiel
Unsubscribe

On Tue, Dec 8, 2020 at 8:50 AM rahul c  wrote:

> Unsubscribe
>


[no subject]

2020-12-08 Thread rahul c
Unsubscribe


unsubscribe

2020-12-08 Thread ????????
unsubscribe

RE: Is there any inplict RDD cache operation for query optimizations?

2020-12-08 Thread Theodoros Gkountouvas
I think Spark allows users to manage the cache space because they can do it 
much more effectively compared to an automated approach. It is very difficult 
to find a caching strategy that fits the needs of all users. Finally, although 
there are soft limits between the execution and cache memory space for 
executors, Spark does not want to fill the cache space with unnecessary 
intermediate data and limit the execution space for no reason by default.

There are somethings that are implicitly cached though (e.g. shuffles in disk) 
and you can avoid re-executing them if you re-use them.

To answer your question directly, I am not aware of any Catalyst optimization 
that does what you want, but Spark allows custom optimizations in Catalyst and 
you can implement your own caching strategy if it fits your purposes (see 
below).

sparkSession.experimental.extraOptimizations += Seq(CacheRule)

I hope this helps,
Theo.

-Original Message-
From: marcelo.amaral  
Sent: Tuesday, December 8, 2020 4:02 AM
To: dev@spark.apache.org
Subject: Is there any inplict RDD cache operation for query optimizations?

As the documentation says, Cache Manager is only invoked when a caching (i.e.
persist) function is called by the user in the code. Therefore, giving that, as 
far as I understood, unless cache/persist operations are not explicitly called, 
the job's results (including inputs and intermediate ones) will never be stored 
to be reused.

I am wondering if there exist any optimization for the query execution plan 
that applies any implicit cache mechanism without calling the cache/persist 
operation. Or if there is any other mechanism that can implicitly invoke the 
cache for any other situation.

In the case that I understood correctly, is there any strong reason why 
Catalyst Optimizer does not enforce any cache mechanism for the intermediate 
results between jobs?



--
Sent from: 
https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fapache-spark-developers-list.1001551.n3.nabble.com%2Fdata=04%7C01%7Ctheo.gkountouvas%40futurewei.com%7C07ccc03d5852409ea1e808d89b57f1ef%7C0fee8ff2a3b240189c753a1d5591fedc%7C1%7C0%7C637430149328029037%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=fCFkqJ2o3lSPMbwcOwHRFSX3szkSwEitpcp1m2IhHm8%3Dreserved=0

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Is there any inplict RDD cache operation for query optimizations?

2020-12-08 Thread marcelo.amaral
As the documentation says, Cache Manager is only invoked when a caching (i.e.
persist) function is called by the user in the code. Therefore, giving that,
as far as I understood, unless cache/persist operations are not explicitly
called, the job's results (including inputs and intermediate ones) will
never be stored to be reused.

I am wondering if there exist any optimization for the query execution plan
that applies any implicit cache mechanism without calling the cache/persist
operation. Or if there is any other mechanism that can implicitly invoke the
cache for any other situation.

In the case that I understood correctly, is there any strong reason why
Catalyst Optimizer does not enforce any cache mechanism for the intermediate
results between jobs?



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org