[Spark/deeplyR] how come spark is caching tables read through jdbc connection from oracle, even when memory=false is chosen

Joris Billen Tue, 31 Jan 2023 07:40:45 -0800

This question is related to using Spark and deeplyR.
We load a lot of data from oracle in dataframes through a jdbc connection:


dfX <- spark_read_jdbc(spConn, “myconnection", 
            options = list(
                    url = urlDEVdb,
                    driver = "oracle.jdbc.OracleDriver",
                    user = dbt_schema,
                    password = dbt_password,
                    dbtable = pQuery,
                    memory = FALSE # don't cache the whole (big) table
            ))

Then we do a lot of sql statemsnts, and use sdf_register to register the 
results. Eventually we want to write the final result to a db. 

Although we have set memory=FALSE, we see all these tables get cached. I notice 
that counts are triggered (I think this happens before a table is ccahed) and a 
collect is triggered. Also we think we see that when the tables are registered 
with sdf_register, looks like it triggers a collect action (almost looks like 
these are also cached). This leads to a lot of actions (often on the dataframes 
resulting from the same pipeline) which takes a long time.

Questions to people using deeplyR+spark:
1) Is it possible that this memory =false is ignored when reading through jdbc? 
2) can someone confirm that there is a lot of automatic caching happening (and 
hence a lot of counts and a lot of actions)?


Thanks for input!


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

[Spark/deeplyR] how come spark is caching tables read through jdbc connection from oracle, even when memory=false is chosen

Reply via email to