Re: Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-07 Thread Nitin Siwach
Thank you for the help Mich :) I have not started with a pandas DF. I have used pandas to create a dummy .csv which I dump on the disk that I intend to use to showcase my pain point. Providing pandas code was to ensure an end-to-end runnable example is provided and the effort on anyone trying to

Re: Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-07 Thread Mich Talebzadeh
You have started with panda DF which won't scale outside of the driver itself. Let us put that aside. df1.to_csv("./df1.csv",index_label = "index") ## write the dataframe to the underlying file system starting with spark df1 = spark.read.csv("./df1.csv", header=True, schema = schema) ## read

Re: Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-07 Thread Nitin Siwach
Thank you for your response, Sir. My understanding is that the final ```df3.count()``` is the only action in the code I have attached. In fact, I tried running the rest of the code (commenting out just the final df3.count()) and, as I expected, no computations were triggered On Sun, 7 May, 2023,

Re: Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-07 Thread Mich Talebzadeh
...However, In my case here I am calling just one action. .. ok, which line in your code is called one action? Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies Limited London United Kingdom view my Linkedin profile

Re: Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-07 Thread Nitin Siwach
@Vikas Kumar I am sorry but I thought that you had answered the other question that I had raised to the same email address yesterday. It was around the SQL tab in web UI and the output of .explain showing different plans. I get how using .cache I can ensure that the data from a particular

unsubscribe

2023-05-07 Thread Utkarsh Jain

Re: Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-07 Thread Winston Lai
When your memory is not sufficient to keep the cached data for your jobs in two different stages, it might be read twice because Spark might have to clear the previous cache for other jobs. In those cases, a spill may triggered when Spark write your data from memory to disk. One way to to