Thank you for the help Mich :)
I have not started with a pandas DF. I have used pandas to create a dummy
.csv which I dump on the disk that I intend to use to showcase my pain
point. Providing pandas code was to ensure an end-to-end runnable example
is provided and the effort on anyone trying to
You have started with panda DF which won't scale outside of the driver
itself.
Let us put that aside.
df1.to_csv("./df1.csv",index_label = "index") ## write the dataframe to
the underlying file system
starting with spark
df1 = spark.read.csv("./df1.csv", header=True, schema = schema) ## read
Thank you for your response, Sir.
My understanding is that the final ```df3.count()``` is the only action in
the code I have attached. In fact, I tried running the rest of the code
(commenting out just the final df3.count()) and, as I expected, no
computations were triggered
On Sun, 7 May, 2023,
...However, In my case here I am calling just one action. ..
ok, which line in your code is called one action?
Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom
view my Linkedin profile
@Vikas Kumar
I am sorry but I thought that you had answered the other question that I
had raised to the same email address yesterday. It was around the SQL tab
in web UI and the output of .explain showing different plans.
I get how using .cache I can ensure that the data from a particular
When your memory is not sufficient to keep the cached data for your jobs in two
different stages, it might be read twice because Spark might have to clear the
previous cache for other jobs. In those cases, a spill may triggered when Spark
write your data from memory to disk.
One way to to