Hello list
I have imported the data into spark and I found there is disk IO in
every node. The memory didn't get overflow.
But such query is quite slow:
>>> df.groupBy("rvid").agg({'rate':'avg','rvid':'count'}).show()
May I ask:
1. since I have 3 nodes (as known as 3 executors?), are there 3
partitions for each job?
2. can I expand the partition by hand to increase the performance?
Thanks
On 2022/2/11 6:22, frakass wrote:
On 2022/2/11 6:16, Gourav Sengupta wrote:
What is the source data (is it JSON, CSV, Parquet, etc)? Where are you
reading it from (JDBC, file, etc)? What is the compression format (GZ,
BZIP, etc)? What is the SPARK version that you are using?
it's a well built csv file (no compressed) stored in HDFS.
spark 3.2.0
Thanks.
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org