Hello list

I have imported the data into spark and I found there is disk IO in every node. The memory didn't get overflow.

But such query is quite slow:

>>> df.groupBy("rvid").agg({'rate':'avg','rvid':'count'}).show()


May I ask:
1. since I have 3 nodes (as known as 3 executors?), are there 3 partitions for each job?
2. can I expand the partition by hand to increase the performance?

Thanks



On 2022/2/11 6:22, frakass wrote:


On 2022/2/11 6:16, Gourav Sengupta wrote:
What is the source data (is it JSON, CSV, Parquet, etc)? Where are you reading it from (JDBC, file, etc)? What is the compression format (GZ, BZIP, etc)? What is the SPARK version that you are using?

it's a well built csv file (no compressed) stored in HDFS.
spark 3.2.0

Thanks.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to