Re: data size exceeds the total ram

frakass Fri, 11 Feb 2022 03:08:58 -0800

Hello list

I have imported the data into spark and I found there is disk IO inevery node. The memory didn't get overflow.


But such query is quite slow:

>>> df.groupBy("rvid").agg({'rate':'avg','rvid':'count'}).show()


May I ask:

1. since I have 3 nodes (as known as 3 executors?), are there 3partitions for each job?

2. can I expand the partition by hand to increase the performance?

Thanks



On 2022/2/11 6:22, frakass wrote:

On 2022/2/11 6:16, Gourav Sengupta wrote:
What is the source data (is it JSON, CSV, Parquet, etc)? Where are youreading it from (JDBC, file, etc)? What is the compression format (GZ,BZIP, etc)? What is the SPARK version that you are using?
it's a well built csv file (no compressed) stored in HDFS.
spark 3.2.0

Thanks.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: data size exceeds the total ram

Reply via email to