Re: data size exceeds the total ram

2022-02-11 Thread Gourav Sengupta
Hi, I am in a meeting, but you can look out for a setting that tells spark how many bytes to read from a file at one go. I use SQL, which is far better in case you are using dataframes. As we do not still know what is the SPARK version that you are using, it may cause issues around skew, and

Re: data size exceeds the total ram

2022-02-11 Thread Mich Talebzadeh
check this https://sparkbyexamples.com/spark/spark-partitioning-understanding/ view my Linkedin profile *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other

Re: data size exceeds the total ram

2022-02-11 Thread frakass
Hello list I have imported the data into spark and I found there is disk IO in every node. The memory didn't get overflow. But such query is quite slow: >>> df.groupBy("rvid").agg({'rate':'avg','rvid':'count'}).show() May I ask: 1. since I have 3 nodes (as known as 3 executors?), are there

Re: data size exceeds the total ram

2022-02-11 Thread frakass
On 2022/2/11 6:16, Gourav Sengupta wrote: What is the source data (is it JSON, CSV, Parquet, etc)? Where are you reading it from (JDBC, file, etc)? What is the compression format (GZ, BZIP, etc)? What is the SPARK version that you are using? it's a well built csv file (no compressed)

Re: data size exceeds the total ram

2022-02-11 Thread Gourav Sengupta
Hi, just so that we understand the problem first? What is the source data (is it JSON, CSV, Parquet, etc)? Where are you reading it from (JDBC, file, etc)? What is the compression format (GZ, BZIP, etc)? What is the SPARK version that you are using? Thanks and Regards, Gourav Sengupta On Fri,

Re: data size exceeds the total ram

2022-02-11 Thread Mich Talebzadeh
Well one experiment is worth many times more than asking what/if scenario question. 1. Try running it first to see how spark handles it 2. Go to spark GUI (on port 4044) and look at the storage tab and see what it says 3. Unless you explicitly persist the data, Spark will read the

data size exceeds the total ram

2022-02-11 Thread frakass
Hello I have three nodes with total memory 128G x 3 = 384GB But the input data is about 1TB. How can spark handle this case? Thanks. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org