Re: data size exceeds the total ram

2022-02-11 Thread Gourav Sengupta
Hi,

I am in a meeting, but you can look out for a setting that tells spark how
many bytes to read from a file at one go.

I use SQL, which  is far better in case you are using dataframes.

As we do not still know what is the SPARK version that you are using, it
may cause issues around skew, and there are different ways to manage that
depending on the SPARK version.



Thanks and Regards,
Gourav Sengupta

On Fri, Feb 11, 2022 at 11:09 AM frakass  wrote:

> Hello list
>
> I have imported the data into spark and I found there is disk IO in
> every node. The memory didn't get overflow.
>
> But such query is quite slow:
>
>  >>> df.groupBy("rvid").agg({'rate':'avg','rvid':'count'}).show()
>
>
> May I ask:
> 1. since I have 3 nodes (as known as 3 executors?), are there 3
> partitions for each job?
> 2. can I expand the partition by hand to increase the performance?
>
> Thanks
>
>
>
> On 2022/2/11 6:22, frakass wrote:
> >
> >
> > On 2022/2/11 6:16, Gourav Sengupta wrote:
> >> What is the source data (is it JSON, CSV, Parquet, etc)? Where are you
> >> reading it from (JDBC, file, etc)? What is the compression format (GZ,
> >> BZIP, etc)? What is the SPARK version that you are using?
> >
> > it's a well built csv file (no compressed) stored in HDFS.
> > spark 3.2.0
> >
> > Thanks.
> >
> > -
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: data size exceeds the total ram

2022-02-11 Thread Mich Talebzadeh
check this

https://sparkbyexamples.com/spark/spark-partitioning-understanding/



   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 11 Feb 2022 at 11:09, frakass  wrote:

> Hello list
>
> I have imported the data into spark and I found there is disk IO in
> every node. The memory didn't get overflow.
>
> But such query is quite slow:
>
>  >>> df.groupBy("rvid").agg({'rate':'avg','rvid':'count'}).show()
>
>
> May I ask:
> 1. since I have 3 nodes (as known as 3 executors?), are there 3
> partitions for each job?
> 2. can I expand the partition by hand to increase the performance?
>
> Thanks
>
>
>
> On 2022/2/11 6:22, frakass wrote:
> >
> >
> > On 2022/2/11 6:16, Gourav Sengupta wrote:
> >> What is the source data (is it JSON, CSV, Parquet, etc)? Where are you
> >> reading it from (JDBC, file, etc)? What is the compression format (GZ,
> >> BZIP, etc)? What is the SPARK version that you are using?
> >
> > it's a well built csv file (no compressed) stored in HDFS.
> > spark 3.2.0
> >
> > Thanks.
> >
> > -
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: data size exceeds the total ram

2022-02-11 Thread frakass

Hello list

I have imported the data into spark and I found there is disk IO in 
every node. The memory didn't get overflow.


But such query is quite slow:

>>> df.groupBy("rvid").agg({'rate':'avg','rvid':'count'}).show()


May I ask:
1. since I have 3 nodes (as known as 3 executors?), are there 3 
partitions for each job?

2. can I expand the partition by hand to increase the performance?

Thanks



On 2022/2/11 6:22, frakass wrote:



On 2022/2/11 6:16, Gourav Sengupta wrote:
What is the source data (is it JSON, CSV, Parquet, etc)? Where are you 
reading it from (JDBC, file, etc)? What is the compression format (GZ, 
BZIP, etc)? What is the SPARK version that you are using?


it's a well built csv file (no compressed) stored in HDFS.
spark 3.2.0

Thanks.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: data size exceeds the total ram

2022-02-11 Thread frakass




On 2022/2/11 6:16, Gourav Sengupta wrote:
What is the source data (is it JSON, CSV, Parquet, etc)? Where are you 
reading it from (JDBC, file, etc)? What is the compression format (GZ, 
BZIP, etc)? What is the SPARK version that you are using?


it's a well built csv file (no compressed) stored in HDFS.
spark 3.2.0

Thanks.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: data size exceeds the total ram

2022-02-11 Thread Gourav Sengupta
Hi,

just so that we understand the problem first?

What is the source data (is it JSON, CSV, Parquet, etc)? Where are you
reading it from (JDBC, file, etc)? What is the compression format (GZ,
BZIP, etc)? What is the SPARK version that you are using?


Thanks and Regards,
Gourav Sengupta

On Fri, Feb 11, 2022 at 9:39 AM Mich Talebzadeh 
wrote:

> Well one experiment is worth many times more than asking what/if scenario
> question.
>
>
>1. Try running it first to see how spark handles it
>2. Go to spark GUI (on port 4044) and look at the storage tab and see
>what it says
>3. Unless you explicitly persist the data, Spark will read the data
>using appropriate partitions given the memory size and cluster count. As
>long as there is sufficient disk space (not memory), Spark will handle
>files larger than the available memory. However, If you do persist,
>you will get an Out of Memory error
>
> HTH
>
>
>
>view my Linkedin profile
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 11 Feb 2022 at 09:23, frakass  wrote:
>
>> Hello
>>
>> I have three nodes with total memory 128G x 3 = 384GB
>> But the input data is about 1TB.
>> How can spark handle this case?
>>
>> Thanks.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>


Re: data size exceeds the total ram

2022-02-11 Thread Mich Talebzadeh
Well one experiment is worth many times more than asking what/if scenario
question.


   1. Try running it first to see how spark handles it
   2. Go to spark GUI (on port 4044) and look at the storage tab and see
   what it says
   3. Unless you explicitly persist the data, Spark will read the data
   using appropriate partitions given the memory size and cluster count. As
   long as there is sufficient disk space (not memory), Spark will handle
   files larger than the available memory. However, If you do persist, you
   will get an Out of Memory error

HTH



   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 11 Feb 2022 at 09:23, frakass  wrote:

> Hello
>
> I have three nodes with total memory 128G x 3 = 384GB
> But the input data is about 1TB.
> How can spark handle this case?
>
> Thanks.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


data size exceeds the total ram

2022-02-11 Thread frakass

Hello

I have three nodes with total memory 128G x 3 = 384GB
But the input data is about 1TB.
How can spark handle this case?

Thanks.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org