Hi,

I am in a meeting, but you can look out for a setting that tells spark how
many bytes to read from a file at one go.

I use SQL, which  is far better in case you are using dataframes.

As we do not still know what is the SPARK version that you are using, it
may cause issues around skew, and there are different ways to manage that
depending on the SPARK version.



Thanks and Regards,
Gourav Sengupta

On Fri, Feb 11, 2022 at 11:09 AM frakass <capitnfrak...@free.fr> wrote:

> Hello list
>
> I have imported the data into spark and I found there is disk IO in
> every node. The memory didn't get overflow.
>
> But such query is quite slow:
>
>  >>> df.groupBy("rvid").agg({'rate':'avg','rvid':'count'}).show()
>
>
> May I ask:
> 1. since I have 3 nodes (as known as 3 executors?), are there 3
> partitions for each job?
> 2. can I expand the partition by hand to increase the performance?
>
> Thanks
>
>
>
> On 2022/2/11 6:22, frakass wrote:
> >
> >
> > On 2022/2/11 6:16, Gourav Sengupta wrote:
> >> What is the source data (is it JSON, CSV, Parquet, etc)? Where are you
> >> reading it from (JDBC, file, etc)? What is the compression format (GZ,
> >> BZIP, etc)? What is the SPARK version that you are using?
> >
> > it's a well built csv file (no compressed) stored in HDFS.
> > spark 3.2.0
> >
> > Thanks.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Reply via email to