What is the no of part files in that big table? And what is the
distribution of request ID? Is the variance of the column is less or huge?
Because partitionBy clause will move data with same request ID to one
executor. If the data is huge it might put load on executor.

On Sun, 25 Aug 2019 at 16:56, Tzahi File <tzahi.f...@ironsrc.com> wrote:

> Hi,
>
> I encountered some issue to run a spark SQL query, and will happy to some
> advice.
> I'm trying to run a query on a very big data set (around 1.5TB) and it
> getting failures in all of my tries. A template of the query is as below:
> insert overwrite table partition(part)
> select  /*+ BROADCAST(c) */
>  *, row_number() over (partition by request_id order by economic_value
> DESC) row_number
> from (
> select a,b,c,d,e
> from table (raw data 1.5TB))
> left join small_table
>
> The heavy part in this query is the window function.
> I'm using 65 spots of type 5.4x.large.
> The spark settings:
> --conf spark.driver.memory=10g
> --conf spark.sql.shuffle.partitions=1200
> --conf spark.executor.memory=22000M
> --conf spark.shuffle.service.enabled=false
>
>
> You can see below an example of the errors that I get:
> [image: image.png]
>
>
> any suggestions?
>
>
>
> Thanks!
> Tzahi
>

Reply via email to