Hi Jia,

Thank you very much for your suggestion.

I tried to repartition both input datasets df1 and df2 and unfortunately it did 
not help; my cluster was always  stuck at the last 3 tasks out of 100-200 when 
doing:
        SELECT * FROM df1, df2 WHERE ST_Contains(df1.geom, df2.geom)

However, if I do ST_Within(df2.geom, df1.geom), it will not get stuck, except 
the query would take longer than ST_contains(df1, df2) as expected.

So I have a couple follow-up questions.

a) Any idea on how to troubleshoot the ST_Contains query? I really would like 
to see it work since it will be faster. And what could have been the reason 
that ST_within works but not ST_Contains. My two input datasets on based on 
nested geo-parquet format and retrieved from Amazon S3. 

b) This is not necessarily a Sedona specific question. Using ST_Within, the 
entire spatial query plus output to S3 in geo-parquet format takes about 2.2 
hours. Yet if I simply do a count() on the result dataframe, it would take 3.2 
hours. That does not make sense to me. Did I miss something basic?

Thank you again!

Hanxi



-----Original Message-----
From: Jia Yu <ji...@apache.org> 
Sent: April 19, 2023 9:30 PM
To: dev@sedona.apache.org
Subject: Re: Sedona workers stuck in loop: DynamicIndexLookupJudgement: [xx, 
PID=xx] [Streaming shapes] Reached a milestone: xxxxxxx

Hi,

This is likely caused by skewed data. To address that:

(1) try to increase the number of partitions in your two input DataFrame. For 
example, df = df.repartition(1000)
(2) Try to switch the sides of spatial joins, this might improve the join 
performance

Rule of thumb:
The spatial partitioning grids (which directly affects the load balance of the 
workloads) should be built on the larger dataset in a spatial join. We call 
this dataset the dominant dataset.

In Sedona 1.3.1-incubating and earlier versions:

dominant dataset is df1:
SELECT * FROM df1, df2 WHERE ST_Contains(df1.geom, df2.geom)

dominant dataset is df2:
SELECT * FROM df1, df2 WHERE ST_CoveredBy(df2.geom, df1.geom)

In Sedona 1.4.0 and later:

dominant dataset is df1:
SELECT * FROM df1, df2 WHERE ST_Contains(df1.geom, df2.geom)

dominant dataset is df2:
SELECT * FROM df2, df1 WHERE ST_Contains(df1.geom, df2.geom)

Thanks,
Jia

On Wed, Apr 19, 2023 at 12:00 PM Zhang, Hanxi (ISED/ISDE) 
<hanxi.zh...@ised-isde.gc.ca> wrote:
>
> Hello sedona community,
>
>
>
> I am running a geospatial Sedona cluster on Amazon EMR. Specifically, my 
> cluster is based on Spark 3.3.0 and Sedona 1.3.1-incubating.
>
>
>
> In my cluster of there are 10 executor nodes, and each runs two executors 
> based on my configuration. I am using the above cluster to run a large 
> st_contains join between two datasets in geoparquet format.
>
>
>
> The issue I have been experiencing is, the vast majority of the executors 
> complete their tasks within about 2 minutes. However 1 or two executors are 
> stuck on the last few jobs. From the stderr logs on Spark History Web UI, 
> this is the final state of the problematic executors:
>
>
>
> 23/04/19 05:37:10 INFO DynamicIndexLookupJudgement: [73, PID=2] 
> [Streaming shapes] Reached a milestone: 1600001
>
> 23/04/19 06:20:08 INFO DynamicIndexLookupJudgement: [73, PID=2] 
> [Streaming shapes] Reached a milestone: 1700001
>
> 23/04/19 07:02:13 INFO DynamicIndexLookupJudgement: [73, PID=2] 
> [Streaming shapes] Reached a milestone: 1800001
>
> 23/04/19 07:45:25 INFO DynamicIndexLookupJudgement: [73, PID=2] 
> [Streaming shapes] Reached a milestone: 1900001
>
> 23/04/19 08:29:24 INFO DynamicIndexLookupJudgement: [73, PID=2] 
> [Streaming shapes] Reached a milestone: 2000001
>
> 23/04/19 09:14:25 INFO DynamicIndexLookupJudgement: [73, PID=2] 
> [Streaming shapes] Reached a milestone: 2100001
>
> 23/04/19 09:56:18 INFO DynamicIndexLookupJudgement: [73, PID=2] 
> [Streaming shapes] Reached a milestone: 2200001
>
> 23/04/19 10:38:24 INFO DynamicIndexLookupJudgement: [73, PID=2] 
> [Streaming shapes] Reached a milestone: 2300001
>
> 23/04/19 11:21:53 INFO DynamicIndexLookupJudgement: [73, PID=2] 
> [Streaming shapes] Reached a milestone: 2400001
>
> 23/04/19 12:05:49 INFO DynamicIndexLookupJudgement: [73, PID=2] 
> [Streaming shapes] Reached a milestone: 2500001
>
> 23/04/19 12:53:34 INFO DynamicIndexLookupJudgement: [73, PID=2] 
> [Streaming shapes] Reached a milestone: 2600001
>
> 23/04/19 13:36:55 INFO DynamicIndexLookupJudgement: [73, PID=2] 
> [Streaming shapes] Reached a milestone: 2700001
>
> 23/04/19 14:19:18 INFO DynamicIndexLookupJudgement: [73, PID=2] 
> [Streaming shapes] Reached a milestone: 2800001
>
> 23/04/19 15:04:06 INFO DynamicIndexLookupJudgement: [73, PID=2] 
> [Streaming shapes] Reached a milestone: 2900001
>
> 23/04/19 16:08:55 INFO DynamicIndexLookupJudgement: [73, PID=2] 
> [Streaming shapes] Reached a milestone: 3000001
>
> 23/04/19 16:51:07 INFO DynamicIndexLookupJudgement: [73, PID=2] 
> [Streaming shapes] Reached a milestone: 3100001
>
> 23/04/19 17:36:03 INFO DynamicIndexLookupJudgement: [73, PID=2] 
> [Streaming shapes] Reached a milestone: 3200001
>
>
>
> I would greatly appreciate any comments on what do the above logs indicate, 
> and what needs to be done with either the spark sql query, the input datasets 
> or spark/Sedona configurations to alleviate this issue. Thank you very much!
>
>
>
>
>
>

Reply via email to