Re: Two questions regarding Sedona performance tuning

2024-01-21 Thread Jia Yu
Hi Hanxi,

Thanks for your interest in Sedona. I suggest that you read my post in
Sedona Discord:
https://discord.com/channels/1034906732264181760/1038573531434328135/1038573531434328135

To join our discord, please read:
https://sedona.apache.org/1.5.1/community/contact/#discord-server

To your question:

1. GeoParquet filter pushdown has NO contribution to the performance
of spatial join at all. Because a filter pushdown only works for
"filter" query not join query. The only impact a GeoParquet file could
make is probably the number of partitions.
2. The number of partitions, the dominant side of the join will both
have a significant impact on the join performance. Currently, you will
need to find the best number of partitions yourself. Although Sedona
has been trying to find the best one automatically, it is not perfect.
Sedona 1.5.1 changed the default spatial join index to rtree, it might
speed up the join query performance in some cases.

Thanks,
Jia



On Mon, Jan 15, 2024 at 12:01 PM Zhang, Hanxi (ISED/ISDE)
 wrote:
>
>
>
> Hello Sedona community,
>
>
>
> I was benchmarking the following geospatial range join query and similar 
> using Sedona, based on serverless spark clusters provisioned via Amazon Glue 
> Notebooks. I have been using Sedona version v1.5.0. Pyspark dataframes 't' 
> and 'g' are both loaded from parquet files hosted on Amazon S3, and are 
> partitioned in nested folder structures, based on region, year, month and 
> date. Specifically, ‘g’ is a polygon layer with millions of rows representing 
> polygons across Canada,  and ‘t’ is a point layer that has about 200 columns, 
> and up to tens of billions of rows depending on the region/year/month/date 
> chosen, representing points across Canada and the U.S.
>
>
>
> result = sedona.sql(
>
>  """
>
>  select t.device_uid, g.grid_id
>
>  from t, g
>
>  where st_contains(ST_GeomFromWKT(g.geometry), 
> ST_GeomFromWKT(t.geometry))
>
>  """
>
> )
>
>
>
>
>
> I was able to increase spark cluster size in AWS Glue to handle larger 
> volumes of the 't' dataset. For instance, as a stress test, I was able to 
> complete a query with 't' amounting to 14 billion rows in Canada in a bit 
> over 3 hours, with a spark cluster of 2TB memory. I am in general impressed 
> with the performance, and my questions below are regarding potential venues 
> to improve the performance a bit further if possible, and also help me 
> understand deeper about Sedona performance tuning.  I am attaching the result 
> of running ‘explain’ on my query. Any comments and suggestions are greatly 
> appreciated! And I apologize for the long message.
>
>
>
> 1) Geoparquet-level predicate pushdown
>
>
>
> Based on https://sedona.apache.org/1.5.0/api/sql/Optimizer/, my understanding 
> is that geoparquet level predicate pushdown happens automatically. I have 
> experimented with roughly equal volumes of 't' in Canada and U.S. 
> respectively. When 't' is in the order of hundreds of millions of objects in 
> the U.S., the above query would complete in 1/6 of the time taken by same 
> volume of 't' data in Canada, with an identical size and configuration of 
> spark cluster. That seems to confirm geoparquet-level predicate pushdown is 
> at play and causes 'polygons in Canada containing points in U.S.' to return 
> right away.
>
>
>
> When I increased the volume of 't' data 10x to the order of billions of point 
> objects, the U.S. case is still faster than the Canada case, but only 
> marginally, based on a spark cluster 8x larger in both cases. So for the 
> seemingly diminishing benefit of geoparquet-level predicate pushdown, I 
> wonder if it is hitting any limitation of Sedona's predicate pushdown 
> mechanism, or it is potentially due to limitations associated with the way 
> spark clusters are being scaled up in AWS Glue.
>
>
>
> In general, how can one confirm based on spark logs etc. that 
> geoparquet-level predicate pushdown is active? Can it be turned on and off? 
> And if so can one see statistics on volumes of data and files scanned?
>
>
>
> As a related note and question, in 
> https://sedona.apache.org/1.5.0/api/sql/Optimizer/ under "Push spatial 
> predicates to GeoParquet", it states that "To maximize the performance of 
> Sedona GeoParquet filter pushdown, we suggest that you sort the data by their 
> geohash values and then save as a GeoParquet file". In my case, the large 
> points layer 't' does have a geohash6 column, and I took the following steps 
> to see if that would help
>
>
>
> a)  select * from 't' order by t.geohash6
>
> b)  save the result to S3 as geoparquet files, following the same 
> partitioning scheme
>
> c)  run 'g' contains 't' query against the newly ordered 't' dataset with a 
> spark cluster of the same size and configuration
>
>
>
> Based on my results, that does help the performance a little bit, but it is 
> very marginal to the point it is negligible. I am a little confused as to 
> whether 

Re: Reading NetCDF files with Apache Sedona

2024-01-21 Thread Jia Yu
Hi Adam,

We recently have released Sedona 1.5.1 which includes the function to load
NetCDF file calle RS_FromNetCDF:
https://sedona.apache.org/1.5.1/api/sql/Raster-loader/#rs_fromnetcdf

You can follow our tutorial here to get started:
https://sedona.apache.org/1.5.1/tutorial/raster/

Thanks,
Jia

On Tue, Jan 9, 2024 at 10:34 AM Jia Yu  wrote:

> Hi Adam,
>
> We recently added the NetCDF reader in our raster data support. This
> feature will be shipped in Sedona 1.5.1, which will be released this week.
>
> Thanks,
> Jia
>
> On Tue, Jan 9, 2024 at 10:29 AM Rahman, Adam  wrote:
>
>> Hello,
>>
>>
>>
>> My team has been working on a geospatial project using Apache Sedona, and
>> we have raster data stored as NetCDF files in an AWS S3 bucket which we are
>> accessing via a Databricks cluster. Does Sedona have the ability to read &
>> process the NetCDF file type? We’re hoping to avoid having to process them
>> offline, and would love any help or advice that the team could provide.
>>
>>
>>
>> Thank you!
>>
>> -Adam
>>
>>
>>
>> __
>>
>> *Adam Rahman, PhD*
>> Senior Data Scientist
>> M +1 416 993 3548
>>
>> [image: signature_3886514788]
>>
>>
>>
>> --
>>
>> The Boston Consulting Group, Inc.
>>
>> This e-mail message may contain confidential and/or privileged
>> information. If you are not an addressee or otherwise authorized to receive
>> this message, you should not use, copy, disclose or take any action based
>> on this e-mail or any information contained in the message. If you have
>> received this material in error, please advise the sender immediately by
>> reply e-mail and delete this message.
>> We may share your contact details with other BCG entities and our third
>> party service providers. Please see BCG privacy policy
>> https://www.bcg.com/about/privacy-policy.aspx for further information.
>> Thank you.
>>
>


[jira] [Commented] (SEDONA-468) Add provision to use spheroid distance in ST_DWithin

2024-01-21 Thread Jia Yu (Jira)


[ 
https://issues.apache.org/jira/browse/SEDONA-468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17809240#comment-17809240
 ] 

Jia Yu commented on SEDONA-468:
---

This might be related to https://github.com/apache/sedona/pull/1163

> Add provision to use spheroid distance in ST_DWithin
> 
>
> Key: SEDONA-468
> URL: https://issues.apache.org/jira/browse/SEDONA-468
> Project: Apache Sedona
>  Issue Type: New Feature
>Reporter: Nilesh Gajwani
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [SEDONA-470] Distinguish between missing or null crs from the result of geoparquet.metadata [sedona]

2024-01-21 Thread via GitHub


jiayuasu merged PR #1211:
URL: https://github.com/apache/sedona/pull/1211


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@sedona.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org