Re: Release manager for Apache Sedona 1.1.0
Btw some projects have scripts that automate most of the steps there. It will be worthwhile look into those. On Tue, Jul 27, 2021 at 3:53 PM Jia Yu wrote: > Hi Sedona PMC members, > > We are planning to release Apache Sedona 1.1.0 as soon as possible. For > this release, I think we should find another release manager to get > familiar with the release process. > > To be honest, the entire process is not very easy: > https://sedona.apache.org/download/publish/ > > Do any of you want to be in charge of this particular release? > > Thanks, > Jia >
Release manager for Apache Sedona 1.1.0
Hi Sedona PMC members, We are planning to release Apache Sedona 1.1.0 as soon as possible. For this release, I think we should find another release manager to get familiar with the release process. To be honest, the entire process is not very easy: https://sedona.apache.org/download/publish/ Do any of you want to be in charge of this particular release? Thanks, Jia
Re: Spatial join performances
Hi Pietro, A few tips to optimize your join: 1. Mix DF and RDD together and use RDD API for the join part. See the example here: https://github.com/apache/incubator-sedona/blob/master/binder/ApacheSedonaSQL_SpatialJoin_AirportsPerCountry.ipynb 2. When use point_rdd.spatialPartitioning(GridType.KDBTREE, 4), try to use a large number of partitions (say 1000 or more) If this approach doesn't work, consider broadcast join if needed. Broadcast the polygon side: https://sedona.apache.org/api/sql/Optimizer/#broadcast-join Thanks, Jia On Tue, Jul 27, 2021 at 2:21 AM pietro greselin wrote: > To whom it may concern, > > we reported the following Sedona behaviour and would like to ask your > opinion on how we can otpimize it. > > Our aim is to perform a inner spatial join between a points_df and a > polygon_df when a point in points_df is contained in a polygon from > polygons_df. > Below you can find more details about the 2 dataframes we are considering: > - points_df: it contains 50mln events with latitude and longitude > approximated to the third decimal digit; > - polygon_df: it contains 10k multi-polygons having 600 vertexes on > average. > > The issue we are reporting is a very long computing time and the spatial > join query never completing even when running on cluster with 40 workers > with 4 cores each. > No error is being print by driver but we are receiving the following > warning: > WARN org.apache.sedona.core.spatialOperator.JoinQuery: UseIndex is true, > but no index exists. Will build index on the fly. > > Actually we were able to run successfully the same spatial join when only > considering a very small sample of events. > Do you have any suggestion on how we can archive the same result on higher > volumes of data or if there is a way we can optimize the join? > > Attached you can find the pseudo-code we are running. > > Looking forward to hearing from you. > > Kind regards, > Pietro Greselin >
Re: Processing a netcdf file
Hi Panos, NetCDF is currently supported only in Java and Scala API. You need to add SerNetCF dependency: https://sedona.apache.org/download/maven-coordinates/#sernetcdf-010 Here is an example for it. Basically you need to provide a CSV file that describes the paths of a number of NetCDF files. https://github.com/apache/incubator-sedona/blob/master/core/src/test/java/org/apache/sedona/core/io/EarthdataHDFTest.java Thanks, Jia On Fri, Jul 23, 2021 at 5:59 AM Panos Konstantinidis wrote: > Hello, I would like to try out Sedona (we already have a Spark cluster > installed) in order to process and read netcdf files. If there an example > of how to do it? Any pointers? > In your examples (https://sedona.apache.org/tutorial/rdd/) you don't have > that information. > Regards > Panos
Spatial join performances
To whom it may concern, we reported the following Sedona behaviour and would like to ask your opinion on how we can otpimize it. Our aim is to perform a inner spatial join between a points_df and a polygon_df when a point in points_df is contained in a polygon from polygons_df. Below you can find more details about the 2 dataframes we are considering: - points_df: it contains 50mln events with latitude and longitude approximated to the third decimal digit; - polygon_df: it contains 10k multi-polygons having 600 vertexes on average. The issue we are reporting is a very long computing time and the spatial join query never completing even when running on cluster with 40 workers with 4 cores each. No error is being print by driver but we are receiving the following warning: WARN org.apache.sedona.core.spatialOperator.JoinQuery: UseIndex is true, but no index exists. Will build index on the fly. Actually we were able to run successfully the same spatial join when only considering a very small sample of events. Do you have any suggestion on how we can archive the same result on higher volumes of data or if there is a way we can optimize the join? Attached you can find the pseudo-code we are running. Looking forward to hearing from you. Kind regards, Pietro Greselin def toSpatialPoints(df, lat_column_name, lon_column_name): df.createOrReplaceTempView("points_df") return ( spark.sql("SELECT *, ST_Point({}, {}) AS point FROM points_df".format(lon_column_name, lat_column_name)) .drop(lat_column_name, lon_column_name) ) def toSpatialPolygons(df, wtk_column_name): df.createOrReplaceTempView("polygons_df") return ( spark.sql("SELECT *, ST_GeomFromWKT({}) AS polygon FROM polygons_df".format(wtk_column_name)) .drop(wtk_column_name) ) def sJoin(polygons_df, points_df): polygons_df.createOrReplaceTempView('polygons_df') points_df.createOrReplaceTempView('points_df') return spark.sql("SELECT * FROM polygons_df, points_df WHERE ST_Contains(polygons_df.polygon, points_df.point)") maps = spark.read.parquet(maps_path).select('AREA_ID', 'WKT') polygons_df = toSpatialPolygons(maps, 'WKT') events = spark.read.parquet(events_path).select('ID', 'LATITUDE', 'LONGITUDE') points_df = toSpatialPoints(events, 'LATITUDE', 'LONGITUDE') spatial_join = sJoin(df_polygons, df_points)
[jira] [Created] (SEDONA-55) Publish Python artifact 1.0.2
Artur Dryomov created SEDONA-55: --- Summary: Publish Python artifact 1.0.2 Key: SEDONA-55 URL: https://issues.apache.org/jira/browse/SEDONA-55 Project: Apache Sedona Issue Type: Task Reporter: Artur Dryomov As noted in release notes for [the {{1.0.1}} release|http://sedona.apache.org/download/release-notes/#sedona-101] there was a configuration issue, resulting in PySpark version mismatch. Unfortunately suggested workarounds do not work with [tools like {{pip-tools}}|https://github.com/jazzband/pip-tools] which auto-generate dependencies. Seems like this kind of change fits the patch release so it would be great to have a {{1.0.2}} release, resolving this incovinience. -- This message was sent by Atlassian Jira (v8.3.4#803005)