Re: Release manager for Apache Sedona 1.1.0

2021-07-27 Thread Felix Cheung
Btw some projects have scripts that automate most of the steps there. It
will be worthwhile look into those.


On Tue, Jul 27, 2021 at 3:53 PM Jia Yu  wrote:

> Hi Sedona PMC members,
>
> We are planning to release Apache Sedona 1.1.0 as soon as possible. For
> this release, I think we should find another release manager to get
> familiar with the release process.
>
> To be honest, the entire process is not very easy:
> https://sedona.apache.org/download/publish/
>
> Do any of you want to be in charge of this particular release?
>
> Thanks,
> Jia
>


Release manager for Apache Sedona 1.1.0

2021-07-27 Thread Jia Yu
Hi Sedona PMC members,

We are planning to release Apache Sedona 1.1.0 as soon as possible. For
this release, I think we should find another release manager to get
familiar with the release process.

To be honest, the entire process is not very easy:
https://sedona.apache.org/download/publish/

Do any of you want to be in charge of this particular release?

Thanks,
Jia


Re: Spatial join performances

2021-07-27 Thread Jia Yu
Hi Pietro,

A few tips to optimize your join:

1. Mix DF and RDD together and use RDD API for the join part. See the
example here:
https://github.com/apache/incubator-sedona/blob/master/binder/ApacheSedonaSQL_SpatialJoin_AirportsPerCountry.ipynb

2. When use point_rdd.spatialPartitioning(GridType.KDBTREE, 4), try to use
a large number of partitions (say 1000 or more)

If this approach doesn't work, consider broadcast join if needed. Broadcast
the polygon side:
https://sedona.apache.org/api/sql/Optimizer/#broadcast-join

Thanks,
Jia


On Tue, Jul 27, 2021 at 2:21 AM pietro greselin 
wrote:

> To whom it may concern,
>
> we reported the following Sedona behaviour and would like to ask your
> opinion on how we can otpimize it.
>
> Our aim is to perform a inner spatial join between a points_df and a
> polygon_df when a point in points_df is contained in a polygon from
> polygons_df.
> Below you can find more details about the 2 dataframes we are considering:
> - points_df: it contains 50mln events with latitude and longitude
> approximated to the third decimal digit;
> - polygon_df: it contains 10k multi-polygons having 600 vertexes on
> average.
>
> The issue we are reporting is a very long computing time and the spatial
> join query never completing even when running on cluster with 40 workers
> with 4 cores each.
> No error is being print by driver but we are receiving the following
> warning:
> WARN org.apache.sedona.core.spatialOperator.JoinQuery: UseIndex is true,
> but no index exists. Will build index on the fly.
>
> Actually we were able to run successfully the same spatial join when only
> considering a very small sample of events.
> Do you have any suggestion on how we can archive the same result on higher
> volumes of data or if there is a way we can optimize the join?
>
> Attached you can find the pseudo-code we are running.
>
> Looking forward to hearing from you.
>
> Kind regards,
> Pietro Greselin
>


Re: Processing a netcdf file

2021-07-27 Thread Jia Yu
Hi Panos,

NetCDF is currently supported only in Java and Scala API. You need to add
SerNetCF dependency:
https://sedona.apache.org/download/maven-coordinates/#sernetcdf-010

Here is an example for it. Basically you need to provide a CSV file that
describes the paths of a number of NetCDF files.

https://github.com/apache/incubator-sedona/blob/master/core/src/test/java/org/apache/sedona/core/io/EarthdataHDFTest.java

Thanks,
Jia

On Fri, Jul 23, 2021 at 5:59 AM Panos Konstantinidis
 wrote:

> Hello, I would like to try out Sedona (we already have a Spark cluster
> installed) in order to process and read netcdf files. If there an example
> of how to do it? Any pointers?
> In your examples (https://sedona.apache.org/tutorial/rdd/) you don't have
> that information.
> Regards
> Panos


Spatial join performances

2021-07-27 Thread pietro greselin
To whom it may concern,

we reported the following Sedona behaviour and would like to ask your
opinion on how we can otpimize it.

Our aim is to perform a inner spatial join between a points_df and a
polygon_df when a point in points_df is contained in a polygon from
polygons_df.
Below you can find more details about the 2 dataframes we are considering:
- points_df: it contains 50mln events with latitude and longitude
approximated to the third decimal digit;
- polygon_df: it contains 10k multi-polygons having 600 vertexes on average.

The issue we are reporting is a very long computing time and the spatial
join query never completing even when running on cluster with 40 workers
with 4 cores each.
No error is being print by driver but we are receiving the following
warning:
WARN org.apache.sedona.core.spatialOperator.JoinQuery: UseIndex is true,
but no index exists. Will build index on the fly.

Actually we were able to run successfully the same spatial join when only
considering a very small sample of events.
Do you have any suggestion on how we can archive the same result on higher
volumes of data or if there is a way we can optimize the join?

Attached you can find the pseudo-code we are running.

Looking forward to hearing from you.

Kind regards,
Pietro Greselin
def toSpatialPoints(df, lat_column_name, lon_column_name):
df.createOrReplaceTempView("points_df")
return (
spark.sql("SELECT *, ST_Point({}, {}) AS point FROM 
points_df".format(lon_column_name, lat_column_name))
.drop(lat_column_name, lon_column_name)
)

def toSpatialPolygons(df, wtk_column_name):
df.createOrReplaceTempView("polygons_df")
return (
spark.sql("SELECT *, ST_GeomFromWKT({}) AS polygon FROM 
polygons_df".format(wtk_column_name))
.drop(wtk_column_name)
)

def sJoin(polygons_df, points_df):
polygons_df.createOrReplaceTempView('polygons_df')
points_df.createOrReplaceTempView('points_df')
return spark.sql("SELECT * FROM polygons_df, points_df WHERE 
ST_Contains(polygons_df.polygon, points_df.point)")


maps = spark.read.parquet(maps_path).select('AREA_ID', 'WKT')
polygons_df = toSpatialPolygons(maps, 'WKT')

events = spark.read.parquet(events_path).select('ID', 'LATITUDE', 'LONGITUDE')
points_df = toSpatialPoints(events, 'LATITUDE', 'LONGITUDE')

spatial_join = sJoin(df_polygons, df_points)

[jira] [Created] (SEDONA-55) Publish Python artifact 1.0.2

2021-07-27 Thread Artur Dryomov (Jira)
Artur Dryomov created SEDONA-55:
---

 Summary: Publish Python artifact 1.0.2
 Key: SEDONA-55
 URL: https://issues.apache.org/jira/browse/SEDONA-55
 Project: Apache Sedona
  Issue Type: Task
Reporter: Artur Dryomov


As noted in release notes for [the {{1.0.1}} 
release|http://sedona.apache.org/download/release-notes/#sedona-101] there was 
a configuration issue, resulting in PySpark version mismatch. Unfortunately 
suggested workarounds do not work with [tools like 
{{pip-tools}}|https://github.com/jazzband/pip-tools] which auto-generate 
dependencies.

Seems like this kind of change fits the patch release so it would be great to 
have a {{1.0.2}} release, resolving this incovinience.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)