Re: Apache Sedona

2020-08-21 Thread Paweł Kociński
- Grant support for Scala 2.12 and Spark 3.0
I meant here Python.

- Implement loading geospatial data sources (geojson, shapefile, osm, wkb,
wkt) from Dataframe API like
-- spark.read.format("geojson").load(path)

 It is possible and I think it will be easier for users to load the data
(Also agree that is not priority).

- Add broadcast join for joining big and small dataframe

Agree

- Fix issue with 3D geometries while loading shapefile

Exactly

- Add support for multiline geojson (I have some code on my local branch)

We have to write our own in that case, it will require some amount of work
but is doable.

- Add direct writing to geospatial databases like PostgreSQL

I have to analyze spark code and will be back with a solution

- Remove NullPointer exception when there is null value within data or data
is wrong within some rows.

I meant SQL functions, some time they should replace the value with
null/Option instead of raising null pointer exception.

- geohash spatial join

I think in some cases it can be more suitable for users. It should not be
tough to implement but it brings additional value.

pt., 21 sie 2020 o 11:08 Jia Yu  napisał(a):

> Hi Paweł and CCed sedona-dev and other committers,
>
> Please find my opinion below.
>
> - Grant support for Scala 2.12 and Spark 3.0
> Jia: the Scala and Java code in the master branch has supported Spark 3.0+
> 2.12. We need to support the following: Sedona Scala 2.12 support for other
> Spark versions and Scala 2.12 support in all Python APIs.
>
> - Implement loading geospatial data sources (geojson, shapefile, osm, wkb,
> wkt) from Dataframe API like
> -- spark.read.format("geojson").load(path)
> Jia: Direct DataFrame API support requires a bit more coding effort. I am
> actually not sure whether this func in DF is extensible. But if so, I am
> not against it. But it is not the top priority.
>
> - Add broadcast join for joining big and small dataframe
> Jia: Yes, we should have it here:
> https://github.com/DataSystemsLab/GeoSpark/blob/master/sql/src/main/scala/org/apache/spark/sql/geosparksql/strategy/join/TraitJoinQueryExec.scala#L67
> - Fix issue with 3D geometries while loading shapefile
> Jia: How do we fix it? Convert it to a 2D geoms and discard the Z
> dimension or M dimension?
>
> - Add support for multiline geojson (I have some code on my local branch)
> Jia: This is not easy. In Spark, its DF has a readjson API:
> https://spark.apache.org/docs/latest/sql-data-sources-json.html Not sure
> whether we can leverage this.
>
> - Add direct writing to geospatial databases like PostgreSQL
> Jia: Good point. Any particular challenge on this?
>
> - Add more geospatial functions
> Jia: Agree.
>
> - Remove NullPointer exception when there is null value within data or
> data is wrong within some rows
> Jia: I believe this has been solved by "allowTopologyInvalidGeometries"
> and "skipSyntaxInvalidGeometries"
> https://datasystemslab.github.io/GeoSpark/tutorial/rdd/#create-a-generic-spatialrdd-behavoir-changed-in-v120
>
> - geohash spatial join
> Jia: Yes, we can do that. But will it bring in any benefit as opposed to
> the existing spatial join algorithm?
>
> Thanks,
> Jia
>
> On Wed, Aug 19, 2020 at 10:22 AM Paweł Kociński 
> wrote:
>
>> Hi Jia,
>> I hope you are fine. Do we have some features to add to Apache Sedona
>> after the code will be merged ?
>> My ideas of tasks:
>> - Grant support for Scala 2.12 and Spark 3.0
>> - Implement loading geospatial data sources (geojson, shapefile, osm,
>> wkb, wkt) from Dataframe API like
>> -- spark.read.format("geojson").load(path)
>> I have some code, but code migration is holding me back
>>
>> [image: image.png]
>> - Add broadcast join for joining big and small dataframe
>> - Fix issue with 3D geometries while loading shapefile
>> - Add support for multiline geojson (I have some code on my local branch)
>> - Add direct writing to geospatial databases like PostgreSQL
>> - Add more geospatial functions
>> - Remove NullPointer exception when there is null value within data or
>> data is wrong within some rows
>> - geohash spatial join
>>
>> What do you think?
>>
>> Regards,
>> Paweł
>>
>>
>> pon., 17 sie 2020 o 07:45 Jia Yu  napisał(a):
>>
>>> Hello Paweł,
>>>
>>> I just posted the current situation into priv...@sedona.apache.org. The
>>> current problem is I have made everything ready to be imported to ASF
>>> GitHub repo (https://github.com/apache/incubator-sedona). But one
>>> committer (Masha from Facebook) who made thousands of lines of contribution
>>> to GeoSpark still didn't submit her CLA. The entire process is currently
>>> blocked by this.
>>>
>>> Mohamed and I have been trying to reach her a couple of times in the
>>> past 3 weeks but got no reply. I have asked the champion about how we can
>>> proceed in this case. Let's see what will happen.
>>>
>>> Thanks,
>>> Jia
>>>
>>>
>>> On Sun, Aug 16, 2020 at 9:06 AM Paweł Kociński <
>>> pawel93kocin...@gmail.com> wrote:
>>>
 Hi Jia,
 Do we know when 

Re: Apache Sedona

2020-08-21 Thread Jia Yu
Hi Paweł and CCed sedona-dev and other committers,

Please find my opinion below.

- Grant support for Scala 2.12 and Spark 3.0
Jia: the Scala and Java code in the master branch has supported Spark 3.0+
2.12. We need to support the following: Sedona Scala 2.12 support for other
Spark versions and Scala 2.12 support in all Python APIs.

- Implement loading geospatial data sources (geojson, shapefile, osm, wkb,
wkt) from Dataframe API like
-- spark.read.format("geojson").load(path)
Jia: Direct DataFrame API support requires a bit more coding effort. I am
actually not sure whether this func in DF is extensible. But if so, I am
not against it. But it is not the top priority.

- Add broadcast join for joining big and small dataframe
Jia: Yes, we should have it here:
https://github.com/DataSystemsLab/GeoSpark/blob/master/sql/src/main/scala/org/apache/spark/sql/geosparksql/strategy/join/TraitJoinQueryExec.scala#L67
- Fix issue with 3D geometries while loading shapefile
Jia: How do we fix it? Convert it to a 2D geoms and discard the Z dimension
or M dimension?

- Add support for multiline geojson (I have some code on my local branch)
Jia: This is not easy. In Spark, its DF has a readjson API:
https://spark.apache.org/docs/latest/sql-data-sources-json.html Not sure
whether we can leverage this.

- Add direct writing to geospatial databases like PostgreSQL
Jia: Good point. Any particular challenge on this?

- Add more geospatial functions
Jia: Agree.

- Remove NullPointer exception when there is null value within data or data
is wrong within some rows
Jia: I believe this has been solved by "allowTopologyInvalidGeometries" and
"skipSyntaxInvalidGeometries"
https://datasystemslab.github.io/GeoSpark/tutorial/rdd/#create-a-generic-spatialrdd-behavoir-changed-in-v120

- geohash spatial join
Jia: Yes, we can do that. But will it bring in any benefit as opposed to
the existing spatial join algorithm?

Thanks,
Jia

On Wed, Aug 19, 2020 at 10:22 AM Paweł Kociński 
wrote:

> Hi Jia,
> I hope you are fine. Do we have some features to add to Apache Sedona
> after the code will be merged ?
> My ideas of tasks:
> - Grant support for Scala 2.12 and Spark 3.0
> - Implement loading geospatial data sources (geojson, shapefile, osm, wkb,
> wkt) from Dataframe API like
> -- spark.read.format("geojson").load(path)
> I have some code, but code migration is holding me back
>
> [image: image.png]
> - Add broadcast join for joining big and small dataframe
> - Fix issue with 3D geometries while loading shapefile
> - Add support for multiline geojson (I have some code on my local branch)
> - Add direct writing to geospatial databases like PostgreSQL
> - Add more geospatial functions
> - Remove NullPointer exception when there is null value within data or
> data is wrong within some rows
> - geohash spatial join
>
> What do you think?
>
> Regards,
> Paweł
>
>
> pon., 17 sie 2020 o 07:45 Jia Yu  napisał(a):
>
>> Hello Paweł,
>>
>> I just posted the current situation into priv...@sedona.apache.org. The
>> current problem is I have made everything ready to be imported to ASF
>> GitHub repo (https://github.com/apache/incubator-sedona). But one
>> committer (Masha from Facebook) who made thousands of lines of contribution
>> to GeoSpark still didn't submit her CLA. The entire process is currently
>> blocked by this.
>>
>> Mohamed and I have been trying to reach her a couple of times in the past
>> 3 weeks but got no reply. I have asked the champion about how we can
>> proceed in this case. Let's see what will happen.
>>
>> Thanks,
>> Jia
>>
>>
>> On Sun, Aug 16, 2020 at 9:06 AM Paweł Kociński 
>> wrote:
>>
>>> Hi Jia,
>>> Do we know when the first release of Apache Sedona will occur ? Can I
>>> help with something to make it happen? I have few ideas and some code which
>>> will be useful in the future.
>>>
>>> Regards,
>>> Pawel
>>>
>>


Re: Use JTS as a dependency instead of JTSPlus

2020-08-21 Thread Netanel Malka
Yes, that's right.
We are already working on that.
I hope to create the PR soon.

On Fri, Aug 21, 2020, 11:35 Jia Yu  wrote:

> Hi folks,
>
> I believe the conclusion is that we should use the wrapper solution
> instead of the reflection, right? (of course, with additional care to the
> wrapper)
>
> Thanks,
> Jia
>
> On Sun, Aug 9, 2020 at 11:37 PM Paweł Kociński 
> wrote:
>
>> Hi,
>> From my point of view, Python API needs only a few changes in that case.
>> First of all, few type annotation names change (Python API already has some
>> proxy object which holds shapely geometry and user data as a
>> separate attribute), If the new object has getter  *getUserData, *the
>> change should be minimal. And those are changes for RDD API. SQL API should
>> not require changes due to the fact that translation between Dataframe and
>> RDD is hidden for Python (I assume that GeometryUDT will remain the same).
>>
>> Regards,
>> Pawel
>>
>> pon., 10 sie 2020 o 07:08 Georg Heiler  napisał(a):
>>
>>> I agree with @Jia Yu  and think it is better to
>>> move forward with the wrapper.
>>>
>>> Best,
>>> Georg
>>>
>>> Am Mo., 10. Aug. 2020 um 01:41 Uhr schrieb Jia Yu :
>>>
 Hi Netanel, CCed Pawel (GeoSpark Python), Georg (who might be also
 interested in this issue), Sedona-dev

 I think reflection would be a neat solution but it may bring
 technical debt in the future and cause problems to the python API.

 In the long run, a wrapper around JTS geometry would be a better
 solution although we may need to change many places in the code.

 Folks, what do you think?

 Thanks,
 Jia

 On Sun, Aug 9, 2020 at 7:49 AM Netanel Malka 
 wrote:

> Hi,
> Currently, we are having some problems with userData on Geometry.
> The problems are:
>
>1. Geometry toString function doesn't take userData into account
>2. Geometry equals function doesn't take userData into account
>
>
> Our proposed solution is to wrap Geometry with a proxy object, that
> holds the Geometry and handles other columns instead of using Goemtery 
> user
> data.
> Another possible solution is using reflection to change methods on
> Geometry itself
>
> What do you think we should do?
>
> Thanks. Regards
>
> On Thu, Jul 23, 2020, 21:32 Jia Yu  wrote:
>
>> Hi Netanel,
>>
>> Sorry. I somehow missed this email. The only test that GeoSpark does
>> not cover for JTSplus is this one:
>> https://github.com/jiayuasu/JTSplus/blob/master/src/test/java/jtsplustest/GeometryToStringTest.java
>>
>> If you can add this back to GeoSpark, I think you are good to go.
>>
>> Thanks,
>> Jia
>>
>> On Thu, Jul 23, 2020 at 6:08 AM Netanel Malka 
>> wrote:
>>
>>> Hi,
>>> Have you had time to look at this?
>>>
>>> Best regards,
>>> Netanel Malka.
>>>
>>> -- Forwarded message -
>>> From: Netanel Malka 
>>> Date: Tue, 7 Jul 2020 at 11:06
>>> Subject: Re: Use JTS as a dependency instead of JTSPlus
>>> To: Jia Yu 
>>>
>>>
>>> OK.
>>> We saw that in Geometry the userData field changed from null to "",
>>> is it crucial? because this is a change that I believe that JTS won't
>>> accept.
>>>
>>> Also, does GeoSpark tests are covered JTSPlus changes? If all the
>>> geospark tests are working, does it mean that we didn't break anything?
>>>
>>>
>>> On Thu, 2 Jul 2020 at 18:54, Jia Yu  wrote:
>>>
 HI Netanel,

 Thanks for your work on this.

 userData in Envelope can be ignored. We will no longer support
 userData in Envelope.

 Userdata field is used to hold non-spatial attributes in GeoSpark
 core. When print a spatial object, userData will be printed out as a 
 WKT
 string.

 In GeoSpark, I think it only calls the getUserData or setUserData,
 but the majority of the work was done in JTSplus. When check the 
 equality
 of two objects in JTSplus, we also check the UserData but JTS by 
 default
 does not check that.


 We communicate via mail since this thread is gonna be long.

 Thanks,
 Jia

 

 Jia Yu

 Ph.D. in Computer Science

 Arizona State University 

 Reach me via: Homepage  | GitHub
 


 On Thu, Jul 2, 2020 at 3:01 AM Netanel Malka 
 wrote:

> Hi,
> how are you?
>
> I am working on this issue
>  which I
> and my friends trying to upgrade the JTS 

Re: Use JTS as a dependency instead of JTSPlus

2020-08-21 Thread Jia Yu
Hi folks,

I believe the conclusion is that we should use the wrapper solution instead
of the reflection, right? (of course, with additional care to the wrapper)

Thanks,
Jia

On Sun, Aug 9, 2020 at 11:37 PM Paweł Kociński 
wrote:

> Hi,
> From my point of view, Python API needs only a few changes in that case.
> First of all, few type annotation names change (Python API already has some
> proxy object which holds shapely geometry and user data as a
> separate attribute), If the new object has getter  *getUserData, *the
> change should be minimal. And those are changes for RDD API. SQL API should
> not require changes due to the fact that translation between Dataframe and
> RDD is hidden for Python (I assume that GeometryUDT will remain the same).
>
> Regards,
> Pawel
>
> pon., 10 sie 2020 o 07:08 Georg Heiler  napisał(a):
>
>> I agree with @Jia Yu  and think it is better to
>> move forward with the wrapper.
>>
>> Best,
>> Georg
>>
>> Am Mo., 10. Aug. 2020 um 01:41 Uhr schrieb Jia Yu :
>>
>>> Hi Netanel, CCed Pawel (GeoSpark Python), Georg (who might be also
>>> interested in this issue), Sedona-dev
>>>
>>> I think reflection would be a neat solution but it may bring
>>> technical debt in the future and cause problems to the python API.
>>>
>>> In the long run, a wrapper around JTS geometry would be a better
>>> solution although we may need to change many places in the code.
>>>
>>> Folks, what do you think?
>>>
>>> Thanks,
>>> Jia
>>>
>>> On Sun, Aug 9, 2020 at 7:49 AM Netanel Malka 
>>> wrote:
>>>
 Hi,
 Currently, we are having some problems with userData on Geometry.
 The problems are:

1. Geometry toString function doesn't take userData into account
2. Geometry equals function doesn't take userData into account


 Our proposed solution is to wrap Geometry with a proxy object, that
 holds the Geometry and handles other columns instead of using Goemtery user
 data.
 Another possible solution is using reflection to change methods on
 Geometry itself

 What do you think we should do?

 Thanks. Regards

 On Thu, Jul 23, 2020, 21:32 Jia Yu  wrote:

> Hi Netanel,
>
> Sorry. I somehow missed this email. The only test that GeoSpark does
> not cover for JTSplus is this one:
> https://github.com/jiayuasu/JTSplus/blob/master/src/test/java/jtsplustest/GeometryToStringTest.java
>
> If you can add this back to GeoSpark, I think you are good to go.
>
> Thanks,
> Jia
>
> On Thu, Jul 23, 2020 at 6:08 AM Netanel Malka 
> wrote:
>
>> Hi,
>> Have you had time to look at this?
>>
>> Best regards,
>> Netanel Malka.
>>
>> -- Forwarded message -
>> From: Netanel Malka 
>> Date: Tue, 7 Jul 2020 at 11:06
>> Subject: Re: Use JTS as a dependency instead of JTSPlus
>> To: Jia Yu 
>>
>>
>> OK.
>> We saw that in Geometry the userData field changed from null to "",
>> is it crucial? because this is a change that I believe that JTS won't
>> accept.
>>
>> Also, does GeoSpark tests are covered JTSPlus changes? If all the
>> geospark tests are working, does it mean that we didn't break anything?
>>
>>
>> On Thu, 2 Jul 2020 at 18:54, Jia Yu  wrote:
>>
>>> HI Netanel,
>>>
>>> Thanks for your work on this.
>>>
>>> userData in Envelope can be ignored. We will no longer support
>>> userData in Envelope.
>>>
>>> Userdata field is used to hold non-spatial attributes in GeoSpark
>>> core. When print a spatial object, userData will be printed out as a WKT
>>> string.
>>>
>>> In GeoSpark, I think it only calls the getUserData or setUserData,
>>> but the majority of the work was done in JTSplus. When check the 
>>> equality
>>> of two objects in JTSplus, we also check the UserData but JTS by default
>>> does not check that.
>>>
>>>
>>> We communicate via mail since this thread is gonna be long.
>>>
>>> Thanks,
>>> Jia
>>>
>>> 
>>>
>>> Jia Yu
>>>
>>> Ph.D. in Computer Science
>>>
>>> Arizona State University 
>>>
>>> Reach me via: Homepage  | GitHub
>>> 
>>>
>>>
>>> On Thu, Jul 2, 2020 at 3:01 AM Netanel Malka 
>>> wrote:
>>>
 Hi,
 how are you?

 I am working on this issue
  which I
 and my friends trying to upgrade the JTS version on GeoSpark.
 We are facing the userData field on Envelope which arent exists on
 JTS.
 Based on this PR  I
 saw it's deprecated, can we ignore it?

 Also, We started to search for the using of userData for 

Confirmation of Apache Sedona (incubating) CLA

2020-08-21 Thread Jia Yu
Dear ASF Secretary,

This is Jia Yu. I am a committer of Apache Sedona (incubating). We are
working on importing our old codebase from GitHub (
https://github.com/DataSystemsLab/GeoSpark) to ASF GitHub repo (
https://github.com/apache/incubator-sedona). We have been waiting for more
than one month after being accepted to the ASF incubator.

The infra team asks us to obtain a confirmation from you that you have
allowed us to import the code.

1. Our old codebase was under Apache License 2.0 and MIT License (before
v1.2.0)
2. 8 out 10 contributors who made significant contributions have submitted
the CLA: https://github.com/DataSystemsLab/GeoSpark/graphs/contributors .
Please refer to the list attached.

Given that the project was in Apache License 2.0 and MIT License, I believe
the code is safe. Could you please confirm that we are allowed to import
the code to the ASF GitHub repository?

Thanks,
Jia
---

Here is a list of the top 10 significant contributors, the names are sorted
by their num of commits.

Jia Yu (#1, submitted)

Zongsi Zhang (#2, submitted)

Jinxuan Wu (#3, submitted)

Masha Basmanova ($4, not submitted)

Mohamed Sarwat (#5, submitted)

Paweł Kociński (imbruced) (#6, submitted)

Netanel Malka (#7, submitted)

Sachio Wakai (#8, submitted)

Kengo Seki (#9, submitted)

Omar Kaptan (#10, not submitted)