Re: First Sedona release

2020-11-22 Thread Paweł Kociński
Hi,
I saw some users reported need to improve Python RDD API in two scenarios:
- converting spatial flat join result to df
- saving spatial flat join result directly to external storage

Currently SerDe between jvm and Python causes additional time needed to
compute the result. I have a local branch with tests where this
functionality is available (need 3-4 days to make it 100% ready), in two
above scenarios there will be almost no difference between Python and Scala
or Java API. Should I create PR to include this feature within the first
Sedona release ?
Regards,
Paweł

pon., 16 lis 2020 o 08:29 Jia Yu  napisał(a):

> Dear all,
>
> Thanks for all your suggestions.
>
> 1. To completely solve the long-overdue JTS issue, I made a Sedona PR and
> two JTS PRs. @Jim Hughes  , @Paweł Kociński
>  , I, and probably Martin from JTS will take
> care of these PRs in the coming days.
> (1) Sedona PR: https://github.com/apache/incubator-sedona/pull/488
> (2) JTS PR: https://github.com/locationtech/jts/pull/633
> https://github.com/locationtech/jts/pull/634
>
> 2. To move forward with the first release, I have deleted the "SNAPSHOT"
> in my JTS 1.16 fork.
> Most likely, we have to move forward with my JTS 1.16 fork in the first
> Sedona release because of the conflict among JTStoGeoJSON, GeoTools, and
> JTS 1.17.
> So @Netanel Malka   could you please do another
> dry-run on the Sedona first release on this Sedona branch: sedona-1.0-doc:
> https://github.com/apache/incubator-sedona/tree/sedona-1.0-doc
>
> Thanks,
> Jia
>
> On Thu, Nov 12, 2020 at 11:36 AM Jim Hughes  wrote:
>
>> Hi Mo,
>>
>> I can definitely help.  The first step will be for Jia to push a PR for
>> the JTS changes.  (Since they are his changes, I cannot do this on his
>> behalf.)
>>
>>  From talking to the lead JTS developer, he wanted to see the previous
>> PR (from months/a year+ ago) split up.  I think the initial PR should be
>> used to discuss what changes are sensible for JTS and where we'll need
>> to push some of the changes to Sedona.
>>
>> Concretely, I noticed that the Sedona JTS fork changes the toString on
>> Geometry to include printing out the userData.  I imagine that may cause
>> trouble for downstream JTS users, so it'd be good to find an
>> alternative.  One suggestion would to be add a static method in Sedona
>> for printing a Geometry with its userData object.
>>
>> Cheers,
>>
>> Jim
>>
>> On 11/12/20 12:32 PM, Mohamed Sarwat wrote:
>> > Folks,
>> >
>> > I totally agree with Jim on that. Jim, would you like to take the lead
>> on that - I trust that you can bring this task to completion. Jia, would
>> you please let us know how we can incorporate the changes into the JTS
>> master branch?
>> >
>> > Thanks,
>> >
>> >> On Nov 12, 2020, at 10:10 AM, Jim Hughes  wrote:
>> >>
>> >> Hi all,
>> >>
>> >> As a JTS committer, I have tried to request that the Sedona project
>> discuss the desired changes to JTS previously.  I'd still encourage that.
>> >>
>> >> JTS is an active project and I feel that maintaining a fork of JTS is
>> unnecessary and inappropriate.
>> >>
>> >> Cheers,
>> >>
>> >> Jim
>> >>
>> >>> On 11/11/20 9:04 PM, Felix Cheung wrote:
>> >>> Ah. You will need to publish it in order for the dependency chain to
>> work
>> >>> on Maven Central
>> >>>
>> >>> However, since you are not the project owner there you might need to
>> >>> publish that under a different artifact id.
>> >>>
>> >>> In general, it would be best to avoid hard forking another project
>> like
>> >>> this.
>> >>>
>> >>>
>>  On Wed, Nov 11, 2020 at 1:05 PM Jia Yu 
>> wrote:
>> 
>>  Hi Netanel,
>> 
>>  That links to this git submodule:
>>  https://github.com/jiayuasu/jts/blob/1.16.x/modules/core/pom.xml#L6
>> 
>>  I can easily fix this by changing the version number here to 1.16.2
>>  excluding "SNAPSHOT":
>>  https://github.com/jiayuasu/jts/blob/1.16.x/modules/core/pom.xml#L6
>> 
>>  Will this solve the problem?
>> 
>>  On Wed, Nov 11, 2020 at 7:40 AM Netanel Malka 
>>  wrote:
>> 
>> > Hi Folks,
>> >
>> > I tried to make a release (dry-run) following by
>> > publishing-maven-artifacts
>> > , and I
>> > encountered an issue.
>> >
>> > On sedona-core, we have jts-core as a dependency with the SNAPSHOT
>> > version.
>> > (link
>> > <
>> >
>> https://github.com/apache/incubator-sedona/blob/2e60fc07b0eae78ccae3876d970e677fc9319c40/core/pom.xml#L37
>> > )
>> >
>> > As a prerequisite to the release process, we cannot have
>> dependencies in a
>> > SNAPSHOT version.
>> >
>> >
>> > Do you have any clue about how to solve this?
>> >
>> >
>> > On Mon, 9 Nov 2020 at 21:22, Netanel Malka 
>> wrote:
>> >
>> >> OK. Thanks Felix.
>> >>
>> >>
>> >> Updates:
>> >>
>> >>*
>> >>*   Opened a ticket for INFRA to Enable Nexus Access For S

[GitHub] [incubator-sedona] jiayuasu commented on pull request #488: [SEDONA-1] Move to jts 1.16.1, minimize the dependency on JTSplus

2020-11-22 Thread GitBox


jiayuasu commented on pull request #488:
URL: https://github.com/apache/incubator-sedona/pull/488#issuecomment-731953167


   Hi all @Imbruced @netanel246 @Sarwat 
   
   I managed to fix all the issues. In a nutshell:
   
   ## Changes on JTS side
   
   This PR is relevant to two JTS PRs made by me:
   
   * Add the check of userData in equals(object o): locationtech/jts#633 : this 
PR is no longer needed.
   
   * Change the access modifiers of tree indexes and add setter/getters: 
https://github.com/locationtech/jts/pull/634: this PR has been updated. We only 
need to change a few variables to package-private. I am confident that JTS 
committers will accept this PR soon.
   
   ## Changes on Sedona side
   
   1. The following spatial partitioning methods, Equal grids, R-Tree, Voronoi, 
and Hilbert curve, have been removed because they can no longer yield correct 
results due to the change in JTS. Only Quad-Tree and KDB-Tree are kept.
   
   2. JoinQuery.SpatialJoinQuery/DistanceJoinQuery now returns > instead of > because we can no 
longer use HashSet in Sedona for duplicates removal.
   
   3. The duplicates preserving strategy in 
JoinQuery.SpatialJoinQuery/DistanceJoinQuery is changed.
   1. After this PR, duplicate geometries present in the input 
queryWindowRDD, regardless of their non-spatial attributes, will not be 
reflected in the join results. Duplicate geometries present in the input 
spatialRDD, regardless of their non-spatial attributes, will be reflected in 
the join results.
   2. Before this PR, duplicate geometries present in the input 
queryWindowRDD, if their non-spatial attributes are also same, will be 
reflected in the final result. Duplicate geometries present in the input 
SpatialRDD (if their non-spatial attributes are also same), will not be 
reflected in the final result.
   
   4. Now use Sedona Core GeomUtils to do Geometry print and equalityCheck. 
Adapter.scala in Sedona SQL has been updated accordingly but no API change 
needed.
   
   5. JoinQuery.SpatialJoinQueryFlat/DistanceJoinQueryPlat still have the same 
signature. No change in terms of API and behavior. Duplicate geometries present 
in both input datasets will be reflected in the output. The query correctness 
is still guaranteed.
   
   6. IndexSerializer of JTS Quad-Tree and R-Tree are now under 
org.locationtech.jts.index.quadtree and strtree. The purpose is to leverage 
"package private" to minimize the Sedona changes to JTS core.
   
   7. JTS version has been upgrade to JTS 1.18.0-SNAPSHOT is to get ready for 
the next JTS release which includes my JTS PR.
   
   8. GeoTools upgraded to 24.0 and Jts2GeoJSON upgraded to 1.4.3. Jts2GeoJSON 
1.4.3 (the latest version) does not support JTS 1.17+. To overcome this, I 
copied "GeoJSONWRITER" from Jts2GeoJSON to Sedona "GeoJSONWRITERNew". This 
fixed the version conflict but we may need to list Jts2GeoJSON's MIT license 
somewhere in Sedona.
   
   ## To-Dos
   
   1. Wait for JTS to accept the PR and release JTS 1.18 on Maven Central. Then 
I will change the JTS dependency in Sedona to its Maven coordinate. I expect 
this to be done in the coming days.
   2. @Imbruced Can you please fix the Python APIs based on the aforementioned 
Change 1, 2, 3, 4. I am sure that Python APIs cannot compile now. And some test 
cases may fail because 1 and 3. For Change 6,7,8, I am not sure whether Python 
APIs should be updated. Please check.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




Re: First Sedona release

2020-11-22 Thread Jia Yu
Hi Pawel and everyone,

Let's do this in the first Sedona release. But can you please first fix the
Python API for our Move-to-JTS PR, and then work on this one? If this
Python RDD-DF Adapter PR might slow down our progress of releasing Sedona
before Christmas, we can postpone it to Sedona 1.0.1 or 1.1.0.

@everyone
Our top priority is to draw the first Sedona release ASAP. Users have been
waiting for almost six months. Let's push hard to publish the first Sedona
release to Maven Central and PyPI before Christmas. In order to make it
happen,

Finalize coding and documentation before Dec 6:
1. I believe the Move-to-JTS PR will be done in around one week.
2. Then we can accept Pawel' Python RDD-DF Adapter PR, if necessary
3. I will work on Sedona documentation.
4. @Netanel will work on Sedona support of Spark 2.4 and Scala 2.11. I will
first create a branch for it to illustrate some necessary changes in Sedona
SQL for Spark 2.4.

Final walk-through before Dec 13
1. Netanel can test the release management for Sedona.
2. Other committers can go through the docs, release notes

Community voting before Dec 20
1. Sedona community voting: before Dec 16
2. Apache Incubator voting: before Dec 20

Push to Maven Central and PyPi before Dec 24

Please feel free to comment if you have any suggestions!

Jia

On Sun, Nov 22, 2020 at 9:51 AM Paweł Kociński 
wrote:

> Hi,
> I saw some users reported need to improve Python RDD API in two scenarios:
> - converting spatial flat join result to df
> - saving spatial flat join result directly to external storage
>
> Currently SerDe between jvm and Python causes additional time needed to
> compute the result. I have a local branch with tests where this
> functionality is available (need 3-4 days to make it 100% ready), in two
> above scenarios there will be almost no difference between Python and Scala
> or Java API. Should I create PR to include this feature within the first
> Sedona release ?
> Regards,
> Paweł
>
> pon., 16 lis 2020 o 08:29 Jia Yu  napisał(a):
>
>> Dear all,
>>
>> Thanks for all your suggestions.
>>
>> 1. To completely solve the long-overdue JTS issue, I made a Sedona PR and
>> two JTS PRs. @Jim Hughes  , @Paweł Kociński
>>  , I, and probably Martin from JTS will take
>> care of these PRs in the coming days.
>> (1) Sedona PR: https://github.com/apache/incubator-sedona/pull/488
>> (2) JTS PR: https://github.com/locationtech/jts/pull/633
>> https://github.com/locationtech/jts/pull/634
>>
>> 2. To move forward with the first release, I have deleted the "SNAPSHOT"
>> in my JTS 1.16 fork.
>> Most likely, we have to move forward with my JTS 1.16 fork in the first
>> Sedona release because of the conflict among JTStoGeoJSON, GeoTools, and
>> JTS 1.17.
>> So @Netanel Malka   could you please do another
>> dry-run on the Sedona first release on this Sedona branch: sedona-1.0-doc:
>> https://github.com/apache/incubator-sedona/tree/sedona-1.0-doc
>>
>> Thanks,
>> Jia
>>
>> On Thu, Nov 12, 2020 at 11:36 AM Jim Hughes  wrote:
>>
>>> Hi Mo,
>>>
>>> I can definitely help.  The first step will be for Jia to push a PR for
>>> the JTS changes.  (Since they are his changes, I cannot do this on his
>>> behalf.)
>>>
>>>  From talking to the lead JTS developer, he wanted to see the previous
>>> PR (from months/a year+ ago) split up.  I think the initial PR should be
>>> used to discuss what changes are sensible for JTS and where we'll need
>>> to push some of the changes to Sedona.
>>>
>>> Concretely, I noticed that the Sedona JTS fork changes the toString on
>>> Geometry to include printing out the userData.  I imagine that may cause
>>> trouble for downstream JTS users, so it'd be good to find an
>>> alternative.  One suggestion would to be add a static method in Sedona
>>> for printing a Geometry with its userData object.
>>>
>>> Cheers,
>>>
>>> Jim
>>>
>>> On 11/12/20 12:32 PM, Mohamed Sarwat wrote:
>>> > Folks,
>>> >
>>> > I totally agree with Jim on that. Jim, would you like to take the lead
>>> on that - I trust that you can bring this task to completion. Jia, would
>>> you please let us know how we can incorporate the changes into the JTS
>>> master branch?
>>> >
>>> > Thanks,
>>> >
>>> >> On Nov 12, 2020, at 10:10 AM, Jim Hughes  wrote:
>>> >>
>>> >> Hi all,
>>> >>
>>> >> As a JTS committer, I have tried to request that the Sedona project
>>> discuss the desired changes to JTS previously.  I'd still encourage that.
>>> >>
>>> >> JTS is an active project and I feel that maintaining a fork of JTS is
>>> unnecessary and inappropriate.
>>> >>
>>> >> Cheers,
>>> >>
>>> >> Jim
>>> >>
>>> >>> On 11/11/20 9:04 PM, Felix Cheung wrote:
>>> >>> Ah. You will need to publish it in order for the dependency chain to
>>> work
>>> >>> on Maven Central
>>> >>>
>>> >>> However, since you are not the project owner there you might need to
>>> >>> publish that under a different artifact id.
>>> >>>
>>> >>> In general, it would be best to avoid hard forking another project
>>> like
>>> >>> this.