Re: Apache Sedona contribution

2021-03-29 Thread Jia Yu
Hi Alessandro,

You cannot use Sedona KNNQuery.SpatialKNNquery after DistanceJoinQuery. You
should add your own filtering logic (in Spark mappartition func) after
DistanceJoinQuery result.

1. Your contribution should cover RDD API. For now, I cannot think of a SQL
Syntax that describes the KNN join query.
2. Your contribution should cover both Scala/Java API and Python. The core
algorithm will be implemented in Java KNNQuery.java. By default, it
automatically works for Scala. For Python support, you need to have a
corresponding wrapper API in Python. But you can first finish the Java
implementation, and then create the PR and consult Pawel @Paweł Kociński
 who is the lead of Python API.
3. You can refer to [1] [2] for compiling and documenting your work. But
you won't be able to publish since you are not a Sedona committer.

Thanks,
Jia


On Mon, Mar 29, 2021 at 2:42 AM Alessandro Calvio 
wrote:

> Hi all,
>
> thank you for your answer.
>
> It would be very interesting understand how to implement the solution
> proposed by the paper in Apache Sedona.
>
> Anyway, I think I could try to implement the simplified version proposed
> by you. If I understand correctly it would be like use the current 
> *SpatialKNNQuery
> *function on the geometries filtered out by *DistanceJoinQuery*, am I
> right?
>
>
>
> Can I refer to these links [1], [2] and [3] as guide to the compilation
> and publish mechanism? And what about the limitations of the contribution
> mentioned in my previous questions?
>
>
>
> Finally, I didn’t receive the mail of Adam but yes, the expected output
> would have been the one described by you.
>
>
>
> Thanks,
>
> Best regards,
>
> Alessandro
>
>
>
> [1]: https://sedona.incubator.apache.org/community/rule/
>
> [2]:
> https://sedona.incubator.apache.org/download/compile/#compile-the-documentation
>
> [3]: https://sedona.incubator.apache.org/download/publish/
>
>
>
>
>
> *Da: *Jia Yu 
> *Inviato: *lunedì 29 marzo 2021 07:53
> *A: *dev@sedona.apache.org; alexcal...@hotmail.it; adam...@gmail.com
> *Oggetto: *Re: Apache Sedona contribution
>
>
>
> Hi folks,
>
>
>
> Thanks for your proposal. However, the reason why Sedona does not have KNN
> Join query is that a complete and correct KNN join is very difficult to
> implement.
>
>
>
> Note that: the existing spatial partitioning scheme in Sedona cannot yield
> KNN join correctly because once you zip two RDDs together, there is no
> guarantee that for each point in Partition A of RDD1, you can find its kth
> neighbor in Partition A of RDD2. To implement a correct KNN join, we need
> to find a correct partitioning mechanism. This research problem has been
> studied  by this TKDE paper:
> https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=7337428_token=PZSM8VwhkwMA:slOnDt2_70HFwdu81c_7jVRiYcZPj7FPbJ3OvET_g0ApMDDEcg2Fq71CMgYxWrSCdXmjZqACew=1
>
> We have confirmed that this is the correct solution we want.
>
>
>
> Alessandro, if you want to proceed, I would suggest that, you can
> implement a simplified version of KNN Join which is:
>
>
>
> For each obj in RDD 1, within its D radius circle, find its k nearest
> neighbors in RDD2.
>
>
>
> To do so, you can apply a KNN neighbor map function after Sedona
> JoinQuery.DistanceJoinQuery API:
> https://github.com/apache/incubator-sedona/blob/master/core/src/main/java/org/apache/sedona/core/spatialOperator/JoinQuery.java#L289
>  or
> https://github.com/apache/incubator-sedona/blob/master/core/src/main/java/org/apache/sedona/core/spatialOperator/JoinQuery.java#L253
>
>
>
> Thanks,
>
> Jia
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Fri, Mar 26, 2021 at 8:35 PM Adam Binford  wrote:
>
> Out of curiosity and knowing next to nothing about KNN, what is the return
> value supposed to represent? The K nearest nearest geometries in spatialRDD
> to any geometry in dataset point?
>
> Adam
>
> On Fri, Mar 26, 2021, 6:56 AM Alessandro Calvio 
> wrote:
>
> > Hi,
> > I’m a graduated in Computer Engineering and I am writing in connection
> > with the possibility to contribute to the Apache Sedona project.
> > During my work I bumped into a problem regarding the incapability to
> > perform the KNNQuery operation with a dataset rather than a single point.
> > Hence, the contribution will enhance the library with a new signature of
> > the SpatialKNNQuery:
> >
> > public static  List
> > SpatialKnnQuery(
> > SpatialRDD spatialRDD, SpatialRDD datasetPoint, Integer k, boolean
> > useIndex
> > )
> >
> > The solution I’ve tried is similar to the one exploited for the
> > Join-Query. In a few words, I’ll subdivide both dataset geographically,
> zip
> > the partitions together and finally iterate on each partition computing
> the
> > nearest neighbour query.
> > I’d like to know if it could be a good proposal for a contribution and
> ask
> > you some questions about the idea:
> >
> >   1.  Can the contribution be limited to RDD API or should it cover the
> > SQL API too?
> >   2.  Can the contribution be limited to enhance the Scala/Java API or
> 

[GitHub] [incubator-sedona] jiayuasu commented on pull request #516: Sedona 17 shape ser de

2021-03-29 Thread GitBox


jiayuasu commented on pull request #516:
URL: https://github.com/apache/incubator-sedona/pull/516#issuecomment-809796086


   Awesome. Once you finish the alternative WKB writer, we can publish the 
Sedona 1.0.1 release. There are a few relatively critical patches to be 
released in 1.0.1.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-sedona] netanel246 commented on pull request #516: Sedona 17 shape ser de

2021-03-29 Thread GitBox


netanel246 commented on pull request #516:
URL: https://github.com/apache/incubator-sedona/pull/516#issuecomment-809715103


   @jiayuasu , I will be adding more commits soon to eliminate the code 
repetition of the ShapeGeometrySerde and the WKBGeometrySerde. 
   
   You can start reviewing the other code if you want to. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Closed] (SEDONA-17) Replace geometry serializer in RDD API with the WKB serializer

2021-03-29 Thread Netanel Malka (Jira)


 [ 
https://issues.apache.org/jira/browse/SEDONA-17?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Netanel Malka closed SEDONA-17.
---
Resolution: Won't Do

Will be part of SEDONA-28

> Replace geometry serializer in RDD API with the WKB serializer
> --
>
> Key: SEDONA-17
> URL: https://issues.apache.org/jira/browse/SEDONA-17
> Project: Apache Sedona
>  Issue Type: Task
>Reporter: Netanel Malka
>Assignee: Netanel Malka
>Priority: Normal
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In the Sedona SQL module, we are using the WKB serializer instead of the 
> Shape serializer because of an old bug.
> Now, we want to replace also the serializer in the RDD API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-sedona] netanel246 opened a new pull request #516: Sedona 17 shape ser de

2021-03-29 Thread GitBox


netanel246 opened a new pull request #516:
URL: https://github.com/apache/incubator-sedona/pull/516


   ## Is this PR related to a proposed Issue?
   [SEDONA-28](https://issues.apache.org/jira/browse/SEDONA-28)
   ## What changes were proposed in this PR?
   Added WKB serializer as an optional serializer and use the old Serde as the 
default SerDe for both Core and SQL. The user should only use it if they use 
geometries that are currently not supported by the old Serde.
   The user should be able to choose the serializer as follows
   
   .config("spark.serializer", classOf[KryoSerializer].getName) // 
org.apache.spark.serializer.KryoSerializer
   .config("spark.kryo.registrator", classOf[SedonaKryoRegistrator].getName)
   .config("spark.serializer", classOf[KryoSerializer].getName) // 
org.apache.spark.serializer.KryoSerializer
   .config("spark.kryo.registrator", classOf[SedonaWKBKryoRegistrator].getName)
   
   ## How was this patch tested?
   Using the existing tests. I Will add more tests soon
   ## Did this PR include necessary documentation updates?
   Not yet. Will be added soon.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (SEDONA-17) Replace geometry serializer in RDD API with the WKB serializer

2021-03-29 Thread Netanel Malka (Jira)


[ 
https://issues.apache.org/jira/browse/SEDONA-17?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17310987#comment-17310987
 ] 

Netanel Malka commented on SEDONA-17:
-

Close this issue and open SEDONA-28 as we decided to implement both of the 
serdes and let the user decide what to use.
 

> Replace geometry serializer in RDD API with the WKB serializer
> --
>
> Key: SEDONA-17
> URL: https://issues.apache.org/jira/browse/SEDONA-17
> Project: Apache Sedona
>  Issue Type: Task
>Reporter: Netanel Malka
>Assignee: Netanel Malka
>Priority: Normal
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In the Sedona SQL module, we are using the WKB serializer instead of the 
> Shape serializer because of an old bug.
> Now, we want to replace also the serializer in the RDD API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (SEDONA-28) Add WKB serializer in RDD and SQL API and let the user choose the SerDe

2021-03-29 Thread Netanel Malka (Jira)
Netanel Malka created SEDONA-28:
---

 Summary: Add WKB serializer in RDD and SQL API  and let the user 
choose the SerDe
 Key: SEDONA-28
 URL: https://issues.apache.org/jira/browse/SEDONA-28
 Project: Apache Sedona
  Issue Type: Task
Reporter: Netanel Malka
Assignee: Netanel Malka


Add WKB serializer as an optional serializer and use the old Serde as the 
default SerDe for both Core and SQL. The user should only use it if they use 
geometries that are currently not supported by the old Serde.
The user should be able to choose the serializer as follows

.config("spark.serializer", classOf[KryoSerializer].getName) // 
org.apache.spark.serializer.KryoSerializer
.config("spark.kryo.registrator", classOf[SedonaKryoRegistrator].getName)
.config("spark.serializer", classOf[KryoSerializer].getName) // 
org.apache.spark.serializer.KryoSerializer
.config("spark.kryo.registrator", classOf[SedonaWKBKryoRegistrator].getName)
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-sedona] netanel246 closed pull request #510: [SEDONA-17] Use WKB serde instead of the ShapeSerde

2021-03-29 Thread GitBox


netanel246 closed pull request #510:
URL: https://github.com/apache/incubator-sedona/pull/510


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-sedona] netanel246 commented on pull request #510: [SEDONA-17] Use WKB serde instead of the ShapeSerde

2021-03-29 Thread GitBox


netanel246 commented on pull request #510:
URL: https://github.com/apache/incubator-sedona/pull/510#issuecomment-809705023


   @jiayuasu I am opening a new branch for the optional WKB serde because this 
one is not relevant.
   I will close it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




R: Apache Sedona contribution

2021-03-29 Thread Alessandro Calvio
Hi all,
thank you for your answer.
It would be very interesting understand how to implement the solution proposed 
by the paper in Apache Sedona.
Anyway, I think I could try to implement the simplified version proposed by 
you. If I understand correctly it would be like use the current SpatialKNNQuery 
function on the geometries filtered out by DistanceJoinQuery, am I right?

Can I refer to these links [1], [2] and [3] as guide to the compilation and 
publish mechanism? And what about the limitations of the contribution mentioned 
in my previous questions?

Finally, I didn’t receive the mail of Adam but yes, the expected output would 
have been the one described by you.

Thanks,
Best regards,
Alessandro

[1]: https://sedona.incubator.apache.org/community/rule/
[2]: 
https://sedona.incubator.apache.org/download/compile/#compile-the-documentation
[3]: https://sedona.incubator.apache.org/download/publish/


Da: Jia Yu
Inviato: lunedì 29 marzo 2021 07:53
A: dev@sedona.apache.org; 
alexcal...@hotmail.it; 
adam...@gmail.com
Oggetto: Re: Apache Sedona contribution

Hi folks,

Thanks for your proposal. However, the reason why Sedona does not have KNN Join 
query is that a complete and correct KNN join is very difficult to implement.

Note that: the existing spatial partitioning scheme in Sedona cannot yield KNN 
join correctly because once you zip two RDDs together, there is no guarantee 
that for each point in Partition A of RDD1, you can find its kth neighbor in 
Partition A of RDD2. To implement a correct KNN join, we need to find a correct 
partitioning mechanism. This research problem has been studied  by this TKDE 
paper: 
https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=7337428_token=PZSM8VwhkwMA:slOnDt2_70HFwdu81c_7jVRiYcZPj7FPbJ3OvET_g0ApMDDEcg2Fq71CMgYxWrSCdXmjZqACew=1
We have confirmed that this is the correct solution we want.

Alessandro, if you want to proceed, I would suggest that, you can implement a 
simplified version of KNN Join which is:

For each obj in RDD 1, within its D radius circle, find its k nearest neighbors 
in RDD2.

To do so, you can apply a KNN neighbor map function after Sedona 
JoinQuery.DistanceJoinQuery API: 
https://github.com/apache/incubator-sedona/blob/master/core/src/main/java/org/apache/sedona/core/spatialOperator/JoinQuery.java#L289
   or 
https://github.com/apache/incubator-sedona/blob/master/core/src/main/java/org/apache/sedona/core/spatialOperator/JoinQuery.java#L253

Thanks,
Jia






On Fri, Mar 26, 2021 at 8:35 PM Adam Binford 
mailto:adam...@gmail.com>> wrote:
Out of curiosity and knowing next to nothing about KNN, what is the return
value supposed to represent? The K nearest nearest geometries in spatialRDD
to any geometry in dataset point?

Adam

On Fri, Mar 26, 2021, 6:56 AM Alessandro Calvio 
mailto:alexcal...@hotmail.it>>
wrote:

> Hi,
> I’m a graduated in Computer Engineering and I am writing in connection
> with the possibility to contribute to the Apache Sedona project.
> During my work I bumped into a problem regarding the incapability to
> perform the KNNQuery operation with a dataset rather than a single point.
> Hence, the contribution will enhance the library with a new signature of
> the SpatialKNNQuery:
>
> public static  List
> SpatialKnnQuery(
> SpatialRDD spatialRDD, SpatialRDD datasetPoint, Integer k, boolean
> useIndex
> )
>
> The solution I’ve tried is similar to the one exploited for the
> Join-Query. In a few words, I’ll subdivide both dataset geographically, zip
> the partitions together and finally iterate on each partition computing the
> nearest neighbour query.
> I’d like to know if it could be a good proposal for a contribution and ask
> you some questions about the idea:
>
>   1.  Can the contribution be limited to RDD API or should it cover the
> SQL API too?
>   2.  Can the contribution be limited to enhance the Scala/Java API or
> should it cover the Python API too?
>   3.  Need the tests to be runned in local or should I deploy something
> like a cluster?
>
> It would be my first contribution in a open-source project so I’m not very
> experienced in these kind of procedures. I want to be sure that I can
> develop and submit my solution in a correct environment: where could I find
> a guide or doc with all the steps to do this after a possible approval?
>
> Waiting for a response,
> Best regards,
> Alessandro.
>