date:20230106

[jira] [Created] (SEDONA-231) Redundant Serde Removal

2023-01-06 Thread Doug Dennis (Jira)

Doug Dennis created SEDONA-231:
--

 Summary: Redundant Serde Removal
 Key: SEDONA-231
 URL: https://issues.apache.org/jira/browse/SEDONA-231
 Project: Apache Sedona
  Issue Type: Improvement
Reporter: Doug Dennis


Currently, Geometry objects are deserialized and reserialized during every 
evaluation of a function on a row in Spark. This amounts to a great deal of 
redundant serde during query execution. At times, objects are serialized just 
to be immediately deserialized.

To demonstrate this in action, I placed print statements in the 
GeometrySerializer serialize and deserialize methods, the GeometryUDT serialize 
and deserialize methods, and in the eval methods of several functions. When the 
following is executed:
 
{noformat}
val columns = Seq("input", "blade")
val data = Seq(
  ("GEOMETRYCOLLECTION ( LINESTRING (0 0, 1.5 1.5, 2 2), LINESTRING (3 3, 4.5 
4.5, 5 5))", "MULTIPOINT (0.5 0.5, 1 1, 3.5 3.5, 4 4)")
)
var df = spark.createDataFrame(data).toDF(columns:_*)     
println(
  df.selectExpr("ST_Normalize(ST_Split(ST_GeomFromWKT(input), 
ST_GeomFromWKT(blade))) AS 
result").collect()(0).get(0).asInstanceOf[Geometry].toText()
){noformat}
I get the following output:
{noformat}
 **org.apache.spark.sql.sedona_sql.expressions.ST_Normalize**
 **org.apache.spark.sql.sedona_sql.expressions.ST_Split**
 **org.apache.spark.sql.sedona_sql.expressions.ST_GeomFromWKT**
Inside GeometrySerializer.serialize
Inside GeometrySerializer.serialize
Inside GeometrySerializer.serialize
Inside GeometrySerializer.deserialize
 **org.apache.spark.sql.sedona_sql.expressions.ST_GeomFromWKT**
Inside GeometrySerializer.serialize
Inside GeometrySerializer.deserialize
Inside GeometrySerializer.serialize
Inside GeometrySerializer.deserialize
Inside GeometrySerializer.serialize
Inside UDT deserialize.
Inside GeometrySerializer.deserialize
MULTILINESTRING ((0 0, 0.5 0.5), (0.5 0.5, 1 1), (1 1, 1.5 1.5, 2 2), (3 3, 3.5 
3.5), (3.5 3.5, 4 4), (4 4, 4.5 4.5, 5 5)){noformat}
To explain what is happening:
 # ST_Normalize.eval is called.
 # ST_Normalize.eval calls ST_Split.eval.
 # ST_Split.eval first calls the ST_GeomFromWKT that had the GEOMETRYCOLLECTION 
wkt.
 # ST_GeomFromWKT processes the wkt string and generates a Geometry object.
 # The Geometry object is passed to GeometrySerializer.serialize. This is the 
first call to serialize.
 # This object is a GeometryCollection and the GeometrySerializer uses 
recursion to handle them so you see two more additional calls to serialize.
 # The GeometryCollection is then immediately deserialized and returned to 
ST_Split.
 # The second ST_GeomFromWKT is called (this one has a MULTIPOINT wkt).
 # ST_GeomFromWKT processes the WKT and then serializes the geometry.
 # That geometry is immediately deserialized and returned to ST_Split.
 # ST_Split performs its operation and then serializes the geometry.
 # That geometry is then immediately deserialized and returned to ST_Normalize.
 # ST_Normalize normalizes the geometry object and then serializes it for good.
 # Then the GeometryUDT.deserialize is called to handle the collect call which 
of course calls GeometrySerializer.deserialize.

There are multiple instances here where geometry objects are serialized and 
then immediately deserialized to be further operated on. That is obviously 
pretty wasteful.

 

I propose eliminating this redundancy through the following steps.
 * Create a trait called SerdeAware which has a single method called 
doNotSerializeOutput.
 * This trait is then added to the InferredUnaryExpression and 
InferredBinaryExpression abstract classes.
 * When the doNotSerializeOutput is called on one of the expression classes, a 
serializeOutput flag is set to false.
 * That flag is read in the class's eval method.
 * If the flag is false then the output will not be serialized and if the flag 
is true then the output does get serialized.
 * Finally, the buildExtractor method of the InferredTypes object is modified 
to detect if the input expression is SerdeAware and if it is then the 
doNotSerializeOutput method is called before calling the input expression's 
eval method.

In the test implementation I created I was able to get the following output:
{noformat}
 **org.apache.spark.sql.sedona_sql.expressions.ST_Normalize**
 **org.apache.spark.sql.sedona_sql.expressions.ST_Split**
 **org.apache.spark.sql.sedona_sql.expressions.ST_GeomFromWKT**
 **org.apache.spark.sql.sedona_sql.expressions.ST_GeomFromWKT**
Inside GeometrySerializer.serialize
Inside UDT deserialize.
Inside GeometrySerializer.deserialize
MULTILINESTRING ((0 0, 0.5 0.5), (0.5 0.5, 1 1), (1 1, 1.5 1.5, 2 2), (3 3, 3.5 
3.5), (3.5 3.5, 4 4), (4 4, 4.5 4.5, 5 5)){noformat}
You can see that only a single serialization was called and only at the very 
end of the computation.
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [sedona] Kontinuation commented on pull request #745: [SEDONA-227] Python Serde Refactor

2023-01-06 Thread GitBox



Kontinuation commented on PR #745:
URL: https://github.com/apache/sedona/pull/745#issuecomment-1373383865

   LGTM. The refactored implementation looks much more pythonic while having 
better performance. It is also a great move to remove the expensive `has_z` 
calls, as shapely currently does not support M dimension.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@sedona.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [sedona] douglasdennis commented on pull request #745: [SEDONA-227] Python Serde Refactor

2023-01-06 Thread GitBox



douglasdennis commented on PR #745:
URL: https://github.com/apache/sedona/pull/745#issuecomment-1373758835

   @Kontinuation What's your take on dropping support for empty geometry inside 
of multi-geometries in the new serialization format? They don't have 
representation in any other form so I'm not sure that supporting them provides 
any value. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@sedona.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [sedona] Kontinuation commented on pull request #745: [SEDONA-227] Python Serde Refactor

2023-01-06 Thread GitBox



Kontinuation commented on PR #745:
URL: https://github.com/apache/sedona/pull/745#issuecomment-1373869602

   > @Kontinuation What's your take on dropping support for empty geometry 
inside of multi-geometries in the new serialization format? They don't have 
representation in any other form so I'm not sure that supporting them provides 
any value.
   
   I think it is OK to ignore empty geometries inside geometry collections 
since it has no impact on the topological properties of the geometries, though 
it may change some structural properties of the geometry collection such as 
indices of geometry elements (`ST_GeometryN`) and number of geometries in the 
geometry collection (`ST_NumGeometries`).
   
   If I understand it correctly, both WKT and WKB support empty geometries in 
geometry collections according to the OGC standard [Simple Feature Access - 
Part 1: Common Architecture](https://www.ogc.org/standards/sfa). The standard 
does not explicitly forbid empty geometries inside geometry collections, and 
these kinds of geometries could be represented according to the BNF of WKT:
   
   ```
::= multilinestring 
::=  |   
{ }* 
::=  |   { }* 

::= EMPTY
   ```
   
   A MultiLineString containing 2 empty LineStrings could be represented as 
`MULTILINESTRING (EMPTY, EMPTY)`. Both JTS and Shapely could successfully parse 
it, though their behavior varies in various aspects:
   
   * JTS 1.19.0 parses it as a MultiLineString object containing 2 empty 
LineStrings.
   * Shapely 2.0 parses it as a MultiLineString object containing 2 empty 
LineStrings, though Shapely does not allow creating MultiLineString with empty 
components (`MultiLineString([LineString(), LineString()])` raises an 
exception).
   * Shapely 1.8 parses it as a MultiLineString containing no geometries.
   
   WKB may also represent `MULTILINESTRING (EMPTY, EMPTY)` as 
`b'\x00\x00\x00\x00\x05\x00\x00\x00\x02\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00'`,
 though some implementation failed to parse it.
   
   * JTS 1.19.0 raised a `ParseException` when parsing it
   * Shapely 2.0 and Shapely 1.8 parsed it without a problem. Shapely 2.0 
parsed it as a MultiLineString containing 2 empty LineStrings while Shapely 1.8 
parsed it as an empty MultiLineString.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@sedona.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [sedona] douglasdennis commented on pull request #745: [SEDONA-227] Python Serde Refactor

2023-01-06 Thread GitBox



douglasdennis commented on PR #745:
URL: https://github.com/apache/sedona/pull/745#issuecomment-1373879622

   I stand corrected :) I should have been a little more thorough. I think 
empty geometries in collections should be supported then, since the standards 
expressly provide for them.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@sedona.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: JoinQueryRaw.SpatialJoinQueryFlat for polygon - linestring join?

2023-01-06 Thread Mark Broich

Thank you Jia for confirming that Sedona supports polygon-linestring joins.
I did set
'ConsiderBoundaryIntersection' to true. Am still looking for the mistake in
my code...
Regards, Mark


On Thu, Jan 5, 2023 at 6:44 PM Jia Yu  wrote:

> Hi Mark,
>
> Sedona supports polygon-linestring joins. Did you set
> 'ConsiderBoundaryIntersection' to true? See:
>
> https://sedona.apache.org/1.3.1-incubating/tutorial/core-python/#write-a-spatial-join-query
>
> This is the last parameter in Sedona Python
> JoinQueryRaw.SpatialJoinQueryFlat().
>
> Thanks,
> Jia
>
> -- Forwarded message -
> From: Mark Broich 
> Date: Thu, Jan 5, 2023 at 4:27 PM
> Subject: JoinQueryRaw.SpatialJoinQueryFlat for polygon - linestring join?
> To: 
> Cc: 
>
>
> Hi all,
>
> I am trying to use JoinQueryRaw.SpatialJoinQueryFlat() to join polygons and
> linestrings but the result is empty despite overlap in the polygons and
> linestrings.
>
> I am wondering if JoinQueryRaw.SpatialJoinQueryFlat can do the join I am
> after or if I need to do a RangeQuery.SpatialRangeQuery().
>
> Also, how do I get posting rights on the Apache Sedona community server
> ?
>
> Tnx for any pointers. Regards, Mark
>

Re: JoinQueryRaw.SpatialJoinQueryFlat for polygon - linestring join?

2023-01-06 Thread Jia Yu

And please try to switch the left and right side of the join and see if the
result changes.

JoinQuery.SpatialJoingQueryFlat(RddA, RddB, considerBoundaryIntersection)
means that check if each one in Rdd A is CONTAINED BY each one in Rdd B,
considering the situation of boundary intersecting (not fully contained).

In your case, a line cannot contain a polygon, but a polygon can contain a
line. Make sure you get the order correct.

Thanks,
Jia

On Fri, Jan 6, 2023 at 3:16 PM Mark Broich 
wrote:

> Thank you Jia for confirming that Sedona supports polygon-linestring joins.
> I did set
> 'ConsiderBoundaryIntersection' to true. Am still looking for the mistake in
> my code...
> Regards, Mark
>
>
> On Thu, Jan 5, 2023 at 6:44 PM Jia Yu  wrote:
>
> > Hi Mark,
> >
> > Sedona supports polygon-linestring joins. Did you set
> > 'ConsiderBoundaryIntersection' to true? See:
> >
> >
> https://sedona.apache.org/1.3.1-incubating/tutorial/core-python/#write-a-spatial-join-query
> >
> > This is the last parameter in Sedona Python
> > JoinQueryRaw.SpatialJoinQueryFlat().
> >
> > Thanks,
> > Jia
> >
> > -- Forwarded message -
> > From: Mark Broich 
> > Date: Thu, Jan 5, 2023 at 4:27 PM
> > Subject: JoinQueryRaw.SpatialJoinQueryFlat for polygon - linestring join?
> > To: 
> > Cc: 
> >
> >
> > Hi all,
> >
> > I am trying to use JoinQueryRaw.SpatialJoinQueryFlat() to join polygons
> and
> > linestrings but the result is empty despite overlap in the polygons and
> > linestrings.
> >
> > I am wondering if JoinQueryRaw.SpatialJoinQueryFlat can do the join I am
> > after or if I need to do a RangeQuery.SpatialRangeQuery().
> >
> > Also, how do I get posting rights on the Apache Sedona community server
> > ?
> >
> > Tnx for any pointers. Regards, Mark
> >
>

Fwd: JoinQueryRaw.SpatialJoinQueryFlat for polygon - linestring join?

2023-01-06 Thread Jia Yu

And please try to switch the left and right side of the join and see if the
result changes.

JoinQuery.SpatialJoingQueryFlat(RddA, RddB, considerBoundaryIntersection)
means that check if each one in Rdd A is CONTAINED BY each one in Rdd B,
considering the situation of boundary intersecting (not fully contained).

In your case, a line cannot contain a polygon, but a polygon can contain a
line. Make sure you get the order correct.

Thanks,
Jia

On Fri, Jan 6, 2023 at 3:16 PM Mark Broich 
wrote:

> Thank you Jia for confirming that Sedona supports polygon-linestring joins.
> I did set
> 'ConsiderBoundaryIntersection' to true. Am still looking for the mistake in
> my code...
> Regards, Mark
>
>
> On Thu, Jan 5, 2023 at 6:44 PM Jia Yu  wrote:
>
> > Hi Mark,
> >
> > Sedona supports polygon-linestring joins. Did you set
> > 'ConsiderBoundaryIntersection' to true? See:
> >
> >
> https://sedona.apache.org/1.3.1-incubating/tutorial/core-python/#write-a-spatial-join-query
> >
> > This is the last parameter in Sedona Python
> > JoinQueryRaw.SpatialJoinQueryFlat().
> >
> > Thanks,
> > Jia
> >
> > -- Forwarded message -
> > From: Mark Broich 
> > Date: Thu, Jan 5, 2023 at 4:27 PM
> > Subject: JoinQueryRaw.SpatialJoinQueryFlat for polygon - linestring join?
> > To: 
> > Cc: 
> >
> >
> > Hi all,
> >
> > I am trying to use JoinQueryRaw.SpatialJoinQueryFlat() to join polygons
> and
> > linestrings but the result is empty despite overlap in the polygons and
> > linestrings.
> >
> > I am wondering if JoinQueryRaw.SpatialJoinQueryFlat can do the join I am
> > after or if I need to do a RangeQuery.SpatialRangeQuery().
> >
> > Also, how do I get posting rights on the Apache Sedona community server
> > ?
> >
> > Tnx for any pointers. Regards, Mark
> >
>

[GitHub] [sedona] jiayuasu merged pull request #746: [SEDONA-230] rdd.saveAsGeoJSON should generate feature properties with field names

2023-01-06 Thread GitBox



jiayuasu merged PR #746:
URL: https://github.com/apache/sedona/pull/746


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@sedona.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (SEDONA-231) Redundant Serde Removal

2023-01-06 Thread Adam Binford (Jira)



[ 
https://issues.apache.org/jira/browse/SEDONA-231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655634#comment-17655634
 ] 

Adam Binford commented on SEDONA-231:
-

This is an interesting idea! Only comment I might add is instead of using a 
method that sets a flag, maybe just create an "evalDeserialized" method on the 
SerdeAware trait that gets called instead of "eval" if the expression matches 
the trait

> Redundant Serde Removal
> ---
>
> Key: SEDONA-231
> URL: https://issues.apache.org/jira/browse/SEDONA-231
> Project: Apache Sedona
>  Issue Type: Improvement
>Reporter: Doug Dennis
>Priority: Major
>
> Currently, Geometry objects are deserialized and reserialized during every 
> evaluation of a function on a row in Spark. This amounts to a great deal of 
> redundant serde during query execution. At times, objects are serialized just 
> to be immediately deserialized.
> To demonstrate this in action, I placed print statements in the 
> GeometrySerializer serialize and deserialize methods, the GeometryUDT 
> serialize and deserialize methods, and in the eval methods of several 
> functions. When the following is executed:
>  
> {noformat}
> val columns = Seq("input", "blade")
> val data = Seq(
>   ("GEOMETRYCOLLECTION ( LINESTRING (0 0, 1.5 1.5, 2 2), LINESTRING (3 3, 4.5 
> 4.5, 5 5))", "MULTIPOINT (0.5 0.5, 1 1, 3.5 3.5, 4 4)")
> )
> var df = spark.createDataFrame(data).toDF(columns:_*)     
> println(
>   df.selectExpr("ST_Normalize(ST_Split(ST_GeomFromWKT(input), 
> ST_GeomFromWKT(blade))) AS 
> result").collect()(0).get(0).asInstanceOf[Geometry].toText()
> ){noformat}
> I get the following output:
> {noformat}
>  **org.apache.spark.sql.sedona_sql.expressions.ST_Normalize**
>  **org.apache.spark.sql.sedona_sql.expressions.ST_Split**
>  **org.apache.spark.sql.sedona_sql.expressions.ST_GeomFromWKT**
> Inside GeometrySerializer.serialize
> Inside GeometrySerializer.serialize
> Inside GeometrySerializer.serialize
> Inside GeometrySerializer.deserialize
>  **org.apache.spark.sql.sedona_sql.expressions.ST_GeomFromWKT**
> Inside GeometrySerializer.serialize
> Inside GeometrySerializer.deserialize
> Inside GeometrySerializer.serialize
> Inside GeometrySerializer.deserialize
> Inside GeometrySerializer.serialize
> Inside UDT deserialize.
> Inside GeometrySerializer.deserialize
> MULTILINESTRING ((0 0, 0.5 0.5), (0.5 0.5, 1 1), (1 1, 1.5 1.5, 2 2), (3 3, 
> 3.5 3.5), (3.5 3.5, 4 4), (4 4, 4.5 4.5, 5 5)){noformat}
> To explain what is happening:
>  # ST_Normalize.eval is called.
>  # ST_Normalize.eval calls ST_Split.eval.
>  # ST_Split.eval first calls the ST_GeomFromWKT that had the 
> GEOMETRYCOLLECTION wkt.
>  # ST_GeomFromWKT processes the wkt string and generates a Geometry object.
>  # The Geometry object is passed to GeometrySerializer.serialize. This is the 
> first call to serialize.
>  # This object is a GeometryCollection and the GeometrySerializer uses 
> recursion to handle them so you see two more additional calls to serialize.
>  # The GeometryCollection is then immediately deserialized and returned to 
> ST_Split.
>  # The second ST_GeomFromWKT is called (this one has a MULTIPOINT wkt).
>  # ST_GeomFromWKT processes the WKT and then serializes the geometry.
>  # That geometry is immediately deserialized and returned to ST_Split.
>  # ST_Split performs its operation and then serializes the geometry.
>  # That geometry is then immediately deserialized and returned to 
> ST_Normalize.
>  # ST_Normalize normalizes the geometry object and then serializes it for 
> good.
>  # Then the GeometryUDT.deserialize is called to handle the collect call 
> which of course calls GeometrySerializer.deserialize.
> There are multiple instances here where geometry objects are serialized and 
> then immediately deserialized to be further operated on. That is obviously 
> pretty wasteful.
>  
> I propose eliminating this redundancy through the following steps.
>  * Create a trait called SerdeAware which has a single method called 
> doNotSerializeOutput.
>  * This trait is then added to the InferredUnaryExpression and 
> InferredBinaryExpression abstract classes.
>  * When the doNotSerializeOutput is called on one of the expression classes, 
> a serializeOutput flag is set to false.
>  * That flag is read in the class's eval method.
>  * If the flag is false then the output will not be serialized and if the 
> flag is true then the output does get serialized.
>  * Finally, the buildExtractor method of the InferredTypes object is modified 
> to detect if the input expression is SerdeAware and if it is then the 
> doNotSerializeOutput method is called before calling the input expression's 
> eval method.
> In the test implementation I created I was able to get the following output:
> {noformat}
>  **org.apache.spark.sql.sedona_sql.expre

[jira] [Commented] (SEDONA-231) Redundant Serde Removal

2023-01-06 Thread Doug Dennis (Jira)



[ 
https://issues.apache.org/jira/browse/SEDONA-231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655651#comment-17655651
 ] 

Doug Dennis commented on SEDONA-231:


Oh, that's a much better idea. I'll try that out.

I've been able to extend to the regular UnaryExpression and BinaryExpression as 
well.

> Redundant Serde Removal
> ---
>
> Key: SEDONA-231
> URL: https://issues.apache.org/jira/browse/SEDONA-231
> Project: Apache Sedona
>  Issue Type: Improvement
>Reporter: Doug Dennis
>Priority: Major
>
> Currently, Geometry objects are deserialized and reserialized during every 
> evaluation of a function on a row in Spark. This amounts to a great deal of 
> redundant serde during query execution. At times, objects are serialized just 
> to be immediately deserialized.
> To demonstrate this in action, I placed print statements in the 
> GeometrySerializer serialize and deserialize methods, the GeometryUDT 
> serialize and deserialize methods, and in the eval methods of several 
> functions. When the following is executed:
>  
> {noformat}
> val columns = Seq("input", "blade")
> val data = Seq(
>   ("GEOMETRYCOLLECTION ( LINESTRING (0 0, 1.5 1.5, 2 2), LINESTRING (3 3, 4.5 
> 4.5, 5 5))", "MULTIPOINT (0.5 0.5, 1 1, 3.5 3.5, 4 4)")
> )
> var df = spark.createDataFrame(data).toDF(columns:_*)     
> println(
>   df.selectExpr("ST_Normalize(ST_Split(ST_GeomFromWKT(input), 
> ST_GeomFromWKT(blade))) AS 
> result").collect()(0).get(0).asInstanceOf[Geometry].toText()
> ){noformat}
> I get the following output:
> {noformat}
>  **org.apache.spark.sql.sedona_sql.expressions.ST_Normalize**
>  **org.apache.spark.sql.sedona_sql.expressions.ST_Split**
>  **org.apache.spark.sql.sedona_sql.expressions.ST_GeomFromWKT**
> Inside GeometrySerializer.serialize
> Inside GeometrySerializer.serialize
> Inside GeometrySerializer.serialize
> Inside GeometrySerializer.deserialize
>  **org.apache.spark.sql.sedona_sql.expressions.ST_GeomFromWKT**
> Inside GeometrySerializer.serialize
> Inside GeometrySerializer.deserialize
> Inside GeometrySerializer.serialize
> Inside GeometrySerializer.deserialize
> Inside GeometrySerializer.serialize
> Inside UDT deserialize.
> Inside GeometrySerializer.deserialize
> MULTILINESTRING ((0 0, 0.5 0.5), (0.5 0.5, 1 1), (1 1, 1.5 1.5, 2 2), (3 3, 
> 3.5 3.5), (3.5 3.5, 4 4), (4 4, 4.5 4.5, 5 5)){noformat}
> To explain what is happening:
>  # ST_Normalize.eval is called.
>  # ST_Normalize.eval calls ST_Split.eval.
>  # ST_Split.eval first calls the ST_GeomFromWKT that had the 
> GEOMETRYCOLLECTION wkt.
>  # ST_GeomFromWKT processes the wkt string and generates a Geometry object.
>  # The Geometry object is passed to GeometrySerializer.serialize. This is the 
> first call to serialize.
>  # This object is a GeometryCollection and the GeometrySerializer uses 
> recursion to handle them so you see two more additional calls to serialize.
>  # The GeometryCollection is then immediately deserialized and returned to 
> ST_Split.
>  # The second ST_GeomFromWKT is called (this one has a MULTIPOINT wkt).
>  # ST_GeomFromWKT processes the WKT and then serializes the geometry.
>  # That geometry is immediately deserialized and returned to ST_Split.
>  # ST_Split performs its operation and then serializes the geometry.
>  # That geometry is then immediately deserialized and returned to 
> ST_Normalize.
>  # ST_Normalize normalizes the geometry object and then serializes it for 
> good.
>  # Then the GeometryUDT.deserialize is called to handle the collect call 
> which of course calls GeometrySerializer.deserialize.
> There are multiple instances here where geometry objects are serialized and 
> then immediately deserialized to be further operated on. That is obviously 
> pretty wasteful.
>  
> I propose eliminating this redundancy through the following steps.
>  * Create a trait called SerdeAware which has a single method called 
> doNotSerializeOutput.
>  * This trait is then added to the InferredUnaryExpression and 
> InferredBinaryExpression abstract classes.
>  * When the doNotSerializeOutput is called on one of the expression classes, 
> a serializeOutput flag is set to false.
>  * That flag is read in the class's eval method.
>  * If the flag is false then the output will not be serialized and if the 
> flag is true then the output does get serialized.
>  * Finally, the buildExtractor method of the InferredTypes object is modified 
> to detect if the input expression is SerdeAware and if it is then the 
> doNotSerializeOutput method is called before calling the input expression's 
> eval method.
> In the test implementation I created I was able to get the following output:
> {noformat}
>  **org.apache.spark.sql.sedona_sql.expressions.ST_Normalize**
>  **org.apache.spark.sql.sedona_sql.expressions.ST_Split**
>  **org.apache.spark.sql.se

[jira] [Created] (SEDONA-231) Redundant Serde Removal

[GitHub] [sedona] Kontinuation commented on pull request #745: [SEDONA-227] Python Serde Refactor

[GitHub] [sedona] douglasdennis commented on pull request #745: [SEDONA-227] Python Serde Refactor

[GitHub] [sedona] Kontinuation commented on pull request #745: [SEDONA-227] Python Serde Refactor

[GitHub] [sedona] douglasdennis commented on pull request #745: [SEDONA-227] Python Serde Refactor

Re: JoinQueryRaw.SpatialJoinQueryFlat for polygon - linestring join?

Re: JoinQueryRaw.SpatialJoinQueryFlat for polygon - linestring join?

Fwd: JoinQueryRaw.SpatialJoinQueryFlat for polygon - linestring join?

[GitHub] [sedona] jiayuasu merged pull request #746: [SEDONA-230] rdd.saveAsGeoJSON should generate feature properties with field names

[jira] [Commented] (SEDONA-231) Redundant Serde Removal

[jira] [Commented] (SEDONA-231) Redundant Serde Removal

11 matches

Site Navigation

Mail list logo

Footer information