[ https://issues.apache.org/jira/browse/SEDONA-231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17697693#comment-17697693 ]
Doug Dennis commented on SEDONA-231: ------------------------------------ [~jiayu] Oh, shoot, sorry. This got lost in the shuffle. I can push a draft PR for comment tonight. The only issue I hit was not knowing how to test it. Let me put that up, but if [~Kontinuation] has a better path forward then I'd be happy to close it. > Redundant Serde Removal > ----------------------- > > Key: SEDONA-231 > URL: https://issues.apache.org/jira/browse/SEDONA-231 > Project: Apache Sedona > Issue Type: Improvement > Reporter: Doug Dennis > Priority: Major > > Currently, Geometry objects are deserialized and reserialized during every > evaluation of a function on a row in Spark. This amounts to a great deal of > redundant serde during query execution. At times, objects are serialized just > to be immediately deserialized. > To demonstrate this in action, I placed print statements in the > GeometrySerializer serialize and deserialize methods, the GeometryUDT > serialize and deserialize methods, and in the eval methods of several > functions. When the following is executed: > > {noformat} > val columns = Seq("input", "blade") > val data = Seq( > ("GEOMETRYCOLLECTION ( LINESTRING (0 0, 1.5 1.5, 2 2), LINESTRING (3 3, 4.5 > 4.5, 5 5))", "MULTIPOINT (0.5 0.5, 1 1, 3.5 3.5, 4 4)") > ) > var df = spark.createDataFrame(data).toDF(columns:_*) > println( > df.selectExpr("ST_Normalize(ST_Split(ST_GeomFromWKT(input), > ST_GeomFromWKT(blade))) AS > result").collect()(0).get(0).asInstanceOf[Geometry].toText() > ){noformat} > I get the following output: > {noformat} > **org.apache.spark.sql.sedona_sql.expressions.ST_Normalize** > **org.apache.spark.sql.sedona_sql.expressions.ST_Split** > **org.apache.spark.sql.sedona_sql.expressions.ST_GeomFromWKT** > Inside GeometrySerializer.serialize > Inside GeometrySerializer.serialize > Inside GeometrySerializer.serialize > Inside GeometrySerializer.deserialize > **org.apache.spark.sql.sedona_sql.expressions.ST_GeomFromWKT** > Inside GeometrySerializer.serialize > Inside GeometrySerializer.deserialize > Inside GeometrySerializer.serialize > Inside GeometrySerializer.deserialize > Inside GeometrySerializer.serialize > Inside UDT deserialize. > Inside GeometrySerializer.deserialize > MULTILINESTRING ((0 0, 0.5 0.5), (0.5 0.5, 1 1), (1 1, 1.5 1.5, 2 2), (3 3, > 3.5 3.5), (3.5 3.5, 4 4), (4 4, 4.5 4.5, 5 5)){noformat} > To explain what is happening: > # ST_Normalize.eval is called. > # ST_Normalize.eval calls ST_Split.eval. > # ST_Split.eval first calls the ST_GeomFromWKT that had the > GEOMETRYCOLLECTION wkt. > # ST_GeomFromWKT processes the wkt string and generates a Geometry object. > # The Geometry object is passed to GeometrySerializer.serialize. This is the > first call to serialize. > # This object is a GeometryCollection and the GeometrySerializer uses > recursion to handle them so you see two more additional calls to serialize. > # The GeometryCollection is then immediately deserialized and returned to > ST_Split. > # The second ST_GeomFromWKT is called (this one has a MULTIPOINT wkt). > # ST_GeomFromWKT processes the WKT and then serializes the geometry. > # That geometry is immediately deserialized and returned to ST_Split. > # ST_Split performs its operation and then serializes the geometry. > # That geometry is then immediately deserialized and returned to > ST_Normalize. > # ST_Normalize normalizes the geometry object and then serializes it for > good. > # Then the GeometryUDT.deserialize is called to handle the collect call > which of course calls GeometrySerializer.deserialize. > There are multiple instances here where geometry objects are serialized and > then immediately deserialized to be further operated on. That is obviously > pretty wasteful. > > I propose eliminating this redundancy through the following steps. > * Create a trait called SerdeAware which has a single method called > evalWithoutSerialization. > * This trait is then added to the InferredUnaryExpression, > InferredBinaryExpression, InferredTernaryExpression, UnaryGeometryExpression, > and BinaryGeometryExpression abstract classes. > * When a Sedona expression is evaluating its children expressions, it first > checks the child for the SerdeAware trait. If the trait is detected then the > parent expression calls the child's evalWithoutSerialization method. This > method returns an actual geometry object without the child having serialized > it. > In the test implementation I created I was able to get the following output: > {noformat} > **org.apache.spark.sql.sedona_sql.expressions.ST_Normalize** > **org.apache.spark.sql.sedona_sql.expressions.ST_Split** > **org.apache.spark.sql.sedona_sql.expressions.ST_GeomFromWKT** > **org.apache.spark.sql.sedona_sql.expressions.ST_GeomFromWKT** > Inside GeometrySerializer.serialize > Inside UDT deserialize. > Inside GeometrySerializer.deserialize > MULTILINESTRING ((0 0, 0.5 0.5), (0.5 0.5, 1 1), (1 1, 1.5 1.5, 2 2), (3 3, > 3.5 3.5), (3.5 3.5, 4 4), (4 4, 4.5 4.5, 5 5)){noformat} > You can see that only a single serialization was called and only at the very > end of the computation. > > Edit: I updated the proposed method with Adam's suggestion. I also extended > the proposal to include the other expression types. -- This message was sent by Atlassian Jira (v8.20.10#820010)