douglasdennis opened a new pull request, #745:
URL: https://github.com/apache/sedona/pull/745

   
   ## Did you read the Contributor Guide?
   
   - Yes, I have read [Contributor 
Rules](https://sedona.apache.org/community/rule/) and [Contributor Development 
Guide](https://sedona.apache.org/community/develop/)
   
   ## Is this PR related to a JIRA ticket?
   
   - Yes, the URL of the associated JIRA ticket is 
https://issues.apache.org/jira/browse/SEDONA-227. The PR name follows the 
format `[SEDONA-XXX] my subject`.
   
   ## What changes were proposed in this PR?
   
   A refactor of some of the python serialization/deserialization functions is 
proposed. This is due to the performance regression experienced from the change 
in serialization formats. Most of the functions were refactored to increase 
performance. Attempts were made to maintain readability of the code at the same 
time.
   
   A significant pain point involves interacting with shapely geometry. Most of 
the shapely methods are too slow and involve repeated calls to native code. 
Once shapely 2.0 is adopted by Sedona and its users, it may be a good idea to 
revisit this code and attempt to utilize some of the libgeos connections that 
shapely exposes.
   
   There are other issues in the current implementation of serde:
   1. Shapely does not respect an M coordinate yet, and so that is not 
supported in python at the moment.
   2. SRID is not supported in python for a similar reason. Part of Sedona's 
python library does add a "userData" field to shapely geometry classes at 
runtime. This could be utilized to store SRIDs and possibly M coordinates.
   3. This PR mostly avoids using the GeometryBuffer class. I don't believe it 
will be the fastest solution in the future. However, I think it's a nice class 
for specifically handling the GeometryCollection type.
   4. Should the serialization format allow for empty geometry within 
collections? They have no WKT representation, so I'm not sure that should be 
allowed.
   
   Finally, here are performance results between current master and this branch:
   ```
   
############################################################################################
   #     MASTER                                #              REFACTOR          
              #
   
############################################################################################
   short line serialize trial:                 #        short line serialize 
trial:
           Total Time (seconds):               #            Total Time 
(seconds):
                   Shapely: 3.1197412          #                Shapely: 
1.6295438
                   Sedona: 7.6139485           #                Sedona: 1.624361
                   Factor: 1.440570551172642   #                Factor: 
-0.00318052205776856
                                               #
   long line serialize trial:                  #        long line serialize 
trial:
           Total Time (seconds):               #            Total Time 
(seconds):
                   Shapely: 4.3532153          #                Shapely: 
4.106511
                   Sedona: 53.1161724          #                Sedona: 
20.2585903
                   Factor: 11.201595542494763  #                Factor: 
3.9332852876809534
                                               #
   point serialize trial:                      #        point serialize trial:
           Total Time (seconds):               #            Total Time 
(seconds):
                   Shapely: 4.5569092          #                Shapely: 
4.1321635
                   Sedona: 12.9271814          #                Sedona: 
3.6287594
                   Factor: 1.8368310257312128  #                Factor: 
-0.12182579416327549
                                               #
   small polygon serialize trial:              #        small polygon serialize 
trial:
           Total Time (seconds):               #            Total Time 
(seconds):
                   Shapely: 1.9922023          #                Shapely: 
1.6930909
                   Sedona: 15.1673936          #                Sedona: 
2.5668217
                   Factor: 6.613380227499988   #                Factor: 
0.5160566393688608
                                               #
   large polygon serialize trial:              #        large polygon serialize 
trial:
           Total Time (seconds):               #            Total Time 
(seconds):
                   Shapely: 2.5921638          #                Shapely: 
2.3090157
                   Sedona: 27.4987078          #                Sedona: 
2.8944101
                   Factor: 9.608398975404254   #                Factor: 
0.25352551738821005
                                               #
   small multipoint serialize trial:           #        small multipoint 
serialize trial:
           Total Time (seconds):               #            Total Time 
(seconds):
                   Shapely: 0.0964883          #                Shapely: 
0.0957348
                   Sedona: 0.8657994           #                Sedona: 
0.5249629
                   Factor: 7.97310243832672    #                Factor: 
4.483511742856307
                                               #
   large multipoint serialize trial:           #        large multipoint 
serialize trial:
           Total Time (seconds):               #            Total Time 
(seconds):
                   Shapely: 0.2059924          #                Shapely: 
0.2002637
                   Sedona: 18.6648998          #                Sedona: 
13.9681053
                   Factor: 89.60965258912465   #                Factor: 
68.74856301965858
                                               #
   small multilinestring serialize trial:      #        small multilinestring 
serialize trial:
           Total Time (seconds):               #            Total Time 
(seconds):
                   Shapely: 0.1003702          #                Shapely: 
0.0984753
                   Sedona: 1.2611056           #                Sedona: 
0.5051064
                   Factor: 11.564542065274354  #                Factor: 
4.129269979375539
                                               #
   large multilinestring serialize trial:      #        large multilinestring 
serialize trial:
           Total Time (seconds):               #            Total Time 
(seconds):
                   Shapely: 0.1382676          #                Shapely: 
0.1290179
                   Sedona: 14.4627404          #                Sedona: 
6.4109233
                   Factor: 103.59963433226584  #                Factor: 
48.69018485031922
                                               #
   small multipolygon serialize trial:         #        small multipolygon 
serialize trial:
           Total Time (seconds):               #            Total Time 
(seconds):
                   Shapely: 0.1074735          #                Shapely: 
0.107037
                   Sedona: 2.5492284           #                Sedona: 
1.1849261
                   Factor: 22.71959971527865   #                Factor: 
10.070247671365976
                                               #
   large multipolygon serialize trial:         #        large multipolygon 
serialize trial:
           Total Time (seconds):               #            Total Time 
(seconds):
                   Shapely: 0.218638           #                Shapely: 
0.2061548
                   Sedona: 33.7390888          #                Sedona: 
16.8197829
                   Factor: 153.31484371426743  #                Factor: 
80.58812164451179
                                               #
   short line deserialize trial:               #        short line deserialize 
trial:
           Total Time (seconds):               #            Total Time 
(seconds):
                   Shapely: 2.6179912          #                Shapely: 
2.3070263
                   Sedona: 1.8462964           #                Sedona: 
1.8466384
                   Factor: -0.2947660022692208 #                Factor: 
-0.19955901673075854
                                               #
   long line deserialize trial:                #        long line deserialize 
trial:
           Total Time (seconds):               #            Total Time 
(seconds):
                   Shapely: 3.1905663          #                Shapely: 
3.0322167
                   Sedona: 46.6516656          #                Sedona: 
3.0514442
                   Factor: 13.62175087851959   #                Factor: 
0.006341070544199562
                                               #
   point deserialize trial:                    #        point deserialize trial:
           Total Time (seconds):               #            Total Time 
(seconds):
                   Shapely: 5.8758644          #                Shapely: 
5.7208507
                   Sedona: 5.2898097           #                Sedona: 
5.2560053
                   Factor: -0.0997393166527124 #                Factor: 
-0.08125459383164815
                                               #
   small polygon deserialize trial:            #        small polygon 
deserialize trial:
           Total Time (seconds):               #            Total Time 
(seconds):
                   Shapely: 2.4844221          #                Shapely: 
2.3755022
                   Sedona: 4.120936            #                Sedona: 
3.0382661
                   Factor: 0.6587100879516408  #                Factor: 
0.27899948903436084
                                               #
   large polygon deserialize trial:            #        large polygon 
deserialize trial:
           Total Time (seconds):               #            Total Time 
(seconds):
                   Shapely: 2.0467411          #                Shapely: 
1.9710574
                   Sedona: 20.173081           #                Sedona: 
3.0417363
                   Factor: 8.85619578362891    #                Factor: 
0.5432002639801358
                                               #
   small multipoint deserialize trial:         #        small multipoint 
deserialize trial:
           Total Time (seconds):               #            Total Time 
(seconds):
                   Shapely: 0.1267147          #                Shapely: 
0.1221836
                   Sedona: 0.4915774           #                Sedona: 
0.4675109
                   Factor: 2.8794031000349603  #                Factor: 
2.8262982920784787
                                               #
   large multipoint deserialize trial:         #        large multipoint 
deserialize trial:
           Total Time (seconds):               #            Total Time 
(seconds):
                   Shapely: 0.2152692          #                Shapely: 
0.2037456
                   Sedona: 10.9262919          #                Sedona: 
11.1691675
                   Factor: 49.75641057801116   #                Factor: 
53.81918382531942
                                               #
   small multilinestring deserialize trial:    #        small multilinestring 
deserialize trial:
           Total Time (seconds):               #            Total Time 
(seconds):
                   Shapely: 0.1260666          #                Shapely: 
0.1310238
                   Sedona: 0.4858702           #                Sedona: 
0.5045361
                   Factor: 2.8540755441964802  #                Factor: 
2.8507210140447765
                                               #
   large multilinestring deserialize trial:    #        large multilinestring 
deserialize trial:
           Total Time (seconds):               #            Total Time 
(seconds):
                   Shapely: 0.1432103          #                Shapely: 
0.1417217
                   Sedona: 5.63192             #                Sedona: 
5.8515723
                   Factor: 38.3262216474653    #                Factor: 
40.28917660457079
                                               #
   small multipolygon deserialize trial:       #        small multipolygon 
deserialize trial:
           Total Time (seconds):               #            Total Time 
(seconds):
                   Shapely: 0.1392047          #                Shapely: 0.12827
                   Sedona: 1.3398053           #                Sedona: 
1.1914238
                   Factor: 8.624713102359332   #                Factor: 
8.288405706712403
                                               #
   large multipolygon deserialize trial:       #        large multipolygon 
deserialize trial:
           Total Time (seconds):               #            Total Time 
(seconds):
                   Shapely: 0.2063552          #                Shapely: 
0.2101143
                   Sedona: 19.0151752          #                Sedona: 
15.8966164
                   Factor: 91.14778789194554   #                Factor: 
74.65699431214344
   ```
   Here is the python script I used to generate these results:
   ```
   from sedona.utils.geometry_serde import serialize, deserialize
   from shapely.geometry import LineString, Point, Polygon, MultiPoint, 
MultiLineString, MultiPolygon
   from shapely.wkb import dumps, loads
   
   import time
   
   def run_serialize_trial(geom, number_iterations, name):
       print(f"{name} serialize trial:")
   
       start_time = time.perf_counter_ns()
       for _ in range(number_iterations):
           dumps(geom)
       shapely_time = time.perf_counter_ns() - start_time
   
       start_time = time.perf_counter_ns()
       for _ in range(number_iterations):
           serialize(geom)
       sedona_time = time.perf_counter_ns() - start_time
   
       print(f"\tTotal Time (seconds):")
       print(f"\t\tShapely: {shapely_time / 1e9}\n\t\tSedona: {sedona_time / 
1e9}\n\t\tFactor: {(sedona_time - shapely_time) / shapely_time}\n")
   
   def run_deserialize_trial(geom, number_iterations, name):
       print(f"{name} deserialize trial:")
   
       shapely_serialized_geom = dumps(geom)
       sedona_serialized_geom = serialize(geom)
   
       start_time = time.perf_counter_ns()
       for _ in range(number_iterations):
           loads(shapely_serialized_geom)
       shapely_time = time.perf_counter_ns() - start_time
   
       start_time = time.perf_counter_ns()
       for _ in range(number_iterations):
           deserialize(sedona_serialized_geom)
       sedona_time = time.perf_counter_ns() - start_time
   
       print(f"\tTotal Time (seconds):")
       print(f"\t\tShapely: {shapely_time / 1e9}\n\t\tSedona: {sedona_time / 
1e9}\n\t\tFactor: {(sedona_time - shapely_time) / shapely_time}\n")
   
   short_line_iterations = 200_000
   short_line = LineString([(10.0, 10.0), (20.0, 20.0)])
   
   long_line_iterations = 100_000
   long_line = LineString([(float(n), float(n)) for n in range(1000)])
   
   point_iterations = 500_000
   point = Point(12.3, 45.6)
   
   small_polygon_iterations = 200_000
   small_polygon = Polygon([(10.0, 10.0), (20.0, 10.0), (20.0, 20.0), (10.0, 
20.0), (10.0, 10.0)])
   
   large_polygon_iterations = 100_000
   large_polygon = Polygon(
       [(0.0, float(n * 10)) for n in range(100)]
       + [(float(n * 10), 990.0) for n in range(100)]
       + [(990.0, float(n * 10)) for n in reversed(range(100))]
       + [(float(n * 10), 0.0) for n in reversed(range(100))]
   )
   
   small_multipoint_iterations = 10_000
   small_multipoint = MultiPoint([(n, n) for n in range(3)])
   
   large_multipoint_iterations = 10_000
   large_multipoint = MultiPoint([(n, n) for n in range(100)])
   
   small_multilinestring_iterations = 10_000
   small_multilinestring = MultiLineString([[(10.0, 10.0), (20.0, 20.0)] for _ 
in range(3)])
   
   large_multilinestring_iterations = 5_000
   large_multilinestring = MultiLineString([[(10.0, 10.0), (20.0, 20.0)] for _ 
in range(100)])
   
   small_multipolygon_iterations = 10_000
   small_multipolygon = MultiPolygon([small_polygon for _ in range(3)])
   
   large_multipolygon_iterations = 5_000
   large_multipolygon = MultiPolygon([small_polygon for _ in range(100)])
   
   run_serialize_trial(short_line, short_line_iterations, "short line")
   run_serialize_trial(long_line, long_line_iterations, "long line")
   run_serialize_trial(point, point_iterations, "point")
   run_serialize_trial(small_polygon, small_polygon_iterations, "small polygon")
   run_serialize_trial(large_polygon, large_polygon_iterations, "large polygon")
   run_serialize_trial(small_multipoint, small_multipoint_iterations, "small 
multipoint")
   run_serialize_trial(large_multipoint, large_multipoint_iterations, "large 
multipoint")
   run_serialize_trial(small_multilinestring, small_multilinestring_iterations, 
"small multilinestring")
   run_serialize_trial(large_multilinestring, large_multilinestring_iterations, 
"large multilinestring")
   run_serialize_trial(small_multipolygon, small_multipolygon_iterations, 
"small multipolygon")
   run_serialize_trial(large_multipolygon, large_multipolygon_iterations, 
"large multipolygon")
   
   run_deserialize_trial(short_line, short_line_iterations, "short line")
   run_deserialize_trial(long_line, long_line_iterations, "long line")
   run_deserialize_trial(point, point_iterations, "point")
   run_deserialize_trial(small_polygon, small_polygon_iterations, "small 
polygon")
   run_deserialize_trial(large_polygon, large_polygon_iterations, "large 
polygon")
   run_deserialize_trial(small_multipoint, small_multipoint_iterations, "small 
multipoint")
   run_deserialize_trial(large_multipoint, large_multipoint_iterations, "large 
multipoint")
   run_deserialize_trial(small_multilinestring, 
small_multilinestring_iterations, "small multilinestring")
   run_deserialize_trial(large_multilinestring, 
large_multilinestring_iterations, "large multilinestring")
   run_deserialize_trial(small_multipolygon, small_multipolygon_iterations, 
"small multipolygon")
   run_deserialize_trial(large_multipolygon, large_multipolygon_iterations, 
"large multipolygon")
   ```
   
   ## How was this patch tested?
   
   A parameterized set of unit tests were added that test the 
serialization/deserialization round trip of a geometry object between python 
and spark.
   
   ## Did this PR include necessary documentation updates?
   
   - No, this PR does not affect any public API so no need to change the docs.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@sedona.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to