douglasdennis opened a new pull request, #745: URL: https://github.com/apache/sedona/pull/745
## Did you read the Contributor Guide? - Yes, I have read [Contributor Rules](https://sedona.apache.org/community/rule/) and [Contributor Development Guide](https://sedona.apache.org/community/develop/) ## Is this PR related to a JIRA ticket? - Yes, the URL of the associated JIRA ticket is https://issues.apache.org/jira/browse/SEDONA-227. The PR name follows the format `[SEDONA-XXX] my subject`. ## What changes were proposed in this PR? A refactor of some of the python serialization/deserialization functions is proposed. This is due to the performance regression experienced from the change in serialization formats. Most of the functions were refactored to increase performance. Attempts were made to maintain readability of the code at the same time. A significant pain point involves interacting with shapely geometry. Most of the shapely methods are too slow and involve repeated calls to native code. Once shapely 2.0 is adopted by Sedona and its users, it may be a good idea to revisit this code and attempt to utilize some of the libgeos connections that shapely exposes. There are other issues in the current implementation of serde: 1. Shapely does not respect an M coordinate yet, and so that is not supported in python at the moment. 2. SRID is not supported in python for a similar reason. Part of Sedona's python library does add a "userData" field to shapely geometry classes at runtime. This could be utilized to store SRIDs and possibly M coordinates. 3. This PR mostly avoids using the GeometryBuffer class. I don't believe it will be the fastest solution in the future. However, I think it's a nice class for specifically handling the GeometryCollection type. 4. Should the serialization format allow for empty geometry within collections? They have no WKT representation, so I'm not sure that should be allowed. Finally, here are performance results between current master and this branch: ``` ############################################################################################ # MASTER # REFACTOR # ############################################################################################ short line serialize trial: # short line serialize trial: Total Time (seconds): # Total Time (seconds): Shapely: 3.1197412 # Shapely: 1.6295438 Sedona: 7.6139485 # Sedona: 1.624361 Factor: 1.440570551172642 # Factor: -0.00318052205776856 # long line serialize trial: # long line serialize trial: Total Time (seconds): # Total Time (seconds): Shapely: 4.3532153 # Shapely: 4.106511 Sedona: 53.1161724 # Sedona: 20.2585903 Factor: 11.201595542494763 # Factor: 3.9332852876809534 # point serialize trial: # point serialize trial: Total Time (seconds): # Total Time (seconds): Shapely: 4.5569092 # Shapely: 4.1321635 Sedona: 12.9271814 # Sedona: 3.6287594 Factor: 1.8368310257312128 # Factor: -0.12182579416327549 # small polygon serialize trial: # small polygon serialize trial: Total Time (seconds): # Total Time (seconds): Shapely: 1.9922023 # Shapely: 1.6930909 Sedona: 15.1673936 # Sedona: 2.5668217 Factor: 6.613380227499988 # Factor: 0.5160566393688608 # large polygon serialize trial: # large polygon serialize trial: Total Time (seconds): # Total Time (seconds): Shapely: 2.5921638 # Shapely: 2.3090157 Sedona: 27.4987078 # Sedona: 2.8944101 Factor: 9.608398975404254 # Factor: 0.25352551738821005 # small multipoint serialize trial: # small multipoint serialize trial: Total Time (seconds): # Total Time (seconds): Shapely: 0.0964883 # Shapely: 0.0957348 Sedona: 0.8657994 # Sedona: 0.5249629 Factor: 7.97310243832672 # Factor: 4.483511742856307 # large multipoint serialize trial: # large multipoint serialize trial: Total Time (seconds): # Total Time (seconds): Shapely: 0.2059924 # Shapely: 0.2002637 Sedona: 18.6648998 # Sedona: 13.9681053 Factor: 89.60965258912465 # Factor: 68.74856301965858 # small multilinestring serialize trial: # small multilinestring serialize trial: Total Time (seconds): # Total Time (seconds): Shapely: 0.1003702 # Shapely: 0.0984753 Sedona: 1.2611056 # Sedona: 0.5051064 Factor: 11.564542065274354 # Factor: 4.129269979375539 # large multilinestring serialize trial: # large multilinestring serialize trial: Total Time (seconds): # Total Time (seconds): Shapely: 0.1382676 # Shapely: 0.1290179 Sedona: 14.4627404 # Sedona: 6.4109233 Factor: 103.59963433226584 # Factor: 48.69018485031922 # small multipolygon serialize trial: # small multipolygon serialize trial: Total Time (seconds): # Total Time (seconds): Shapely: 0.1074735 # Shapely: 0.107037 Sedona: 2.5492284 # Sedona: 1.1849261 Factor: 22.71959971527865 # Factor: 10.070247671365976 # large multipolygon serialize trial: # large multipolygon serialize trial: Total Time (seconds): # Total Time (seconds): Shapely: 0.218638 # Shapely: 0.2061548 Sedona: 33.7390888 # Sedona: 16.8197829 Factor: 153.31484371426743 # Factor: 80.58812164451179 # short line deserialize trial: # short line deserialize trial: Total Time (seconds): # Total Time (seconds): Shapely: 2.6179912 # Shapely: 2.3070263 Sedona: 1.8462964 # Sedona: 1.8466384 Factor: -0.2947660022692208 # Factor: -0.19955901673075854 # long line deserialize trial: # long line deserialize trial: Total Time (seconds): # Total Time (seconds): Shapely: 3.1905663 # Shapely: 3.0322167 Sedona: 46.6516656 # Sedona: 3.0514442 Factor: 13.62175087851959 # Factor: 0.006341070544199562 # point deserialize trial: # point deserialize trial: Total Time (seconds): # Total Time (seconds): Shapely: 5.8758644 # Shapely: 5.7208507 Sedona: 5.2898097 # Sedona: 5.2560053 Factor: -0.0997393166527124 # Factor: -0.08125459383164815 # small polygon deserialize trial: # small polygon deserialize trial: Total Time (seconds): # Total Time (seconds): Shapely: 2.4844221 # Shapely: 2.3755022 Sedona: 4.120936 # Sedona: 3.0382661 Factor: 0.6587100879516408 # Factor: 0.27899948903436084 # large polygon deserialize trial: # large polygon deserialize trial: Total Time (seconds): # Total Time (seconds): Shapely: 2.0467411 # Shapely: 1.9710574 Sedona: 20.173081 # Sedona: 3.0417363 Factor: 8.85619578362891 # Factor: 0.5432002639801358 # small multipoint deserialize trial: # small multipoint deserialize trial: Total Time (seconds): # Total Time (seconds): Shapely: 0.1267147 # Shapely: 0.1221836 Sedona: 0.4915774 # Sedona: 0.4675109 Factor: 2.8794031000349603 # Factor: 2.8262982920784787 # large multipoint deserialize trial: # large multipoint deserialize trial: Total Time (seconds): # Total Time (seconds): Shapely: 0.2152692 # Shapely: 0.2037456 Sedona: 10.9262919 # Sedona: 11.1691675 Factor: 49.75641057801116 # Factor: 53.81918382531942 # small multilinestring deserialize trial: # small multilinestring deserialize trial: Total Time (seconds): # Total Time (seconds): Shapely: 0.1260666 # Shapely: 0.1310238 Sedona: 0.4858702 # Sedona: 0.5045361 Factor: 2.8540755441964802 # Factor: 2.8507210140447765 # large multilinestring deserialize trial: # large multilinestring deserialize trial: Total Time (seconds): # Total Time (seconds): Shapely: 0.1432103 # Shapely: 0.1417217 Sedona: 5.63192 # Sedona: 5.8515723 Factor: 38.3262216474653 # Factor: 40.28917660457079 # small multipolygon deserialize trial: # small multipolygon deserialize trial: Total Time (seconds): # Total Time (seconds): Shapely: 0.1392047 # Shapely: 0.12827 Sedona: 1.3398053 # Sedona: 1.1914238 Factor: 8.624713102359332 # Factor: 8.288405706712403 # large multipolygon deserialize trial: # large multipolygon deserialize trial: Total Time (seconds): # Total Time (seconds): Shapely: 0.2063552 # Shapely: 0.2101143 Sedona: 19.0151752 # Sedona: 15.8966164 Factor: 91.14778789194554 # Factor: 74.65699431214344 ``` Here is the python script I used to generate these results: ``` from sedona.utils.geometry_serde import serialize, deserialize from shapely.geometry import LineString, Point, Polygon, MultiPoint, MultiLineString, MultiPolygon from shapely.wkb import dumps, loads import time def run_serialize_trial(geom, number_iterations, name): print(f"{name} serialize trial:") start_time = time.perf_counter_ns() for _ in range(number_iterations): dumps(geom) shapely_time = time.perf_counter_ns() - start_time start_time = time.perf_counter_ns() for _ in range(number_iterations): serialize(geom) sedona_time = time.perf_counter_ns() - start_time print(f"\tTotal Time (seconds):") print(f"\t\tShapely: {shapely_time / 1e9}\n\t\tSedona: {sedona_time / 1e9}\n\t\tFactor: {(sedona_time - shapely_time) / shapely_time}\n") def run_deserialize_trial(geom, number_iterations, name): print(f"{name} deserialize trial:") shapely_serialized_geom = dumps(geom) sedona_serialized_geom = serialize(geom) start_time = time.perf_counter_ns() for _ in range(number_iterations): loads(shapely_serialized_geom) shapely_time = time.perf_counter_ns() - start_time start_time = time.perf_counter_ns() for _ in range(number_iterations): deserialize(sedona_serialized_geom) sedona_time = time.perf_counter_ns() - start_time print(f"\tTotal Time (seconds):") print(f"\t\tShapely: {shapely_time / 1e9}\n\t\tSedona: {sedona_time / 1e9}\n\t\tFactor: {(sedona_time - shapely_time) / shapely_time}\n") short_line_iterations = 200_000 short_line = LineString([(10.0, 10.0), (20.0, 20.0)]) long_line_iterations = 100_000 long_line = LineString([(float(n), float(n)) for n in range(1000)]) point_iterations = 500_000 point = Point(12.3, 45.6) small_polygon_iterations = 200_000 small_polygon = Polygon([(10.0, 10.0), (20.0, 10.0), (20.0, 20.0), (10.0, 20.0), (10.0, 10.0)]) large_polygon_iterations = 100_000 large_polygon = Polygon( [(0.0, float(n * 10)) for n in range(100)] + [(float(n * 10), 990.0) for n in range(100)] + [(990.0, float(n * 10)) for n in reversed(range(100))] + [(float(n * 10), 0.0) for n in reversed(range(100))] ) small_multipoint_iterations = 10_000 small_multipoint = MultiPoint([(n, n) for n in range(3)]) large_multipoint_iterations = 10_000 large_multipoint = MultiPoint([(n, n) for n in range(100)]) small_multilinestring_iterations = 10_000 small_multilinestring = MultiLineString([[(10.0, 10.0), (20.0, 20.0)] for _ in range(3)]) large_multilinestring_iterations = 5_000 large_multilinestring = MultiLineString([[(10.0, 10.0), (20.0, 20.0)] for _ in range(100)]) small_multipolygon_iterations = 10_000 small_multipolygon = MultiPolygon([small_polygon for _ in range(3)]) large_multipolygon_iterations = 5_000 large_multipolygon = MultiPolygon([small_polygon for _ in range(100)]) run_serialize_trial(short_line, short_line_iterations, "short line") run_serialize_trial(long_line, long_line_iterations, "long line") run_serialize_trial(point, point_iterations, "point") run_serialize_trial(small_polygon, small_polygon_iterations, "small polygon") run_serialize_trial(large_polygon, large_polygon_iterations, "large polygon") run_serialize_trial(small_multipoint, small_multipoint_iterations, "small multipoint") run_serialize_trial(large_multipoint, large_multipoint_iterations, "large multipoint") run_serialize_trial(small_multilinestring, small_multilinestring_iterations, "small multilinestring") run_serialize_trial(large_multilinestring, large_multilinestring_iterations, "large multilinestring") run_serialize_trial(small_multipolygon, small_multipolygon_iterations, "small multipolygon") run_serialize_trial(large_multipolygon, large_multipolygon_iterations, "large multipolygon") run_deserialize_trial(short_line, short_line_iterations, "short line") run_deserialize_trial(long_line, long_line_iterations, "long line") run_deserialize_trial(point, point_iterations, "point") run_deserialize_trial(small_polygon, small_polygon_iterations, "small polygon") run_deserialize_trial(large_polygon, large_polygon_iterations, "large polygon") run_deserialize_trial(small_multipoint, small_multipoint_iterations, "small multipoint") run_deserialize_trial(large_multipoint, large_multipoint_iterations, "large multipoint") run_deserialize_trial(small_multilinestring, small_multilinestring_iterations, "small multilinestring") run_deserialize_trial(large_multilinestring, large_multilinestring_iterations, "large multilinestring") run_deserialize_trial(small_multipolygon, small_multipolygon_iterations, "small multipolygon") run_deserialize_trial(large_multipolygon, large_multipolygon_iterations, "large multipolygon") ``` ## How was this patch tested? A parameterized set of unit tests were added that test the serialization/deserialization round trip of a geometry object between python and spark. ## Did this PR include necessary documentation updates? - No, this PR does not affect any public API so no need to change the docs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@sedona.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org