Doug Dennis created SEDONA-227: ---------------------------------- Summary: Python SerDe Performance Degradation Key: SEDONA-227 URL: https://issues.apache.org/jira/browse/SEDONA-227 Project: Apache Sedona Issue Type: Bug Reporter: Doug Dennis
With the new geometry serde in Sedona, there appears to be a fairly significant performance regression on the python side. The PR's author acknowledged a regression in the PR so this is expected, however my trials are showing a regression that is sometimes far higher than the 2x noted in the PR. For serialization, I'm seeing points and short linestrings taking about twice as long (as expected). Unfortunately, small polygons are taking about 7-8 times longer while long linestrings and large polygons are taking between 11-12 times longer. The news isn't all bad though. For me, short linestrings are consistently deserializing faster (about 25-30% faster) and points are deserializing at roughly the same rate as before. The other deserializations show regressions that are more or less in line with the results for serialization though. To test this, I'm strictly comparing the new serialize and deserialize sedona functions against shapely's wkb loads and dumps functions. Below you will find my most recent results (which have been fairly consistent) as well as the python code I used to generate it. I'm very open to critiques of my approach to measuring performance, and hope that some of this performance loss is due to my own error. Serialization results: {code:java} short line serialize trial: Total Time (seconds): Shapely: 1.7364926 Sedona: 5.4626863 Factor: 2.145816054730092 Average Time (nanoseconds): Shapely: 8682.463 Sedona: 27313.4315 Factor: 2.145816054730092 long line serialize trial: Total Time (seconds): Shapely: 4.0879395 Sedona: 50.1508444 Factor: 11.268000639441949 Average Time (nanoseconds): Shapely: 40879.395 Sedona: 501508.444 Factor: 11.268000639441949 point serialize trial: Total Time (seconds): Shapely: 4.7864782 Sedona: 13.0319586 Factor: 1.7226612251153677 Average Time (nanoseconds): Shapely: 9572.9564 Sedona: 26063.9172 Factor: 1.7226612251153677 small polygon serialize trial: Total Time (seconds): Shapely: 1.8339082 Sedona: 14.9376628 Factor: 7.145262014750793 Average Time (nanoseconds): Shapely: 9169.541 Sedona: 74688.314 Factor: 7.145262014750793 large polygon serialize trial: Total Time (seconds): Shapely: 2.3705298 Sedona: 30.4154897 Factor: 11.830671734225826 Average Time (nanoseconds): Shapely: 23705.298 Sedona: 304154.897 Factor: 11.830671734225826 {code} Deserialization results: {code:java} short line deserialize trial: Total Time (seconds): Shapely: 2.5166469 Sedona: 1.7909991 Factor: -0.28833913887562057 Average Time (nanoseconds): Shapely: 12583.2345 Sedona: 8954.9955 Factor: -0.28833913887562057 long line deserialize trial: Total Time (seconds): Shapely: 3.1818201 Sedona: 45.1792348 Factor: 13.199179519923204 Average Time (nanoseconds): Shapely: 31818.201 Sedona: 451792.348 Factor: 13.199179519923204 point deserialize trial: Total Time (seconds): Shapely: 5.7874722 Sedona: 5.3168965 Factor: -0.08130936680784402 Average Time (nanoseconds): Shapely: 11574.9444 Sedona: 10633.793 Factor: -0.08130936680784402 small polygon deserialize trial: Total Time (seconds): Shapely: 2.5079775 Sedona: 4.0216245 Factor: 0.6035329264317563 Average Time (nanoseconds): Shapely: 12539.8875 Sedona: 20108.1225 Factor: 0.6035329264317563 large polygon deserialize trial: Total Time (seconds): Shapely: 1.9952702 Sedona: 19.909025 Factor: 8.978109731704508 Average Time (nanoseconds): Shapely: 19952.702 Sedona: 199090.25 Factor: 8.978109731704508 {code} Python code used to generate results: {code:java} from sedona.utils.geometry_serde import serialize, deserialize from shapely.geometry import LineString, Point, Polygon from shapely.wkb import dumps, loads import time def run_serialize_trial(geom, number_iterations, name): print(f"{name} serialize trial:") start_time = time.perf_counter_ns() for _ in range(number_iterations): dumps(geom) shapely_time = time.perf_counter_ns() - start_time start_time = time.perf_counter_ns() for _ in range(number_iterations): serialize(geom) sedona_time = time.perf_counter_ns() - start_time print(f"\tTotal Time (seconds):") print(f"\t\tShapely: {shapely_time / 1e9}\n\t\tSedona: {sedona_time / 1e9}\n\t\tFactor: {(sedona_time - shapely_time) / shapely_time}\n") print(f"\tAverage Time (nanoseconds):") print(f"\t\tShapely: {shapely_time / number_iterations}\n\t\tSedona: {sedona_time / number_iterations}\n\t\tFactor: {(sedona_time - shapely_time) / shapely_time}\n") def run_deserialize_trial(geom, number_iterations, name): print(f"{name} deserialize trial:") shapely_serialized_geom = dumps(geom) sedona_serialized_geom = serialize(geom) start_time = time.perf_counter_ns() for _ in range(number_iterations): loads(shapely_serialized_geom) shapely_time = time.perf_counter_ns() - start_time start_time = time.perf_counter_ns() for _ in range(number_iterations): deserialize(sedona_serialized_geom) sedona_time = time.perf_counter_ns() - start_time print(f"\tTotal Time (seconds):") print(f"\t\tShapely: {shapely_time / 1e9}\n\t\tSedona: {sedona_time / 1e9}\n\t\tFactor: {(sedona_time - shapely_time) / shapely_time}\n") print(f"\tAverage Time (nanoseconds):") print(f"\t\tShapely: {shapely_time / number_iterations}\n\t\tSedona: {sedona_time / number_iterations}\n\t\tFactor: {(sedona_time - shapely_time) / shapely_time}\n") short_line_iterations = 200_000 short_line = LineString([(10.0, 10.0), (20.0, 20.0)]) long_line_iterations = 100_000 long_line = LineString([(float(n), float(n)) for n in range(1000)]) point_iterations = 500_000 point = Point(12.3, 45.6) small_polygon_iterations = 200_000 small_polygon = Polygon([(10.0, 10.0), (20.0, 10.0), (20.0, 20.0), (10.0, 20.0), (10.0, 10.0)]) large_polygon_iterations = 100_000 large_polygon = Polygon( [(0.0, float(n * 10)) for n in range(100)] + [(float(n * 10), 990.0) for n in range(100)] + [(990.0, float(n * 10)) for n in reversed(range(100))] + [(float(n * 10), 0.0) for n in reversed(range(100))] ) run_serialize_trial(short_line, short_line_iterations, "short line") run_serialize_trial(long_line, long_line_iterations, "long line") run_serialize_trial(point, point_iterations, "point") run_serialize_trial(small_polygon, small_polygon_iterations, "small polygon") run_serialize_trial(large_polygon, large_polygon_iterations, "large polygon") run_deserialize_trial(short_line, short_line_iterations, "short line") run_deserialize_trial(long_line, long_line_iterations, "long line") run_deserialize_trial(point, point_iterations, "point") run_deserialize_trial(small_polygon, small_polygon_iterations, "small polygon") run_deserialize_trial(large_polygon, large_polygon_iterations, "large polygon"){code} -- This message was sent by Atlassian Jira (v8.20.10#820010)