dbtsai commented on issue #26085: [SPARK-29434][Core] Improve the MapStatuses Serialization Performance URL: https://github.com/apache/spark/pull/26085#issuecomment-544750761 @tgravescs The following the result ran on my desktop. LZ4 is 5x faster but creates 1.6x bigger data. Wondering should we trade the serialization time with larger data? 1. ZSTD ```scala Java HotSpot(TM) 64-Bit Server VM 1.8.0_161-b12 on Mac OS X 10.14.2 Intel(R) Xeon(R) CPU E5-1650 v2 @ 3.50GHz 200000 MapOutputs, 1000 blocks w/o broadcast: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ Serialization 3340 3355 21 0.1 16700.1 1.0X Deserialization 650 660 14 0.3 3248.6 5.1X Compressed Serialized MapStatus sizes: 123 MB Compressed Serialized Broadcast MapStatus sizes: 0 bytes ``` 2. LZ4 ```scala Running benchmark: 200000 MapOutputs, 1000 blocks w/o broadcast Running case: Serialization Stopped after 3 iterations, 2109 ms Running case: Deserialization Stopped after 5 iterations, 2424 ms Java HotSpot(TM) 64-Bit Server VM 1.8.0_161-b12 on Mac OS X 10.14.2 Intel(R) Xeon(R) CPU E5-1650 v2 @ 3.50GHz 200000 MapOutputs, 1000 blocks w/o broadcast: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ Serialization 677 703 32 0.3 3383.6 1.0X Deserialization 466 485 27 0.4 2331.1 1.5X Compressed Serialized MapStatus sizes: 194 MB Compressed Serialized Broadcast MapStatus sizes: 0 bytes ``` 2. LZF ```scala Java HotSpot(TM) 64-Bit Server VM 1.8.0_161-b12 on Mac OS X 10.14.2 Intel(R) Xeon(R) CPU E5-1650 v2 @ 3.50GHz 200000 MapOutputs, 1000 blocks w/o broadcast: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ Serialization 2199 2202 4 0.1 10994.6 1.0X Deserialization 690 720 46 0.3 3450.6 3.2X Compressed Serialized MapStatus sizes: 182 MB Compressed Serialized Broadcast MapStatus sizes: 0 bytes ```
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org