Here are some timings showing effect of caching last Binary->String
conversion. Query times are reduced significantly and variation in timings
due to reduction in garbage is very significant.

Set of sample queries selecting various columns, applying some filtering and
then aggregating

Spark 1.2.0
Query 1 mean time 8353.3 millis, std deviation 480.91511147441025 millis
Query 2 mean time 8677.6 millis, std deviation 3193.345518417949 millis
Query 3 mean time 11302.5 millis, std deviation 2989.9406998950476 millis
Query 4 mean time 10537.0 millis, std deviation 5166.024024549462 millis
Query 5 mean time 9559.9 millis, std deviation 4141.487667493409 millis
Query 6 mean time 12638.1 millis, std deviation 3639.4505522430477 millis


Spark 1.2.0 - cache last Binary->String conversion
Query 1 mean time 5118.9 millis, std deviation 549.6670608448152 millis
Query 2 mean time 3761.3 millis, std deviation 202.57785883183013 millis
Query 3 mean time 7358.8 millis, std deviation 242.58918176850162 millis
Query 4 mean time 4173.5 millis, std deviation 179.802515122688 millis
Query 5 mean time 3857.0 millis, std deviation 140.71957930579526 millis
Query 6 mean time 7512.0 millis, std deviation 198.32633040858022 millis




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-tp10141p10193.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to