Re: Delayed hotspot optimizations in Spark

Guillaume Pitel Fri, 10 Oct 2014 02:19:09 -0700

Hi

Could it be due to GC ? I read it may happen if your program starts witha small heap. What are your -Xms and -Xmx values ?


Print GC stats with -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps

Guillaume

Hello spark users and developers!
I am using hdfs + spark sql + hive schema + parquet as storage format.I have lot of parquet files - one files fits one hdfs block for oneday. The strange thing is very slow first query for spark sql.
To reproduce situation I use only one core and I have 97sec for firsttime and only 13sec for all next queries. Sure I query for differentdata, but it has same structure and size. The situation can bereproduced after restart thrift server.
Here it information about parquet files reading from worker node:

Slow one:
Oct 10, 2014 2:26:53 PM INFO:parquet.hadoop.InternalParquetRecordReader: Assembled and processed1560251 records from 30 columns in 11686 ms: 133.51454 rec/ms,4005.4363 cell/ms
Fast one:
Oct 10, 2014 2:31:30 PM INFO:parquet.hadoop.InternalParquetRecordReader: Assembled and processed1568899 records from 1 columns in 1373 ms: 1142.6796 rec/ms, 1142.6796cell/ms
As you can see second reading is 10x times faster then first. Most ofthe query time spent to work with parquet file.
This problem is really annoying, because most of my spark taskcontains just 1 sql query and data processing and to speedup my jobs Iput special warmup query in from of any job.
My assumption is that it is hotspot optimizations that used due firstreading. Do you have any idea how to confirm/solve this performanceproblem?
Thanks for advice!
p.s. I have billion hotspot optimization showedwith -XX:+PrintCompilation but can not figure out what are importantand what are not.



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Delayed hotspot optimizations in Spark

Reply via email to