Hi
Could it be due to GC ? I read it may happen if your program starts with
a small heap. What are your -Xms and -Xmx values ?
Print GC stats with -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
Guillaume
Hello spark users and developers!
I am using hdfs + spark sql + hive schema + parquet as storage format.
I have lot of parquet files - one files fits one hdfs block for one
day. The strange thing is very slow first query for spark sql.
To reproduce situation I use only one core and I have 97sec for first
time and only 13sec for all next queries. Sure I query for different
data, but it has same structure and size. The situation can be
reproduced after restart thrift server.
Here it information about parquet files reading from worker node:
Slow one:
Oct 10, 2014 2:26:53 PM INFO:
parquet.hadoop.InternalParquetRecordReader: Assembled and processed
1560251 records from 30 columns in 11686 ms: 133.51454 rec/ms,
4005.4363 cell/ms
Fast one:
Oct 10, 2014 2:31:30 PM INFO:
parquet.hadoop.InternalParquetRecordReader: Assembled and processed
1568899 records from 1 columns in 1373 ms: 1142.6796 rec/ms, 1142.6796
cell/ms
As you can see second reading is 10x times faster then first. Most of
the query time spent to work with parquet file.
This problem is really annoying, because most of my spark task
contains just 1 sql query and data processing and to speedup my jobs I
put special warmup query in from of any job.
My assumption is that it is hotspot optimizations that used due first
reading. Do you have any idea how to confirm/solve this performance
problem?
Thanks for advice!
p.s. I have billion hotspot optimization showed
with -XX:+PrintCompilation but can not figure out what are important
and what are not.
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org