Hey Sean and spark users! Thanks for reply. I try -Xcomp right now and start time was about few minutes (as expected), but I got first query slow as before: Oct 10, 2014 3:03:41 PM INFO: parquet.hadoop.InternalParquetRecordReader: Assembled and processed 1568899 records from 30 columns in 12897 ms: 121.64837 rec/ms, 3649.451 cell/ms
and next Oct 10, 2014 3:05:03 PM INFO: parquet.hadoop.InternalParquetRecordReader: Assembled and processed 1568899 records from 1 columns in 1757 ms: 892.94196 rec/ms, 892.94196 cell/ms I have no idea about caching or other stuff because CPU load is 100% on worker and jstack show that worker is reading from parquet file. Any ideas? Thanks! On Fri, Oct 10, 2014 at 2:55 PM, Sean Owen <so...@cloudera.com> wrote: > You could try setting "-Xcomp" for executors to force JIT compilation > upfront. I don't know if it's a good idea overall but might show > whether the upfront compilation really helps. I doubt it. > > However is this almost surely due to caching somewhere, in Spark SQL > or HDFS? I really doubt hotspot makes a difference compared to these > much larger factors. > > On Fri, Oct 10, 2014 at 8:49 AM, Alexey Romanchuk > <alexey.romanc...@gmail.com> wrote: > > Hello spark users and developers! > > > > I am using hdfs + spark sql + hive schema + parquet as storage format. I > > have lot of parquet files - one files fits one hdfs block for one day. > The > > strange thing is very slow first query for spark sql. > > > > To reproduce situation I use only one core and I have 97sec for first > time > > and only 13sec for all next queries. Sure I query for different data, > but it > > has same structure and size. The situation can be reproduced after > restart > > thrift server. > > > > Here it information about parquet files reading from worker node: > > > > Slow one: > > Oct 10, 2014 2:26:53 PM INFO: parquet.hadoop.InternalParquetRecordReader: > > Assembled and processed 1560251 records from 30 columns in 11686 ms: > > 133.51454 rec/ms, 4005.4363 cell/ms > > > > Fast one: > > Oct 10, 2014 2:31:30 PM INFO: parquet.hadoop.InternalParquetRecordReader: > > Assembled and processed 1568899 records from 1 columns in 1373 ms: > 1142.6796 > > rec/ms, 1142.6796 cell/ms > > > > As you can see second reading is 10x times faster then first. Most of the > > query time spent to work with parquet file. > > > > This problem is really annoying, because most of my spark task contains > just > > 1 sql query and data processing and to speedup my jobs I put special > warmup > > query in from of any job. > > > > My assumption is that it is hotspot optimizations that used due first > > reading. Do you have any idea how to confirm/solve this performance > problem? > > > > Thanks for advice! > > > > p.s. I have billion hotspot optimization showed with > -XX:+PrintCompilation > > but can not figure out what are important and what are not. >