Hello, I am using SparkSQL to query some Hive tables. Most of the time, when I create a DataFrame using sqlContext.sql("select * from table") command, DataFrame creation is less than 0.5 second. But I have this one table with which it takes almost 12 seconds!
scala> val start = scala.compat.Platform.currentTime; val logs = sqlContext.sql("select * from temp.log"); val execution = scala.compat.Platform.currentTime - start 15/09/04 12:07:02 INFO ParseDriver: Parsing command: select * from temp.log 15/09/04 12:07:02 INFO ParseDriver: Parse Completed start: Long = 1441336022731 logs: org.apache.spark.sql.DataFrame = [user_id: string, option: int, log_time: string, tag: string, dt: string, test_id: int] execution: Long = *11567* This table has 3.6 B rows, and 2 partitions (on dt and test_id columns). I have created DataFrames on even larger tables and do not see such delay. So my questions are: - What can impact DataFrame creation time? - Is it related to the table partitions? Thanks much your help! Isabelle