Here's my code: my_data = sqlCtx.sql("SELECT * FROM raw.site_activity_data LIMIT 2") my_data.collect()
raw.site_activity_data is a Hive external table atop daily-partitioned .gzip data. When I execute the command I start seeing many of these pop up in the logs (below is a small subset) 15/11/05 17:56:45 INFO FileInputFormat: Total input paths to process : 718 15/11/05 17:56:45 INFO FileInputFormat: Total input paths to process : 562 15/11/05 17:56:45 INFO FileInputFormat: Total input paths to process : 261 15/11/05 17:56:45 INFO FileInputFormat: Total input paths to process : 542 15/11/05 17:56:45 INFO FileInputFormat: Total input paths to process : 272 15/11/05 17:56:45 INFO FileInputFormat: Total input paths to process : 785 15/11/05 17:56:45 INFO FileInputFormat: Total input paths to process : 748 15/11/05 17:56:45 INFO FileInputFormat: Total input paths to process : 559 15/11/05 17:56:45 INFO FileInputFormat: Total input paths to process : 543 15/11/05 17:56:45 INFO FileInputFormat: Total input paths to process : 607 15/11/05 17:56:45 INFO FileInputFormat: Total input paths to process : 695 15/11/05 17:56:46 INFO FileInputFormat: Total input paths to process : 336 15/11/05 17:56:46 INFO FileInputFormat: Total input paths to process : 449 15/11/05 17:56:46 INFO FileInputFormat: Total input paths to process : 509 15/11/05 17:56:46 INFO FileInputFormat: Total input paths to process : 567 15/11/05 17:56:46 INFO FileInputFormat: Total input paths to process : 544 15/11/05 17:56:46 INFO FileInputFormat: Total input paths to process : 418 15/11/05 17:56:46 INFO FileInputFormat: Total input paths to process : 568 15/11/05 17:56:46 INFO FileInputFormat: Total input paths to process : 716 15/11/05 17:56:46 INFO FileInputFormat: Total input paths to process : 0 15/11/05 17:56:46 INFO FileInputFormat: Total input paths to process : 265 15/11/05 17:56:46 INFO FileInputFormat: Total input paths to process : 235 15/11/05 17:56:46 INFO FileInputFormat: Total input paths to process : 227 15/11/05 17:56:46 INFO FileInputFormat: Total input paths to process : 551 15/11/05 17:56:46 INFO FileInputFormat: Total input paths to process : 256 15/11/05 17:56:46 INFO FileInputFormat: Total input paths to process : 0 15/11/05 17:56:46 INFO FileInputFormat: Total input paths to process : 271 15/11/05 17:56:46 INFO FileInputFormat: Total input paths to process : 728 Then after that the Spark job starts executing 328,785 tasks. Why doesn't Spark SQL just look at one input path? On Mon, Oct 5, 2015 at 5:35 PM, Michael Armbrust <mich...@databricks.com> wrote: > It does do a take. Run explain to make sure that is the case. Why do you > think its reading the whole table? > > On Mon, Oct 5, 2015 at 1:53 PM, YaoPau <jonrgr...@gmail.com> wrote: > >> I'm using SqlCtx connected to Hive in CDH 5.4.4. When I run "SELECT * >> FROM >> my_db.my_tbl LIMIT 5", it scans the entire table like Hive would instead >> of >> doing a .take(5) on it and returning results immediately. >> >> Is there a way to get Spark SQL to use .take(5) instead of the Hive logic >> of >> scanning the full table when running a SELECT? >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-SELECT-LIMIT-scans-the-entire-Hive-table-tp24938.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >