Here's my code:

my_data = sqlCtx.sql("SELECT * FROM raw.site_activity_data LIMIT 2")
my_data.collect()

raw.site_activity_data is a Hive external table atop daily-partitioned
.gzip data.  When I execute the command I start seeing many of these pop up
in the logs (below is a small subset)

15/11/05 17:56:45 INFO FileInputFormat: Total input paths to process : 718
15/11/05 17:56:45 INFO FileInputFormat: Total input paths to process : 562
15/11/05 17:56:45 INFO FileInputFormat: Total input paths to process : 261
15/11/05 17:56:45 INFO FileInputFormat: Total input paths to process : 542
15/11/05 17:56:45 INFO FileInputFormat: Total input paths to process : 272
15/11/05 17:56:45 INFO FileInputFormat: Total input paths to process : 785
15/11/05 17:56:45 INFO FileInputFormat: Total input paths to process : 748
15/11/05 17:56:45 INFO FileInputFormat: Total input paths to process : 559
15/11/05 17:56:45 INFO FileInputFormat: Total input paths to process : 543
15/11/05 17:56:45 INFO FileInputFormat: Total input paths to process : 607
15/11/05 17:56:45 INFO FileInputFormat: Total input paths to process : 695
15/11/05 17:56:46 INFO FileInputFormat: Total input paths to process : 336
15/11/05 17:56:46 INFO FileInputFormat: Total input paths to process : 449
15/11/05 17:56:46 INFO FileInputFormat: Total input paths to process : 509
15/11/05 17:56:46 INFO FileInputFormat: Total input paths to process : 567
15/11/05 17:56:46 INFO FileInputFormat: Total input paths to process : 544
15/11/05 17:56:46 INFO FileInputFormat: Total input paths to process : 418
15/11/05 17:56:46 INFO FileInputFormat: Total input paths to process : 568
15/11/05 17:56:46 INFO FileInputFormat: Total input paths to process : 716
15/11/05 17:56:46 INFO FileInputFormat: Total input paths to process : 0
15/11/05 17:56:46 INFO FileInputFormat: Total input paths to process : 265
15/11/05 17:56:46 INFO FileInputFormat: Total input paths to process : 235
15/11/05 17:56:46 INFO FileInputFormat: Total input paths to process : 227
15/11/05 17:56:46 INFO FileInputFormat: Total input paths to process : 551
15/11/05 17:56:46 INFO FileInputFormat: Total input paths to process : 256
15/11/05 17:56:46 INFO FileInputFormat: Total input paths to process : 0
15/11/05 17:56:46 INFO FileInputFormat: Total input paths to process : 271
15/11/05 17:56:46 INFO FileInputFormat: Total input paths to process : 728

Then after that the Spark job starts executing 328,785 tasks.  Why doesn't
Spark SQL just look at one input path?

On Mon, Oct 5, 2015 at 5:35 PM, Michael Armbrust <mich...@databricks.com>
wrote:

> It does do a take.  Run explain to make sure that is the case.  Why do you
> think its reading the whole table?
>
> On Mon, Oct 5, 2015 at 1:53 PM, YaoPau <jonrgr...@gmail.com> wrote:
>
>> I'm using SqlCtx connected to Hive in CDH 5.4.4.  When I run "SELECT *
>> FROM
>> my_db.my_tbl LIMIT 5", it scans the entire table like Hive would instead
>> of
>> doing a .take(5) on it and returning results immediately.
>>
>> Is there a way to get Spark SQL to use .take(5) instead of the Hive logic
>> of
>> scanning the full table when running a SELECT?
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-SELECT-LIMIT-scans-the-entire-Hive-table-tp24938.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Reply via email to