Spark SQL "SELECT ... LIMIT" scans the entire Hive table?

2015-10-05 Thread YaoPau
I'm using SqlCtx connected to Hive in CDH 5.4.4.  When I run "SELECT * FROM
my_db.my_tbl LIMIT 5", it scans the entire table like Hive would instead of
doing a .take(5) on it and returning results immediately.

Is there a way to get Spark SQL to use .take(5) instead of the Hive logic of
scanning the full table when running a SELECT?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-SELECT-LIMIT-scans-the-entire-Hive-table-tp24938.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark SQL "SELECT ... LIMIT" scans the entire Hive table?

2015-10-05 Thread Michael Armbrust
It does do a take.  Run explain to make sure that is the case.  Why do you
think its reading the whole table?

On Mon, Oct 5, 2015 at 1:53 PM, YaoPau  wrote:

> I'm using SqlCtx connected to Hive in CDH 5.4.4.  When I run "SELECT * FROM
> my_db.my_tbl LIMIT 5", it scans the entire table like Hive would instead of
> doing a .take(5) on it and returning results immediately.
>
> Is there a way to get Spark SQL to use .take(5) instead of the Hive logic
> of
> scanning the full table when running a SELECT?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-SELECT-LIMIT-scans-the-entire-Hive-table-tp24938.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Spark SQL "SELECT ... LIMIT" scans the entire Hive table?

2015-11-05 Thread Jon Gregg
Here's my code:

my_data = sqlCtx.sql("SELECT * FROM raw.site_activity_data LIMIT 2")
my_data.collect()

raw.site_activity_data is a Hive external table atop daily-partitioned
.gzip data.  When I execute the command I start seeing many of these pop up
in the logs (below is a small subset)

15/11/05 17:56:45 INFO FileInputFormat: Total input paths to process : 718
15/11/05 17:56:45 INFO FileInputFormat: Total input paths to process : 562
15/11/05 17:56:45 INFO FileInputFormat: Total input paths to process : 261
15/11/05 17:56:45 INFO FileInputFormat: Total input paths to process : 542
15/11/05 17:56:45 INFO FileInputFormat: Total input paths to process : 272
15/11/05 17:56:45 INFO FileInputFormat: Total input paths to process : 785
15/11/05 17:56:45 INFO FileInputFormat: Total input paths to process : 748
15/11/05 17:56:45 INFO FileInputFormat: Total input paths to process : 559
15/11/05 17:56:45 INFO FileInputFormat: Total input paths to process : 543
15/11/05 17:56:45 INFO FileInputFormat: Total input paths to process : 607
15/11/05 17:56:45 INFO FileInputFormat: Total input paths to process : 695
15/11/05 17:56:46 INFO FileInputFormat: Total input paths to process : 336
15/11/05 17:56:46 INFO FileInputFormat: Total input paths to process : 449
15/11/05 17:56:46 INFO FileInputFormat: Total input paths to process : 509
15/11/05 17:56:46 INFO FileInputFormat: Total input paths to process : 567
15/11/05 17:56:46 INFO FileInputFormat: Total input paths to process : 544
15/11/05 17:56:46 INFO FileInputFormat: Total input paths to process : 418
15/11/05 17:56:46 INFO FileInputFormat: Total input paths to process : 568
15/11/05 17:56:46 INFO FileInputFormat: Total input paths to process : 716
15/11/05 17:56:46 INFO FileInputFormat: Total input paths to process : 0
15/11/05 17:56:46 INFO FileInputFormat: Total input paths to process : 265
15/11/05 17:56:46 INFO FileInputFormat: Total input paths to process : 235
15/11/05 17:56:46 INFO FileInputFormat: Total input paths to process : 227
15/11/05 17:56:46 INFO FileInputFormat: Total input paths to process : 551
15/11/05 17:56:46 INFO FileInputFormat: Total input paths to process : 256
15/11/05 17:56:46 INFO FileInputFormat: Total input paths to process : 0
15/11/05 17:56:46 INFO FileInputFormat: Total input paths to process : 271
15/11/05 17:56:46 INFO FileInputFormat: Total input paths to process : 728

Then after that the Spark job starts executing 328,785 tasks.  Why doesn't
Spark SQL just look at one input path?

On Mon, Oct 5, 2015 at 5:35 PM, Michael Armbrust 
wrote:

> It does do a take.  Run explain to make sure that is the case.  Why do you
> think its reading the whole table?
>
> On Mon, Oct 5, 2015 at 1:53 PM, YaoPau  wrote:
>
>> I'm using SqlCtx connected to Hive in CDH 5.4.4.  When I run "SELECT *
>> FROM
>> my_db.my_tbl LIMIT 5", it scans the entire table like Hive would instead
>> of
>> doing a .take(5) on it and returning results immediately.
>>
>> Is there a way to get Spark SQL to use .take(5) instead of the Hive logic
>> of
>> scanning the full table when running a SELECT?
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-SELECT-LIMIT-scans-the-entire-Hive-table-tp24938.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>