First of all, select * is not a useful SQL to evaluate. Very rarely would a
user require all 362K records for visual analysis.

Second, collect() forces movement of all data from executors to the driver.
Instead write it out to some other table or to HDFS.

Also Spark is more beneficial when you have subsequent
queries/transformations on the same dataset. You cache the table and then
can subsequent operations will be faster.

Regards
Sab

On Wed, Nov 25, 2015 at 12:30 PM, UMESH CHAUDHARY <umesh9...@gmail.com>
wrote:

> Hi,
> I am using Hive 1.1.0 and Spark 1.5.1 and creating hive context in
> spark-shell.
>
> Now, I am experiencing reversed performance by Spark-Sql over Hive.
> By default Hive gives result back in 27 seconds for plain select * query
> on 1 GB dataset containing 3623203 records, while spark-sql gives back in 2
> mins on collect operation.
>
> Cluster Config:
> Hive : 6 Node : 16 GB Memory, 4 cores each
> Spark : 4 Nodes : 16 GB Memory, 4 cores each
>
> My dataset has 45 partitions and spark-sql creates 82 jobs.
>
> I have tried all memory and garbage collection optimizations suggested on
> official website but failed to get better performance and its worth to
> mention that sometimes I get OOM error when I allocate executor memory less
> than 10G.
>
> Can somebody tell whats actually going on ?
>
>
>


-- 

Architect - Big Data
Ph: +91 99805 99458

Manthan Systems | *Company of the year - Analytics (2014 Frost and Sullivan
India ICT)*
+++

Reply via email to