Hi Mahender,

please ensure that for dimension tables you are enabling the broadcast
method. You must be able to see surprising gains @12x.

Overall I think that SPARK cannot figure out whether to scan all the
columns in a table or just the ones which are being used causing this
issue.

When you start using HIVE with ORC and TEZ  (*) you will see some amazing
results, and leaves SPARK way way behind. So pretty much you need to have
your data in memory for matching the performance claims of SPARK and the
advantage in that case you are getting is not because of SPARK algorithms
but just fast I/O from RAM. The advantage of SPARK is that it makes
accessible analytics, querying, and streaming frameworks together.


In case you are following the optimisations mentioned in the link you
hardly have any reasons for using SPARK SQL:
http://hortonworks.com/blog/5-ways-make-hive-queries-run-faster/ . And
imagine being able to do all of that without having machines which requires
huge RAM, or in short you are achieving those performance gains using
commodity low cost systems around which HADOOP was designed.

I think that Hortonworks is giving a stiff competition here :)

Regards,
Gourav Sengupta

On Wed, Jun 15, 2016 at 11:35 PM, Mahender Sarangam <
mahender.bigd...@outlook.com> wrote:

> +1,
>
> Even see performance degradation while comparing SPark SQL with Hive.
> We have table of 260 columns. We have executed in hive and SPARK. In Hive,
> it is taking 66 sec for 1 gb of data whereas in Spark, it is taking 4 mins
> of time.
> On 6/9/2016 3:19 PM, Gavin Yue wrote:
>
> Could you print out the sql execution plan? My guess is about broadcast
> join.
>
>
>
> On Jun 9, 2016, at 07:14, Gourav Sengupta < <gourav.sengu...@gmail.com>
> gourav.sengu...@gmail.com> wrote:
>
> Hi,
>
> Query1 is almost 25x faster in HIVE than in SPARK. What is happening here
> and is there a way we can optimize the queries in SPARK without the obvious
> hack in Query2.
>
>
> -----------------------
> ENVIRONMENT:
> -----------------------
>
> > Table A 533 columns x 24 million rows and Table B has 2 columns x 3
> million rows. Both the files are single gzipped csv file.
> > Both table A and B are external tables in AWS S3 and created in HIVE
> accessed through SPARK using HiveContext
> > EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using
> allowMaximumResource allocation and node types are c3.4xlarge).
>
> --------------
> QUERY1:
> --------------
> select A.PK, B.FK
> from A
> left outer join B on (A.PK = B.FK)
> where B.FK is not null;
>
>
>
> This query takes 4 mins in HIVE and 1.1 hours in SPARK
>
>
> --------------
> QUERY 2:
> --------------
>
> select A.PK, B.FK
> from (select PK from A) A
> left outer join B on (A.PK = B.FK)
> where B.FK is not null;
>
> This query takes 4.5 mins in SPARK
>
>
>
> Regards,
> Gourav Sengupta
>
>
>
>
>

Reply via email to