Re: HIVE Query 25x faster than SPARK Query

Mich Talebzadeh Thu, 16 Jun 2016 03:47:19 -0700

Hi,

Your statement


"I have a system with 64 GB RAM and SSD and its performance on local
cluster SPARK is way better"

Is this a host with 64GB of RAM and you data is stored on local Solid State
Disks?

Can you kindly provide the parameters you pass to spark-submit:

${SPARK_HOME}/bin/spark-submit \

                --master local[?] \

                --driver-memory ?G \

                --num-executors 1 \

                --executor-memory ?G \


Thanks



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 16 June 2016 at 11:40, Gourav Sengupta <gourav.sengu...@gmail.com> wrote:

> Hi,
>
> We do have a dimension table with around few hundred columns from which we
> need only a few columns to join with the main fact table which has a few
> million rows. I do not know how one off this case sounds like but  since I
> have been working in data warehousing it sounds like a fairly general used
> case.
>
> Spark in local mode will be way faster compared to SPARK running on
> HADOOP. I have a system with 64 GB RAM and SSD and its performance on local
> cluster SPARK is way better.
>
> Did your join include the same number of columns and rows for the
> dimension table?
>
>
> Regards,
> Gourav Sengupta
>
> On Thu, Jun 16, 2016 at 9:35 AM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> sounds like this is a one off case.
>>
>> Do you have any other use case where you have Hive on MR outperforms
>> Spark?
>>
>> I did some tests on 1 billion row table getting the selectivity of a
>> column using Hive on MR, Hive on Spark engine and Spark running on local
>> mode (to keep it simple)
>>
>>
>> Hive 2, Spark 1.6.1
>>
>> Results:
>>
>> Hive with map-reduce --> 18  minutes
>> Hive on Spark engine -->  6 minutes
>> Spark                -->  2 minutes
>>
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 16 June 2016 at 08:43, Jörn Franke <jornfra...@gmail.com> wrote:
>>
>>> I agree here.
>>>
>>> However it depends always on your use case !
>>>
>>> Best regards
>>>
>>> On 16 Jun 2016, at 04:58, Gourav Sengupta <gourav.sengu...@gmail.com>
>>> wrote:
>>>
>>> Hi Mahender,
>>>
>>> please ensure that for dimension tables you are enabling the broadcast
>>> method. You must be able to see surprising gains @12x.
>>>
>>> Overall I think that SPARK cannot figure out whether to scan all the
>>> columns in a table or just the ones which are being used causing this
>>> issue.
>>>
>>> When you start using HIVE with ORC and TEZ  (*) you will see some
>>> amazing results, and leaves SPARK way way behind. So pretty much you need
>>> to have your data in memory for matching the performance claims of SPARK
>>> and the advantage in that case you are getting is not because of SPARK
>>> algorithms but just fast I/O from RAM. The advantage of SPARK is that it
>>> makes accessible analytics, querying, and streaming frameworks together.
>>>
>>>
>>> In case you are following the optimisations mentioned in the link you
>>> hardly have any reasons for using SPARK SQL:
>>> http://hortonworks.com/blog/5-ways-make-hive-queries-run-faster/ . And
>>> imagine being able to do all of that without having machines which requires
>>> huge RAM, or in short you are achieving those performance gains using
>>> commodity low cost systems around which HADOOP was designed.
>>>
>>> I think that Hortonworks is giving a stiff competition here :)
>>>
>>> Regards,
>>> Gourav Sengupta
>>>
>>> On Wed, Jun 15, 2016 at 11:35 PM, Mahender Sarangam <
>>> mahender.bigd...@outlook.com> wrote:
>>>
>>>> +1,
>>>>
>>>> Even see performance degradation while comparing SPark SQL with Hive.
>>>> We have table of 260 columns. We have executed in hive and SPARK. In
>>>> Hive, it is taking 66 sec for 1 gb of data whereas in Spark, it is taking 4
>>>> mins of time.
>>>> On 6/9/2016 3:19 PM, Gavin Yue wrote:
>>>>
>>>> Could you print out the sql execution plan? My guess is about broadcast
>>>> join.
>>>>
>>>>
>>>>
>>>> On Jun 9, 2016, at 07:14, Gourav Sengupta < <gourav.sengu...@gmail.com>
>>>> gourav.sengu...@gmail.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> Query1 is almost 25x faster in HIVE than in SPARK. What is happening
>>>> here and is there a way we can optimize the queries in SPARK without the
>>>> obvious hack in Query2.
>>>>
>>>>
>>>> -----------------------
>>>> ENVIRONMENT:
>>>> -----------------------
>>>>
>>>> > Table A 533 columns x 24 million rows and Table B has 2 columns x 3
>>>> million rows. Both the files are single gzipped csv file.
>>>> > Both table A and B are external tables in AWS S3 and created in HIVE
>>>> accessed through SPARK using HiveContext
>>>> > EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using
>>>> allowMaximumResource allocation and node types are c3.4xlarge).
>>>>
>>>> --------------
>>>> QUERY1:
>>>> --------------
>>>> select A.PK, B.FK
>>>> from A
>>>> left outer join B on (A.PK = B.FK)
>>>> where B.FK is not null;
>>>>
>>>>
>>>>
>>>> This query takes 4 mins in HIVE and 1.1 hours in SPARK
>>>>
>>>>
>>>> --------------
>>>> QUERY 2:
>>>> --------------
>>>>
>>>> select A.PK, B.FK
>>>> from (select PK from A) A
>>>> left outer join B on (A.PK = B.FK)
>>>> where B.FK is not null;
>>>>
>>>> This query takes 4.5 mins in SPARK
>>>>
>>>>
>>>>
>>>> Regards,
>>>> Gourav Sengupta
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Re: HIVE Query 25x faster than SPARK Query

Reply via email to