Hi,

We do have a dimension table with around few hundred columns from which we
need only a few columns to join with the main fact table which has a few
million rows. I do not know how one off this case sounds like but  since I
have been working in data warehousing it sounds like a fairly general used
case.

Spark in local mode will be way faster compared to SPARK running on HADOOP.
I have a system with 64 GB RAM and SSD and its performance on local cluster
SPARK is way better.

Did your join include the same number of columns and rows for the dimension
table?


Regards,
Gourav Sengupta

On Thu, Jun 16, 2016 at 9:35 AM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> sounds like this is a one off case.
>
> Do you have any other use case where you have Hive on MR outperforms Spark?
>
> I did some tests on 1 billion row table getting the selectivity of a
> column using Hive on MR, Hive on Spark engine and Spark running on local
> mode (to keep it simple)
>
>
> Hive 2, Spark 1.6.1
>
> Results:
>
> Hive with map-reduce --> 18  minutes
> Hive on Spark engine -->  6 minutes
> Spark                -->  2 minutes
>
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 16 June 2016 at 08:43, Jörn Franke <jornfra...@gmail.com> wrote:
>
>> I agree here.
>>
>> However it depends always on your use case !
>>
>> Best regards
>>
>> On 16 Jun 2016, at 04:58, Gourav Sengupta <gourav.sengu...@gmail.com>
>> wrote:
>>
>> Hi Mahender,
>>
>> please ensure that for dimension tables you are enabling the broadcast
>> method. You must be able to see surprising gains @12x.
>>
>> Overall I think that SPARK cannot figure out whether to scan all the
>> columns in a table or just the ones which are being used causing this
>> issue.
>>
>> When you start using HIVE with ORC and TEZ  (*) you will see some amazing
>> results, and leaves SPARK way way behind. So pretty much you need to have
>> your data in memory for matching the performance claims of SPARK and the
>> advantage in that case you are getting is not because of SPARK algorithms
>> but just fast I/O from RAM. The advantage of SPARK is that it makes
>> accessible analytics, querying, and streaming frameworks together.
>>
>>
>> In case you are following the optimisations mentioned in the link you
>> hardly have any reasons for using SPARK SQL:
>> http://hortonworks.com/blog/5-ways-make-hive-queries-run-faster/ . And
>> imagine being able to do all of that without having machines which requires
>> huge RAM, or in short you are achieving those performance gains using
>> commodity low cost systems around which HADOOP was designed.
>>
>> I think that Hortonworks is giving a stiff competition here :)
>>
>> Regards,
>> Gourav Sengupta
>>
>> On Wed, Jun 15, 2016 at 11:35 PM, Mahender Sarangam <
>> mahender.bigd...@outlook.com> wrote:
>>
>>> +1,
>>>
>>> Even see performance degradation while comparing SPark SQL with Hive.
>>> We have table of 260 columns. We have executed in hive and SPARK. In
>>> Hive, it is taking 66 sec for 1 gb of data whereas in Spark, it is taking 4
>>> mins of time.
>>> On 6/9/2016 3:19 PM, Gavin Yue wrote:
>>>
>>> Could you print out the sql execution plan? My guess is about broadcast
>>> join.
>>>
>>>
>>>
>>> On Jun 9, 2016, at 07:14, Gourav Sengupta < <gourav.sengu...@gmail.com>
>>> gourav.sengu...@gmail.com> wrote:
>>>
>>> Hi,
>>>
>>> Query1 is almost 25x faster in HIVE than in SPARK. What is happening
>>> here and is there a way we can optimize the queries in SPARK without the
>>> obvious hack in Query2.
>>>
>>>
>>> -----------------------
>>> ENVIRONMENT:
>>> -----------------------
>>>
>>> > Table A 533 columns x 24 million rows and Table B has 2 columns x 3
>>> million rows. Both the files are single gzipped csv file.
>>> > Both table A and B are external tables in AWS S3 and created in HIVE
>>> accessed through SPARK using HiveContext
>>> > EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using
>>> allowMaximumResource allocation and node types are c3.4xlarge).
>>>
>>> --------------
>>> QUERY1:
>>> --------------
>>> select A.PK, B.FK
>>> from A
>>> left outer join B on (A.PK = B.FK)
>>> where B.FK is not null;
>>>
>>>
>>>
>>> This query takes 4 mins in HIVE and 1.1 hours in SPARK
>>>
>>>
>>> --------------
>>> QUERY 2:
>>> --------------
>>>
>>> select A.PK, B.FK
>>> from (select PK from A) A
>>> left outer join B on (A.PK = B.FK)
>>> where B.FK is not null;
>>>
>>> This query takes 4.5 mins in SPARK
>>>
>>>
>>>
>>> Regards,
>>> Gourav Sengupta
>>>
>>>
>>>
>>>
>>>
>>
>

Reply via email to