Hi Stephen,

How can a single gzipped CSV file be partitioned and who partitions tables
based on Primary Key in Hive?  If you read the environments section you
will be able to see that all the required details are mentioned.

As far as I understand that Hive does work 25x faster (in these particular
cases) and around 100x faster (when we are using TEZ) when compared to
SPARK.

It will be interesting to see if Ted includes these findings while they are
benchmarking SPARK. This is a very typical and a general used case.


Regards,
Gourav

On Thu, Jun 9, 2016 at 5:11 PM, Stephen Boesch <java...@gmail.com> wrote:

> ooc are the tables partitioned on a.pk and b.fk?  Hive might be using
> copartitioning in that case: it is one of hive's strengths.
>
> 2016-06-09 7:28 GMT-07:00 Gourav Sengupta <gourav.sengu...@gmail.com>:
>
>> Hi Mich,
>>
>> does not Hive use map-reduce? I thought it to be so. And since I am
>> running the queries in EMR 4.6 therefore HIVE is not using TEZ.
>>
>>
>> Regards,
>> Gourav
>>
>> On Thu, Jun 9, 2016 at 3:25 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> are you using map-reduce with Hive?
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 9 June 2016 at 15:14, Gourav Sengupta <gourav.sengu...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> Query1 is almost 25x faster in HIVE than in SPARK. What is happening
>>>> here and is there a way we can optimize the queries in SPARK without the
>>>> obvious hack in Query2.
>>>>
>>>>
>>>> -----------------------
>>>> ENVIRONMENT:
>>>> -----------------------
>>>>
>>>> > Table A 533 columns x 24 million rows and Table B has 2 columns x 3
>>>> million rows. Both the files are single gzipped csv file.
>>>> > Both table A and B are external tables in AWS S3 and created in HIVE
>>>> accessed through SPARK using HiveContext
>>>> > EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using
>>>> allowMaximumResource allocation and node types are c3.4xlarge).
>>>>
>>>> --------------
>>>> QUERY1:
>>>> --------------
>>>> select A.PK, B.FK
>>>> from A
>>>> left outer join B on (A.PK = B.FK)
>>>> where B.FK is not null;
>>>>
>>>>
>>>>
>>>> This query takes 4 mins in HIVE and 1.1 hours in SPARK
>>>>
>>>>
>>>> --------------
>>>> QUERY 2:
>>>> --------------
>>>>
>>>> select A.PK, B.FK
>>>> from (select PK from A) A
>>>> left outer join B on (A.PK = B.FK)
>>>> where B.FK is not null;
>>>>
>>>> This query takes 4.5 mins in SPARK
>>>>
>>>>
>>>>
>>>> Regards,
>>>> Gourav Sengupta
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Reply via email to