Re: Issue joining 21 HUGE Hive tables

Mich Talebzadeh Thu, 24 Mar 2016 00:38:52 -0700

Posting a typical query that you are using will help to clarify the issue.

Also you may use TEMPORARY TABLEs to keep the intermediate stage results.


On the face of it you can time every query itself to find out the longest
components etc

select from_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss') AS
StartTime;

CREATE TEMPORARY TABLE tmp AS
SELECT t.calendar_month_desc, c.channel_desc, SUM(s.amount_sold) AS
TotalSales
--FROM smallsales s, times t, channels c
FROM sales s, times t, channels c
WHERE s.time_id = t.time_id
AND   s.channel_id = c.channel_id
GROUP BY t.calendar_month_desc, c.channel_desc
;
select from_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss') AS
FirstQuery;

SELECT calendar_month_desc AS MONTH, channel_desc AS CHANNEL, TotalSales
from tmp
ORDER BY MONTH, CHANNEL LIMIT 5
;

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 24 March 2016 at 06:36, Jörn Franke <jornfra...@gmail.com> wrote:

> Joining so many external tables is always an issue with any component.
> Your problem is not Hive specific; but your data model seems to be messed
> up. First of all you should have them in an appropriate format, such as ORC
> or parquet and the tables should not be external. Then you should use the
> right data types for columns, eg an int instead of a varchar if you have
> just numbers in a column. After that check if you can prejoin and store the
> data in one big flat table and do queries on that.
>
> Then you should look at the min / max indexes , bloom filters, statistics,
> partitions etc.
>
> Maybe you can post more details about data model and queries.
>
> On 24 Mar 2016, at 02:49, Sanka, Himabindu <himabindu_sa...@optum.com>
> wrote:
>
> Hi Team,
>
>
>
> I need some inputs from you. I have a requirement for my project where I
> have to join 21 hive external tables.
>
>
>
> Out of which 6 tables are HUGE  having 500 million records of data. Other
> 15 tables are smaller ones around 100 to 1000 records each.
>
>
>
> When I am doing inner joins/ left outer joins its taking hours to run the
> query.
>
>
>
> Please let me know some optimization techniques or any other eco system
> components that performs better than HIVE.
>
>
>
>
>
> *Regards,*
>
> Hima
>
>
>
>
>
>
> This e-mail, including attachments, may include confidential and/or
> proprietary information, and may be used only by the person or entity
> to which it is addressed. If the reader of this e-mail is not the intended
> recipient or his or her authorized agent, the reader is hereby notified
> that any dissemination, distribution or copying of this e-mail is
> prohibited. If you have received this e-mail in error, please notify the
> sender by replying to this message and delete this e-mail immediately.
>
>

Re: Issue joining 21 HUGE Hive tables

Reply via email to