Hi,

 

 

Your point on 

 

“Basically modern technologies, such as Hive, but also relational database,
suggest to prejoin tables and working on big flat tables. The reason is that
they are distributed systems and you should avoid transferring for each
query a lot of data between nodes.”

 

Can you please clarify whether you are suggesting to merge fact table with
dimension tables in Hive as an example, In other words further demoralise
it?

 

Thanks

 

 

Dr Mich Talebzadeh

 

LinkedIn
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUr
V8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.
pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15",
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN:
978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume
one out shortly

 

http://talebzadehmich.wordpress.com

 

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is
the responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Technology Ltd, its subsidiaries nor their
employees accept any responsibility.

 

-----Original Message-----
From: Jörn Franke [mailto:jornfra...@gmail.com] 
Sent: 18 January 2016 08:37
To: user@hive.apache.org
Subject: Re: optimize joins in hive 1.2.1

 

Do you have some data model?

 

Basically modern technologies, such as Hive, but also relational database,
suggest to prejoin tables and working on big flat tables. The reason is that
they are distributed systems and you should avoid transferring for each
query a lot of data between nodes. 

Hence, Hive works with the ORC format (or Parquet) to support storage
indexes and bloom filters. The idea here is that these techniques allow you
to skip reading a lot of data blocks. So the first optimization is using the
right format. The second optimization is to use compression. For example,
snappy allows fast decompression so it is more suitable for current data.
The third optimizations are partitions and buckets. The fourth optimization
is evaluating the different type of joins. The fifths optimizations can be
in-memory technologies, such as ignite HDFS cache or llap. The six
optimization is the execution engine. Tez supports currently in Hive the
most optimizations. The seventh optimization is data model: use for join
keys int or similar numeric values. Try to use the right data type for your
needs. The eight optimization can be hardware improvements.  The ninth
optimization can be use the right data structure. If your queries are more
graph type queries then consider using a graph database, such as TitanDB.
The tenth type of optimizations are os/network level type of optimizations.
For example using Jumbo frames for your network.

Not all necessary in this order and there is much more to optimize.

 

But as always this depends on your data structure. Sometimes joins can make
sense. Most of the time you have to experiment.

 

> On 18 Jan 2016, at 09:07, Divya Gehlot < <mailto:divya.htco...@gmail.com>
divya.htco...@gmail.com> wrote:

> 

> Hi,

> Need tips/guidance to optimize(increase perfomance) billion data rows
joins in hive .

> 

> Any help would be appreciated.

> 

> 

> Thanks,

> Divya 

Reply via email to