Hi,
Your point on Basically modern technologies, such as Hive, but also relational database, suggest to prejoin tables and working on big flat tables. The reason is that they are distributed systems and you should avoid transferring for each query a lot of data between nodes. Can you please clarify whether you are suggesting to merge fact table with dimension tables in Hive as an example, In other words further demoralise it? Thanks Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUr V8Pw Sybase ASE 15 Gold Medal Award 2008 A Winning Strategy: Running the most Critical Financial Data on ASE 15 http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908. pdf Author of the books "A Practitioners Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7. co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4 Publications due shortly: Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8 Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly http://talebzadehmich.wordpress.com NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Technology Ltd, its subsidiaries nor their employees accept any responsibility. -----Original Message----- From: Jörn Franke [mailto:jornfra...@gmail.com] Sent: 18 January 2016 08:37 To: user@hive.apache.org Subject: Re: optimize joins in hive 1.2.1 Do you have some data model? Basically modern technologies, such as Hive, but also relational database, suggest to prejoin tables and working on big flat tables. The reason is that they are distributed systems and you should avoid transferring for each query a lot of data between nodes. Hence, Hive works with the ORC format (or Parquet) to support storage indexes and bloom filters. The idea here is that these techniques allow you to skip reading a lot of data blocks. So the first optimization is using the right format. The second optimization is to use compression. For example, snappy allows fast decompression so it is more suitable for current data. The third optimizations are partitions and buckets. The fourth optimization is evaluating the different type of joins. The fifths optimizations can be in-memory technologies, such as ignite HDFS cache or llap. The six optimization is the execution engine. Tez supports currently in Hive the most optimizations. The seventh optimization is data model: use for join keys int or similar numeric values. Try to use the right data type for your needs. The eight optimization can be hardware improvements. The ninth optimization can be use the right data structure. If your queries are more graph type queries then consider using a graph database, such as TitanDB. The tenth type of optimizations are os/network level type of optimizations. For example using Jumbo frames for your network. Not all necessary in this order and there is much more to optimize. But as always this depends on your data structure. Sometimes joins can make sense. Most of the time you have to experiment. > On 18 Jan 2016, at 09:07, Divya Gehlot < <mailto:divya.htco...@gmail.com> divya.htco...@gmail.com> wrote: > > Hi, > Need tips/guidance to optimize(increase perfomance) billion data rows joins in hive . > > Any help would be appreciated. > > > Thanks, > Divya