Hi, I am also using Spark on Hive Metastore. The performance is much more better esp. for larger datasets. I have the feeling that the performance is better if I load the data into dataframes and do a join instead of doing direct join within SparkSQL. But i can’t explain yet.
Any body experiences in that ? Br, Dennis Von meinem iPhone gesendet > Am 15.03.2020 um 06:04 schrieb Manjunath Shetty H <manjunathshe...@live.com>: > > > Hi All, > > We have 10 tables in data warehouse (hdfs/hive) written using ORC format. We > are serving a usecase on top of that by joining 4-5 tables using Hive as of > now. But it is not fast as we wanted it to be, so we are thinking of using > spark for this use case. > > Any suggestion on this ? Is it good idea to use the Spark for this use case ? > Can we get better performance by using spark ? > > Any pointers would be helpful. > > Notes: > Data is partitioned by date (yyyyMMdd) as integer. > Query will fetch data for last 7 days from some tables while joining with > other tables. > > Approach we thought of as now : > Create dataframe for each table and partition by same column for all tables ( > Lets say Country as partition column ) > Register all tables as temporary tables > Run the sql query with joins > But the problem we are seeing with this approach is , even though we already > partitioned using country it still does hashParittioning + shuffle during > join. All the table join contain `Country` column with some extra column > based on the table. > > Is there any way to avoid these shuffles ? and improve performance ? > > > Thanks and regards > Manjunath