Hi,

I am also using Spark on Hive Metastore. The performance is much more better 
esp. for larger datasets. I have the feeling that the performance is better if 
I load the data into dataframes and do a join instead of doing direct join 
within SparkSQL. But i can’t explain yet. 

Any body experiences in that ?

Br,

Dennis

Von meinem iPhone gesendet

> Am 15.03.2020 um 06:04 schrieb Manjunath Shetty H <manjunathshe...@live.com>:
> 
> 
> Hi All,
> 
> We have 10 tables in data warehouse (hdfs/hive) written using ORC format. We 
> are serving a usecase on top of that by joining 4-5 tables using Hive as of 
> now. But it is not fast as we wanted it to be, so we are thinking of using 
> spark for this use case.
> 
> Any suggestion on this ? Is it good idea to use the Spark for this use case ? 
> Can we get better performance by using spark ?
> 
> Any pointers would be helpful.
> 
> Notes: 
> Data is partitioned by date (yyyyMMdd) as integer.
> Query will fetch data for last 7 days from some tables while joining with 
> other tables.
> 
> Approach we thought of as now :
> Create dataframe for each table and partition by same column for all tables ( 
> Lets say Country as partition column )
> Register all tables as temporary tables 
> Run the sql query with joins
> But the problem we are seeing with this approach is , even though we already 
> partitioned using country it still does hashParittioning + shuffle during 
> join. All the table join contain `Country` column with some extra column 
> based on the table.
> 
> Is there any way to avoid these shuffles ? and improve performance ?
> 
> 
> Thanks and regards 
> Manjunath

Reply via email to