Hi All,

I have a fairly complex HiveQL data processing which I am trying to convert
to SparkSQL to improve performance. Below is what it does.

Select around 100 columns including Aggregates
>From a FACT_TABLE
Joined to the summary of the same FACT_TABLE
Joined to 2 smaller DIMENSION tables.

The data processing currently takes around an hour to complete processing.

This is what I have tried so far.

1. Use hiveContext to query the DIMENSION tables, store it as DataFrames
and "registerTempTable".

2.  Use hiveContext to query the summary of FACT_TABLE, store it as
DataFrames and "registerTempTable".

3. Use the "Temp" tables from above 2 steps to get the final RecordSet to
another DataFrame.

4. Save the DataFrame from step 3 to Hive with "InsertOverwrite" using
"saveAsTable".

Below are my questions. Any response would be much appreciated. Thanks.

A. Is there a better approach?
B. Does breaking down the big Hive query into multiple steps with multiple
DataFrames expected to give better performance?
C. Is there an opportunity to intermix RDD with SparkSQL in this case?
D. Can the "Caching" of a DataFrame improve performance?
E. Are there other suggestions to improve performance?

Thank You for your time.

Reply via email to