Did anybody try to convert HiveQL queries to SparkSQL? If so, would you share the experience, pros & cons please? Thank you.
On Thu, Jul 30, 2015 at 10:37 AM, Bigdata techguy <bigdatatech...@gmail.com> wrote: > Thanks Jorn for the response and for the pointer questions to Hive > optimization tips. > > I believe I have done the possible & applicable things to improve hive > query performance including but not limited to - running on TEZ, using > partitioning, bucketing, using explain to make sure partition pruning is > happening, using compression, using the best data types for join columns, > denormalizing etc:. I am using Hive version - 0.13. > > The idea behind this POC is to find the strengths of SparkSQL over HiveQL > and identify the use cases where SparkSQL can perform better than HiveQL > other than the "iterative use cases". In general, what would be the > SparkSQL use scenarios? > > I am pretty sure someone have tried this before and compared > performance...Any responses would be much appreciated. Thank you. > > > On Wed, Jul 29, 2015 at 1:57 PM, Jörn Franke <jornfra...@gmail.com> wrote: > >> What Hive Version are you using? Do you run it in on TEZ? Are you using >> the ORC Format? Do you use compression? Snappy? Do you use Bloom filters? >> Do you insert the data sorted on the right columns? Do you use >> partitioning? Did you increase the replication factor for often used tables >> or partitions? Do you use bucketing? Is your data model appropriate (join >> columns as int , use numeric data types where appropriate , dates as >> int...), dif you calculate statistics? Did you use indexes (compressed, ORC >> Format?) do you provide mapjoin hints? Did you do any other Hive >> optimization? Did you use explain to verify that only selected partitions, >> indexes, Bloom filters had been used? >> Did you verify that no other application has taken resources? What is the >> CPU level on namenode, hiveserver2? If it is high then you need Mord >> memory there! >> >> First rule is to get it Hive right before you think about in-memory. >> Caching will only help for iterative stuff. You may think about >> denormalizing the model even more to avoid joins as much as possible. >> >> Bigdata techguy <bigdatatech...@gmail.com> schrieb am Mi., 29.07.2015, >> 18:49: >> >>> Hi All, >>> >>> I have a fairly complex HiveQL data processing which I am trying to >>> convert to SparkSQL to improve performance. Below is what it does. >>> >>> Select around 100 columns including Aggregates >>> From a FACT_TABLE >>> Joined to the summary of the same FACT_TABLE >>> Joined to 2 smaller DIMENSION tables. >>> >>> The data processing currently takes around an hour to complete >>> processing. >>> >>> This is what I have tried so far. >>> >>> 1. Use hiveContext to query the DIMENSION tables, store it as DataFrames >>> and "registerTempTable". >>> >>> 2. Use hiveContext to query the summary of FACT_TABLE, store it as >>> DataFrames and "registerTempTable". >>> >>> 3. Use the "Temp" tables from above 2 steps to get the final RecordSet >>> to another DataFrame. >>> >>> 4. Save the DataFrame from step 3 to Hive with "InsertOverwrite" using >>> "saveAsTable". >>> >>> Below are my questions. Any response would be much appreciated. Thanks. >>> >>> A. Is there a better approach? >>> B. Does breaking down the big Hive query into multiple steps with >>> multiple DataFrames expected to give better performance? >>> C. Is there an opportunity to intermix RDD with SparkSQL in this case? >>> D. Can the "Caching" of a DataFrame improve performance? >>> E. Are there other suggestions to improve performance? >>> >>> Thank You for your time. >>> >>> >