Try: - filtering down the data as soon as possible in the job, dropping columns you don’t need. - processing fewer partitions of the hive tables at a time - caching frequently accessed data, for example dimension tables, lookup tables, or other datasets that are repeatedly accessed - using the Spark UI to identify the bottlenecked resource - remove features or columns from the output data, until it runs, then add them back in one at a time. - creating a static dataset small enough to work, and editing the query, then retesting, repeatedly until you cut the execution time by a significant fraction - Using the Spark UI or spark shell to check the skew and make sure partitions are evenly distributed
> On Jul 18, 2016, at 3:33 AM, Zhiliang Zhu <zchl.j...@yahoo.com.INVALID> wrote: > > Thanks a lot for your reply . > > In effect , here we tried to run the sql on kettle, hive and spark hive (by > HiveContext) respectively, the job seems frozen to finish to run . > > In the 6 tables , need to respectively read the different columns in > different tables for specific information , then do some simple calculation > before output . > join operation is used most in the sql . > > Best wishes! > > > > > On Monday, July 18, 2016 6:24 PM, Chanh Le <giaosu...@gmail.com> wrote: > > > Hi, > What about the network (bandwidth) between hive and spark? > Does it run in Hive before then you move to Spark? > Because It's complex you can use something like EXPLAIN command to show what > going on. > > > > > >> On Jul 18, 2016, at 5:20 PM, Zhiliang Zhu <zchl.j...@yahoo.com.INVALID >> <mailto:zchl.j...@yahoo.com.invalid>> wrote: >> >> the sql logic in the program is very much complex , so do not describe the >> detailed codes here . >> >> >> On Monday, July 18, 2016 6:04 PM, Zhiliang Zhu <zchl.j...@yahoo.com.INVALID >> <mailto:zchl.j...@yahoo.com.invalid>> wrote: >> >> >> Hi All, >> >> Here we have one application, it needs to extract different columns from 6 >> hive tables, and then does some easy calculation, there is around 100,000 >> number of rows in each table, >> finally need to output another table or file (with format of consistent >> columns) . >> >> However, after lots of days trying, the spark hive job is unthinkably slow >> - sometimes almost frozen. There is 5 nodes for spark cluster. >> >> Could anyone offer some help, some idea or clue is also good. >> >> Thanks in advance~ >> >> Zhiliang >> >> > > >