Hi everyone I have tried to to achieve hierarchical based (index mode) top n creation using spark query. it taken more time when i execute following query
Select SUM(`adventurepersoncontacts`.`contactid`) AS `adventurepersoncontacts_contactid` , `adventurepersoncontacts`.`fullname` AS `adventurepersoncontacts_fullname` FROM `default`.`adventurepersoncontacts` AS adventurepersoncontacts JOIN ( SELECT `F_0`.`fullname_0_0` AS `fullname_0_0`, ROW_NUMBER() OVER( ORDER BY `F_0`.`Measure_0` DESC) AS `R_N_0` FROM( SELECT `adventurepersoncontacts`.`fullname` AS `fullname_0_0`, SUM(adventurepersoncontacts.contactid) AS `Measure_0` FROM `default`.`adventurepersoncontacts` AS adventurepersoncontacts GROUP BY `adventurepersoncontacts`.`fullname` )`F_0` ) `T_0` on ((`adventurepersoncontacts`.`fullname` = `T_0`.`fullname_0_0` ) AND (`T_0`.`R_N_0` <= 5)) GROUP BY `adventurepersoncontacts`.`fullname In mentioned query, I have set row index in every group according to the aggregation type. Row_number calculation according to the aggregation like Row_Number() Over( order by sum( column name) order by DESC) not directly supported in spark . So i have using sub query like SELECT `F_0`.`fullname_0_0` AS `fullname_0_0`, ROW_NUMBER() OVER( ORDER BY `F_0`.`Measure_0` DESC) AS `R_N_0` FROM( SELECT `adventurepersoncontacts`.`fullname` AS `fullname_0_0`, SUM(adventurepersoncontacts.contactid) AS `Measure_0` FROM `default`.`adventurepersoncontacts` AS adventurepersoncontacts GROUP BY `adventurepersoncontacts`.`fullname` )`F_0` ) `T_0` The execution time was getting slowed when using following way. In case i removed main group by( remove aggregation) method , it getting very fast execution. Refer following query. I mentioned this query only for slow execution when i include main group by Select `adventurepersoncontacts`.`contactid` AS `adventurepersoncontacts_contactid` , `adventurepersoncontacts`.`fullname` AS `adventurepersoncontacts_fullname` FROM `default`.`adventurepersoncontacts` AS adventurepersoncontacts JOIN ( SELECT `F_0`.`fullname_0_0` AS `fullname_0_0`, ROW_NUMBER() OVER( ORDER BY `F_0`.`Measure_0` DESC) AS `R_N_0` FROM( SELECT `adventurepersoncontacts`.`fullname` AS `fullname_0_0`, SUM(adventurepersoncontacts.contactid) AS `Measure_0` FROM `default`.`adventurepersoncontacts` AS adventurepersoncontacts GROUP BY `adventurepersoncontacts`.`fullname` )`F_0` ) `T_0` on ((`adventurepersoncontacts`.`fullname` = `T_0`.`fullname_0_0` ) AND (`T_0`.`R_N_0` <= 5)) I have found other way for getting this hierarchical query. I have created a temp table ( The table refers inner sub query(refer previous query) ) "CREATE external table IF NOT EXISTS temp_table AS SELECT SUM(adventurepersoncontacts.contactid) as contactid, adventurepersoncontacts.fullname as fullname FROM `default`.`adventurepersoncontacts` AS adventurepersoncontacts GROUP BY `adventurepersoncontacts`.`fullname`"; "SELECT SUM(`adventurepersoncontacts`.`contactid`) AS `adventurepersoncontacts_contactid` ,`adventurepersoncontacts`.`fullname` AS `adventurepersoncontacts_fullname` FROM `default`.`adventurepersoncontacts` AS adventurepersoncontacts JOIN (SELECT `fullname` AS `fullname_0_0`,ROW_NUMBER() over( order by `contactid` desc) AS `R_N_0` FROM default.temp_table) `T_0` on ((`adventurepersoncontacts`.`fullname` = `T_0`.`fullname_0_0` ) AND (`T_0`.`R_N_0` <= 2)) GROUP BY `adventurepersoncontacts`.`fullname`"; This is other way. But this execution time was delayed when using create table using select statement Please help any other optimized way to achieve my requirement. Regards, Ravi