I am doing a pivot transformation on an input dataset ==================== Following input schema ===================== |-- c_salutation: string (nullable = true) |-- c_preferred_cust_flag: string (nullable = true) |-- integer_type_col: integer (nullable = false) |-- long_type_col: long (nullable = false) |-- string_type_col: string (nullable = true) |-- decimal_type_col: decimal(38,0) (nullable = true)
My pivot column is c_preferred_cust_flag , pivot values is "Y","N","R" and group by column is c_salutation I am using the api * pivot(String pivotColumn,* * java.util.List<Object> values) * *on RelationalGroupedDataset* ================================ My aggregation functions after this pivot is =================================== count(`string_type_col`) ,sum(`string_type_col`) ,sum(`integer_type_col`) ,avg(`integer_type_col`) ,sum(`long_type_col`) ,avg(`long_type_col`) ,avg(`decimal_type_col`) =========================================== My output dataset schema after the groupby.pivot.agg() is ================================================ |-- c_salutation: string (nullable = true) |-- Y_count(`string_type_col`): long (nullable = true) |-- Y_sum(CAST(`string_type_col` AS DOUBLE)): double (nullable = true) |-- Y_sum(CAST(`integer_type_col` AS BIGINT)): long (nullable = true) |-- Y_avg(CAST(`integer_type_col` AS BIGINT)): double (nullable = true) |-- Y_sum(`long_type_col`): long (nullable = true) |-- Y_avg(`long_type_col`): double (nullable = true) |-- Y_avg(`decimal_type_col`): decimal(38,4) (nullable = true) |-- N_count(`string_type_col`): long (nullable = true) |-- N_sum(CAST(`string_type_col` AS DOUBLE)): double (nullable = true) |-- N_sum(CAST(`integer_type_col` AS BIGINT)): long (nullable = true) |-- N_avg(CAST(`integer_type_col` AS BIGINT)): double (nullable = true) |-- N_sum(`long_type_col`): long (nullable = true) |-- N_avg(`long_type_col`): double (nullable = true) |-- N_avg(`decimal_type_col`): decimal(38,4) (nullable = true) |-- R_count(`string_type_col`): long (nullable = true) |-- R_sum(CAST(`string_type_col` AS DOUBLE)): double (nullable = true) |-- R_sum(CAST(`integer_type_col` AS BIGINT)): long (nullable = true) |-- R_avg(CAST(`integer_type_col` AS BIGINT)): double (nullable = true) |-- R_sum(`long_type_col`): long (nullable = true) |-- R_avg(`long_type_col`): double (nullable = true) |-- R_avg(`decimal_type_col`): decimal(38,4) (nullable = true) ====================== My requirement is: ====================== to rename the system generated column names such as Y_count(`string_type_col`), N_avg(`decimal_type_col`) etc to a user defined name based on a mapping. I I need to be able to do this programatically given a mapping of the form: (pivotvalue + aggregationfunction) --> (requiredcolumnname) =================== My question is : =================== Can i rely on the order of the output columns generated? The order looks to confirm to this pattern PivotValue1-aggregationfunction1 PivotValue1-aggregationfunction2 .... PivotValue1-aggregationfunctionN PivotValue2-aggregationfunction1 PivotValue2-aggregationfunction2 .. Is this order standard across spark versions 2+ . ? Is this subject to change or not reliable from a user point of view. ? If not reliable , is there another way by which I can logically/programatically identify that a column such as R_sum(CAST(`integer_type_col` AS BIGINT)) corresponds to the input pivot value "R" and aggregation function of sum(`integer_type_col`) Thanks Manohar