I am doing a pivot transformation on an input dataset
Following input schema
=
|-- c_salutation: string (nullable = true)
|-- c_preferred_cust_flag: string (nullable = true)
|-- integer_type_col: integer (nullable = false)
|-- long_type_col: long (nullable = false)
|-- string_type_col: string (nullable = true)
|-- decimal_type_col: decimal(38,0) (nullable = true)
My pivot column is c_preferred_cust_flag , pivot values is "Y","N","R"
and group by column is c_salutation
I am using the api * pivot(String pivotColumn,*
* java.util.List values) *
*on RelationalGroupedDataset*
My aggregation functions after this pivot is
===
count(`string_type_col`) ,sum(`string_type_col`) ,sum(`integer_type_col`)
,avg(`integer_type_col`)
,sum(`long_type_col`) ,avg(`long_type_col`) ,avg(`decimal_type_col`)
===
My output dataset schema after the groupby.pivot.agg()
is
|-- c_salutation: string (nullable = true)
|-- Y_count(`string_type_col`): long (nullable = true)
|-- Y_sum(CAST(`string_type_col` AS DOUBLE)): double (nullable = true)
|-- Y_sum(CAST(`integer_type_col` AS BIGINT)): long (nullable = true)
|-- Y_avg(CAST(`integer_type_col` AS BIGINT)): double (nullable = true)
|-- Y_sum(`long_type_col`): long (nullable = true)
|-- Y_avg(`long_type_col`): double (nullable = true)
|-- Y_avg(`decimal_type_col`): decimal(38,4) (nullable = true)
|-- N_count(`string_type_col`): long (nullable = true)
|-- N_sum(CAST(`string_type_col` AS DOUBLE)): double (nullable = true)
|-- N_sum(CAST(`integer_type_col` AS BIGINT)): long (nullable = true)
|-- N_avg(CAST(`integer_type_col` AS BIGINT)): double (nullable = true)
|-- N_sum(`long_type_col`): long (nullable = true)
|-- N_avg(`long_type_col`): double (nullable = true)
|-- N_avg(`decimal_type_col`): decimal(38,4) (nullable = true)
|-- R_count(`string_type_col`): long (nullable = true)
|-- R_sum(CAST(`string_type_col` AS DOUBLE)): double (nullable = true)
|-- R_sum(CAST(`integer_type_col` AS BIGINT)): long (nullable = true)
|-- R_avg(CAST(`integer_type_col` AS BIGINT)): double (nullable = true)
|-- R_sum(`long_type_col`): long (nullable = true)
|-- R_avg(`long_type_col`): double (nullable = true)
|-- R_avg(`decimal_type_col`): decimal(38,4) (nullable = true)
==
My requirement is:
==
to rename the system generated column names such as
Y_count(`string_type_col`), N_avg(`decimal_type_col`)
etc to a user defined name based on a mapping. I
I need to be able to do this programatically given a mapping of the form:
(pivotvalue + aggregationfunction) --> (requiredcolumnname)
===
My question is :
===
Can i rely on the order of the output columns generated?
The order looks to confirm to this pattern
PivotValue1-aggregationfunction1
PivotValue1-aggregationfunction2
PivotValue1-aggregationfunctionN
PivotValue2-aggregationfunction1
PivotValue2-aggregationfunction2
..
Is this order standard across spark versions 2+ . ?
Is this subject to change or not reliable from a user point of view. ?
If not reliable , is there another way by which I can
logically/programatically
identify that a column such as R_sum(CAST(`integer_type_col` AS
BIGINT))
corresponds to the input pivot value "R" and aggregation function of
sum(`integer_type_col`)
Thanks
Manohar