Hi, what is the most efficient way to split columns and know how many columns are created.
Here is the current RDD ----------------- ID STATE ----------------- 1 TX, NY, FL 2 CA, OH ----------------- This is the preferred output: ------------------------- ID STATE_1 STATE_2 STATE_3 ------------------------- 1 TX NY FL 2 CA OH ------------------------- With a separated with the new columns STATE_1, STATE_2, STATE_3 It looks like the following output is feasible using a ReduceBy operator ------------------------- ID STATE_1 STATE_2 STATE_3 NEW_COLUMNS ------------------------- 1 TX NY FL STATE_1, STATE_2, STATE_3 2 CA OH STATE_1, STATE_2 ------------------------- Then in the reduce step, the distinct new columns can be calculated. Is it possible to get the second output where next to the RDD the new_columns are saved somewhere? Or is the required to use the second approach? thanks in advance, Richard