Hi,

what is the most efficient way to split columns and know how many columns
are created.

Here is the current RDD
-----------------
ID   STATE
-----------------
1       TX, NY, FL
2       CA, OH
-----------------

This is the preferred output:
-------------------------
ID    STATE_1     STATE_2      STATE_3
-------------------------
1     TX              NY              FL
2     CA              OH
-------------------------

With a separated with the new columns STATE_1, STATE_2, STATE_3


It looks like the following output is feasible using a ReduceBy operator
-------------------------
ID    STATE_1     STATE_2      STATE_3       NEW_COLUMNS
-------------------------
1     TX                NY               FL            STATE_1, STATE_2,
STATE_3
2     CA                OH                             STATE_1, STATE_2
-------------------------

Then in the reduce step, the distinct new columns can be calculated.
Is it possible to get the second output where next to the RDD the
new_columns are saved somewhere?
Or is the required to use the second approach?

thanks in advance,
Richard

Reply via email to