The most efficient to determine the number of columns would be to do a take(1) and split in the driver.
Regards Sab On 19-Jan-2016 8:48 pm, "Richard Siebeling" <rsiebel...@gmail.com> wrote: > Hi, > > what is the most efficient way to split columns and know how many columns > are created. > > Here is the current RDD > ----------------- > ID STATE > ----------------- > 1 TX, NY, FL > 2 CA, OH > ----------------- > > This is the preferred output: > ------------------------- > ID STATE_1 STATE_2 STATE_3 > ------------------------- > 1 TX NY FL > 2 CA OH > ------------------------- > > With a separated with the new columns STATE_1, STATE_2, STATE_3 > > > It looks like the following output is feasible using a ReduceBy operator > ------------------------- > ID STATE_1 STATE_2 STATE_3 NEW_COLUMNS > ------------------------- > 1 TX NY FL STATE_1, STATE_2, > STATE_3 > 2 CA OH STATE_1, STATE_2 > ------------------------- > > Then in the reduce step, the distinct new columns can be calculated. > Is it possible to get the second output where next to the RDD the > new_columns are saved somewhere? > Or is the required to use the second approach? > > thanks in advance, > Richard > >