edit: Mistake in the second code example val numColumns = separatedInputStrings.filter{ case(id, (stateList, numStates)) => numStates}.reduce(math.max)
On Tue, Jan 19, 2016 at 8:17 AM Daniel Imberman <daniel.imber...@gmail.com> wrote: > Hi Richard, > > If I understand the question correctly it sounds like you could probably > do this using mapValues (I'm assuming that you want two pieces of > information out of all rows, the states as individual items, and the number > of states in the row) > > > val separatedInputStrings = input:RDD[(Int, String).mapValues{ > val inputsString = "TX,NV,WY" > val stringList = inputString.split(",") > (stringList, stringList.size) > } > > If you then wanted to find out how many state columns you should have in > your table you could use a normal reduce (with a filter beforehand to > reduce how much data you are shuffling) > > val numColumns = separatedInputStrings.filter(_._2).reduce(math.max) > > I hope this helps! > > > > On Tue, Jan 19, 2016 at 8:05 AM Richard Siebeling <rsiebel...@gmail.com> > wrote: > >> that's true and that's the way we're doing it now but then we're only >> using the first row to determine the number of splitted columns. >> It could be that in the second (or last) row there are 10 new columns and >> we'd like to know that too. >> >> Probably a reduceby operator can be used to do that, but I'm hoping that >> there is a better or another way, >> >> thanks, >> Richard >> >> On Tue, Jan 19, 2016 at 4:22 PM, Sabarish Sasidharan < >> sabarish.sasidha...@manthan.com> wrote: >> >>> The most efficient to determine the number of columns would be to do a >>> take(1) and split in the driver. >>> >>> Regards >>> Sab >>> On 19-Jan-2016 8:48 pm, "Richard Siebeling" <rsiebel...@gmail.com> >>> wrote: >>> >>>> Hi, >>>> >>>> what is the most efficient way to split columns and know how many >>>> columns are created. >>>> >>>> Here is the current RDD >>>> ----------------- >>>> ID STATE >>>> ----------------- >>>> 1 TX, NY, FL >>>> 2 CA, OH >>>> ----------------- >>>> >>>> This is the preferred output: >>>> ------------------------- >>>> ID STATE_1 STATE_2 STATE_3 >>>> ------------------------- >>>> 1 TX NY FL >>>> 2 CA OH >>>> ------------------------- >>>> >>>> With a separated with the new columns STATE_1, STATE_2, STATE_3 >>>> >>>> >>>> It looks like the following output is feasible using a ReduceBy operator >>>> ------------------------- >>>> ID STATE_1 STATE_2 STATE_3 NEW_COLUMNS >>>> ------------------------- >>>> 1 TX NY FL STATE_1, >>>> STATE_2, STATE_3 >>>> 2 CA OH STATE_1, STATE_2 >>>> ------------------------- >>>> >>>> Then in the reduce step, the distinct new columns can be calculated. >>>> Is it possible to get the second output where next to the RDD the >>>> new_columns are saved somewhere? >>>> Or is the required to use the second approach? >>>> >>>> thanks in advance, >>>> Richard >>>> >>>> >>