edit 2: filter should be map val numColumns = separatedInputStrings.map{ case(id, (stateList, numStates)) => numStates}.reduce(math.max)
On Tue, Jan 19, 2016 at 8:19 AM Daniel Imberman <daniel.imber...@gmail.com> wrote: > edit: Mistake in the second code example > > val numColumns = separatedInputStrings.filter{ case(id, (stateList, > numStates)) => numStates}.reduce(math.max) > > > On Tue, Jan 19, 2016 at 8:17 AM Daniel Imberman <daniel.imber...@gmail.com> > wrote: > >> Hi Richard, >> >> If I understand the question correctly it sounds like you could probably >> do this using mapValues (I'm assuming that you want two pieces of >> information out of all rows, the states as individual items, and the number >> of states in the row) >> >> >> val separatedInputStrings = input:RDD[(Int, String).mapValues{ >> val inputsString = "TX,NV,WY" >> val stringList = inputString.split(",") >> (stringList, stringList.size) >> } >> >> If you then wanted to find out how many state columns you should have in >> your table you could use a normal reduce (with a filter beforehand to >> reduce how much data you are shuffling) >> >> val numColumns = separatedInputStrings.filter(_._2).reduce(math.max) >> >> I hope this helps! >> >> >> >> On Tue, Jan 19, 2016 at 8:05 AM Richard Siebeling <rsiebel...@gmail.com> >> wrote: >> >>> that's true and that's the way we're doing it now but then we're only >>> using the first row to determine the number of splitted columns. >>> It could be that in the second (or last) row there are 10 new columns >>> and we'd like to know that too. >>> >>> Probably a reduceby operator can be used to do that, but I'm hoping that >>> there is a better or another way, >>> >>> thanks, >>> Richard >>> >>> On Tue, Jan 19, 2016 at 4:22 PM, Sabarish Sasidharan < >>> sabarish.sasidha...@manthan.com> wrote: >>> >>>> The most efficient to determine the number of columns would be to do a >>>> take(1) and split in the driver. >>>> >>>> Regards >>>> Sab >>>> On 19-Jan-2016 8:48 pm, "Richard Siebeling" <rsiebel...@gmail.com> >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> what is the most efficient way to split columns and know how many >>>>> columns are created. >>>>> >>>>> Here is the current RDD >>>>> ----------------- >>>>> ID STATE >>>>> ----------------- >>>>> 1 TX, NY, FL >>>>> 2 CA, OH >>>>> ----------------- >>>>> >>>>> This is the preferred output: >>>>> ------------------------- >>>>> ID STATE_1 STATE_2 STATE_3 >>>>> ------------------------- >>>>> 1 TX NY FL >>>>> 2 CA OH >>>>> ------------------------- >>>>> >>>>> With a separated with the new columns STATE_1, STATE_2, STATE_3 >>>>> >>>>> >>>>> It looks like the following output is feasible using a ReduceBy >>>>> operator >>>>> ------------------------- >>>>> ID STATE_1 STATE_2 STATE_3 NEW_COLUMNS >>>>> ------------------------- >>>>> 1 TX NY FL STATE_1, >>>>> STATE_2, STATE_3 >>>>> 2 CA OH STATE_1, STATE_2 >>>>> ------------------------- >>>>> >>>>> Then in the reduce step, the distinct new columns can be calculated. >>>>> Is it possible to get the second output where next to the RDD the >>>>> new_columns are saved somewhere? >>>>> Or is the required to use the second approach? >>>>> >>>>> thanks in advance, >>>>> Richard >>>>> >>>>> >>>