Re: Split columns in RDD

Daniel Imberman Tue, 19 Jan 2016 08:20:58 -0800

edit: Mistake in the second code example

val numColumns = separatedInputStrings.filter{ case(id, (stateList,
numStates)) => numStates}.reduce(math.max)



On Tue, Jan 19, 2016 at 8:17 AM Daniel Imberman <daniel.imber...@gmail.com>
wrote:

> Hi Richard,
>
> If I understand the question correctly it sounds like you could probably
> do this using mapValues (I'm assuming that you want two pieces of
> information out of all rows, the states as individual items, and the number
> of states in the row)
>
>
> val separatedInputStrings = input:RDD[(Int, String).mapValues{
>     val inputsString = "TX,NV,WY"
>     val stringList = inputString.split(",")
>     (stringList, stringList.size)
> }
>
> If you then wanted to find out how many state columns you should have in
> your table you could use a normal reduce (with a filter beforehand to
> reduce how much data you are shuffling)
>
> val numColumns = separatedInputStrings.filter(_._2).reduce(math.max)
>
> I hope this helps!
>
>
>
> On Tue, Jan 19, 2016 at 8:05 AM Richard Siebeling <rsiebel...@gmail.com>
> wrote:
>
>> that's true and that's the way we're doing it now but then we're only
>> using the first row to determine the number of splitted columns.
>> It could be that in the second (or last) row there are 10 new columns and
>> we'd like to know that too.
>>
>> Probably a reduceby operator can be used to do that, but I'm hoping that
>> there is a better or another way,
>>
>> thanks,
>> Richard
>>
>> On Tue, Jan 19, 2016 at 4:22 PM, Sabarish Sasidharan <
>> sabarish.sasidha...@manthan.com> wrote:
>>
>>> The most efficient to determine the number of columns would be to do a
>>> take(1) and split in the driver.
>>>
>>> Regards
>>> Sab
>>> On 19-Jan-2016 8:48 pm, "Richard Siebeling" <rsiebel...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> what is the most efficient way to split columns and know how many
>>>> columns are created.
>>>>
>>>> Here is the current RDD
>>>> -----------------
>>>> ID   STATE
>>>> -----------------
>>>> 1       TX, NY, FL
>>>> 2       CA, OH
>>>> -----------------
>>>>
>>>> This is the preferred output:
>>>> -------------------------
>>>> ID    STATE_1     STATE_2      STATE_3
>>>> -------------------------
>>>> 1     TX              NY              FL
>>>> 2     CA              OH
>>>> -------------------------
>>>>
>>>> With a separated with the new columns STATE_1, STATE_2, STATE_3
>>>>
>>>>
>>>> It looks like the following output is feasible using a ReduceBy operator
>>>> -------------------------
>>>> ID    STATE_1     STATE_2      STATE_3       NEW_COLUMNS
>>>> -------------------------
>>>> 1     TX                NY               FL            STATE_1,
>>>> STATE_2, STATE_3
>>>> 2     CA                OH                             STATE_1, STATE_2
>>>> -------------------------
>>>>
>>>> Then in the reduce step, the distinct new columns can be calculated.
>>>> Is it possible to get the second output where next to the RDD the
>>>> new_columns are saved somewhere?
>>>> Or is the required to use the second approach?
>>>>
>>>> thanks in advance,
>>>> Richard
>>>>
>>>>
>>

Re: Split columns in RDD

Reply via email to