Re: Split columns in RDD

Daniel Imberman Tue, 19 Jan 2016 09:36:18 -0800

edit 2: filter should be map

val numColumns = separatedInputStrings.map{ case(id, (stateList,
numStates)) => numStates}.reduce(math.max)


On Tue, Jan 19, 2016 at 8:19 AM Daniel Imberman <daniel.imber...@gmail.com>
wrote:

> edit: Mistake in the second code example
>
> val numColumns = separatedInputStrings.filter{ case(id, (stateList,
> numStates)) => numStates}.reduce(math.max)
>
>
> On Tue, Jan 19, 2016 at 8:17 AM Daniel Imberman <daniel.imber...@gmail.com>
> wrote:
>
>> Hi Richard,
>>
>> If I understand the question correctly it sounds like you could probably
>> do this using mapValues (I'm assuming that you want two pieces of
>> information out of all rows, the states as individual items, and the number
>> of states in the row)
>>
>>
>> val separatedInputStrings = input:RDD[(Int, String).mapValues{
>>     val inputsString = "TX,NV,WY"
>>     val stringList = inputString.split(",")
>>     (stringList, stringList.size)
>> }
>>
>> If you then wanted to find out how many state columns you should have in
>> your table you could use a normal reduce (with a filter beforehand to
>> reduce how much data you are shuffling)
>>
>> val numColumns = separatedInputStrings.filter(_._2).reduce(math.max)
>>
>> I hope this helps!
>>
>>
>>
>> On Tue, Jan 19, 2016 at 8:05 AM Richard Siebeling <rsiebel...@gmail.com>
>> wrote:
>>
>>> that's true and that's the way we're doing it now but then we're only
>>> using the first row to determine the number of splitted columns.
>>> It could be that in the second (or last) row there are 10 new columns
>>> and we'd like to know that too.
>>>
>>> Probably a reduceby operator can be used to do that, but I'm hoping that
>>> there is a better or another way,
>>>
>>> thanks,
>>> Richard
>>>
>>> On Tue, Jan 19, 2016 at 4:22 PM, Sabarish Sasidharan <
>>> sabarish.sasidha...@manthan.com> wrote:
>>>
>>>> The most efficient to determine the number of columns would be to do a
>>>> take(1) and split in the driver.
>>>>
>>>> Regards
>>>> Sab
>>>> On 19-Jan-2016 8:48 pm, "Richard Siebeling" <rsiebel...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> what is the most efficient way to split columns and know how many
>>>>> columns are created.
>>>>>
>>>>> Here is the current RDD
>>>>> -----------------
>>>>> ID   STATE
>>>>> -----------------
>>>>> 1       TX, NY, FL
>>>>> 2       CA, OH
>>>>> -----------------
>>>>>
>>>>> This is the preferred output:
>>>>> -------------------------
>>>>> ID    STATE_1     STATE_2      STATE_3
>>>>> -------------------------
>>>>> 1     TX              NY              FL
>>>>> 2     CA              OH
>>>>> -------------------------
>>>>>
>>>>> With a separated with the new columns STATE_1, STATE_2, STATE_3
>>>>>
>>>>>
>>>>> It looks like the following output is feasible using a ReduceBy
>>>>> operator
>>>>> -------------------------
>>>>> ID    STATE_1     STATE_2      STATE_3       NEW_COLUMNS
>>>>> -------------------------
>>>>> 1     TX                NY               FL            STATE_1,
>>>>> STATE_2, STATE_3
>>>>> 2     CA                OH                             STATE_1, STATE_2
>>>>> -------------------------
>>>>>
>>>>> Then in the reduce step, the distinct new columns can be calculated.
>>>>> Is it possible to get the second output where next to the RDD the
>>>>> new_columns are saved somewhere?
>>>>> Or is the required to use the second approach?
>>>>>
>>>>> thanks in advance,
>>>>> Richard
>>>>>
>>>>>
>>>

Re: Split columns in RDD

Reply via email to