Hi Ted

Following is my use case.

I have a prediction algorithm where i need to update some records to
predict the target.

For eg.
I have an eq. Y=  mX +c
I need to change value of Xi of some records and calculate sum(Yi) if the
value of prediction is not close to target value then repeat the process.

In each iteration different set of values are updated but result is checked
when we sum up the values.

On Sat, 7 May 2016, 8:58 a.m. Ted Yu, <yuzhih...@gmail.com> wrote:

> Using RDDs requires some 'low level' optimization techniques.
> While using dataframes / Spark SQL allows you to leverage existing code.
>
> If you can share some more of your use case, that would help other people
> provide suggestions.
>
> Thanks
>
> On May 6, 2016, at 6:57 PM, HARSH TAKKAR <takkarha...@gmail.com> wrote:
>
> Hi Ted
>
> I am aware that rdd are immutable, but in my use case i need to update
> same data set after each iteration.
>
> Following are the points which i was exploring.
>
> 1. Generating rdd in each iteration.( It might use a lot of memory).
>
> 2. Using Hive tables and update the same table after each iteration.
>
> Please suggest,which one of the methods listed above will be good to use ,
> or is there are more better ways to accomplish it.
>
> On Fri, 6 May 2016, 7:09 p.m. Ted Yu, <yuzhih...@gmail.com> wrote:
>
>> Please see the doc at the beginning of RDD class:
>>
>>  * A Resilient Distributed Dataset (RDD), the basic abstraction in Spark.
>> Represents an immutable,
>>  * partitioned collection of elements that can be operated on in
>> parallel. This class contains the
>>  * basic operations available on all RDDs, such as `map`, `filter`, and
>> `persist`. In addition,
>>
>> On Fri, May 6, 2016 at 5:25 AM, HARSH TAKKAR <takkarha...@gmail.com>
>> wrote:
>>
>>> Hi
>>>
>>> Is there a way i can modify a RDD, in for-each loop,
>>>
>>> Basically, i have a use case in which i need to perform multiple
>>> iteration over data and modify few values in each iteration.
>>>
>>>
>>> Please help.
>>>
>>
>>

Reply via email to