You say you reduceByKey but are you really collecting all the tuples
for a vehicle in a collection, like what groupByKey does already? Yes,
if one vehicle has a huge amount of data that could fail.

Otherwise perhaps you are simply not increasing memory from the default.

Maybe you can consider using something like vehicle and *day* as a
key. This would make you process each day of data separately, but if
that's fine for you, might drastically cut down the data associated to
a single key.

Spark Streaming has a windowing function, and there is a window
function for an entire RDD, but I am not sure if there is support for
a 'window by key' anywhere. You can perhaps get your direct approach
of collecting events working with some of the changes above.

Otherwise I think you have to roll your own to some extent, creating
the overlapping buckets of data, which will mean mapping the data to
several copies of itself. This might still be quite feasible depending
on how big a lag you are thinking of.

PS for the interested, this is what LAG is:
http://www.oracle-base.com/articles/misc/lag-lead-analytic-functions.php#lag

On Wed, Oct 15, 2014 at 1:37 AM, Manas Kar <manasdebashis...@gmail.com> wrote:
> Hi,
>  I have an RDD containing Vehicle Number , timestamp, Position.
>  I want to get the "lag" function equivalent to my RDD to be able to create
> track segment of each Vehicle.
>
> Any help?
>
> PS: I have tried reduceByKey and then splitting the List of position in
> tuples. For me it runs out of memory every time because of the volume of
> data.
>
> ...Manas
>
> For some reason I have never got any reply to my emails to the user group. I
> am hoping to break that trend this time. :)

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to