Look at mapPartitions. Where as map turns one value V1 into one value
V2, mapPartitions lets you turn one entire Iterator[V1] to one whole
Iterator [V2]. The function that does so can perform some
initialization at its start, and then process all of the values, and
clean up at its end. This is how you mimic a Mapper, really.

The most literal translation of Hadoop MapReduce I can think of is:

Mapper: mapPartitions to turn many (K1,V1) into many (K2,V2)
(shuffle) groupByKey to turn that into (K2,Iterator[V2])
Reducer mapPartitions to turn many (K2,Iterator[V2]) into many (K3,V3)

It's not necessarily optimal to do it this way -- especially the
groupByKey bit. You have more expressive power here and need not fit
it into this paradigm. But yes you can get the same effect as in
MapReduce, mostly from mapPartitions.

On Sat, Jul 26, 2014 at 8:52 AM, Yosi Botzer <yosi.bot...@gmail.com> wrote:
> Thank you, but that doesn't answer my general question.
>
> I might need to enrich my records using different datasources (or DB's)
>
> So the general use case I need to support is to have some kind of Function
> that has init() logic for creating connection to DB, query the DB for each
> records and enrich my input record with stuff from the DB, and use some kind
> of close() logic to close the connection.
>
> I have implemented this kind of use case using Map/Reduce and I want to know
> how can I do it with spark

Reply via email to