Re: Spark Function setup and cleanup

2014-07-26 Thread Sean Owen
Look at mapPartitions. Where as map turns one value V1 into one value
V2, mapPartitions lets you turn one entire Iterator[V1] to one whole
Iterator [V2]. The function that does so can perform some
initialization at its start, and then process all of the values, and
clean up at its end. This is how you mimic a Mapper, really.

The most literal translation of Hadoop MapReduce I can think of is:

Mapper: mapPartitions to turn many (K1,V1) into many (K2,V2)
(shuffle) groupByKey to turn that into (K2,Iterator[V2])
Reducer mapPartitions to turn many (K2,Iterator[V2]) into many (K3,V3)

It's not necessarily optimal to do it this way -- especially the
groupByKey bit. You have more expressive power here and need not fit
it into this paradigm. But yes you can get the same effect as in
MapReduce, mostly from mapPartitions.

On Sat, Jul 26, 2014 at 8:52 AM, Yosi Botzer  wrote:
> Thank you, but that doesn't answer my general question.
>
> I might need to enrich my records using different datasources (or DB's)
>
> So the general use case I need to support is to have some kind of Function
> that has init() logic for creating connection to DB, query the DB for each
> records and enrich my input record with stuff from the DB, and use some kind
> of close() logic to close the connection.
>
> I have implemented this kind of use case using Map/Reduce and I want to know
> how can I do it with spark


Re: Spark Function setup and cleanup

2014-07-26 Thread Yosi Botzer
Thank you, but that doesn't answer my general question.

I might need to enrich my records using different datasources (or DB's)

So the general use case I need to support is to have some kind of Function
that has init() logic for creating connection to DB, query the DB for each
records and enrich my input record with stuff from the DB, and use some
kind of close() logic to close the connection.

I have implemented this kind of use case using Map/Reduce and I want to
know how can I do it with spark

Thanks


On Fri, Jul 25, 2014 at 6:24 AM, Yanbo Liang  wrote:

> You can refer this topic
> http://www.mapr.com/developercentral/code/loading-hbase-tables-spark
>
>
> 2014-07-24 22:32 GMT+08:00 Yosi Botzer :
>
> In my case I want to reach HBase. For every record with userId I want to
>> get some extra information about the user and add it to result record for
>> further prcessing
>>
>>
>> On Thu, Jul 24, 2014 at 9:11 AM, Yanbo Liang 
>> wrote:
>>
>>> If you want to connect to DB in program, you can use JdbcRDD (
>>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala
>>> )
>>>
>>>
>>> 2014-07-24 18:32 GMT+08:00 Yosi Botzer :
>>>
>>> Hi,

 I am using the Java api of Spark.

 I wanted to know if there is a way to run some code in a manner that is
 like the setup() and cleanup() methods of Hadoop Map/Reduce

 The reason I need it is because I want to read something from the DB
 according to each record I scan in my Function, and I would like to open
 the DB connection only once (and close it only once).

 Thanks

>>>
>>>
>>
>


Re: Spark Function setup and cleanup

2014-07-24 Thread Yanbo Liang
You can refer this topic
http://www.mapr.com/developercentral/code/loading-hbase-tables-spark


2014-07-24 22:32 GMT+08:00 Yosi Botzer :

> In my case I want to reach HBase. For every record with userId I want to
> get some extra information about the user and add it to result record for
> further prcessing
>
>
> On Thu, Jul 24, 2014 at 9:11 AM, Yanbo Liang  wrote:
>
>> If you want to connect to DB in program, you can use JdbcRDD (
>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala
>> )
>>
>>
>> 2014-07-24 18:32 GMT+08:00 Yosi Botzer :
>>
>> Hi,
>>>
>>> I am using the Java api of Spark.
>>>
>>> I wanted to know if there is a way to run some code in a manner that is
>>> like the setup() and cleanup() methods of Hadoop Map/Reduce
>>>
>>> The reason I need it is because I want to read something from the DB
>>> according to each record I scan in my Function, and I would like to open
>>> the DB connection only once (and close it only once).
>>>
>>> Thanks
>>>
>>
>>
>


Re: Spark Function setup and cleanup

2014-07-24 Thread Yosi Botzer
In my case I want to reach HBase. For every record with userId I want to
get some extra information about the user and add it to result record for
further prcessing


On Thu, Jul 24, 2014 at 9:11 AM, Yanbo Liang  wrote:

> If you want to connect to DB in program, you can use JdbcRDD (
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala
> )
>
>
> 2014-07-24 18:32 GMT+08:00 Yosi Botzer :
>
> Hi,
>>
>> I am using the Java api of Spark.
>>
>> I wanted to know if there is a way to run some code in a manner that is
>> like the setup() and cleanup() methods of Hadoop Map/Reduce
>>
>> The reason I need it is because I want to read something from the DB
>> according to each record I scan in my Function, and I would like to open
>> the DB connection only once (and close it only once).
>>
>> Thanks
>>
>
>


Re: Spark Function setup and cleanup

2014-07-24 Thread Yanbo Liang
If you want to connect to DB in program, you can use JdbcRDD (
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala
)


2014-07-24 18:32 GMT+08:00 Yosi Botzer :

> Hi,
>
> I am using the Java api of Spark.
>
> I wanted to know if there is a way to run some code in a manner that is
> like the setup() and cleanup() methods of Hadoop Map/Reduce
>
> The reason I need it is because I want to read something from the DB
> according to each record I scan in my Function, and I would like to open
> the DB connection only once (and close it only once).
>
> Thanks
>