Re: Beam spark 2.x runner status

Jean-Baptiste Onofré Tue, 02 May 2017 22:33:05 -0700

Hi Ted,

My branch used Spark 2.1.0 and I just updated to 2.1.1.


As discussed with Aviem, I should be able to create the pull request later 
today.

Regards
JB

On 05/03/2017 02:50 AM, Ted Yu wrote:

Spark 2.1.1 has been released.

Consider using the new release in this work.

Thanks

On Wed, Mar 29, 2017 at 5:43 AM, Jean-Baptiste Onofré <[email protected]>
wrote:

Cool for the PR merge, I will rebase my branch on it.

Thanks !
Regards
JB


On 03/29/2017 01:58 PM, Amit Sela wrote:

@Ted definitely makes sense.
@JB I'm merging https://github.com/apache/beam/pull/2354 soon so any
deprecated Spark API issues should be resolved.

On Wed, Mar 29, 2017 at 2:46 PM Ted Yu <[email protected]> wrote:

This is what I did over HBASE-16179:


-        f.call((asJavaIterator(it), conn)).iterator()
+        // the return type is different in spark 1.x & 2.x, we handle
both
cases
+        f.call(asJavaIterator(it), conn) match {
+          // spark 1.x
+          case iterable: Iterable[R] => iterable.iterator()
+          // spark 2.x
+          case iterator: Iterator[R] => iterator
+        }
       )

FYI

On Wed, Mar 29, 2017 at 1:47 AM, Amit Sela <[email protected]> wrote:

Just tried to replace dependencies and see what happens:


Most required changes are about the runner using deprecated Spark APIs,

and

after fixing them the only real issue is with the Java API for
Pair/FlatMapFunction that changed return value to Iterator (in 1.6 its
Iterable).

So I'm not sure that a profile that simply sets dependency on
1.6.3/2.1.0
is feasible.

On Thu, Mar 23, 2017 at 10:22 AM Kobi Salant <[email protected]>
wrote:

So, if everything is in place in Spark 2.X and we use provided

dependencies

for Spark in Beam.
Theoretically, you can run the same code in 2.X without any need for a
branch?

2017-03-23 9:47 GMT+02:00 Amit Sela <[email protected]>:

If StreamingContext is valid and we don't have to use SparkSession,

and

Accumulators are valid as well and we don't need AccumulatorsV2, I

don't

see a reason this shouldn't work (which means there are still tons of

reasons this could break, but I can't think of them off the top of my

head

right now).

@JB simply add a profile for the Spark dependencies and run the

tests -

you'll have a very definitive answer ;-) .

If this passes, try on a cluster running Spark 2 as well.

Let me know of I can assist.

On Thu, Mar 23, 2017 at 6:55 AM Jean-Baptiste Onofré <

[email protected]>

wrote:


Hi guys,


Ismaël summarize well what I have in mind.

I'm a bit late on the PoC around that (I started a branch already).
I will move forward over the week end.

Regards
JB

On 03/22/2017 11:42 PM, Ismaël Mejía wrote:

Amit, I suppose JB is talking about the RDD based version, so no

need

to worry about SparkSession or different incompatible APIs.


Remember the idea we are discussing is to have in master both the
spark 1 and spark 2 runners using the RDD based translation. At

the

same time we can have a feature branch to evolve the DataSet

based

translator (this one will replace the RDD based translator for

spark

once it is mature).


The advantages have been already discussed as well as the

possible

issues so I think we have to see now if JB's idea is feasible and

how

hard would be to live with this while the DataSet version

evolves.

I think what we are trying to avoid is to have a long living

branch

for a spark 2 runner based on RDD  because the maintenance burden

would be even worse. We would have to fight not only with the

double

merge of fixes (in case the profile idea does not work), but also

with

the continue evolution of Beam and we would end up in the long

living

branch mess that others runners have dealt with (e.g. the Apex

runner)

https://lists.apache.org/thread.html/12cc086f5ffe331cc70b89322ce541

6c3112b87efc3393e3e16032a2@%3Cdev.beam.apache.org%3E

What do you think about this Amit ? Would you be ok to go with it

if

JB's profile idea proves to help with the msintenance issues ?


Ismaël



On Wed, Mar 22, 2017 at 5:53 PM, Ted Yu <[email protected]>

wrote:

hbase-spark module doesn't use SparkSession. So situation there

is

simpler

:-)


On Wed, Mar 22, 2017 at 5:35 AM, Amit Sela <

[email protected]>

wrote:

I'm still wondering how we'll do this - it's not just different

implementations of the same Class, but a completely different

concepts

such

as using SparkSession in Spark 2 instead of

SparkContext/StreamingContext

in Spark 1.


On Tue, Mar 21, 2017 at 7:25 PM Ted Yu <[email protected]>

wrote:

I have done some work over in HBASE-16179 where compatibility

modules

are

created to isolate changes in Spark 2.x API so that code in

hbase-spark

module can be reused.

FYI

--
Jean-Baptiste Onofré
[email protected]
http://blog.nanthrax.net
Talend - http://www.talend.com

--
Jean-Baptiste Onofré
[email protected]
http://blog.nanthrax.net
Talend - http://www.talend.com


--
Jean-Baptiste Onofré
[email protected]
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Beam spark 2.x runner status

Reply via email to