Re: Beam spark 2.x runner status

Pei HE Mon, 21 Aug 2017 01:01:27 -0700

Any updates for upgrading to spark 2.x?

I tried to replace the dependency and found a compile error from
implementing a scala trait:
org.apache.beam.runners.spark.io.SourceRDD.SourcePartition is not abstract
and does not override abstract method
org$apache$spark$Partition$$super$equals(java.lang.Object) in
org.apache.spark.Partition


(The spark side change was introduced in
https://github.com/apache/spark/pull/12157.)

Does anyone have ideas about this compile error?


On Wed, May 3, 2017 at 1:32 PM, Jean-Baptiste Onofré <[email protected]>
wrote:

> Hi Ted,
>
> My branch used Spark 2.1.0 and I just updated to 2.1.1.
>
> As discussed with Aviem, I should be able to create the pull request later
> today.
>
> Regards
> JB
>
>
> On 05/03/2017 02:50 AM, Ted Yu wrote:
>
>> Spark 2.1.1 has been released.
>>
>> Consider using the new release in this work.
>>
>> Thanks
>>
>> On Wed, Mar 29, 2017 at 5:43 AM, Jean-Baptiste Onofré <[email protected]>
>> wrote:
>>
>> Cool for the PR merge, I will rebase my branch on it.
>>>
>>> Thanks !
>>> Regards
>>> JB
>>>
>>>
>>> On 03/29/2017 01:58 PM, Amit Sela wrote:
>>>
>>> @Ted definitely makes sense.
>>>> @JB I'm merging https://github.com/apache/beam/pull/2354 soon so any
>>>> deprecated Spark API issues should be resolved.
>>>>
>>>> On Wed, Mar 29, 2017 at 2:46 PM Ted Yu <[email protected]> wrote:
>>>>
>>>> This is what I did over HBASE-16179:
>>>>
>>>>>
>>>>> -        f.call((asJavaIterator(it), conn)).iterator()
>>>>> +        // the return type is different in spark 1.x & 2.x, we handle
>>>>> both
>>>>> cases
>>>>> +        f.call(asJavaIterator(it), conn) match {
>>>>> +          // spark 1.x
>>>>> +          case iterable: Iterable[R] => iterable.iterator()
>>>>> +          // spark 2.x
>>>>> +          case iterator: Iterator[R] => iterator
>>>>> +        }
>>>>>        )
>>>>>
>>>>> FYI
>>>>>
>>>>> On Wed, Mar 29, 2017 at 1:47 AM, Amit Sela <[email protected]>
>>>>> wrote:
>>>>>
>>>>> Just tried to replace dependencies and see what happens:
>>>>>
>>>>>>
>>>>>> Most required changes are about the runner using deprecated Spark
>>>>>> APIs,
>>>>>>
>>>>>> and
>>>>>
>>>>> after fixing them the only real issue is with the Java API for
>>>>>> Pair/FlatMapFunction that changed return value to Iterator (in 1.6 its
>>>>>> Iterable).
>>>>>>
>>>>>> So I'm not sure that a profile that simply sets dependency on
>>>>>> 1.6.3/2.1.0
>>>>>> is feasible.
>>>>>>
>>>>>> On Thu, Mar 23, 2017 at 10:22 AM Kobi Salant <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>> So, if everything is in place in Spark 2.X and we use provided
>>>>>>
>>>>>>>
>>>>>>> dependencies
>>>>>>
>>>>>> for Spark in Beam.
>>>>>>> Theoretically, you can run the same code in 2.X without any need for
>>>>>>> a
>>>>>>> branch?
>>>>>>>
>>>>>>> 2017-03-23 9:47 GMT+02:00 Amit Sela <[email protected]>:
>>>>>>>
>>>>>>> If StreamingContext is valid and we don't have to use SparkSession,
>>>>>>>
>>>>>>>>
>>>>>>>> and
>>>>>>>
>>>>>>
>>>>> Accumulators are valid as well and we don't need AccumulatorsV2, I
>>>>>>
>>>>>>>
>>>>>>>> don't
>>>>>>>
>>>>>>
>>>>>> see a reason this shouldn't work (which means there are still tons of
>>>>>>>
>>>>>>>> reasons this could break, but I can't think of them off the top of
>>>>>>>> my
>>>>>>>>
>>>>>>>> head
>>>>>>>
>>>>>>> right now).
>>>>>>>>
>>>>>>>> @JB simply add a profile for the Spark dependencies and run the
>>>>>>>>
>>>>>>>> tests -
>>>>>>>
>>>>>>
>>>>> you'll have a very definitive answer ;-) .
>>>>>>
>>>>>>> If this passes, try on a cluster running Spark 2 as well.
>>>>>>>>
>>>>>>>> Let me know of I can assist.
>>>>>>>>
>>>>>>>> On Thu, Mar 23, 2017 at 6:55 AM Jean-Baptiste Onofré <
>>>>>>>>
>>>>>>>> [email protected]>
>>>>>>>
>>>>>>
>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>>> Hi guys,
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Ismaël summarize well what I have in mind.
>>>>>>>>>
>>>>>>>>> I'm a bit late on the PoC around that (I started a branch already).
>>>>>>>>> I will move forward over the week end.
>>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>> JB
>>>>>>>>>
>>>>>>>>> On 03/22/2017 11:42 PM, Ismaël Mejía wrote:
>>>>>>>>>
>>>>>>>>> Amit, I suppose JB is talking about the RDD based version, so no
>>>>>>>>>>
>>>>>>>>>> need
>>>>>>>>>
>>>>>>>>
>>>>>> to worry about SparkSession or different incompatible APIs.
>>>>>>>
>>>>>>>>
>>>>>>>>>> Remember the idea we are discussing is to have in master both the
>>>>>>>>>> spark 1 and spark 2 runners using the RDD based translation. At
>>>>>>>>>>
>>>>>>>>>> the
>>>>>>>>>
>>>>>>>>
>>>>> same time we can have a feature branch to evolve the DataSet
>>>>>>
>>>>>>>
>>>>>>>>>> based
>>>>>>>>>
>>>>>>>>
>>>>> translator (this one will replace the RDD based translator for
>>>>>>
>>>>>>>
>>>>>>>>>> spark
>>>>>>>>>
>>>>>>>>
>>>>>> 2
>>>>>>>
>>>>>>> once it is mature).
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> The advantages have been already discussed as well as the
>>>>>>>>>>
>>>>>>>>>> possible
>>>>>>>>>
>>>>>>>>
>>>>> issues so I think we have to see now if JB's idea is feasible and
>>>>>>
>>>>>>>
>>>>>>>>>> how
>>>>>>>>>
>>>>>>>>
>>>>>> hard would be to live with this while the DataSet version
>>>>>>>
>>>>>>>>
>>>>>>>>>> evolves.
>>>>>>>>>
>>>>>>>>
>>>>>
>>>>>> I think what we are trying to avoid is to have a long living
>>>>>>>>>>
>>>>>>>>>> branch
>>>>>>>>>
>>>>>>>>
>>>>> for a spark 2 runner based on RDD  because the maintenance burden
>>>>>>
>>>>>>> would be even worse. We would have to fight not only with the
>>>>>>>>>>
>>>>>>>>>> double
>>>>>>>>>
>>>>>>>>
>>>>>> merge of fixes (in case the profile idea does not work), but also
>>>>>>>
>>>>>>>>
>>>>>>>>>> with
>>>>>>>>>
>>>>>>>>
>>>>>>> the continue evolution of Beam and we would end up in the long
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> living
>>>>>>>>>
>>>>>>>>
>>>>>> branch mess that others runners have dealt with (e.g. the Apex
>>>>>>>
>>>>>>>>
>>>>>>>>>> runner)
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> https://lists.apache.org/thread.html/12cc086f5ffe331cc70b893
>>>>>>>>> 22ce541
>>>>>>>>>
>>>>>>>>
>>>>> 6c3112b87efc3393e3e16032a2@%3Cdev.beam.apache.org%3E
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>> What do you think about this Amit ? Would you be ok to go with it
>>>>>>>>>>
>>>>>>>>>> if
>>>>>>>>>
>>>>>>>>
>>>>>> JB's profile idea proves to help with the msintenance issues ?
>>>>>>>
>>>>>>>>
>>>>>>>>>> Ismaël
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Mar 22, 2017 at 5:53 PM, Ted Yu <[email protected]>
>>>>>>>>>>
>>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>
>>>>>> hbase-spark module doesn't use SparkSession. So situation there
>>>>>>>
>>>>>>>>
>>>>>>>>>>> is
>>>>>>>>>>
>>>>>>>>>
>>>>> simpler
>>>>>>
>>>>>>>
>>>>>>>>> :-)
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Mar 22, 2017 at 5:35 AM, Amit Sela <
>>>>>>>>>>>
>>>>>>>>>>> [email protected]>
>>>>>>>>>>
>>>>>>>>>
>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>>>>
>>>>>>>>>> I'm still wondering how we'll do this - it's not just different
>>>>>>>>>>>
>>>>>>>>>>>> implementations of the same Class, but a completely different
>>>>>>>>>>>>
>>>>>>>>>>>> concepts
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> such
>>>>>>>>>
>>>>>>>>> as using SparkSession in Spark 2 instead of
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> SparkContext/StreamingContext
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> in Spark 1.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Mar 21, 2017 at 7:25 PM Ted Yu <[email protected]>
>>>>>>>>>>>>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>
>>>>>>>> I have done some work over in HBASE-16179 where compatibility
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> modules
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>> are
>>>>>>>>>
>>>>>>>>> created to isolate changes in Spark 2.x API so that code in
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>> hbase-spark
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>> module can be reused.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>> FYI
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>> Jean-Baptiste Onofré
>>>>>>>>> [email protected]
>>>>>>>>> http://blog.nanthrax.net
>>>>>>>>> Talend - http://www.talend.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>> --
>>> Jean-Baptiste Onofré
>>> [email protected]
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>>>
>>
> --
> Jean-Baptiste Onofré
> [email protected]
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: Beam spark 2.x runner status

Reply via email to