Re: Beam spark 2.x runner status

2017-08-21 Thread Holden Karau
I'd love to take a look at the PR when it comes in (<3 BEAM + SPARK :)). On Mon, Aug 21, 2017 at 11:33 AM, Jean-Baptiste Onofré wrote: > Hi > > I did a new runner supporting spark 2.1.x. I changed code for that. > > I'm still in vacation this week. I will send an update when

Re: Beam spark 2.x runner status

2017-08-21 Thread Jean-Baptiste Onofré
Hi I did a new runner supporting spark 2.1.x. I changed code for that. I'm still in vacation this week. I will send an update when back. Regards JB On Aug 21, 2017, 09:01, at 09:01, Pei HE wrote: >Any updates for upgrading to spark 2.x? > >I tried to replace the dependency

Re: Beam spark 2.x runner status

2017-08-21 Thread Pei HE
Any updates for upgrading to spark 2.x? I tried to replace the dependency and found a compile error from implementing a scala trait: org.apache.beam.runners.spark.io.SourceRDD.SourcePartition is not abstract and does not override abstract method

Re: Beam spark 2.x runner status

2017-03-29 Thread Jean-Baptiste Onofré
Cool for the PR merge, I will rebase my branch on it. Thanks ! Regards JB On 03/29/2017 01:58 PM, Amit Sela wrote: @Ted definitely makes sense. @JB I'm merging https://github.com/apache/beam/pull/2354 soon so any deprecated Spark API issues should be resolved. On Wed, Mar 29, 2017 at 2:46 PM

Re: Beam spark 2.x runner status

2017-03-29 Thread Amit Sela
@Ted definitely makes sense. @JB I'm merging https://github.com/apache/beam/pull/2354 soon so any deprecated Spark API issues should be resolved. On Wed, Mar 29, 2017 at 2:46 PM Ted Yu wrote: > This is what I did over HBASE-16179: > > -f.call((asJavaIterator(it),

Re: Beam spark 2.x runner status

2017-03-29 Thread Ted Yu
This is what I did over HBASE-16179: -f.call((asJavaIterator(it), conn)).iterator() +// the return type is different in spark 1.x & 2.x, we handle both cases +f.call(asJavaIterator(it), conn) match { + // spark 1.x + case iterable: Iterable[R] =>

Re: Beam spark 2.x runner status

2017-03-29 Thread Amit Sela
Just tried to replace dependencies and see what happens: Most required changes are about the runner using deprecated Spark APIs, and after fixing them the only real issue is with the Java API for Pair/FlatMapFunction that changed return value to Iterator (in 1.6 its Iterable). So I'm not sure

Re: Beam spark 2.x runner status

2017-03-23 Thread Amit Sela
If StreamingContext is valid and we don't have to use SparkSession, and Accumulators are valid as well and we don't need AccumulatorsV2, I don't see a reason this shouldn't work (which means there are still tons of reasons this could break, but I can't think of them off the top of my head right

Re: Beam spark 2.x runner status

2017-03-23 Thread Kobi Salant
Hi, We use SparkContext & SparkContextStreaming extensively in Spark runner to create the Dsteams & Rdds so we will need to work on migrating from the 1.X terms to 2.X terms (We may other incompatibilities that we will find out during the work). Regards Kobi 2017-03-23 6:55 GMT+02:00

Re: Beam spark 2.x runner status

2017-03-22 Thread Jean-Baptiste Onofré
Hi guys, Ismaël summarize well what I have in mind. I'm a bit late on the PoC around that (I started a branch already). I will move forward over the week end. Regards JB On 03/22/2017 11:42 PM, Ismaël Mejía wrote: Amit, I suppose JB is talking about the RDD based version, so no need to worry

Re: Beam spark 2.x runner status

2017-03-22 Thread Ted Yu
hbase-spark module doesn't use SparkSession. So situation there is simpler :-) On Wed, Mar 22, 2017 at 5:35 AM, Amit Sela wrote: > I'm still wondering how we'll do this - it's not just different > implementations of the same Class, but a completely different concepts such

Re: Beam spark 2.x runner status

2017-03-21 Thread Ted Yu
I have done some work over in HBASE-16179 where compatibility modules are created to isolate changes in Spark 2.x API so that code in hbase-spark module can be reused. FYI

Re: Beam spark 2.x runner status

2017-03-16 Thread Jean-Baptiste Onofré
Hi guys, Yes, I started to experiment the profiles a bit and Amit and I plan to discuss about that during the week end. Give me some time to move forward a bit and I will get back to you with more details. Regards JB On 03/16/2017 05:15 PM, amarouni wrote: Yeah maintaining 2 RDD branches

Re: Beam spark 2.x runner status

2017-03-16 Thread amarouni
Yeah maintaining 2 RDD branches (master + 2.x branch) is doable but will add more maintenance merge work. The maven profiles solution is worth investigating, with Spark 1.6 RDD as the default profile and an additional Spark 2.x profile. As JBO mentioned carbondata I had a quick look and it looks

Re: Beam spark 2.x runner status

2017-03-16 Thread Cody Innowhere
I'm personally in favor of maintaining one single branch, e.g., spark-runner, which supports both Spark 1.6 & 2.1. Since there's currently no DataFrame support in spark 1.x runner, there should be no conflicts if we put two versions of Spark into one runner. I'm also +1 for adding adapters in the

Re: Beam spark 2.x runner status

2017-03-15 Thread Jean-Baptiste Onofré
Hi guys, sorry, due to the time zone shift, I answer a bit late ;) I think we can have the same runner dealing with the two major Spark version, introducing some adapters. For instance, in CarbonData, we created some adapters to work with Spark 1?5, Spark 1.6 and Spark 2.1. The dependencies

Re: Beam spark 2.x runner status

2017-03-15 Thread Amit Sela
I answered inline to Abbass' comment, but I think he hit something - how about we have a branch with those adaptations ? same RDD implementation, but depending on the latest 2.x version with the minimal changes required. I'd be happy to do that, or guide anyone who wants to (I did most of it on my

Re: Beam spark 2.x runner status

2017-03-15 Thread amarouni
+1 for Spark runners based on different APIs RDD/Dataset and keeping the Spark versions as a deployment dependency. The RDD API is stable & mature enough so it makes sense to have it on master, the Dataset API still have some work to do and from our own experience it just reached a comparable RDD

Re: Beam spark 2.x runner status

2017-03-15 Thread Amit Sela
So you're suggesting we copy-paste the current runner and adapt whatever is necessary so it runs with Spark 2 ? This also means any bug-fix / improvement would have to be maintained in two runners, and I wouldn't wanna do that. I don't like to think in terms of Spark1/2 but in terms of

Re: Beam spark 2.x runner status

2017-03-15 Thread Ismaël Mejía
> However, I do feel that we should use the Dataset API, starting with batch > support first. WDYT ? Well, this is the exact current status quo, and it will take us some time to have something as complete as what we have with the spark 1 runner for the spark 2. The other proposal has two

Re: Beam spark 2.x runner status

2017-03-15 Thread Amit Sela
I feel that as we're getting closer to supporting streaming with Spark 1 runner, and having Structured Streaming advance in Spark 2, we could start work on Spark 2 runner in a separate branch. However, I do feel that we should use the Dataset API, starting with batch support first. WDYT ? On

Re: Beam spark 2.x runner status

2017-03-15 Thread Ismaël Mejía
> So you propose to have the Spark 2 branch a clone of the current one with > adaptations around Context->Session, Accumulator->AccumulatorV2 etc. while > still using the RDD API ? Yes this is exactly what I have in mind. > I think that having another Spark runner is great if it has value, >

Re: Beam spark 2.x runner status

2017-03-15 Thread Amit Sela
So you propose to have the Spark 2 branch a clone of the current one with adaptations around Context->Session, Accumulator->AccumulatorV2 etc. while still using the RDD API ? I think that having another Spark runner is great if it has value, otherwise, let's just bump the version. My idea of

Re: Beam spark 2.x runner status

2017-03-15 Thread Ismaël Mejía
BIG +1 JB, If we can just jump the version number with minor changes staying as close as possible to the current implementation for spark 1 we can go faster and offer in principle the exact same support but for version 2. I know that the advanced streaming stuff based on the DataSet API won't be

Re: Beam spark 2.x runner status

2017-03-15 Thread Jean-Baptiste Onofré
Hi Amit, What do you think of the following: - in the mean time that you reintroduce the Spark 2 branch, what about "extending" the version in the current Spark runner ? Still using RDD/DStream, I think we can support Spark 2.x even if we don't yet leverage the new provided features.

Re: Beam spark 2.x runner status

2017-03-15 Thread Amit Sela
Hi Cody, I will re-introduce this branch soon as part of the work on BEAM-913 . For now, and from previous experience with the mentioned branch, batch implementation should be straight-forward. Only issue is with streaming support - in the current