Re: code freeze and branch cut for Apache Spark 2.4

Erik Erlandson Tue, 31 Jul 2018 13:18:57 -0700

I don't have a comprehensive knowledge of the project hydrogen PRs, however
I've perused them, and they make substantial modifications to Spark's core
DAG scheduler code.


What I'm wondering is: how high is the confidence level that the
"traditional" code paths are still stable. Put another way, is it even
possible to "turn off" or "opt out" of this experimental feature? This
analogy isn't perfect, but for example the k8s back-end is a major body of
code, but it has a very small impact on any *core* code paths, and so if
you opt out of it, it is well understood that you aren't running any
experimental code.

Looking at the project hydrogen code, I'm less sure the same is true.
However, maybe there is a clear way to show how it is true.


On Tue, Jul 31, 2018 at 12:03 PM, Mark Hamstra <m...@clearstorydata.com>
wrote:

> No reasonable amount of time is likely going to be sufficient to fully vet
> the code as a PR. I'm not entirely happy with the design and code as they
> currently are (and I'm still trying to find the time to more publicly
> express my thoughts and concerns), but I'm fine with them going into 2.4
> much as they are as long as they go in with proper stability annotations
> and are understood not to be cast-in-stone final implementations, but
> rather as a way to get people using them and generating the feedback that
> is necessary to get us to something more like a final design and
> implementation.
>
> On Tue, Jul 31, 2018 at 11:54 AM Erik Erlandson <eerla...@redhat.com>
> wrote:
>
>>
>> Barrier mode seems like a high impact feature on Spark's core code: is
>> one additional week enough time to properly vet this feature?
>>
>> On Tue, Jul 31, 2018 at 7:10 AM, Joseph Torres <
>> joseph.tor...@databricks.com> wrote:
>>
>>> Full continuous processing aggregation support ran into unanticipated
>>> scalability and scheduling problems. We’re planning to overcome those by
>>> using some of the barrier execution machinery, but since barrier execution
>>> itself is still in progress the full support isn’t going to make it into
>>> 2.4.
>>>
>>> Jose
>>>
>>> On Tue, Jul 31, 2018 at 6:07 AM Tomasz Gawęda <tomasz.gaw...@outlook.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> what is the status of Continuous Processing + Aggregations? As far as I
>>>> remember, Jose Torres said it should  be easy to perform aggregations
>>>> if
>>>> coalesce(1) work. IIRC it's already merged to master.
>>>>
>>>> Is this work in progress? If yes, it would be great to have full
>>>> aggregation/join support in Spark 2.4 in CP.
>>>>
>>>> Pozdrawiam / Best regards,
>>>>
>>>> Tomek
>>>>
>>>>
>>>> On 2018-07-31 10:43, Petar Zečević wrote:
>>>> > This one is important to us: https://issues.apache.org/
>>>> jira/browse/SPARK-24020 (Sort-merge join inner range optimization) but
>>>> I think it could be useful to others too.
>>>> >
>>>> > It is finished and is ready to be merged (was ready a month ago at
>>>> least).
>>>> >
>>>> > Do you think you could consider including it in 2.4?
>>>> >
>>>> > Petar
>>>> >
>>>> >
>>>> > Wenchen Fan @ 1970-01-01 01:00 CET:
>>>> >
>>>> >> I went through the open JIRA tickets and here is a list that we
>>>> should consider for Spark 2.4:
>>>> >>
>>>> >> High Priority:
>>>> >> SPARK-24374: Support Barrier Execution Mode in Apache Spark
>>>> >> This one is critical to the Spark ecosystem for deep learning. It
>>>> only has a few remaining works and I think we should have it in Spark 2.4.
>>>> >>
>>>> >> Middle Priority:
>>>> >> SPARK-23899: Built-in SQL Function Improvement
>>>> >> We've already added a lot of built-in functions in this release, but
>>>> there are a few useful higher-order functions in progress, like
>>>> `array_except`, `transform`, etc. It would be great if we can get them in
>>>> Spark 2.4.
>>>> >>
>>>> >> SPARK-14220: Build and test Spark against Scala 2.12
>>>> >> Very close to finishing, great to have it in Spark 2.4.
>>>> >>
>>>> >> SPARK-4502: Spark SQL reads unnecessary nested fields from Parquet
>>>> >> This one is there for years (thanks for your patience Michael!), and
>>>> is also close to finishing. Great to have it in 2.4.
>>>> >>
>>>> >> SPARK-24882: data source v2 API improvement
>>>> >> This is to improve the data source v2 API based on what we learned
>>>> during this release. From the migration of existing sources and design of
>>>> new features, we found some problems in the API and want to address them. I
>>>> believe this should be
>>>> >> the last significant API change to data source v2, so great to have
>>>> in Spark 2.4. I'll send a discuss email about it later.
>>>> >>
>>>> >> SPARK-24252: Add catalog support in Data Source V2
>>>> >> This is a very important feature for data source v2, and is
>>>> currently being discussed in the dev list.
>>>> >>
>>>> >> SPARK-24768: Have a built-in AVRO data source implementation
>>>> >> Most of it is done, but date/timestamp support is still missing.
>>>> Great to have in 2.4.
>>>> >>
>>>> >> SPARK-23243: Shuffle+Repartition on an RDD could lead to incorrect
>>>> answers
>>>> >> This is a long-standing correctness bug, great to have in 2.4.
>>>> >>
>>>> >> There are some other important features like the adaptive execution,
>>>> streaming SQL, etc., not in the list, since I think we are not able to
>>>> finish them before 2.4.
>>>> >>
>>>> >> Feel free to add more things if you think they are important to
>>>> Spark 2.4 by replying to this email.
>>>> >>
>>>> >> Thanks,
>>>> >> Wenchen
>>>> >>
>>>> >> On Mon, Jul 30, 2018 at 11:00 PM Sean Owen <sro...@apache.org>
>>>> wrote:
>>>> >>
>>>> >>   In theory releases happen on a time-based cadence, so it's pretty
>>>> much wrap up what's ready by the code freeze and ship it. In practice, the
>>>> cadence slips frequently, and it's very much a negotiation about what
>>>> features should push the
>>>> >>   code freeze out a few weeks every time. So, kind of a hybrid
>>>> approach here that works OK.
>>>> >>
>>>> >>   Certainly speak up if you think there's something that really
>>>> needs to get into 2.4. This is that discuss thread.
>>>> >>
>>>> >>   (BTW I updated the page you mention just yesterday, to reflect the
>>>> plan suggested in this thread.)
>>>> >>
>>>> >>   On Mon, Jul 30, 2018 at 9:51 AM Tom Graves
>>>> <tgraves...@yahoo.com.invalid> wrote:
>>>> >>
>>>> >>   Shouldn't this be a discuss thread?
>>>> >>
>>>> >>   I'm also happy to see more release managers and agree the time is
>>>> getting close, but we should see what features are in progress and see how
>>>> close things are and propose a date based on that.  Cutting a branch to
>>>> soon just creates
>>>> >>   more work for committers to push to more branches.
>>>> >>
>>>> >>    http://spark.apache.org/versioning-policy.html mentioned the
>>>> code freeze and release branch cut mid-august.
>>>> >>
>>>> >>   Tom
>>>> >
>>>> > ---------------------------------------------------------------------
>>>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>> >
>>>>
>>>>
>>

Re: code freeze and branch cut for Apache Spark 2.4

Reply via email to