Re: code freeze and branch cut for Apache Spark 2.4

Xiangrui Meng Wed, 01 Aug 2018 11:05:34 -0700

Sorry for late response on Hydrogen discussions! I was traveling last week.


On Tue, Jul 31, 2018 at 1:20 PM Reynold Xin <[email protected]> wrote:

> I actually totally agree that we should make sure it should have no impact
> on existing code if the feature is not used.
>
>
> On Tue, Jul 31, 2018 at 1:18 PM Erik Erlandson <[email protected]>
> wrote:
>
>> I don't have a comprehensive knowledge of the project hydrogen PRs,
>> however I've perused them, and they make substantial modifications to
>> Spark's core DAG scheduler code.
>>
>> What I'm wondering is: how high is the confidence level that the
>> "traditional" code paths are still stable. Put another way, is it even
>> possible to "turn off" or "opt out" of this experimental feature? This
>> analogy isn't perfect, but for example the k8s back-end is a major body of
>> code, but it has a very small impact on any *core* code paths, and so if
>> you opt out of it, it is well understood that you aren't running any
>> experimental code.
>>
>> Looking at the project hydrogen code, I'm less sure the same is true.
>> However, maybe there is a clear way to show how it is true.
>>
>>
Totally agree that the barrier execution mode must not change any existing
behaviors if barriers are not used. Most code added to DAGScheduler and
TaskSetManager only applies to the barrier mode and we paid special
attention to the rest during review. That being said, I won't say the risk
is zero. We will do comprehensive QA after feature freeze and it would be
great if more community members can help.

Btw, I don't think a feature flag would help reduce the risk. This is a
brand new feature, not an alternative to an existing one. So turning it off
is basically "do not call barrier()".


>
>> On Tue, Jul 31, 2018 at 12:03 PM, Mark Hamstra <[email protected]>
>> wrote:
>>
>>> No reasonable amount of time is likely going to be sufficient to fully
>>> vet the code as a PR. I'm not entirely happy with the design and code as
>>> they currently are (and I'm still trying to find the time to more publicly
>>> express my thoughts and concerns), but I'm fine with them going into 2.4
>>> much as they are as long as they go in with proper stability annotations
>>> and are understood not to be cast-in-stone final implementations, but
>>> rather as a way to get people using them and generating the feedback that
>>> is necessary to get us to something more like a final design and
>>> implementation.
>>>
>>>
All barrier execution mode features will be marked experimental in 2.4. As
you mentioned, the goal is to get some usage and collect feedback so we
have a robust stable version in 3.0. Mark, it would be great if you can
provide input and help the final design. Your time would be greatly
appreciated!


> On Tue, Jul 31, 2018 at 11:54 AM Erik Erlandson <[email protected]>
>>> wrote:
>>>
>>>>
>>>> Barrier mode seems like a high impact feature on Spark's core code: is
>>>> one additional week enough time to properly vet this feature?
>>>>
>>>> On Tue, Jul 31, 2018 at 7:10 AM, Joseph Torres <
>>>> [email protected]> wrote:
>>>>
>>>>> Full continuous processing aggregation support ran into unanticipated
>>>>> scalability and scheduling problems. We’re planning to overcome those by
>>>>> using some of the barrier execution machinery, but since barrier execution
>>>>> itself is still in progress the full support isn’t going to make it into
>>>>> 2.4.
>>>>>
>>>>> Jose
>>>>>
>>>>> On Tue, Jul 31, 2018 at 6:07 AM Tomasz Gawęda <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> what is the status of Continuous Processing + Aggregations? As far as
>>>>>> I
>>>>>> remember, Jose Torres said it should  be easy to perform aggregations
>>>>>> if
>>>>>> coalesce(1) work. IIRC it's already merged to master.
>>>>>>
>>>>>> Is this work in progress? If yes, it would be great to have full
>>>>>> aggregation/join support in Spark 2.4 in CP.
>>>>>>
>>>>>> Pozdrawiam / Best regards,
>>>>>>
>>>>>> Tomek
>>>>>>
>>>>>>
>>>>>> On 2018-07-31 10:43, Petar Zečević wrote:
>>>>>> > This one is important to us:
>>>>>> https://issues.apache.org/jira/browse/SPARK-24020 (Sort-merge join
>>>>>> inner range optimization) but I think it could be useful to others too.
>>>>>> >
>>>>>> > It is finished and is ready to be merged (was ready a month ago at
>>>>>> least).
>>>>>> >
>>>>>> > Do you think you could consider including it in 2.4?
>>>>>> >
>>>>>> > Petar
>>>>>> >
>>>>>> >
>>>>>> > Wenchen Fan @ 1970-01-01 01:00 CET:
>>>>>> >
>>>>>> >> I went through the open JIRA tickets and here is a list that we
>>>>>> should consider for Spark 2.4:
>>>>>> >>
>>>>>> >> High Priority:
>>>>>> >> SPARK-24374: Support Barrier Execution Mode in Apache Spark
>>>>>> >> This one is critical to the Spark ecosystem for deep learning. It
>>>>>> only has a few remaining works and I think we should have it in Spark 
>>>>>> 2.4.
>>>>>> >>
>>>>>> >> Middle Priority:
>>>>>> >> SPARK-23899: Built-in SQL Function Improvement
>>>>>> >> We've already added a lot of built-in functions in this release,
>>>>>> but there are a few useful higher-order functions in progress, like
>>>>>> `array_except`, `transform`, etc. It would be great if we can get them in
>>>>>> Spark 2.4.
>>>>>> >>
>>>>>> >> SPARK-14220: Build and test Spark against Scala 2.12
>>>>>> >> Very close to finishing, great to have it in Spark 2.4.
>>>>>> >>
>>>>>> >> SPARK-4502: Spark SQL reads unnecessary nested fields from Parquet
>>>>>> >> This one is there for years (thanks for your patience Michael!),
>>>>>> and is also close to finishing. Great to have it in 2.4.
>>>>>> >>
>>>>>> >> SPARK-24882: data source v2 API improvement
>>>>>> >> This is to improve the data source v2 API based on what we learned
>>>>>> during this release. From the migration of existing sources and design of
>>>>>> new features, we found some problems in the API and want to address 
>>>>>> them. I
>>>>>> believe this should be
>>>>>> >> the last significant API change to data source v2, so great to
>>>>>> have in Spark 2.4. I'll send a discuss email about it later.
>>>>>> >>
>>>>>> >> SPARK-24252: Add catalog support in Data Source V2
>>>>>> >> This is a very important feature for data source v2, and is
>>>>>> currently being discussed in the dev list.
>>>>>> >>
>>>>>> >> SPARK-24768: Have a built-in AVRO data source implementation
>>>>>> >> Most of it is done, but date/timestamp support is still missing.
>>>>>> Great to have in 2.4.
>>>>>> >>
>>>>>> >> SPARK-23243: Shuffle+Repartition on an RDD could lead to incorrect
>>>>>> answers
>>>>>> >> This is a long-standing correctness bug, great to have in 2.4.
>>>>>> >>
>>>>>> >> There are some other important features like the adaptive
>>>>>> execution, streaming SQL, etc., not in the list, since I think we are not
>>>>>> able to finish them before 2.4.
>>>>>> >>
>>>>>> >> Feel free to add more things if you think they are important to
>>>>>> Spark 2.4 by replying to this email.
>>>>>> >>
>>>>>> >> Thanks,
>>>>>> >> Wenchen
>>>>>> >>
>>>>>> >> On Mon, Jul 30, 2018 at 11:00 PM Sean Owen <[email protected]>
>>>>>> wrote:
>>>>>> >>
>>>>>> >>   In theory releases happen on a time-based cadence, so it's
>>>>>> pretty much wrap up what's ready by the code freeze and ship it. In
>>>>>> practice, the cadence slips frequently, and it's very much a negotiation
>>>>>> about what features should push the
>>>>>> >>   code freeze out a few weeks every time. So, kind of a hybrid
>>>>>> approach here that works OK.
>>>>>> >>
>>>>>> >>   Certainly speak up if you think there's something that really
>>>>>> needs to get into 2.4. This is that discuss thread.
>>>>>> >>
>>>>>> >>   (BTW I updated the page you mention just yesterday, to reflect
>>>>>> the plan suggested in this thread.)
>>>>>> >>
>>>>>> >>   On Mon, Jul 30, 2018 at 9:51 AM Tom Graves
>>>>>> <[email protected]> wrote:
>>>>>> >>
>>>>>> >>   Shouldn't this be a discuss thread?
>>>>>> >>
>>>>>> >>   I'm also happy to see more release managers and agree the time
>>>>>> is getting close, but we should see what features are in progress and see
>>>>>> how close things are and propose a date based on that.  Cutting a branch 
>>>>>> to
>>>>>> soon just creates
>>>>>> >>   more work for committers to push to more branches.
>>>>>> >>
>>>>>> >>    http://spark.apache.org/versioning-policy.html mentioned the
>>>>>> code freeze and release branch cut mid-august.
>>>>>> >>
>>>>>> >>   Tom
>>>>>> >
>>>>>> >
>>>>>> ---------------------------------------------------------------------
>>>>>> > To unsubscribe e-mail: [email protected]
>>>>>> >
>>>>>>
>>>>>>
>>>>
>> --

Xiangrui Meng

Software Engineer

Databricks Inc. [image: http://databricks.com] <http://databricks.com/>

Re: code freeze and branch cut for Apache Spark 2.4

Reply via email to