Re: Spark 3.0 branch cut and code freeze on Jan 31?

2020-02-05 Thread Hyukjin Kwon
Awesome Shane.

2020년 2월 5일 (수) 오전 7:29, Xiao Li 님이 작성:

> Thank you, Shane!
>
> Xiao
>
> On Tue, Feb 4, 2020 at 2:16 PM Dongjoon Hyun 
> wrote:
>
>> Thank you, Shane! :D
>>
>> Bests,
>> Dongjoon
>>
>> On Tue, Feb 4, 2020 at 13:28 shane knapp ☠  wrote:
>>
>>> all the 3.0 builds have been created and are currently churning away!
>>>
>>> (the failed builds were to a silly bug in the build scripts sneaking
>>> it's way back in, but that's resolved now)
>>>
>>> shane
>>>
>>> On Sat, Feb 1, 2020 at 6:16 PM Reynold Xin  wrote:
>>>
>>>> Note that branch-3.0 was cut. Please focus on testing, polish, and
>>>> let's get the release out!
>>>>
>>>>
>>>> On Wed, Jan 29, 2020 at 3:41 PM, Reynold Xin 
>>>> wrote:
>>>>
>>>>> Just a reminder - code freeze is coming this Fri!
>>>>>
>>>>> There can always be exceptions, but those should be exceptions and
>>>>> discussed on a case by case basis rather than becoming the norm.
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Dec 24, 2019 at 4:55 PM, Jungtaek Lim <
>>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>>
>>>>>> Jan 31 sounds good to me.
>>>>>>
>>>>>> Just curious, do we allow some exception on code freeze? One thing
>>>>>> came into my mind is that some feature could have multiple subtasks and
>>>>>> part of subtasks have been merged and other subtask(s) are in reviewing. 
>>>>>> In
>>>>>> this case do we allow these subtasks to have more days to get reviewed 
>>>>>> and
>>>>>> merged later?
>>>>>>
>>>>>> Happy Holiday!
>>>>>>
>>>>>> Thanks,
>>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>>
>>>>>> On Wed, Dec 25, 2019 at 8:36 AM Takeshi Yamamuro <
>>>>>> linguin@gmail.com> wrote:
>>>>>>
>>>>>>> Looks nice, happy holiday, all!
>>>>>>>
>>>>>>> Bests,
>>>>>>> Takeshi
>>>>>>>
>>>>>>> On Wed, Dec 25, 2019 at 3:56 AM Dongjoon Hyun <
>>>>>>> dongjoon.h...@gmail.com> wrote:
>>>>>>>
>>>>>>>> +1 for January 31st.
>>>>>>>>
>>>>>>>> Bests,
>>>>>>>> Dongjoon.
>>>>>>>>
>>>>>>>> On Tue, Dec 24, 2019 at 7:11 AM Xiao Li 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Jan 31 is pretty reasonable. Happy Holidays!
>>>>>>>>>
>>>>>>>>> Xiao
>>>>>>>>>
>>>>>>>>> On Tue, Dec 24, 2019 at 5:52 AM Sean Owen 
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Yep, always happens. Is earlier realistic, like Jan 15? it's all
>>>>>>>>>> arbitrary but indeed this has been in progress for a while, and 
>>>>>>>>>> there's a
>>>>>>>>>> downside to not releasing it, to making the gap to 3.0 larger.
>>>>>>>>>> On my end I don't know of anything that's holding up a release;
>>>>>>>>>> is it basically DSv2?
>>>>>>>>>>
>>>>>>>>>> BTW these are the items still targeted to 3.0.0, some of which
>>>>>>>>>> may not have been legitimately tagged. It may be worth reviewing 
>>>>>>>>>> what's
>>>>>>>>>> still open and necessary, and what should be untargeted.
>>>>>>>>>>
>>>>>>>>>> SPARK-29768 nondeterministic expression fails column pruning
>>>>>>>>>> SPARK-29345 Add an API that allows a user to define and observe
>>>>>>>>>> arbitrary metrics on streaming queries
>>>>>>>>>> SPARK-29348 Add observable metrics
>>>>>>>>>> SPARK-29429 Support Prometheus monitoring natively
>>>>>>>>>> SPARK-29577 Implement p-value simulation and unit tests for chi2
>>>>>>>>>> t

Re: Spark 3.0 branch cut and code freeze on Jan 31?

2020-02-04 Thread Xiao Li
Thank you, Shane!

Xiao

On Tue, Feb 4, 2020 at 2:16 PM Dongjoon Hyun 
wrote:

> Thank you, Shane! :D
>
> Bests,
> Dongjoon
>
> On Tue, Feb 4, 2020 at 13:28 shane knapp ☠  wrote:
>
>> all the 3.0 builds have been created and are currently churning away!
>>
>> (the failed builds were to a silly bug in the build scripts sneaking it's
>> way back in, but that's resolved now)
>>
>> shane
>>
>> On Sat, Feb 1, 2020 at 6:16 PM Reynold Xin  wrote:
>>
>>> Note that branch-3.0 was cut. Please focus on testing, polish, and let's
>>> get the release out!
>>>
>>>
>>> On Wed, Jan 29, 2020 at 3:41 PM, Reynold Xin 
>>> wrote:
>>>
>>>> Just a reminder - code freeze is coming this Fri!
>>>>
>>>> There can always be exceptions, but those should be exceptions and
>>>> discussed on a case by case basis rather than becoming the norm.
>>>>
>>>>
>>>>
>>>> On Tue, Dec 24, 2019 at 4:55 PM, Jungtaek Lim <
>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>
>>>>> Jan 31 sounds good to me.
>>>>>
>>>>> Just curious, do we allow some exception on code freeze? One thing
>>>>> came into my mind is that some feature could have multiple subtasks and
>>>>> part of subtasks have been merged and other subtask(s) are in reviewing. 
>>>>> In
>>>>> this case do we allow these subtasks to have more days to get reviewed and
>>>>> merged later?
>>>>>
>>>>> Happy Holiday!
>>>>>
>>>>> Thanks,
>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>
>>>>> On Wed, Dec 25, 2019 at 8:36 AM Takeshi Yamamuro <
>>>>> linguin@gmail.com> wrote:
>>>>>
>>>>>> Looks nice, happy holiday, all!
>>>>>>
>>>>>> Bests,
>>>>>> Takeshi
>>>>>>
>>>>>> On Wed, Dec 25, 2019 at 3:56 AM Dongjoon Hyun <
>>>>>> dongjoon.h...@gmail.com> wrote:
>>>>>>
>>>>>>> +1 for January 31st.
>>>>>>>
>>>>>>> Bests,
>>>>>>> Dongjoon.
>>>>>>>
>>>>>>> On Tue, Dec 24, 2019 at 7:11 AM Xiao Li 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Jan 31 is pretty reasonable. Happy Holidays!
>>>>>>>>
>>>>>>>> Xiao
>>>>>>>>
>>>>>>>> On Tue, Dec 24, 2019 at 5:52 AM Sean Owen  wrote:
>>>>>>>>
>>>>>>>>> Yep, always happens. Is earlier realistic, like Jan 15? it's all
>>>>>>>>> arbitrary but indeed this has been in progress for a while, and 
>>>>>>>>> there's a
>>>>>>>>> downside to not releasing it, to making the gap to 3.0 larger.
>>>>>>>>> On my end I don't know of anything that's holding up a release; is
>>>>>>>>> it basically DSv2?
>>>>>>>>>
>>>>>>>>> BTW these are the items still targeted to 3.0.0, some of which may
>>>>>>>>> not have been legitimately tagged. It may be worth reviewing what's 
>>>>>>>>> still
>>>>>>>>> open and necessary, and what should be untargeted.
>>>>>>>>>
>>>>>>>>> SPARK-29768 nondeterministic expression fails column pruning
>>>>>>>>> SPARK-29345 Add an API that allows a user to define and observe
>>>>>>>>> arbitrary metrics on streaming queries
>>>>>>>>> SPARK-29348 Add observable metrics
>>>>>>>>> SPARK-29429 Support Prometheus monitoring natively
>>>>>>>>> SPARK-29577 Implement p-value simulation and unit tests for chi2
>>>>>>>>> test
>>>>>>>>> SPARK-28900 Test Pyspark, SparkR on JDK 11 with run-tests
>>>>>>>>> SPARK-28883 Fix a flaky test: ThriftServerQueryTestSuite
>>>>>>>>> SPARK-28717 Update SQL ALTER TABLE RENAME  to use TableCatalog API
>>>>>>>>> SPARK-28588 Build a SQL reference doc
>>>>>>>>> SPARK-28629 Capture the missing rules in HiveSessionStateBuilder
>>>>>>>>> SPARK-286

Re: Spark 3.0 branch cut and code freeze on Jan 31?

2020-02-04 Thread Dongjoon Hyun
Thank you, Shane! :D

Bests,
Dongjoon

On Tue, Feb 4, 2020 at 13:28 shane knapp ☠  wrote:

> all the 3.0 builds have been created and are currently churning away!
>
> (the failed builds were to a silly bug in the build scripts sneaking it's
> way back in, but that's resolved now)
>
> shane
>
> On Sat, Feb 1, 2020 at 6:16 PM Reynold Xin  wrote:
>
>> Note that branch-3.0 was cut. Please focus on testing, polish, and let's
>> get the release out!
>>
>>
>> On Wed, Jan 29, 2020 at 3:41 PM, Reynold Xin  wrote:
>>
>>> Just a reminder - code freeze is coming this Fri!
>>>
>>> There can always be exceptions, but those should be exceptions and
>>> discussed on a case by case basis rather than becoming the norm.
>>>
>>>
>>>
>>> On Tue, Dec 24, 2019 at 4:55 PM, Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
>>>> Jan 31 sounds good to me.
>>>>
>>>> Just curious, do we allow some exception on code freeze? One thing came
>>>> into my mind is that some feature could have multiple subtasks and part of
>>>> subtasks have been merged and other subtask(s) are in reviewing. In this
>>>> case do we allow these subtasks to have more days to get reviewed and
>>>> merged later?
>>>>
>>>> Happy Holiday!
>>>>
>>>> Thanks,
>>>> Jungtaek Lim (HeartSaVioR)
>>>>
>>>> On Wed, Dec 25, 2019 at 8:36 AM Takeshi Yamamuro 
>>>> wrote:
>>>>
>>>>> Looks nice, happy holiday, all!
>>>>>
>>>>> Bests,
>>>>> Takeshi
>>>>>
>>>>> On Wed, Dec 25, 2019 at 3:56 AM Dongjoon Hyun 
>>>>> wrote:
>>>>>
>>>>>> +1 for January 31st.
>>>>>>
>>>>>> Bests,
>>>>>> Dongjoon.
>>>>>>
>>>>>> On Tue, Dec 24, 2019 at 7:11 AM Xiao Li 
>>>>>> wrote:
>>>>>>
>>>>>>> Jan 31 is pretty reasonable. Happy Holidays!
>>>>>>>
>>>>>>> Xiao
>>>>>>>
>>>>>>> On Tue, Dec 24, 2019 at 5:52 AM Sean Owen  wrote:
>>>>>>>
>>>>>>>> Yep, always happens. Is earlier realistic, like Jan 15? it's all
>>>>>>>> arbitrary but indeed this has been in progress for a while, and 
>>>>>>>> there's a
>>>>>>>> downside to not releasing it, to making the gap to 3.0 larger.
>>>>>>>> On my end I don't know of anything that's holding up a release; is
>>>>>>>> it basically DSv2?
>>>>>>>>
>>>>>>>> BTW these are the items still targeted to 3.0.0, some of which may
>>>>>>>> not have been legitimately tagged. It may be worth reviewing what's 
>>>>>>>> still
>>>>>>>> open and necessary, and what should be untargeted.
>>>>>>>>
>>>>>>>> SPARK-29768 nondeterministic expression fails column pruning
>>>>>>>> SPARK-29345 Add an API that allows a user to define and observe
>>>>>>>> arbitrary metrics on streaming queries
>>>>>>>> SPARK-29348 Add observable metrics
>>>>>>>> SPARK-29429 Support Prometheus monitoring natively
>>>>>>>> SPARK-29577 Implement p-value simulation and unit tests for chi2
>>>>>>>> test
>>>>>>>> SPARK-28900 Test Pyspark, SparkR on JDK 11 with run-tests
>>>>>>>> SPARK-28883 Fix a flaky test: ThriftServerQueryTestSuite
>>>>>>>> SPARK-28717 Update SQL ALTER TABLE RENAME  to use TableCatalog API
>>>>>>>> SPARK-28588 Build a SQL reference doc
>>>>>>>> SPARK-28629 Capture the missing rules in HiveSessionStateBuilder
>>>>>>>> SPARK-28684 Hive module support JDK 11
>>>>>>>> SPARK-28548 explain() shows wrong result for persisted DataFrames
>>>>>>>> after some operations
>>>>>>>> SPARK-28264 Revisiting Python / pandas UDF
>>>>>>>> SPARK-28301 fix the behavior of table name resolution with
>>>>>>>> multi-catalog
>>>>>>>> SPARK-28155 do not leak SaveMode to file source v2
>>>>>>>> SPARK-28103 Cannot infer filters fr

Re: Spark 3.0 branch cut and code freeze on Jan 31?

2020-02-01 Thread Reynold Xin
Note that branch-3.0 was cut. Please focus on testing, polish, and let's get 
the release out!

On Wed, Jan 29, 2020 at 3:41 PM, Reynold Xin < r...@databricks.com > wrote:

> 
> Just a reminder - code freeze is coming this Fri !
> 
> 
> 
> There can always be exceptions, but those should be exceptions and
> discussed on a case by case basis rather than becoming the norm.
> 
> 
> 
> 
> 
> 
> On Tue, Dec 24, 2019 at 4:55 PM, Jungtaek Lim < kabhwan. opensource@ gmail.
> com ( kabhwan.opensou...@gmail.com ) > wrote:
> 
>> Jan 31 sounds good to me.
>> 
>> 
>> Just curious, do we allow some exception on code freeze? One thing came
>> into my mind is that some feature could have multiple subtasks and part of
>> subtasks have been merged and other subtask(s) are in reviewing. In this
>> case do we allow these subtasks to have more days to get reviewed and
>> merged later?
>> 
>> 
>> Happy Holiday!
>> 
>> 
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>> 
>> On Wed, Dec 25, 2019 at 8:36 AM Takeshi Yamamuro < linguin. m. s@ gmail. com
>> ( linguin@gmail.com ) > wrote:
>> 
>> 
>>> Looks nice, happy holiday, all!
>>> 
>>> 
>>> Bests,
>>> Takeshi
>>> 
>>> On Wed, Dec 25, 2019 at 3:56 AM Dongjoon Hyun < dongjoon. hyun@ gmail. com
>>> ( dongjoon.h...@gmail.com ) > wrote:
>>> 
>>> 
>>>> +1 for January 31st.
>>>> 
>>>> 
>>>> Bests,
>>>> Dongjoon.
>>>> 
>>>> On Tue, Dec 24, 2019 at 7:11 AM Xiao Li < lixiao@ databricks. com (
>>>> lix...@databricks.com ) > wrote:
>>>> 
>>>> 
>>>>> Jan 31 is pretty reasonable. Happy Holidays! 
>>>>> 
>>>>> 
>>>>> Xiao
>>>>> 
>>>>> On Tue, Dec 24, 2019 at 5:52 AM Sean Owen < srowen@ gmail. com (
>>>>> sro...@gmail.com ) > wrote:
>>>>> 
>>>>> 
>>>>>> Yep, always happens. Is earlier realistic, like Jan 15? it's all 
>>>>>> arbitrary
>>>>>> but indeed this has been in progress for a while, and there's a downside
>>>>>> to not releasing it, to making the gap to 3.0 larger. 
>>>>>> On my end I don't know of anything that's holding up a release; is it
>>>>>> basically DSv2?
>>>>>> 
>>>>>> BTW these are the items still targeted to 3.0.0, some of which may not
>>>>>> have been legitimately tagged. It may be worth reviewing what's still 
>>>>>> open
>>>>>> and necessary, and what should be untargeted.
>>>>>> 
>>>>>> 
>>>>>> SPARK-29768 nondeterministic expression fails column pruning
>>>>>> SPARK-29345 Add an API that allows a user to define and observe arbitrary
>>>>>> metrics on streaming queries
>>>>>> SPARK-29348 Add observable metrics
>>>>>> SPARK-29429 Support Prometheus monitoring natively
>>>>>> SPARK-29577 Implement p-value simulation and unit tests for chi2 test
>>>>>> SPARK-28900 Test Pyspark, SparkR on JDK 11 with run-tests
>>>>>> SPARK-28883 Fix a flaky test: ThriftServerQueryTestSuite
>>>>>> SPARK-28717 Update SQL ALTER TABLE RENAME  to use TableCatalog API
>>>>>> SPARK-28588 Build a SQL reference doc
>>>>>> SPARK-28629 Capture the missing rules in HiveSessionStateBuilder
>>>>>> SPARK-28684 Hive module support JDK 11
>>>>>> SPARK-28548 explain() shows wrong result for persisted DataFrames after
>>>>>> some operations
>>>>>> SPARK-28264 Revisiting Python / pandas UDF
>>>>>> SPARK-28301 fix the behavior of table name resolution with multi-catalog
>>>>>> SPARK-28155 do not leak SaveMode to file source v2
>>>>>> SPARK-28103 Cannot infer filters from union table with empty local
>>>>>> relation table properly
>>>>>> SPARK-27986 Support Aggregate Expressions with filter
>>>>>> SPARK-28024 Incorrect numeric values when out of range
>>>>>> SPARK-27936 Support local dependency uploading from --py-files
>>>>>> SPARK-27780 Shuffle server & client should be versioned to enable 
>>>>>> smoother
>>>>>> upgrade
>>>>>> SPARK-27714 Support Join Reorder based on Genetic Al

Re: Spark 3.0 branch cut and code freeze on Jan 31?

2020-01-29 Thread Reynold Xin
Just a reminder - code freeze is coming this Fri !

There can always be exceptions, but those should be exceptions and discussed on 
a case by case basis rather than becoming the norm.

On Tue, Dec 24, 2019 at 4:55 PM, Jungtaek Lim < kabhwan.opensou...@gmail.com > 
wrote:

> 
> Jan 31 sounds good to me.
> 
> 
> Just curious, do we allow some exception on code freeze? One thing came
> into my mind is that some feature could have multiple subtasks and part of
> subtasks have been merged and other subtask(s) are in reviewing. In this
> case do we allow these subtasks to have more days to get reviewed and
> merged later?
> 
> 
> Happy Holiday!
> 
> 
> Thanks,
> Jungtaek Lim (HeartSaVioR)
> 
> On Wed, Dec 25, 2019 at 8:36 AM Takeshi Yamamuro < linguin. m. s@ gmail. com
> ( linguin@gmail.com ) > wrote:
> 
> 
>> Looks nice, happy holiday, all!
>> 
>> 
>> Bests,
>> Takeshi
>> 
>> On Wed, Dec 25, 2019 at 3:56 AM Dongjoon Hyun < dongjoon. hyun@ gmail. com
>> ( dongjoon.h...@gmail.com ) > wrote:
>> 
>> 
>>> +1 for January 31st.
>>> 
>>> 
>>> Bests,
>>> Dongjoon.
>>> 
>>> On Tue, Dec 24, 2019 at 7:11 AM Xiao Li < lixiao@ databricks. com (
>>> lix...@databricks.com ) > wrote:
>>> 
>>> 
>>>> Jan 31 is pretty reasonable. Happy Holidays! 
>>>> 
>>>> 
>>>> Xiao
>>>> 
>>>> On Tue, Dec 24, 2019 at 5:52 AM Sean Owen < srowen@ gmail. com (
>>>> sro...@gmail.com ) > wrote:
>>>> 
>>>> 
>>>>> Yep, always happens. Is earlier realistic, like Jan 15? it's all arbitrary
>>>>> but indeed this has been in progress for a while, and there's a downside
>>>>> to not releasing it, to making the gap to 3.0 larger. 
>>>>> On my end I don't know of anything that's holding up a release; is it
>>>>> basically DSv2?
>>>>> 
>>>>> BTW these are the items still targeted to 3.0.0, some of which may not
>>>>> have been legitimately tagged. It may be worth reviewing what's still open
>>>>> and necessary, and what should be untargeted.
>>>>> 
>>>>> 
>>>>> SPARK-29768 nondeterministic expression fails column pruning
>>>>> SPARK-29345 Add an API that allows a user to define and observe arbitrary
>>>>> metrics on streaming queries
>>>>> SPARK-29348 Add observable metrics
>>>>> SPARK-29429 Support Prometheus monitoring natively
>>>>> SPARK-29577 Implement p-value simulation and unit tests for chi2 test
>>>>> SPARK-28900 Test Pyspark, SparkR on JDK 11 with run-tests
>>>>> SPARK-28883 Fix a flaky test: ThriftServerQueryTestSuite
>>>>> SPARK-28717 Update SQL ALTER TABLE RENAME  to use TableCatalog API
>>>>> SPARK-28588 Build a SQL reference doc
>>>>> SPARK-28629 Capture the missing rules in HiveSessionStateBuilder
>>>>> SPARK-28684 Hive module support JDK 11
>>>>> SPARK-28548 explain() shows wrong result for persisted DataFrames after
>>>>> some operations
>>>>> SPARK-28264 Revisiting Python / pandas UDF
>>>>> SPARK-28301 fix the behavior of table name resolution with multi-catalog
>>>>> SPARK-28155 do not leak SaveMode to file source v2
>>>>> SPARK-28103 Cannot infer filters from union table with empty local
>>>>> relation table properly
>>>>> SPARK-27986 Support Aggregate Expressions with filter
>>>>> SPARK-28024 Incorrect numeric values when out of range
>>>>> SPARK-27936 Support local dependency uploading from --py-files
>>>>> SPARK-27780 Shuffle server & client should be versioned to enable smoother
>>>>> upgrade
>>>>> SPARK-27714 Support Join Reorder based on Genetic Algorithm when the # of
>>>>> joined tables > 12
>>>>> SPARK-27471 Reorganize public v2 catalog API
>>>>> SPARK-27520 Introduce a global config system to replace
>>>>> hadoopConfiguration
>>>>> SPARK-24625 put all the backward compatible behavior change configs under
>>>>> spark.sql.legacy.*
>>>>> SPARK-24941 Add RDDBarrier.coalesce() function
>>>>> SPARK-25017 Add test suite for ContextBarrierState
>>>>> SPARK-25083 remove the type erasure hack in data source scan
>>>>> SPARK-25383 Image data source supports sample pushdown
>>>>> SPARK-

Re: Spark 3.0 branch cut and code freeze on Jan 31?

2019-12-24 Thread Jungtaek Lim
Jan 31 sounds good to me.

Just curious, do we allow some exception on code freeze? One thing came
into my mind is that some feature could have multiple subtasks and part of
subtasks have been merged and other subtask(s) are in reviewing. In this
case do we allow these subtasks to have more days to get reviewed and
merged later?

Happy Holiday!

Thanks,
Jungtaek Lim (HeartSaVioR)

On Wed, Dec 25, 2019 at 8:36 AM Takeshi Yamamuro 
wrote:

> Looks nice, happy holiday, all!
>
> Bests,
> Takeshi
>
> On Wed, Dec 25, 2019 at 3:56 AM Dongjoon Hyun 
> wrote:
>
>> +1 for January 31st.
>>
>> Bests,
>> Dongjoon.
>>
>> On Tue, Dec 24, 2019 at 7:11 AM Xiao Li  wrote:
>>
>>> Jan 31 is pretty reasonable. Happy Holidays!
>>>
>>> Xiao
>>>
>>> On Tue, Dec 24, 2019 at 5:52 AM Sean Owen  wrote:
>>>
>>>> Yep, always happens. Is earlier realistic, like Jan 15? it's all
>>>> arbitrary but indeed this has been in progress for a while, and there's a
>>>> downside to not releasing it, to making the gap to 3.0 larger.
>>>> On my end I don't know of anything that's holding up a release; is it
>>>> basically DSv2?
>>>>
>>>> BTW these are the items still targeted to 3.0.0, some of which may not
>>>> have been legitimately tagged. It may be worth reviewing what's still open
>>>> and necessary, and what should be untargeted.
>>>>
>>>> SPARK-29768 nondeterministic expression fails column pruning
>>>> SPARK-29345 Add an API that allows a user to define and observe
>>>> arbitrary metrics on streaming queries
>>>> SPARK-29348 Add observable metrics
>>>> SPARK-29429 Support Prometheus monitoring natively
>>>> SPARK-29577 Implement p-value simulation and unit tests for chi2 test
>>>> SPARK-28900 Test Pyspark, SparkR on JDK 11 with run-tests
>>>> SPARK-28883 Fix a flaky test: ThriftServerQueryTestSuite
>>>> SPARK-28717 Update SQL ALTER TABLE RENAME  to use TableCatalog API
>>>> SPARK-28588 Build a SQL reference doc
>>>> SPARK-28629 Capture the missing rules in HiveSessionStateBuilder
>>>> SPARK-28684 Hive module support JDK 11
>>>> SPARK-28548 explain() shows wrong result for persisted DataFrames after
>>>> some operations
>>>> SPARK-28264 Revisiting Python / pandas UDF
>>>> SPARK-28301 fix the behavior of table name resolution with multi-catalog
>>>> SPARK-28155 do not leak SaveMode to file source v2
>>>> SPARK-28103 Cannot infer filters from union table with empty local
>>>> relation table properly
>>>> SPARK-27986 Support Aggregate Expressions with filter
>>>> SPARK-28024 Incorrect numeric values when out of range
>>>> SPARK-27936 Support local dependency uploading from --py-files
>>>> SPARK-27780 Shuffle server & client should be versioned to enable
>>>> smoother upgrade
>>>> SPARK-27714 Support Join Reorder based on Genetic Algorithm when the #
>>>> of joined tables > 12
>>>> SPARK-27471 Reorganize public v2 catalog API
>>>> SPARK-27520 Introduce a global config system to replace
>>>> hadoopConfiguration
>>>> SPARK-24625 put all the backward compatible behavior change configs
>>>> under spark.sql.legacy.*
>>>> SPARK-24941 Add RDDBarrier.coalesce() function
>>>> SPARK-25017 Add test suite for ContextBarrierState
>>>> SPARK-25083 remove the type erasure hack in data source scan
>>>> SPARK-25383 Image data source supports sample pushdown
>>>> SPARK-27272 Enable blacklisting of node/executor on fetch failures by
>>>> default
>>>> SPARK-27296 Efficient User Defined Aggregators
>>>> SPARK-25128 multiple simultaneous job submissions against k8s backend
>>>> cause driver pods to hang
>>>> SPARK-26664 Make DecimalType's minimum adjusted scale configurable
>>>> SPARK-21559 Remove Mesos fine-grained mode
>>>> SPARK-24942 Improve cluster resource management with jobs containing
>>>> barrier stage
>>>> SPARK-25914 Separate projection from grouping and aggregate in logical
>>>> Aggregate
>>>> SPARK-20964 Make some keywords reserved along with the ANSI/SQL standard
>>>> SPARK-26221 Improve Spark SQL instrumentation and metrics
>>>> SPARK-26425 Add more constraint checks in file streaming source to
>>>> avoid checkpoint corruption
>>>> SPARK-25843 Redesign rangeBetween API
>>>> SPARK-25841 Redesig

Re: Spark 3.0 branch cut and code freeze on Jan 31?

2019-12-24 Thread Takeshi Yamamuro
23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2
>>> SPARK-25186 Stabilize Data Source V2 API
>>> SPARK-25376 Scenarios we should handle but missed in 2.4 for barrier
>>> execution mode
>>> SPARK-7768 Make user-defined type (UDT) API public
>>> SPARK-14922 Alter Table Drop Partition Using Predicate-based Partition
>>> Spec
>>> SPARK-15694 Implement ScriptTransformation in sql/core
>>> SPARK-18134 SQL: MapType in Group BY and Joins not working
>>> SPARK-19842 Informational Referential Integrity Constraints Support in
>>> Spark
>>> SPARK-22231 Support of map, filter, withColumn, dropColumn in nested
>>> list of structures
>>> SPARK-22386 Data Source V2 improvements
>>> SPARK-24723 Discuss necessary info and access in barrier mode + YARN
>>>
>>>
>>> On Mon, Dec 23, 2019 at 5:48 PM Reynold Xin  wrote:
>>>
>>>> We've pushed out 3.0 multiple times. The latest release window
>>>> documented on the website
>>>> <http://spark.apache.org/versioning-policy.html> says we'd code freeze
>>>> and cut branch-3.0 early Dec. It looks like we are suffering a bit from the
>>>> tragedy of the commons, that nobody is pushing for getting the release out.
>>>> I understand the natural tendency for each individual is to finish or
>>>> extend the feature/bug that the person has been working on. At some point
>>>> we need to say "this is it" and get the release out. I'm happy to help
>>>> drive this process.
>>>>
>>>> To be realistic, I don't think we should just code freeze *today*.
>>>> Although we have updated the website, contributors have all been operating
>>>> under the assumption that all active developments are still going on. I
>>>> propose we *cut the branch on **Jan 31**, and code freeze and switch
>>>> over to bug squashing mode, and try to get the 3.0 official release out in
>>>> Q1*. That is, by default no new features can go into the branch
>>>> starting Jan 31.
>>>>
>>>> What do you think?
>>>>
>>>> And happy holidays everybody.
>>>>
>>>>
>>>>
>>>>
>>
>> --
>> [image: Databricks Summit - Watch the talks]
>> <https://databricks.com/sparkaisummit/north-america>
>>
>

-- 
---
Takeshi Yamamuro


Re: Spark 3.0 branch cut and code freeze on Jan 31?

2019-12-24 Thread Dongjoon Hyun
BY and Joins not working
>> SPARK-19842 Informational Referential Integrity Constraints Support in
>> Spark
>> SPARK-22231 Support of map, filter, withColumn, dropColumn in nested list
>> of structures
>> SPARK-22386 Data Source V2 improvements
>> SPARK-24723 Discuss necessary info and access in barrier mode + YARN
>>
>>
>> On Mon, Dec 23, 2019 at 5:48 PM Reynold Xin  wrote:
>>
>>> We've pushed out 3.0 multiple times. The latest release window
>>> documented on the website
>>> <http://spark.apache.org/versioning-policy.html> says we'd code freeze
>>> and cut branch-3.0 early Dec. It looks like we are suffering a bit from the
>>> tragedy of the commons, that nobody is pushing for getting the release out.
>>> I understand the natural tendency for each individual is to finish or
>>> extend the feature/bug that the person has been working on. At some point
>>> we need to say "this is it" and get the release out. I'm happy to help
>>> drive this process.
>>>
>>> To be realistic, I don't think we should just code freeze *today*.
>>> Although we have updated the website, contributors have all been operating
>>> under the assumption that all active developments are still going on. I
>>> propose we *cut the branch on **Jan 31**, and code freeze and switch
>>> over to bug squashing mode, and try to get the 3.0 official release out in
>>> Q1*. That is, by default no new features can go into the branch
>>> starting Jan 31.
>>>
>>> What do you think?
>>>
>>> And happy holidays everybody.
>>>
>>>
>>>
>>>
>
> --
> [image: Databricks Summit - Watch the talks]
> <https://databricks.com/sparkaisummit/north-america>
>


Re: Spark 3.0 branch cut and code freeze on Jan 31?

2019-12-24 Thread Xiao Li
Jan 31 is pretty reasonable. Happy Holidays!

Xiao

On Tue, Dec 24, 2019 at 5:52 AM Sean Owen  wrote:

> Yep, always happens. Is earlier realistic, like Jan 15? it's all arbitrary
> but indeed this has been in progress for a while, and there's a downside to
> not releasing it, to making the gap to 3.0 larger.
> On my end I don't know of anything that's holding up a release; is it
> basically DSv2?
>
> BTW these are the items still targeted to 3.0.0, some of which may not
> have been legitimately tagged. It may be worth reviewing what's still open
> and necessary, and what should be untargeted.
>
> SPARK-29768 nondeterministic expression fails column pruning
> SPARK-29345 Add an API that allows a user to define and observe arbitrary
> metrics on streaming queries
> SPARK-29348 Add observable metrics
> SPARK-29429 Support Prometheus monitoring natively
> SPARK-29577 Implement p-value simulation and unit tests for chi2 test
> SPARK-28900 Test Pyspark, SparkR on JDK 11 with run-tests
> SPARK-28883 Fix a flaky test: ThriftServerQueryTestSuite
> SPARK-28717 Update SQL ALTER TABLE RENAME  to use TableCatalog API
> SPARK-28588 Build a SQL reference doc
> SPARK-28629 Capture the missing rules in HiveSessionStateBuilder
> SPARK-28684 Hive module support JDK 11
> SPARK-28548 explain() shows wrong result for persisted DataFrames after
> some operations
> SPARK-28264 Revisiting Python / pandas UDF
> SPARK-28301 fix the behavior of table name resolution with multi-catalog
> SPARK-28155 do not leak SaveMode to file source v2
> SPARK-28103 Cannot infer filters from union table with empty local
> relation table properly
> SPARK-27986 Support Aggregate Expressions with filter
> SPARK-28024 Incorrect numeric values when out of range
> SPARK-27936 Support local dependency uploading from --py-files
> SPARK-27780 Shuffle server & client should be versioned to enable smoother
> upgrade
> SPARK-27714 Support Join Reorder based on Genetic Algorithm when the # of
> joined tables > 12
> SPARK-27471 Reorganize public v2 catalog API
> SPARK-27520 Introduce a global config system to replace hadoopConfiguration
> SPARK-24625 put all the backward compatible behavior change configs under
> spark.sql.legacy.*
> SPARK-24941 Add RDDBarrier.coalesce() function
> SPARK-25017 Add test suite for ContextBarrierState
> SPARK-25083 remove the type erasure hack in data source scan
> SPARK-25383 Image data source supports sample pushdown
> SPARK-27272 Enable blacklisting of node/executor on fetch failures by
> default
> SPARK-27296 Efficient User Defined Aggregators
> SPARK-25128 multiple simultaneous job submissions against k8s backend
> cause driver pods to hang
> SPARK-26664 Make DecimalType's minimum adjusted scale configurable
> SPARK-21559 Remove Mesos fine-grained mode
> SPARK-24942 Improve cluster resource management with jobs containing
> barrier stage
> SPARK-25914 Separate projection from grouping and aggregate in logical
> Aggregate
> SPARK-20964 Make some keywords reserved along with the ANSI/SQL standard
> SPARK-26221 Improve Spark SQL instrumentation and metrics
> SPARK-26425 Add more constraint checks in file streaming source to avoid
> checkpoint corruption
> SPARK-25843 Redesign rangeBetween API
> SPARK-25841 Redesign window function rangeBetween API
> SPARK-25752 Add trait to easily whitelist logical operators that produce
> named output from CleanupAliases
> SPARK-25640 Clarify/Improve EvalType for grouped aggregate and window
> aggregate
> SPARK-25531 new write APIs for data source v2
> SPARK-25547 Pluggable jdbc connection factory
> SPARK-20845 Support specification of column names in INSERT INTO
> SPARK-24724 Discuss necessary info and access in barrier mode + Kubernetes
> SPARK-24725 Discuss necessary info and access in barrier mode + Mesos
> SPARK-25074 Implement maxNumConcurrentTasks() in
> MesosFineGrainedSchedulerBackend
> SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2
> SPARK-25186 Stabilize Data Source V2 API
> SPARK-25376 Scenarios we should handle but missed in 2.4 for barrier
> execution mode
> SPARK-7768 Make user-defined type (UDT) API public
> SPARK-14922 Alter Table Drop Partition Using Predicate-based Partition Spec
> SPARK-15694 Implement ScriptTransformation in sql/core
> SPARK-18134 SQL: MapType in Group BY and Joins not working
> SPARK-19842 Informational Referential Integrity Constraints Support in
> Spark
> SPARK-22231 Support of map, filter, withColumn, dropColumn in nested list
> of structures
> SPARK-22386 Data Source V2 improvements
> SPARK-24723 Discuss necessary info and access in barrier mode + YARN
>
>
> On Mon, Dec 23, 2019 at 5:48 PM Reynold Xin  wrote:
>
>> We've pushed out 3.0 multiple times. The l

Re: Spark 3.0 branch cut and code freeze on Jan 31?

2019-12-24 Thread Sean Owen
Yep, always happens. Is earlier realistic, like Jan 15? it's all arbitrary
but indeed this has been in progress for a while, and there's a downside to
not releasing it, to making the gap to 3.0 larger.
On my end I don't know of anything that's holding up a release; is it
basically DSv2?

BTW these are the items still targeted to 3.0.0, some of which may not have
been legitimately tagged. It may be worth reviewing what's still open and
necessary, and what should be untargeted.

SPARK-29768 nondeterministic expression fails column pruning
SPARK-29345 Add an API that allows a user to define and observe arbitrary
metrics on streaming queries
SPARK-29348 Add observable metrics
SPARK-29429 Support Prometheus monitoring natively
SPARK-29577 Implement p-value simulation and unit tests for chi2 test
SPARK-28900 Test Pyspark, SparkR on JDK 11 with run-tests
SPARK-28883 Fix a flaky test: ThriftServerQueryTestSuite
SPARK-28717 Update SQL ALTER TABLE RENAME  to use TableCatalog API
SPARK-28588 Build a SQL reference doc
SPARK-28629 Capture the missing rules in HiveSessionStateBuilder
SPARK-28684 Hive module support JDK 11
SPARK-28548 explain() shows wrong result for persisted DataFrames after
some operations
SPARK-28264 Revisiting Python / pandas UDF
SPARK-28301 fix the behavior of table name resolution with multi-catalog
SPARK-28155 do not leak SaveMode to file source v2
SPARK-28103 Cannot infer filters from union table with empty local relation
table properly
SPARK-27986 Support Aggregate Expressions with filter
SPARK-28024 Incorrect numeric values when out of range
SPARK-27936 Support local dependency uploading from --py-files
SPARK-27780 Shuffle server & client should be versioned to enable smoother
upgrade
SPARK-27714 Support Join Reorder based on Genetic Algorithm when the # of
joined tables > 12
SPARK-27471 Reorganize public v2 catalog API
SPARK-27520 Introduce a global config system to replace hadoopConfiguration
SPARK-24625 put all the backward compatible behavior change configs under
spark.sql.legacy.*
SPARK-24941 Add RDDBarrier.coalesce() function
SPARK-25017 Add test suite for ContextBarrierState
SPARK-25083 remove the type erasure hack in data source scan
SPARK-25383 Image data source supports sample pushdown
SPARK-27272 Enable blacklisting of node/executor on fetch failures by
default
SPARK-27296 Efficient User Defined Aggregators
SPARK-25128 multiple simultaneous job submissions against k8s backend cause
driver pods to hang
SPARK-26664 Make DecimalType's minimum adjusted scale configurable
SPARK-21559 Remove Mesos fine-grained mode
SPARK-24942 Improve cluster resource management with jobs containing
barrier stage
SPARK-25914 Separate projection from grouping and aggregate in logical
Aggregate
SPARK-20964 Make some keywords reserved along with the ANSI/SQL standard
SPARK-26221 Improve Spark SQL instrumentation and metrics
SPARK-26425 Add more constraint checks in file streaming source to avoid
checkpoint corruption
SPARK-25843 Redesign rangeBetween API
SPARK-25841 Redesign window function rangeBetween API
SPARK-25752 Add trait to easily whitelist logical operators that produce
named output from CleanupAliases
SPARK-25640 Clarify/Improve EvalType for grouped aggregate and window
aggregate
SPARK-25531 new write APIs for data source v2
SPARK-25547 Pluggable jdbc connection factory
SPARK-20845 Support specification of column names in INSERT INTO
SPARK-24724 Discuss necessary info and access in barrier mode + Kubernetes
SPARK-24725 Discuss necessary info and access in barrier mode + Mesos
SPARK-25074 Implement maxNumConcurrentTasks() in
MesosFineGrainedSchedulerBackend
SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2
SPARK-25186 Stabilize Data Source V2 API
SPARK-25376 Scenarios we should handle but missed in 2.4 for barrier
execution mode
SPARK-7768 Make user-defined type (UDT) API public
SPARK-14922 Alter Table Drop Partition Using Predicate-based Partition Spec
SPARK-15694 Implement ScriptTransformation in sql/core
SPARK-18134 SQL: MapType in Group BY and Joins not working
SPARK-19842 Informational Referential Integrity Constraints Support in Spark
SPARK-22231 Support of map, filter, withColumn, dropColumn in nested list
of structures
SPARK-22386 Data Source V2 improvements
SPARK-24723 Discuss necessary info and access in barrier mode + YARN


On Mon, Dec 23, 2019 at 5:48 PM Reynold Xin  wrote:

> We've pushed out 3.0 multiple times. The latest release window documented
> on the website <http://spark.apache.org/versioning-policy.html> says we'd
> code freeze and cut branch-3.0 early Dec. It looks like we are suffering a
> bit from the tragedy of the commons, that nobody is pushing for getting the
> release out. I understand the natural tendency for each individual is to
> finish or extend the feature/bug that the person has been working on. At
> some point we need to say "this is it" and get the release out. I'm happy
> to help drive this proces

Re: Spark 3.0 branch cut and code freeze on Jan 31?

2019-12-23 Thread Hyukjin Kwon
Sounds fine.
I am trying to get pandas UDF redesign done (SPARK-28264
<https://issues.apache.org/jira/browse/SPARK-28264>) on time. Hope I can
make it.

2019년 12월 24일 (화) 오후 4:17, Wenchen Fan 님이 작성:

> Sounds good!
>
> On Tue, Dec 24, 2019 at 7:48 AM Reynold Xin  wrote:
>
>> We've pushed out 3.0 multiple times. The latest release window documented
>> on the website <http://spark.apache.org/versioning-policy.html> says
>> we'd code freeze and cut branch-3.0 early Dec. It looks like we are
>> suffering a bit from the tragedy of the commons, that nobody is pushing for
>> getting the release out. I understand the natural tendency for each
>> individual is to finish or extend the feature/bug that the person has been
>> working on. At some point we need to say "this is it" and get the release
>> out. I'm happy to help drive this process.
>>
>> To be realistic, I don't think we should just code freeze *today*.
>> Although we have updated the website, contributors have all been operating
>> under the assumption that all active developments are still going on. I
>> propose we *cut the branch on **Jan 31**, and code freeze and switch
>> over to bug squashing mode, and try to get the 3.0 official release out in
>> Q1*. That is, by default no new features can go into the branch starting Jan
>> 31.
>>
>> What do you think?
>>
>> And happy holidays everybody.
>>
>>
>>
>>


Re: Spark 3.0 branch cut and code freeze on Jan 31?

2019-12-23 Thread Wenchen Fan
Sounds good!

On Tue, Dec 24, 2019 at 7:48 AM Reynold Xin  wrote:

> We've pushed out 3.0 multiple times. The latest release window documented
> on the website <http://spark.apache.org/versioning-policy.html> says we'd
> code freeze and cut branch-3.0 early Dec. It looks like we are suffering a
> bit from the tragedy of the commons, that nobody is pushing for getting the
> release out. I understand the natural tendency for each individual is to
> finish or extend the feature/bug that the person has been working on. At
> some point we need to say "this is it" and get the release out. I'm happy
> to help drive this process.
>
> To be realistic, I don't think we should just code freeze *today*.
> Although we have updated the website, contributors have all been operating
> under the assumption that all active developments are still going on. I
> propose we *cut the branch on **Jan 31**, and code freeze and switch over
> to bug squashing mode, and try to get the 3.0 official release out in Q1*.
> That is, by default no new features can go into the branch starting Jan 31
> .
>
> What do you think?
>
> And happy holidays everybody.
>
>
>
>


Spark 3.0 branch cut and code freeze on Jan 31?

2019-12-23 Thread Reynold Xin
We've pushed out 3.0 multiple times. The latest release window documented on 
the website ( http://spark.apache.org/versioning-policy.html ) says we'd code 
freeze and cut branch-3.0 early Dec. It looks like we are suffering a bit from 
the tragedy of the commons, that nobody is pushing for getting the release out. 
I understand the natural tendency for each individual is to finish or extend 
the feature/bug that the person has been working on. At some point we need to 
say "this is it" and get the release out. I'm happy to help drive this process.

To be realistic, I don't think we should just code freeze * today *. Although 
we have updated the website, contributors have all been operating under the 
assumption that all active developments are still going on. I propose we *cut 
the branch on* *Jan 31* *, and code freeze and switch over to bug squashing 
mode, and try to get the 3.0 official release out in Q1*. That is, by default 
no new features can go into the branch starting Jan 31.

What do you think?

And happy holidays everybody.

Re: code freeze and branch cut for Apache Spark 2.4

2018-09-05 Thread Hyukjin Kwon
Oops, one more - https://github.com/apache/spark/pull/6. I just read
this thread.

2018년 9월 6일 (목) 오후 12:12, Sean Owen 님이 작성:

> (I slipped https://github.com/apache/spark/pull/22340 in for Scala 2.12.
> Maybe it really is the last one. In any event, yes go ahead with a 2.4 RC)
>
> On Wed, Sep 5, 2018 at 8:14 PM Wenchen Fan  wrote:
>
>> The repartition correctness bug fix is merged. The Scala 2.12 PRs
>> mentioned in this thread are all merged. The Kryo upgrade is done.
>>
>> I'm going to cut the branch 2.4 since all the major blockers are now
>> resolved.
>>
>> Thanks,
>> Wenchen
>>
>


Re: code freeze and branch cut for Apache Spark 2.4

2018-09-05 Thread Sean Owen
(I slipped https://github.com/apache/spark/pull/22340 in for Scala 2.12.
Maybe it really is the last one. In any event, yes go ahead with a 2.4 RC)

On Wed, Sep 5, 2018 at 8:14 PM Wenchen Fan  wrote:

> The repartition correctness bug fix is merged. The Scala 2.12 PRs
> mentioned in this thread are all merged. The Kryo upgrade is done.
>
> I'm going to cut the branch 2.4 since all the major blockers are now
> resolved.
>
> Thanks,
> Wenchen
>


Re: code freeze and branch cut for Apache Spark 2.4

2018-09-05 Thread Wenchen Fan
The repartition correctness bug fix is merged. The Scala 2.12 PRs mentioned
in this thread are all merged. The Kryo upgrade is done.

I'm going to cut the branch 2.4 since all the major blockers are now
resolved.

Thanks,
Wenchen

On Sun, Sep 2, 2018 at 12:07 AM sadhen  wrote:

> https://github.com/apache/spark/pull/22308
>
> https://github.com/apache/spark/pull/22310
>
>
> These two might be the last fixes for Scala 2.12 :)
>
>
> Please review.
>
>  原始邮件
> *发件人:* Sean Owen
> *收件人:* antonkulaga
> *抄送:* dev
> *发送时间:* 2018年8月31日(周五) 05:00
> *主题:* Re: code freeze and branch cut for Apache Spark 2.4
>
> I know it's famous last words, but we really might be down to the last
> fix: https://github.com/apache/spark/pull/22264 More a question of making
> tests happy at this point I think than fundamental problems. My goal is to
> make sure we can release a usable, but beta-quality, 2.12 release of Spark
> in 2.4.
>
> On Thu, Aug 30, 2018 at 3:56 PM antonkulaga  wrote:
>
>> >There are a few PRs to fix Scala 2.12 issues. I think they will keep
>> coming
>> up and we don't need to block Spark 2.4 on this.
>>
>> I think it can be better to wait a bit for Scala 2.12 support in 2.4 than
>> to
>> suffer many months until Spark 2.5 with 2.12 support will be released.
>> Scala
>> 2.12 is not only about Spark but also about a lot of Scala libraries that
>> stopped supporting Scala 2.11, if Spark 2.4 will not support Scala 2.12,
>> then people will not be able to use them in their Zeppelin, Jupyter and
>> other notebooks together with Spark.
>>
>>


Re: code freeze and branch cut for Apache Spark 2.4

2018-09-01 Thread sadhen
https://github.com/apache/spark/pull/22308
https://github.com/apache/spark/pull/22310


These two might be the last fixes for Scala 2.12 :)


Please review.


原始邮件
发件人:Sean owensro...@apache.org
收件人:antonkulagaantonkul...@gmail.com
抄送:dev...@spark.apache.org
发送时间:2018年8月31日(周五) 05:00
主题:Re: code freeze and branch cut for Apache Spark 2.4


I know it's famous last words, but we really might be down to the last 
fix:https://github.com/apache/spark/pull/22264More a question of making tests 
happy at this point I think than fundamental problems. My goal is to make sure 
we can release a usable, but beta-quality, 2.12 release of Spark in 2.4.


On Thu, Aug 30, 2018 at 3:56 PM antonkulaga antonkul...@gmail.com wrote:

There are a few PRs to fix Scala 2.12 issues. I think they will keep coming
 up and we don't need to block Spark 2.4 on this.
 
 I think it can be better to wait a bit for Scala 2.12 support in 2.4 than to
 suffer many months until Spark 2.5 with 2.12 support will be released. Scala
 2.12 is not only about Spark but also about a lot of Scala libraries that
 stopped supporting Scala 2.11, if Spark 2.4 will not support Scala 2.12,
 then people will not be able to use them in their Zeppelin, Jupyter and
 other notebooks together with Spark.

Re: code freeze and branch cut for Apache Spark 2.4

2018-08-30 Thread shane knapp
+1 on beta support for scala 2.12

On Thu, Aug 30, 2018 at 2:33 PM, Stavros Kontopoulos <
stavros.kontopou...@lightbend.com> wrote:

> +1 that would be great Sean, also you put a lot of effort in there, would
> make sense to wait a bit.
>
> Stavros
>
> On Fri, Aug 31, 2018 at 12:00 AM, Sean Owen  wrote:
>
>> I know it's famous last words, but we really might be down to the last
>> fix: https://github.com/apache/spark/pull/22264 More a question of
>> making tests happy at this point I think than fundamental problems. My goal
>> is to make sure we can release a usable, but beta-quality, 2.12 release of
>> Spark in 2.4.
>>
>> On Thu, Aug 30, 2018 at 3:56 PM antonkulaga 
>> wrote:
>>
>>> >There are a few PRs to fix Scala 2.12 issues. I think they will keep
>>> coming
>>> up and we don't need to block Spark 2.4 on this.
>>>
>>> I think it can be better to wait a bit for Scala 2.12 support in 2.4
>>> than to
>>> suffer many months until Spark 2.5 with 2.12 support will be released.
>>> Scala
>>> 2.12 is not only about Spark but also about a lot of Scala libraries that
>>> stopped supporting Scala 2.11, if Spark 2.4 will not support Scala 2.12,
>>> then people will not be able to use them in their Zeppelin, Jupyter and
>>> other notebooks together with Spark.
>>>
>>>
>
>
>
>


-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-30 Thread Stavros Kontopoulos
+1 that would be great Sean, also you put a lot of effort in there, would
make sense to wait a bit.

Stavros

On Fri, Aug 31, 2018 at 12:00 AM, Sean Owen  wrote:

> I know it's famous last words, but we really might be down to the last
> fix: https://github.com/apache/spark/pull/22264 More a question of making
> tests happy at this point I think than fundamental problems. My goal is to
> make sure we can release a usable, but beta-quality, 2.12 release of Spark
> in 2.4.
>
> On Thu, Aug 30, 2018 at 3:56 PM antonkulaga  wrote:
>
>> >There are a few PRs to fix Scala 2.12 issues. I think they will keep
>> coming
>> up and we don't need to block Spark 2.4 on this.
>>
>> I think it can be better to wait a bit for Scala 2.12 support in 2.4 than
>> to
>> suffer many months until Spark 2.5 with 2.12 support will be released.
>> Scala
>> 2.12 is not only about Spark but also about a lot of Scala libraries that
>> stopped supporting Scala 2.11, if Spark 2.4 will not support Scala 2.12,
>> then people will not be able to use them in their Zeppelin, Jupyter and
>> other notebooks together with Spark.
>>
>>


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-30 Thread Sean Owen
I know it's famous last words, but we really might be down to the last fix:
https://github.com/apache/spark/pull/22264 More a question of making tests
happy at this point I think than fundamental problems. My goal is to make
sure we can release a usable, but beta-quality, 2.12 release of Spark in
2.4.

On Thu, Aug 30, 2018 at 3:56 PM antonkulaga  wrote:

> >There are a few PRs to fix Scala 2.12 issues. I think they will keep
> coming
> up and we don't need to block Spark 2.4 on this.
>
> I think it can be better to wait a bit for Scala 2.12 support in 2.4 than
> to
> suffer many months until Spark 2.5 with 2.12 support will be released.
> Scala
> 2.12 is not only about Spark but also about a lot of Scala libraries that
> stopped supporting Scala 2.11, if Spark 2.4 will not support Scala 2.12,
> then people will not be able to use them in their Zeppelin, Jupyter and
> other notebooks together with Spark.
>
>


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-30 Thread Reynold Xin
Let's see how they go. At some point we do need to cut the release. That
argument can be made on every feature, and different people place different
value / importance on different features, so we could just end up never
making a release.



On Thu, Aug 30, 2018 at 1:56 PM antonkulaga  wrote:

> >There are a few PRs to fix Scala 2.12 issues. I think they will keep
> coming
> up and we don't need to block Spark 2.4 on this.
>
> I think it can be better to wait a bit for Scala 2.12 support in 2.4 than
> to
> suffer many months until Spark 2.5 with 2.12 support will be released.
> Scala
> 2.12 is not only about Spark but also about a lot of Scala libraries that
> stopped supporting Scala 2.11, if Spark 2.4 will not support Scala 2.12,
> then people will not be able to use them in their Zeppelin, Jupyter and
> other notebooks together with Spark.
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-29 Thread Wenchen Fan
A few updates on this thread:

We still have a blocking issue, the repartition correctness bug:
https://github.com/apache/spark/pull/22112
It's close to merging.

There are a few PRs to fix Scala 2.12 issues. I think they will keep coming
up and we don't need to block Spark 2.4 on this.

All other features/issues mentioned in this thread are either finished or
retargeted to the next release, hopefully we can cut the branch this week.

Thanks to everyone for your contributions! Please reply to this email if
you think something should be done before Spark 2.4.

Thanks,
Wenchen

On Tue, Aug 14, 2018 at 12:57 AM Xingbo Jiang  wrote:

> I'm working on the fix of SPARK-23243
>  and should be able
> push another commit in 1~2 days. More detailed discussions can go to the PR.
> Thanks for pushing this issue forward! I really appreciate efforts by
> submit PRs or involve in the discussions actively!
>
> 2018-08-13 22:50 GMT+08:00 Tom Graves :
>
>> I agree with Imran, we need to fix SPARK-23243
>>  and any correctness
>> issues for that matter.
>>
>> Tom
>>
>> On Wednesday, August 8, 2018, 9:06:43 AM CDT, Imran Rashid
>>  wrote:
>>
>>
>> On Tue, Aug 7, 2018 at 8:39 AM, Wenchen Fan  wrote:
>>
>> SPARK-23243 : 
>> Shuffle+Repartition
>> on an RDD could lead to incorrect answers
>> It turns out to be a very complicated issue, there is no consensus about
>> what is the right fix yet. Likely to miss it in Spark 2.4 because it's a
>> long-standing issue, not a regression.
>>
>>
>> This is a really serious data loss bug.  Yes its very complex, but we
>> absolutely have to fix this, I really think it should be in 2.4.
>> Has worked on it stopped?
>>
>
>


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-13 Thread Xingbo Jiang
I'm working on the fix of SPARK-23243
 and should be able push
another commit in 1~2 days. More detailed discussions can go to the PR.
Thanks for pushing this issue forward! I really appreciate efforts by
submit PRs or involve in the discussions actively!

2018-08-13 22:50 GMT+08:00 Tom Graves :

> I agree with Imran, we need to fix SPARK-23243
>  and any correctness
> issues for that matter.
>
> Tom
>
> On Wednesday, August 8, 2018, 9:06:43 AM CDT, Imran Rashid
>  wrote:
>
>
> On Tue, Aug 7, 2018 at 8:39 AM, Wenchen Fan  wrote:
>
> SPARK-23243 : 
> Shuffle+Repartition
> on an RDD could lead to incorrect answers
> It turns out to be a very complicated issue, there is no consensus about
> what is the right fix yet. Likely to miss it in Spark 2.4 because it's a
> long-standing issue, not a regression.
>
>
> This is a really serious data loss bug.  Yes its very complex, but we
> absolutely have to fix this, I really think it should be in 2.4.
> Has worked on it stopped?
>


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-13 Thread Tom Graves
 I agree with Imran, we need to fix SPARK-23243 and any correctness issues for 
that matter.
Tom
On Wednesday, August 8, 2018, 9:06:43 AM CDT, Imran Rashid 
 wrote:  
 
 On Tue, Aug 7, 2018 at 8:39 AM, Wenchen Fan  wrote:
SPARK-23243: Shuffle+Repartition on an RDD could lead to incorrect answers
It turns out to be a very complicated issue, there is no consensus about what 
is the right fix yet. Likely to miss it in Spark 2.4 because it's a 
long-standing issue, not a regression.

This is a really serious data loss bug.  Yes its very complex, but we 
absolutely have to fix this, I really think it should be in 2.4.Has worked on 
it stopped?  

Re: code freeze and branch cut for Apache Spark 2.4

2018-08-11 Thread Petar Zečević


Hi, I made some changes to SPARK-24020 
(https://github.com/apache/spark/pull/21109) and implemented spill-over to 
disk. I believe there are no objections to the implementation left and that 
this can now be merged.

Please take a look.

Thanks,

Petar Zečević


Wenchen Fan @ 1970-01-01 01:00 CET:

> Some updates for the JIRA tickets that we want to resolve before Spark 2.4.
>
> green: merged
> orange: in progress
> red: likely to miss
>
> SPARK-24374: Support Barrier Execution Mode in Apache Spark
> The core functionality is finished, but we still need to add Python API. 
> Tracked by SPARK-24822
>
> SPARK-23899: Built-in SQL Function Improvement
> I think it's ready to go. Although there are still some functions working in 
> progress, the common ones are all merged.
>
> SPARK-14220: Build and test Spark against Scala 2.12
> It's close, just one last piece. Tracked by SPARK-25029
>
> SPARK-4502: Spark SQL reads unnecessary nested fields from Parquet
> Being reviewed.
>
> SPARK-24882: data source v2 API improvement
> PR is out, being reviewed.
>
> SPARK-24252: Add catalog support in Data Source V2
> Being reviewed.
>
> SPARK-24768: Have a built-in AVRO data source implementation
> It's close, just one last piece: the decimal type support
>
> SPARK-23243: Shuffle+Repartition on an RDD could lead to incorrect answers
> It turns out to be a very complicated issue, there is no consensus about what 
> is the right fix yet. Likely to miss it in Spark 2.4 because it's a 
> long-standing issue, not a regression.
>
> SPARK-24598: Datatype overflow conditions gives incorrect result
> We decided to keep the current behavior in Spark 2.4 and add some 
> document(already done). We will re-consider this change in Spark 3.0.
>
> SPARK-24020: Sort-merge join inner range optimization
> There are some discussions about the design, I don't think we can get to a 
> consensus within Spark 2.4.
>
> SPARK-24296: replicating large blocks over 2GB
> Being reviewed.
>
> SPARK-23874: upgrade to Apache Arrow 0.10.0
> Apache Arrow 0.10.0 has some critical bug fixes and is being voted, we should 
> wait a few days.
>
> According to the status, I think we should wait a few more days. Any 
> objections?
>
> Thanks,
> Wenchen
>
> On Tue, Aug 7, 2018 at 3:39 AM Sean Owen  wrote:
>
>  ... and we still have a few snags with Scala 2.12 support at 
> https://issues.apache.org/jira/browse/SPARK-25029 
>
>  There is some hope of resolving it on the order of a week, so for the 
> moment, seems worth holding 2.4 for.
>
>  On Mon, Aug 6, 2018 at 2:37 PM Bryan Cutler  wrote:
>
>  Hi All,
>
>  I'd like to request a few days extension to the code freeze to complete the 
> upgrade to Apache Arrow 0.10.0, SPARK-23874. This upgrade includes several 
> key improvements and bug fixes.  The RC vote just passed this morning and code
>  changes are complete in https://github.com/apache/spark/pull/21939. We just 
> need some time for the release artifacts to be available. Thoughts?
>
>  Thanks,
>  Bryan


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: code freeze and branch cut for Apache Spark 2.4

2018-08-10 Thread shane knapp
>
>
> I also think it's a good idea to test against newer Python versions. But I
> don't know how difficult it is and whether or not it's feasible to resolve
> that between branch cut and RC cut.
>

>
unless someone pops in to this thread and tells me w/o a doubt that all
spark branches will happily pass against 3.5, it will not happen until
after the 2.4 cut.  :)

however, from my (limited) testing, it does look like that's the case.
still not gonna pull the trigger on it until after the cut.

shane
-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-10 Thread Li Jin
I agree with Byran. If it's acceptable to have another job to test with
Python 3.5 and pyarrow 0.10.0, I am leaning towards upgrading arrow.

Arrow 0.10.0 has tons of bug fixes and improves from 0.8.0, including
important memory leak fixes such as
https://issues.apache.org/jira/browse/ARROW-1973. I think releasing with
0.10.0 will improve the overall experience of arrow related features quite
bit.

I also think it's a good idea to test against newer Python versions. But I
don't know how difficult it is and whether or not it's feasible to resolve
that between branch cut and RC cut.

On Fri, Aug 10, 2018 at 5:44 PM, shane knapp  wrote:

> see:  https://github.com/apache/spark/pull/21939#issuecomment-412154343
>
> yes, i can set up a build.  have some Qs in the PR about building the
> spark package before running the python tests.
>
> On Fri, Aug 10, 2018 at 10:41 AM, Bryan Cutler  wrote:
>
>> I agree that we should hold off on the Arrow upgrade if it requires major
>> changes to our testing. I did have another thought that maybe we could just
>> add another job to test against Python 3.5 and pyarrow 0.10.0 and keep all
>> current testing the same? I'm not sure how doable that is right now and
>> don't want to make a ton of extra work, so no objections from me to hold
>> off on things for now.
>>
>> On Fri, Aug 10, 2018 at 9:48 AM, shane knapp  wrote:
>>
>>> On Fri, Aug 10, 2018 at 9:47 AM, Wenchen Fan 
>>> wrote:
>>>
 It seems safer to skip the arrow 0.10.0 upgrade for Spark 2.4 and leave
 it to Spark 3.0, so that we have more time to test. Any objections?

>>>
>>> none here.
>>>
>>> --
>>> Shane Knapp
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>
>>
>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-10 Thread shane knapp
python 3.5/pyarrow 0.10.0 build:
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-sbt-hadoop-2.6-python-3.5-arrow-0.10.0-ubuntu-testing/

On Fri, Aug 10, 2018 at 10:44 AM, shane knapp  wrote:

> see:  https://github.com/apache/spark/pull/21939#issuecomment-412154343
>
> yes, i can set up a build.  have some Qs in the PR about building the
> spark package before running the python tests.
>
> On Fri, Aug 10, 2018 at 10:41 AM, Bryan Cutler  wrote:
>
>> I agree that we should hold off on the Arrow upgrade if it requires major
>> changes to our testing. I did have another thought that maybe we could just
>> add another job to test against Python 3.5 and pyarrow 0.10.0 and keep all
>> current testing the same? I'm not sure how doable that is right now and
>> don't want to make a ton of extra work, so no objections from me to hold
>> off on things for now.
>>
>> On Fri, Aug 10, 2018 at 9:48 AM, shane knapp  wrote:
>>
>>> On Fri, Aug 10, 2018 at 9:47 AM, Wenchen Fan 
>>> wrote:
>>>
 It seems safer to skip the arrow 0.10.0 upgrade for Spark 2.4 and leave
 it to Spark 3.0, so that we have more time to test. Any objections?

>>>
>>> none here.
>>>
>>> --
>>> Shane Knapp
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>
>>
>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>



-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-10 Thread shane knapp
see:  https://github.com/apache/spark/pull/21939#issuecomment-412154343

yes, i can set up a build.  have some Qs in the PR about building the spark
package before running the python tests.

On Fri, Aug 10, 2018 at 10:41 AM, Bryan Cutler  wrote:

> I agree that we should hold off on the Arrow upgrade if it requires major
> changes to our testing. I did have another thought that maybe we could just
> add another job to test against Python 3.5 and pyarrow 0.10.0 and keep all
> current testing the same? I'm not sure how doable that is right now and
> don't want to make a ton of extra work, so no objections from me to hold
> off on things for now.
>
> On Fri, Aug 10, 2018 at 9:48 AM, shane knapp  wrote:
>
>> On Fri, Aug 10, 2018 at 9:47 AM, Wenchen Fan  wrote:
>>
>>> It seems safer to skip the arrow 0.10.0 upgrade for Spark 2.4 and leave
>>> it to Spark 3.0, so that we have more time to test. Any objections?
>>>
>>
>> none here.
>>
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>
>


-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-10 Thread Bryan Cutler
I agree that we should hold off on the Arrow upgrade if it requires major
changes to our testing. I did have another thought that maybe we could just
add another job to test against Python 3.5 and pyarrow 0.10.0 and keep all
current testing the same? I'm not sure how doable that is right now and
don't want to make a ton of extra work, so no objections from me to hold
off on things for now.

On Fri, Aug 10, 2018 at 9:48 AM, shane knapp  wrote:

> On Fri, Aug 10, 2018 at 9:47 AM, Wenchen Fan  wrote:
>
>> It seems safer to skip the arrow 0.10.0 upgrade for Spark 2.4 and leave
>> it to Spark 3.0, so that we have more time to test. Any objections?
>>
>
> none here.
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-10 Thread shane knapp
On Fri, Aug 10, 2018 at 9:47 AM, Wenchen Fan  wrote:

> It seems safer to skip the arrow 0.10.0 upgrade for Spark 2.4 and leave it
> to Spark 3.0, so that we have more time to test. Any objections?
>

none here.

-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-10 Thread Wenchen Fan
It seems safer to skip the arrow 0.10.0 upgrade for Spark 2.4 and leave it
to Spark 3.0, so that we have more time to test. Any objections?

On Fri, Aug 10, 2018 at 11:53 PM shane knapp  wrote:

> quick update from my end:
>
> SPARK-24433 (SparkR/k8s) depends on SPARK-25087 (move builds to ubuntu)
>
> SPARK-23874 (arrow -> 0.10.0) now depends on SPARK-25079 (python 3.5
> upgrade)
>
> both SPARK-25087 and SPARK-25079 are in progress and i'm very very
> hesitant to do these upgrades before the code freeze/branch cut.  i've done
> a TON of testing, but even as of yesterday afternoon, i'm still uncovering
> bugs and things that need fixing both on the infrastructure side and spark
> itself.
>
> h/t sean owen for helping out on SPARK-24950
>
> On Wed, Aug 8, 2018 at 10:51 AM, Mark Hamstra 
> wrote:
>
>> I'm inclined to agree. Just saying that it is not a regression doesn't
>> really cut it when it is a now known data correctness issue. We need
>> something a lot more than nothing before releasing 2.4.0. At a barest
>> minimum, that has to be much more complete and publicly highlighted
>> documentation of the issue so that users are less likely to stumble into
>> this unaware; but really we need to fix at least the most common cases of
>> this bug. Backports to maintenance branches are also probably in order.
>>
>> On Wed, Aug 8, 2018 at 7:06 AM Imran Rashid 
>> wrote:
>>
>>> On Tue, Aug 7, 2018 at 8:39 AM, Wenchen Fan  wrote:
>>>>
>>>> SPARK-23243 <https://issues.apache.org/jira/browse/SPARK-23243>: 
>>>> Shuffle+Repartition
>>>> on an RDD could lead to incorrect answers
>>>> It turns out to be a very complicated issue, there is no consensus
>>>> about what is the right fix yet. Likely to miss it in Spark 2.4 because
>>>> it's a long-standing issue, not a regression.
>>>>
>>>
>>> This is a really serious data loss bug.  Yes its very complex, but we
>>> absolutely have to fix this, I really think it should be in 2.4.
>>> Has worked on it stopped?
>>>
>>
>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-10 Thread shane knapp
quick update from my end:

SPARK-24433 (SparkR/k8s) depends on SPARK-25087 (move builds to ubuntu)

SPARK-23874 (arrow -> 0.10.0) now depends on SPARK-25079 (python 3.5
upgrade)

both SPARK-25087 and SPARK-25079 are in progress and i'm very very hesitant
to do these upgrades before the code freeze/branch cut.  i've done a TON of
testing, but even as of yesterday afternoon, i'm still uncovering bugs and
things that need fixing both on the infrastructure side and spark itself.

h/t sean owen for helping out on SPARK-24950

On Wed, Aug 8, 2018 at 10:51 AM, Mark Hamstra 
wrote:

> I'm inclined to agree. Just saying that it is not a regression doesn't
> really cut it when it is a now known data correctness issue. We need
> something a lot more than nothing before releasing 2.4.0. At a barest
> minimum, that has to be much more complete and publicly highlighted
> documentation of the issue so that users are less likely to stumble into
> this unaware; but really we need to fix at least the most common cases of
> this bug. Backports to maintenance branches are also probably in order.
>
> On Wed, Aug 8, 2018 at 7:06 AM Imran Rashid 
> wrote:
>
>> On Tue, Aug 7, 2018 at 8:39 AM, Wenchen Fan  wrote:
>>>
>>> SPARK-23243 <https://issues.apache.org/jira/browse/SPARK-23243>: 
>>> Shuffle+Repartition
>>> on an RDD could lead to incorrect answers
>>> It turns out to be a very complicated issue, there is no consensus about
>>> what is the right fix yet. Likely to miss it in Spark 2.4 because it's a
>>> long-standing issue, not a regression.
>>>
>>
>> This is a really serious data loss bug.  Yes its very complex, but we
>> absolutely have to fix this, I really think it should be in 2.4.
>> Has worked on it stopped?
>>
>


-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-08 Thread Mark Hamstra
I'm inclined to agree. Just saying that it is not a regression doesn't
really cut it when it is a now known data correctness issue. We need
something a lot more than nothing before releasing 2.4.0. At a barest
minimum, that has to be much more complete and publicly highlighted
documentation of the issue so that users are less likely to stumble into
this unaware; but really we need to fix at least the most common cases of
this bug. Backports to maintenance branches are also probably in order.

On Wed, Aug 8, 2018 at 7:06 AM Imran Rashid 
wrote:

> On Tue, Aug 7, 2018 at 8:39 AM, Wenchen Fan  wrote:
>>
>> SPARK-23243 : 
>> Shuffle+Repartition
>> on an RDD could lead to incorrect answers
>> It turns out to be a very complicated issue, there is no consensus about
>> what is the right fix yet. Likely to miss it in Spark 2.4 because it's a
>> long-standing issue, not a regression.
>>
>
> This is a really serious data loss bug.  Yes its very complex, but we
> absolutely have to fix this, I really think it should be in 2.4.
> Has worked on it stopped?
>


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-08 Thread Imran Rashid
On Tue, Aug 7, 2018 at 8:39 AM, Wenchen Fan  wrote:
>
> SPARK-23243 : 
> Shuffle+Repartition
> on an RDD could lead to incorrect answers
> It turns out to be a very complicated issue, there is no consensus about
> what is the right fix yet. Likely to miss it in Spark 2.4 because it's a
> long-standing issue, not a regression.
>

This is a really serious data loss bug.  Yes its very complex, but we
absolutely have to fix this, I really think it should be in 2.4.
Has worked on it stopped?


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-07 Thread John Zhuge
+1 on SPARK-25004. We have found it quite useful to diagnose PySpark OOM.

On Tue, Aug 7, 2018 at 1:21 PM Holden Karau  wrote:

> I'd like to suggest we consider  SPARK-25004  (hopefully it goes in soon),
> but solving some of the consistent Python memory issues we've had for years
> would be really amazing to get in.
>
> On Tue, Aug 7, 2018 at 1:07 PM, Tom Graves 
> wrote:
>
>> I would like to get clarification on our avro compatibility story before
>> the release.  anyone interested please look at -
>> https://issues.apache.org/jira/browse/SPARK-24924 . I probably should
>> have filed a separate jira and can if we don't resolve via discussion there.
>>
>> Tom
>>
>> On Tuesday, August 7, 2018, 11:46:31 AM CDT, shane knapp <
>> skn...@berkeley.edu> wrote:
>>
>>
>> According to the status, I think we should wait a few more days. Any
>> objections?
>>
>>
>> none here.
>>
>> i'm also pretty certain that waiting until after the code freeze to start
>> testing the GHPRB on ubuntu is the wisest course of action for us.
>>
>> shane
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
>


-- 
John


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-07 Thread Holden Karau
I'd like to suggest we consider  SPARK-25004  (hopefully it goes in soon),
but solving some of the consistent Python memory issues we've had for years
would be really amazing to get in.

On Tue, Aug 7, 2018 at 1:07 PM, Tom Graves 
wrote:

> I would like to get clarification on our avro compatibility story before
> the release.  anyone interested please look at -
> https://issues.apache.org/jira/browse/SPARK-24924 . I probably should
> have filed a separate jira and can if we don't resolve via discussion there.
>
> Tom
>
> On Tuesday, August 7, 2018, 11:46:31 AM CDT, shane knapp <
> skn...@berkeley.edu> wrote:
>
>
> According to the status, I think we should wait a few more days. Any
> objections?
>
>
> none here.
>
> i'm also pretty certain that waiting until after the code freeze to start
> testing the GHPRB on ubuntu is the wisest course of action for us.
>
> shane
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>
>


-- 
Twitter: https://twitter.com/holdenkarau


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-07 Thread Tom Graves
 I would like to get clarification on our avro compatibility story before the 
release.  anyone interested please look at - 
https://issues.apache.org/jira/browse/SPARK-24924 . I probably should have 
filed a separate jira and can if we don't resolve via discussion there.
Tom 
On Tuesday, August 7, 2018, 11:46:31 AM CDT, shane knapp 
 wrote:  
 
 
According to the status, I think we should wait a few more days. Any objections?


none here.
i'm also pretty certain that waiting until after the code freeze to start 
testing the GHPRB on ubuntu is the wisest course of action for us.
shane -- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu
  

Re: code freeze and branch cut for Apache Spark 2.4

2018-08-07 Thread shane knapp
>
> According to the status, I think we should wait a few more days. Any
> objections?
>
> none here.

i'm also pretty certain that waiting until after the code freeze to start
testing the GHPRB on ubuntu is the wisest course of action for us.

shane
-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-07 Thread Wenchen Fan
Some updates for the JIRA tickets that we want to resolve before Spark 2.4.

green: merged
orange: in progress
red: likely to miss

SPARK-24374 <https://issues.apache.org/jira/browse/SPARK-24374>: Support
Barrier Execution Mode in Apache Spark
The core functionality is finished, but we still need to add Python API.
Tracked by SPARK-24822 <https://issues.apache.org/jira/browse/SPARK-24822>

SPARK-23899 <https://issues.apache.org/jira/browse/SPARK-23899>: Built-in
SQL Function Improvement
I think it's ready to go. Although there are still some functions working
in progress, the common ones are all merged.

SPARK-14220 <https://issues.apache.org/jira/browse/SPARK-14220>: Build and
test Spark against Scala 2.12
It's close, just one last piece. Tracked by SPARK-25029
<https://issues.apache.org/jira/browse/SPARK-25029>

SPARK-4502 <https://issues.apache.org/jira/browse/SPARK-4502>: Spark SQL
reads unnecessary nested fields from Parquet
Being reviewed.

SPARK-24882 <https://issues.apache.org/jira/browse/SPARK-24882>: data
source v2 API improvement
PR is out, being reviewed.

SPARK-24252 <https://issues.apache.org/jira/browse/SPARK-24252>: Add
catalog support in Data Source V2
Being reviewed.

SPARK-24768 <https://issues.apache.org/jira/browse/SPARK-24768>: Have a
built-in AVRO data source implementation
It's close, just one last piece: the decimal type support

SPARK-23243 <https://issues.apache.org/jira/browse/SPARK-23243>:
Shuffle+Repartition
on an RDD could lead to incorrect answers
It turns out to be a very complicated issue, there is no consensus about
what is the right fix yet. Likely to miss it in Spark 2.4 because it's a
long-standing issue, not a regression.

SPARK-24598 <https://issues.apache.org/jira/browse/SPARK-24598>: Datatype
overflow conditions gives incorrect result
We decided to keep the current behavior in Spark 2.4 and add some
document(already done). We will re-consider this change in Spark 3.0.

SPARK-24020 <https://issues.apache.org/jira/browse/SPARK-24020>: Sort-merge
join inner range optimization
There are some discussions about the design, I don't think we can get to a
consensus within Spark 2.4.

SPARK-24296 <https://issues.apache.org/jira/browse/SPARK-24296>: replicating
large blocks over 2GB
Being reviewed.

SPARK-23874 <https://issues.apache.org/jira/browse/SPARK-23874>: upgrade to
Apache Arrow 0.10.0
Apache Arrow 0.10.0 has some critical bug fixes and is being voted, we
should wait a few days.


According to the status, I think we should wait a few more days. Any
objections?

Thanks,
Wenchen


On Tue, Aug 7, 2018 at 3:39 AM Sean Owen  wrote:

> ... and we still have a few snags with Scala 2.12 support at
> https://issues.apache.org/jira/browse/SPARK-25029
>
> There is some hope of resolving it on the order of a week, so for the
> moment, seems worth holding 2.4 for.
>
> On Mon, Aug 6, 2018 at 2:37 PM Bryan Cutler  wrote:
>
>> Hi All,
>>
>> I'd like to request a few days extension to the code freeze to complete
>> the upgrade to Apache Arrow 0.10.0, SPARK-23874. This upgrade includes
>> several key improvements and bug fixes.  The RC vote just passed this
>> morning and code changes are complete in
>> https://github.com/apache/spark/pull/21939. We just need some time for
>> the release artifacts to be available. Thoughts?
>>
>> Thanks,
>> Bryan
>>
>


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-06 Thread Sean Owen
... and we still have a few snags with Scala 2.12 support at
https://issues.apache.org/jira/browse/SPARK-25029

There is some hope of resolving it on the order of a week, so for the
moment, seems worth holding 2.4 for.

On Mon, Aug 6, 2018 at 2:37 PM Bryan Cutler  wrote:

> Hi All,
>
> I'd like to request a few days extension to the code freeze to complete
> the upgrade to Apache Arrow 0.10.0, SPARK-23874. This upgrade includes
> several key improvements and bug fixes.  The RC vote just passed this
> morning and code changes are complete in
> https://github.com/apache/spark/pull/21939. We just need some time for
> the release artifacts to be available. Thoughts?
>
> Thanks,
> Bryan
>


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-06 Thread Bryan Cutler
Hi All,

I'd like to request a few days extension to the code freeze to complete the
upgrade to Apache Arrow 0.10.0, SPARK-23874. This upgrade includes several
key improvements and bug fixes.  The RC vote just passed this morning and
code changes are complete in https://github.com/apache/spark/pull/21939. We
just need some time for the release artifacts to be available. Thoughts?

Thanks,
Bryan

On Wed, Aug 1, 2018, 5:34 PM shane knapp  wrote:

> ++ssuchter (who kindly set up the initial k8s builds while i hammered on
> the backend)
>
> while i'm pretty confident (read: 99%) that the pull request builds will
> work on the new ubuntu workers:
>
> 1) i'd like to do more stress testing of other spark builds (in progress)
> 2) i'd like to reimage more centos workers before moving the PRB due to
> potential executor starvation, and my lead sysadmin is out until next monday
> 3) we will need to get rid of the ubuntu-specific k8s builds and merge
> that functionality in to the existing PRB job.  after that:  testing and
> babysitting
>
> regarding (1):  if these damn builds didn't take 4+ hours, it would be
> going a lot quicker.  ;)
> regarding (2):  adding two more ubuntu workers would make me comfortable
> WRT number of available executors, and i will guarantee that can happen by
> EOD on the 7th.
> regarding (3):  this should take about a day, and realistically the
> earliest we can get this started is the 8th.  i haven't even had a chance
> to start looking at this stuff yet, either.
>
> if we push release by a week, i think i can get things sorted w/o
> impacting the release schedule.  there will still be a bunch of stuff to
> clean up from the old centos builds (specifically docs, packaging and
> release), but i'll leave the existing and working infrastructure in place
> for now.
>
> shane
>
> On Wed, Aug 1, 2018 at 4:39 PM, Erik Erlandson 
> wrote:
>
>> The PR for SparkR support on the kube back-end is completed, but waiting
>> for Shane to make some tweaks to the CI machinery for full testing support.
>> If the code freeze is being delayed, this PR could be merged as well.
>>
>> On Fri, Jul 6, 2018 at 9:47 AM, Reynold Xin  wrote:
>>
>>> FYI 6 mo is coming up soon since the last release. We will cut the
>>> branch and code freeze on Aug 1st in order to get 2.4 out on time.
>>>
>>>
>>
>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-01 Thread shane knapp
++ssuchter (who kindly set up the initial k8s builds while i hammered on
the backend)

while i'm pretty confident (read: 99%) that the pull request builds will
work on the new ubuntu workers:

1) i'd like to do more stress testing of other spark builds (in progress)
2) i'd like to reimage more centos workers before moving the PRB due to
potential executor starvation, and my lead sysadmin is out until next monday
3) we will need to get rid of the ubuntu-specific k8s builds and merge that
functionality in to the existing PRB job.  after that:  testing and
babysitting

regarding (1):  if these damn builds didn't take 4+ hours, it would be
going a lot quicker.  ;)
regarding (2):  adding two more ubuntu workers would make me comfortable
WRT number of available executors, and i will guarantee that can happen by
EOD on the 7th.
regarding (3):  this should take about a day, and realistically the
earliest we can get this started is the 8th.  i haven't even had a chance
to start looking at this stuff yet, either.

if we push release by a week, i think i can get things sorted w/o impacting
the release schedule.  there will still be a bunch of stuff to clean up
from the old centos builds (specifically docs, packaging and release), but
i'll leave the existing and working infrastructure in place for now.

shane

On Wed, Aug 1, 2018 at 4:39 PM, Erik Erlandson  wrote:

> The PR for SparkR support on the kube back-end is completed, but waiting
> for Shane to make some tweaks to the CI machinery for full testing support.
> If the code freeze is being delayed, this PR could be merged as well.
>
> On Fri, Jul 6, 2018 at 9:47 AM, Reynold Xin  wrote:
>
>> FYI 6 mo is coming up soon since the last release. We will cut the branch
>> and code freeze on Aug 1st in order to get 2.4 out on time.
>>
>>
>


-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-01 Thread Erik Erlandson
The PR for SparkR support on the kube back-end is completed, but waiting
for Shane to make some tweaks to the CI machinery for full testing support.
If the code freeze is being delayed, this PR could be merged as well.

On Fri, Jul 6, 2018 at 9:47 AM, Reynold Xin  wrote:

> FYI 6 mo is coming up soon since the last release. We will cut the branch
> and code freeze on Aug 1st in order to get 2.4 out on time.
>
>


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-01 Thread Erik Erlandson
cient to fully
>>>>> vet the code as a PR. I'm not entirely happy with the design and code as
>>>>> they currently are (and I'm still trying to find the time to more publicly
>>>>> express my thoughts and concerns), but I'm fine with them going into 2.4
>>>>> much as they are as long as they go in with proper stability annotations
>>>>> and are understood not to be cast-in-stone final implementations, but
>>>>> rather as a way to get people using them and generating the feedback that
>>>>> is necessary to get us to something more like a final design and
>>>>> implementation.
>>>>>
>>>>> On Tue, Jul 31, 2018 at 11:54 AM Erik Erlandson 
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> Barrier mode seems like a high impact feature on Spark's core code:
>>>>>> is one additional week enough time to properly vet this feature?
>>>>>>
>>>>>> On Tue, Jul 31, 2018 at 7:10 AM, Joseph Torres <
>>>>>> joseph.tor...@databricks.com> wrote:
>>>>>>
>>>>>>> Full continuous processing aggregation support ran into
>>>>>>> unanticipated scalability and scheduling problems. We’re planning to
>>>>>>> overcome those by using some of the barrier execution machinery, but 
>>>>>>> since
>>>>>>> barrier execution itself is still in progress the full support isn’t 
>>>>>>> going
>>>>>>> to make it into 2.4.
>>>>>>>
>>>>>>> Jose
>>>>>>>
>>>>>>> On Tue, Jul 31, 2018 at 6:07 AM Tomasz Gawęda <
>>>>>>> tomasz.gaw...@outlook.com> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> what is the status of Continuous Processing + Aggregations? As far
>>>>>>>> as I
>>>>>>>> remember, Jose Torres said it should  be easy to perform
>>>>>>>> aggregations if
>>>>>>>> coalesce(1) work. IIRC it's already merged to master.
>>>>>>>>
>>>>>>>> Is this work in progress? If yes, it would be great to have full
>>>>>>>> aggregation/join support in Spark 2.4 in CP.
>>>>>>>>
>>>>>>>> Pozdrawiam / Best regards,
>>>>>>>>
>>>>>>>> Tomek
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2018-07-31 10:43, Petar Zečević wrote:
>>>>>>>> > This one is important to us: https://issues.apache.org/jira
>>>>>>>> /browse/SPARK-24020 (Sort-merge join inner range optimization) but
>>>>>>>> I think it could be useful to others too.
>>>>>>>> >
>>>>>>>> > It is finished and is ready to be merged (was ready a month ago
>>>>>>>> at least).
>>>>>>>> >
>>>>>>>> > Do you think you could consider including it in 2.4?
>>>>>>>> >
>>>>>>>> > Petar
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > Wenchen Fan @ 1970-01-01 01:00 CET:
>>>>>>>> >
>>>>>>>> >> I went through the open JIRA tickets and here is a list that we
>>>>>>>> should consider for Spark 2.4:
>>>>>>>> >>
>>>>>>>> >> High Priority:
>>>>>>>> >> SPARK-24374: Support Barrier Execution Mode in Apache Spark
>>>>>>>> >> This one is critical to the Spark ecosystem for deep learning.
>>>>>>>> It only has a few remaining works and I think we should have it in 
>>>>>>>> Spark
>>>>>>>> 2.4.
>>>>>>>> >>
>>>>>>>> >> Middle Priority:
>>>>>>>> >> SPARK-23899: Built-in SQL Function Improvement
>>>>>>>> >> We've already added a lot of built-in functions in this release,
>>>>>>>> but there are a few useful higher-order functions in progress, like
>>>>>>>> `array_except`, `transform`, etc. It would be great if we can get them 
>>>>>>>>

Re: code freeze and branch cut for Apache Spark 2.4

2018-08-01 Thread Imran Rashid
rocessing + Aggregations? As far
>>>>>>> as I
>>>>>>> remember, Jose Torres said it should  be easy to perform
>>>>>>> aggregations if
>>>>>>> coalesce(1) work. IIRC it's already merged to master.
>>>>>>>
>>>>>>> Is this work in progress? If yes, it would be great to have full
>>>>>>> aggregation/join support in Spark 2.4 in CP.
>>>>>>>
>>>>>>> Pozdrawiam / Best regards,
>>>>>>>
>>>>>>> Tomek
>>>>>>>
>>>>>>>
>>>>>>> On 2018-07-31 10:43, Petar Zečević wrote:
>>>>>>> > This one is important to us: https://issues.apache.org/jira
>>>>>>> /browse/SPARK-24020 (Sort-merge join inner range optimization) but
>>>>>>> I think it could be useful to others too.
>>>>>>> >
>>>>>>> > It is finished and is ready to be merged (was ready a month ago at
>>>>>>> least).
>>>>>>> >
>>>>>>> > Do you think you could consider including it in 2.4?
>>>>>>> >
>>>>>>> > Petar
>>>>>>> >
>>>>>>> >
>>>>>>> > Wenchen Fan @ 1970-01-01 01:00 CET:
>>>>>>> >
>>>>>>> >> I went through the open JIRA tickets and here is a list that we
>>>>>>> should consider for Spark 2.4:
>>>>>>> >>
>>>>>>> >> High Priority:
>>>>>>> >> SPARK-24374: Support Barrier Execution Mode in Apache Spark
>>>>>>> >> This one is critical to the Spark ecosystem for deep learning. It
>>>>>>> only has a few remaining works and I think we should have it in Spark 
>>>>>>> 2.4.
>>>>>>> >>
>>>>>>> >> Middle Priority:
>>>>>>> >> SPARK-23899: Built-in SQL Function Improvement
>>>>>>> >> We've already added a lot of built-in functions in this release,
>>>>>>> but there are a few useful higher-order functions in progress, like
>>>>>>> `array_except`, `transform`, etc. It would be great if we can get them 
>>>>>>> in
>>>>>>> Spark 2.4.
>>>>>>> >>
>>>>>>> >> SPARK-14220: Build and test Spark against Scala 2.12
>>>>>>> >> Very close to finishing, great to have it in Spark 2.4.
>>>>>>> >>
>>>>>>> >> SPARK-4502: Spark SQL reads unnecessary nested fields from Parquet
>>>>>>> >> This one is there for years (thanks for your patience Michael!),
>>>>>>> and is also close to finishing. Great to have it in 2.4.
>>>>>>> >>
>>>>>>> >> SPARK-24882: data source v2 API improvement
>>>>>>> >> This is to improve the data source v2 API based on what we
>>>>>>> learned during this release. From the migration of existing sources and
>>>>>>> design of new features, we found some problems in the API and want to
>>>>>>> address them. I believe this should be
>>>>>>> >> the last significant API change to data source v2, so great to
>>>>>>> have in Spark 2.4. I'll send a discuss email about it later.
>>>>>>> >>
>>>>>>> >> SPARK-24252: Add catalog support in Data Source V2
>>>>>>> >> This is a very important feature for data source v2, and is
>>>>>>> currently being discussed in the dev list.
>>>>>>> >>
>>>>>>> >> SPARK-24768: Have a built-in AVRO data source implementation
>>>>>>> >> Most of it is done, but date/timestamp support is still missing.
>>>>>>> Great to have in 2.4.
>>>>>>> >>
>>>>>>> >> SPARK-23243: Shuffle+Repartition on an RDD could lead to
>>>>>>> incorrect answers
>>>>>>> >> This is a long-standing correctness bug, great to have in 2.4.
>>>>>>> >>
>>>>>>> >> There are some other important features like the adaptive
>>>>>>> execution, streaming SQL, etc., not in the list, since I think we are 
>>>>>>> not
>>>>>>> able to finish them before 2.4.
>>>>>>> >>
>>>>>>> >> Feel free to add more things if you think they are important to
>>>>>>> Spark 2.4 by replying to this email.
>>>>>>> >>
>>>>>>> >> Thanks,
>>>>>>> >> Wenchen
>>>>>>> >>
>>>>>>> >> On Mon, Jul 30, 2018 at 11:00 PM Sean Owen 
>>>>>>> wrote:
>>>>>>> >>
>>>>>>> >>   In theory releases happen on a time-based cadence, so it's
>>>>>>> pretty much wrap up what's ready by the code freeze and ship it. In
>>>>>>> practice, the cadence slips frequently, and it's very much a negotiation
>>>>>>> about what features should push the
>>>>>>> >>   code freeze out a few weeks every time. So, kind of a hybrid
>>>>>>> approach here that works OK.
>>>>>>> >>
>>>>>>> >>   Certainly speak up if you think there's something that really
>>>>>>> needs to get into 2.4. This is that discuss thread.
>>>>>>> >>
>>>>>>> >>   (BTW I updated the page you mention just yesterday, to reflect
>>>>>>> the plan suggested in this thread.)
>>>>>>> >>
>>>>>>> >>   On Mon, Jul 30, 2018 at 9:51 AM Tom Graves
>>>>>>>  wrote:
>>>>>>> >>
>>>>>>> >>   Shouldn't this be a discuss thread?
>>>>>>> >>
>>>>>>> >>   I'm also happy to see more release managers and agree the time
>>>>>>> is getting close, but we should see what features are in progress and 
>>>>>>> see
>>>>>>> how close things are and propose a date based on that.  Cutting a 
>>>>>>> branch to
>>>>>>> soon just creates
>>>>>>> >>   more work for committers to push to more branches.
>>>>>>> >>
>>>>>>> >>http://spark.apache.org/versioning-policy.html mentioned the
>>>>>>> code freeze and release branch cut mid-august.
>>>>>>> >>
>>>>>>> >>   Tom
>>>>>>> >
>>>>>>> > 
>>>>>>> -
>>>>>>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>> >
>>>>>>>
>>>>>>>
>>>>>
>>>
>


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-01 Thread Xingbo Jiang
ge join inner range optimization)
>>>>>> but I think it could be useful to others too.
>>>>>> >
>>>>>> > It is finished and is ready to be merged (was ready a month ago at
>>>>>> least).
>>>>>> >
>>>>>> > Do you think you could consider including it in 2.4?
>>>>>> >
>>>>>> > Petar
>>>>>> >
>>>>>> >
>>>>>> > Wenchen Fan @ 1970-01-01 01:00 CET:
>>>>>> >
>>>>>> >> I went through the open JIRA tickets and here is a list that we
>>>>>> should consider for Spark 2.4:
>>>>>> >>
>>>>>> >> High Priority:
>>>>>> >> SPARK-24374: Support Barrier Execution Mode in Apache Spark
>>>>>> >> This one is critical to the Spark ecosystem for deep learning. It
>>>>>> only has a few remaining works and I think we should have it in Spark 
>>>>>> 2.4.
>>>>>> >>
>>>>>> >> Middle Priority:
>>>>>> >> SPARK-23899: Built-in SQL Function Improvement
>>>>>> >> We've already added a lot of built-in functions in this release,
>>>>>> but there are a few useful higher-order functions in progress, like
>>>>>> `array_except`, `transform`, etc. It would be great if we can get them in
>>>>>> Spark 2.4.
>>>>>> >>
>>>>>> >> SPARK-14220: Build and test Spark against Scala 2.12
>>>>>> >> Very close to finishing, great to have it in Spark 2.4.
>>>>>> >>
>>>>>> >> SPARK-4502: Spark SQL reads unnecessary nested fields from Parquet
>>>>>> >> This one is there for years (thanks for your patience Michael!),
>>>>>> and is also close to finishing. Great to have it in 2.4.
>>>>>> >>
>>>>>> >> SPARK-24882: data source v2 API improvement
>>>>>> >> This is to improve the data source v2 API based on what we learned
>>>>>> during this release. From the migration of existing sources and design of
>>>>>> new features, we found some problems in the API and want to address 
>>>>>> them. I
>>>>>> believe this should be
>>>>>> >> the last significant API change to data source v2, so great to
>>>>>> have in Spark 2.4. I'll send a discuss email about it later.
>>>>>> >>
>>>>>> >> SPARK-24252: Add catalog support in Data Source V2
>>>>>> >> This is a very important feature for data source v2, and is
>>>>>> currently being discussed in the dev list.
>>>>>> >>
>>>>>> >> SPARK-24768: Have a built-in AVRO data source implementation
>>>>>> >> Most of it is done, but date/timestamp support is still missing.
>>>>>> Great to have in 2.4.
>>>>>> >>
>>>>>> >> SPARK-23243: Shuffle+Repartition on an RDD could lead to incorrect
>>>>>> answers
>>>>>> >> This is a long-standing correctness bug, great to have in 2.4.
>>>>>> >>
>>>>>> >> There are some other important features like the adaptive
>>>>>> execution, streaming SQL, etc., not in the list, since I think we are not
>>>>>> able to finish them before 2.4.
>>>>>> >>
>>>>>> >> Feel free to add more things if you think they are important to
>>>>>> Spark 2.4 by replying to this email.
>>>>>> >>
>>>>>> >> Thanks,
>>>>>> >> Wenchen
>>>>>> >>
>>>>>> >> On Mon, Jul 30, 2018 at 11:00 PM Sean Owen 
>>>>>> wrote:
>>>>>> >>
>>>>>> >>   In theory releases happen on a time-based cadence, so it's
>>>>>> pretty much wrap up what's ready by the code freeze and ship it. In
>>>>>> practice, the cadence slips frequently, and it's very much a negotiation
>>>>>> about what features should push the
>>>>>> >>   code freeze out a few weeks every time. So, kind of a hybrid
>>>>>> approach here that works OK.
>>>>>> >>
>>>>>> >>   Certainly speak up if you think there's something that really
>>>>>> needs to get into 2.4. This is that discuss thread.
>>>>>> >>
>>>>>> >>   (BTW I updated the page you mention just yesterday, to reflect
>>>>>> the plan suggested in this thread.)
>>>>>> >>
>>>>>> >>   On Mon, Jul 30, 2018 at 9:51 AM Tom Graves
>>>>>>  wrote:
>>>>>> >>
>>>>>> >>   Shouldn't this be a discuss thread?
>>>>>> >>
>>>>>> >>   I'm also happy to see more release managers and agree the time
>>>>>> is getting close, but we should see what features are in progress and see
>>>>>> how close things are and propose a date based on that.  Cutting a branch 
>>>>>> to
>>>>>> soon just creates
>>>>>> >>   more work for committers to push to more branches.
>>>>>> >>
>>>>>> >>http://spark.apache.org/versioning-policy.html mentioned the
>>>>>> code freeze and release branch cut mid-august.
>>>>>> >>
>>>>>> >>   Tom
>>>>>> >
>>>>>> > 
>>>>>> -
>>>>>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>> >
>>>>>>
>>>>>>
>>>>
>>


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-01 Thread Xiangrui Meng
ge join
>>>>>> inner range optimization) but I think it could be useful to others too.
>>>>>> >
>>>>>> > It is finished and is ready to be merged (was ready a month ago at
>>>>>> least).
>>>>>> >
>>>>>> > Do you think you could consider including it in 2.4?
>>>>>> >
>>>>>> > Petar
>>>>>> >
>>>>>> >
>>>>>> > Wenchen Fan @ 1970-01-01 01:00 CET:
>>>>>> >
>>>>>> >> I went through the open JIRA tickets and here is a list that we
>>>>>> should consider for Spark 2.4:
>>>>>> >>
>>>>>> >> High Priority:
>>>>>> >> SPARK-24374: Support Barrier Execution Mode in Apache Spark
>>>>>> >> This one is critical to the Spark ecosystem for deep learning. It
>>>>>> only has a few remaining works and I think we should have it in Spark 
>>>>>> 2.4.
>>>>>> >>
>>>>>> >> Middle Priority:
>>>>>> >> SPARK-23899: Built-in SQL Function Improvement
>>>>>> >> We've already added a lot of built-in functions in this release,
>>>>>> but there are a few useful higher-order functions in progress, like
>>>>>> `array_except`, `transform`, etc. It would be great if we can get them in
>>>>>> Spark 2.4.
>>>>>> >>
>>>>>> >> SPARK-14220: Build and test Spark against Scala 2.12
>>>>>> >> Very close to finishing, great to have it in Spark 2.4.
>>>>>> >>
>>>>>> >> SPARK-4502: Spark SQL reads unnecessary nested fields from Parquet
>>>>>> >> This one is there for years (thanks for your patience Michael!),
>>>>>> and is also close to finishing. Great to have it in 2.4.
>>>>>> >>
>>>>>> >> SPARK-24882: data source v2 API improvement
>>>>>> >> This is to improve the data source v2 API based on what we learned
>>>>>> during this release. From the migration of existing sources and design of
>>>>>> new features, we found some problems in the API and want to address 
>>>>>> them. I
>>>>>> believe this should be
>>>>>> >> the last significant API change to data source v2, so great to
>>>>>> have in Spark 2.4. I'll send a discuss email about it later.
>>>>>> >>
>>>>>> >> SPARK-24252: Add catalog support in Data Source V2
>>>>>> >> This is a very important feature for data source v2, and is
>>>>>> currently being discussed in the dev list.
>>>>>> >>
>>>>>> >> SPARK-24768: Have a built-in AVRO data source implementation
>>>>>> >> Most of it is done, but date/timestamp support is still missing.
>>>>>> Great to have in 2.4.
>>>>>> >>
>>>>>> >> SPARK-23243: Shuffle+Repartition on an RDD could lead to incorrect
>>>>>> answers
>>>>>> >> This is a long-standing correctness bug, great to have in 2.4.
>>>>>> >>
>>>>>> >> There are some other important features like the adaptive
>>>>>> execution, streaming SQL, etc., not in the list, since I think we are not
>>>>>> able to finish them before 2.4.
>>>>>> >>
>>>>>> >> Feel free to add more things if you think they are important to
>>>>>> Spark 2.4 by replying to this email.
>>>>>> >>
>>>>>> >> Thanks,
>>>>>> >> Wenchen
>>>>>> >>
>>>>>> >> On Mon, Jul 30, 2018 at 11:00 PM Sean Owen 
>>>>>> wrote:
>>>>>> >>
>>>>>> >>   In theory releases happen on a time-based cadence, so it's
>>>>>> pretty much wrap up what's ready by the code freeze and ship it. In
>>>>>> practice, the cadence slips frequently, and it's very much a negotiation
>>>>>> about what features should push the
>>>>>> >>   code freeze out a few weeks every time. So, kind of a hybrid
>>>>>> approach here that works OK.
>>>>>> >>
>>>>>> >>   Certainly speak up if you think there's something that really
>>>>>> needs to get into 2.4. This is that discuss thread.
>>>>>> >>
>>>>>> >>   (BTW I updated the page you mention just yesterday, to reflect
>>>>>> the plan suggested in this thread.)
>>>>>> >>
>>>>>> >>   On Mon, Jul 30, 2018 at 9:51 AM Tom Graves
>>>>>>  wrote:
>>>>>> >>
>>>>>> >>   Shouldn't this be a discuss thread?
>>>>>> >>
>>>>>> >>   I'm also happy to see more release managers and agree the time
>>>>>> is getting close, but we should see what features are in progress and see
>>>>>> how close things are and propose a date based on that.  Cutting a branch 
>>>>>> to
>>>>>> soon just creates
>>>>>> >>   more work for committers to push to more branches.
>>>>>> >>
>>>>>> >>http://spark.apache.org/versioning-policy.html mentioned the
>>>>>> code freeze and release branch cut mid-august.
>>>>>> >>
>>>>>> >>   Tom
>>>>>> >
>>>>>> >
>>>>>> -
>>>>>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>> >
>>>>>>
>>>>>>
>>>>
>> --

Xiangrui Meng

Software Engineer

Databricks Inc. [image: http://databricks.com] <http://databricks.com/>


Re: code freeze and branch cut for Apache Spark 2.4

2018-07-31 Thread Imran Rashid
I'd like to add SPARK-24296, replicating large blocks over 2GB.  Its been
up for review for a while, and would end the 2GB block limit (well ...
subject to a couple of caveats on SPARK-6235).

On Mon, Jul 30, 2018 at 9:01 PM, Wenchen Fan  wrote:

> I went through the open JIRA tickets and here is a list that we should
> consider for Spark 2.4:
>
> *High Priority*:
> SPARK-24374 <https://issues.apache.org/jira/browse/SPARK-24374>: Support
> Barrier Execution Mode in Apache Spark
> This one is critical to the Spark ecosystem for deep learning. It only has
> a few remaining works and I think we should have it in Spark 2.4.
>
> *Middle Priority*:
> SPARK-23899 <https://issues.apache.org/jira/browse/SPARK-23899>: Built-in
> SQL Function Improvement
> We've already added a lot of built-in functions in this release, but there
> are a few useful higher-order functions in progress, like `array_except`,
> `transform`, etc. It would be great if we can get them in Spark 2.4.
>
> SPARK-14220 <https://issues.apache.org/jira/browse/SPARK-14220>: Build
> and test Spark against Scala 2.12
> Very close to finishing, great to have it in Spark 2.4.
>
> SPARK-4502 <https://issues.apache.org/jira/browse/SPARK-4502>: Spark SQL
> reads unnecessary nested fields from Parquet
> This one is there for years (thanks for your patience Michael!), and is
> also close to finishing. Great to have it in 2.4.
>
> SPARK-24882 <https://issues.apache.org/jira/browse/SPARK-24882>: data
> source v2 API improvement
> This is to improve the data source v2 API based on what we learned during
> this release. From the migration of existing sources and design of new
> features, we found some problems in the API and want to address them. I
> believe this should be the last significant API change to data source
> v2, so great to have in Spark 2.4. I'll send a discuss email about it later.
>
> SPARK-24252 <https://issues.apache.org/jira/browse/SPARK-24252>: Add
> catalog support in Data Source V2
> This is a very important feature for data source v2, and is currently
> being discussed in the dev list.
>
> SPARK-24768 <https://issues.apache.org/jira/browse/SPARK-24768>: Have a
> built-in AVRO data source implementation
> Most of it is done, but date/timestamp support is still missing. Great to
> have in 2.4.
>
> SPARK-23243 <https://issues.apache.org/jira/browse/SPARK-23243>:
> Shuffle+Repartition on an RDD could lead to incorrect answers
> This is a long-standing correctness bug, great to have in 2.4.
>
> There are some other important features like the adaptive execution,
> streaming SQL, etc., not in the list, since I think we are not able to
> finish them before 2.4.
>
> Feel free to add more things if you think they are important to Spark 2.4
> by replying to this email.
>
> Thanks,
> Wenchen
>
> On Mon, Jul 30, 2018 at 11:00 PM Sean Owen  wrote:
>
>> In theory releases happen on a time-based cadence, so it's pretty much
>> wrap up what's ready by the code freeze and ship it. In practice, the
>> cadence slips frequently, and it's very much a negotiation about what
>> features should push the code freeze out a few weeks every time. So, kind
>> of a hybrid approach here that works OK.
>>
>> Certainly speak up if you think there's something that really needs to
>> get into 2.4. This is that discuss thread.
>>
>> (BTW I updated the page you mention just yesterday, to reflect the plan
>> suggested in this thread.)
>>
>> On Mon, Jul 30, 2018 at 9:51 AM Tom Graves 
>> wrote:
>>
>>> Shouldn't this be a discuss thread?
>>>
>>> I'm also happy to see more release managers and agree the time is
>>> getting close, but we should see what features are in progress and see how
>>> close things are and propose a date based on that.  Cutting a branch to
>>> soon just creates more work for committers to push to more branches.
>>>
>>>  http://spark.apache.org/versioning-policy.html mentioned the code
>>> freeze and release branch cut mid-august.
>>>
>>>
>>> Tom
>>>
>>>


Re: code freeze and branch cut for Apache Spark 2.4

2018-07-31 Thread Reynold Xin
ful higher-order functions in progress, like
>>>>> `array_except`, `transform`, etc. It would be great if we can get them in
>>>>> Spark 2.4.
>>>>> >>
>>>>> >> SPARK-14220: Build and test Spark against Scala 2.12
>>>>> >> Very close to finishing, great to have it in Spark 2.4.
>>>>> >>
>>>>> >> SPARK-4502: Spark SQL reads unnecessary nested fields from Parquet
>>>>> >> This one is there for years (thanks for your patience Michael!),
>>>>> and is also close to finishing. Great to have it in 2.4.
>>>>> >>
>>>>> >> SPARK-24882: data source v2 API improvement
>>>>> >> This is to improve the data source v2 API based on what we learned
>>>>> during this release. From the migration of existing sources and design of
>>>>> new features, we found some problems in the API and want to address them. 
>>>>> I
>>>>> believe this should be
>>>>> >> the last significant API change to data source v2, so great to have
>>>>> in Spark 2.4. I'll send a discuss email about it later.
>>>>> >>
>>>>> >> SPARK-24252: Add catalog support in Data Source V2
>>>>> >> This is a very important feature for data source v2, and is
>>>>> currently being discussed in the dev list.
>>>>> >>
>>>>> >> SPARK-24768: Have a built-in AVRO data source implementation
>>>>> >> Most of it is done, but date/timestamp support is still missing.
>>>>> Great to have in 2.4.
>>>>> >>
>>>>> >> SPARK-23243: Shuffle+Repartition on an RDD could lead to incorrect
>>>>> answers
>>>>> >> This is a long-standing correctness bug, great to have in 2.4.
>>>>> >>
>>>>> >> There are some other important features like the adaptive
>>>>> execution, streaming SQL, etc., not in the list, since I think we are not
>>>>> able to finish them before 2.4.
>>>>> >>
>>>>> >> Feel free to add more things if you think they are important to
>>>>> Spark 2.4 by replying to this email.
>>>>> >>
>>>>> >> Thanks,
>>>>> >> Wenchen
>>>>> >>
>>>>> >> On Mon, Jul 30, 2018 at 11:00 PM Sean Owen 
>>>>> wrote:
>>>>> >>
>>>>> >>   In theory releases happen on a time-based cadence, so it's pretty
>>>>> much wrap up what's ready by the code freeze and ship it. In practice, the
>>>>> cadence slips frequently, and it's very much a negotiation about what
>>>>> features should push the
>>>>> >>   code freeze out a few weeks every time. So, kind of a hybrid
>>>>> approach here that works OK.
>>>>> >>
>>>>> >>   Certainly speak up if you think there's something that really
>>>>> needs to get into 2.4. This is that discuss thread.
>>>>> >>
>>>>> >>   (BTW I updated the page you mention just yesterday, to reflect
>>>>> the plan suggested in this thread.)
>>>>> >>
>>>>> >>   On Mon, Jul 30, 2018 at 9:51 AM Tom Graves
>>>>>  wrote:
>>>>> >>
>>>>> >>   Shouldn't this be a discuss thread?
>>>>> >>
>>>>> >>   I'm also happy to see more release managers and agree the time is
>>>>> getting close, but we should see what features are in progress and see how
>>>>> close things are and propose a date based on that.  Cutting a branch to
>>>>> soon just creates
>>>>> >>   more work for committers to push to more branches.
>>>>> >>
>>>>> >>http://spark.apache.org/versioning-policy.html mentioned the
>>>>> code freeze and release branch cut mid-august.
>>>>> >>
>>>>> >>   Tom
>>>>> >
>>>>> > -
>>>>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>> >
>>>>>
>>>>>
>>>
>


Re: code freeze and branch cut for Apache Spark 2.4

2018-07-31 Thread Erik Erlandson
e it in 2.4.
>>>> >>
>>>> >> SPARK-24882: data source v2 API improvement
>>>> >> This is to improve the data source v2 API based on what we learned
>>>> during this release. From the migration of existing sources and design of
>>>> new features, we found some problems in the API and want to address them. I
>>>> believe this should be
>>>> >> the last significant API change to data source v2, so great to have
>>>> in Spark 2.4. I'll send a discuss email about it later.
>>>> >>
>>>> >> SPARK-24252: Add catalog support in Data Source V2
>>>> >> This is a very important feature for data source v2, and is
>>>> currently being discussed in the dev list.
>>>> >>
>>>> >> SPARK-24768: Have a built-in AVRO data source implementation
>>>> >> Most of it is done, but date/timestamp support is still missing.
>>>> Great to have in 2.4.
>>>> >>
>>>> >> SPARK-23243: Shuffle+Repartition on an RDD could lead to incorrect
>>>> answers
>>>> >> This is a long-standing correctness bug, great to have in 2.4.
>>>> >>
>>>> >> There are some other important features like the adaptive execution,
>>>> streaming SQL, etc., not in the list, since I think we are not able to
>>>> finish them before 2.4.
>>>> >>
>>>> >> Feel free to add more things if you think they are important to
>>>> Spark 2.4 by replying to this email.
>>>> >>
>>>> >> Thanks,
>>>> >> Wenchen
>>>> >>
>>>> >> On Mon, Jul 30, 2018 at 11:00 PM Sean Owen 
>>>> wrote:
>>>> >>
>>>> >>   In theory releases happen on a time-based cadence, so it's pretty
>>>> much wrap up what's ready by the code freeze and ship it. In practice, the
>>>> cadence slips frequently, and it's very much a negotiation about what
>>>> features should push the
>>>> >>   code freeze out a few weeks every time. So, kind of a hybrid
>>>> approach here that works OK.
>>>> >>
>>>> >>   Certainly speak up if you think there's something that really
>>>> needs to get into 2.4. This is that discuss thread.
>>>> >>
>>>> >>   (BTW I updated the page you mention just yesterday, to reflect the
>>>> plan suggested in this thread.)
>>>> >>
>>>> >>   On Mon, Jul 30, 2018 at 9:51 AM Tom Graves
>>>>  wrote:
>>>> >>
>>>> >>   Shouldn't this be a discuss thread?
>>>> >>
>>>> >>   I'm also happy to see more release managers and agree the time is
>>>> getting close, but we should see what features are in progress and see how
>>>> close things are and propose a date based on that.  Cutting a branch to
>>>> soon just creates
>>>> >>   more work for committers to push to more branches.
>>>> >>
>>>> >>http://spark.apache.org/versioning-policy.html mentioned the
>>>> code freeze and release branch cut mid-august.
>>>> >>
>>>> >>   Tom
>>>> >
>>>> > -
>>>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>> >
>>>>
>>>>
>>


Re: code freeze and branch cut for Apache Spark 2.4

2018-07-31 Thread Mark Hamstra
No reasonable amount of time is likely going to be sufficient to fully vet
the code as a PR. I'm not entirely happy with the design and code as they
currently are (and I'm still trying to find the time to more publicly
express my thoughts and concerns), but I'm fine with them going into 2.4
much as they are as long as they go in with proper stability annotations
and are understood not to be cast-in-stone final implementations, but
rather as a way to get people using them and generating the feedback that
is necessary to get us to something more like a final design and
implementation.

On Tue, Jul 31, 2018 at 11:54 AM Erik Erlandson  wrote:

>
> Barrier mode seems like a high impact feature on Spark's core code: is one
> additional week enough time to properly vet this feature?
>
> On Tue, Jul 31, 2018 at 7:10 AM, Joseph Torres <
> joseph.tor...@databricks.com> wrote:
>
>> Full continuous processing aggregation support ran into unanticipated
>> scalability and scheduling problems. We’re planning to overcome those by
>> using some of the barrier execution machinery, but since barrier execution
>> itself is still in progress the full support isn’t going to make it into
>> 2.4.
>>
>> Jose
>>
>> On Tue, Jul 31, 2018 at 6:07 AM Tomasz Gawęda 
>> wrote:
>>
>>> Hi,
>>>
>>> what is the status of Continuous Processing + Aggregations? As far as I
>>> remember, Jose Torres said it should  be easy to perform aggregations if
>>> coalesce(1) work. IIRC it's already merged to master.
>>>
>>> Is this work in progress? If yes, it would be great to have full
>>> aggregation/join support in Spark 2.4 in CP.
>>>
>>> Pozdrawiam / Best regards,
>>>
>>> Tomek
>>>
>>>
>>> On 2018-07-31 10:43, Petar Zečević wrote:
>>> > This one is important to us:
>>> https://issues.apache.org/jira/browse/SPARK-24020 (Sort-merge join
>>> inner range optimization) but I think it could be useful to others too.
>>> >
>>> > It is finished and is ready to be merged (was ready a month ago at
>>> least).
>>> >
>>> > Do you think you could consider including it in 2.4?
>>> >
>>> > Petar
>>> >
>>> >
>>> > Wenchen Fan @ 1970-01-01 01:00 CET:
>>> >
>>> >> I went through the open JIRA tickets and here is a list that we
>>> should consider for Spark 2.4:
>>> >>
>>> >> High Priority:
>>> >> SPARK-24374: Support Barrier Execution Mode in Apache Spark
>>> >> This one is critical to the Spark ecosystem for deep learning. It
>>> only has a few remaining works and I think we should have it in Spark 2.4.
>>> >>
>>> >> Middle Priority:
>>> >> SPARK-23899: Built-in SQL Function Improvement
>>> >> We've already added a lot of built-in functions in this release, but
>>> there are a few useful higher-order functions in progress, like
>>> `array_except`, `transform`, etc. It would be great if we can get them in
>>> Spark 2.4.
>>> >>
>>> >> SPARK-14220: Build and test Spark against Scala 2.12
>>> >> Very close to finishing, great to have it in Spark 2.4.
>>> >>
>>> >> SPARK-4502: Spark SQL reads unnecessary nested fields from Parquet
>>> >> This one is there for years (thanks for your patience Michael!), and
>>> is also close to finishing. Great to have it in 2.4.
>>> >>
>>> >> SPARK-24882: data source v2 API improvement
>>> >> This is to improve the data source v2 API based on what we learned
>>> during this release. From the migration of existing sources and design of
>>> new features, we found some problems in the API and want to address them. I
>>> believe this should be
>>> >> the last significant API change to data source v2, so great to have
>>> in Spark 2.4. I'll send a discuss email about it later.
>>> >>
>>> >> SPARK-24252: Add catalog support in Data Source V2
>>> >> This is a very important feature for data source v2, and is currently
>>> being discussed in the dev list.
>>> >>
>>> >> SPARK-24768: Have a built-in AVRO data source implementation
>>> >> Most of it is done, but date/timestamp support is still missing.
>>> Great to have in 2.4.
>>> >>
>>> >> SPARK-23243: Shuffle+Repartition on an RDD could lead to incorrect
>>> answers
>>> >> This is a long-s

Re: code freeze and branch cut for Apache Spark 2.4

2018-07-31 Thread Erik Erlandson
Barrier mode seems like a high impact feature on Spark's core code: is one
additional week enough time to properly vet this feature?

On Tue, Jul 31, 2018 at 7:10 AM, Joseph Torres  wrote:

> Full continuous processing aggregation support ran into unanticipated
> scalability and scheduling problems. We’re planning to overcome those by
> using some of the barrier execution machinery, but since barrier execution
> itself is still in progress the full support isn’t going to make it into
> 2.4.
>
> Jose
>
> On Tue, Jul 31, 2018 at 6:07 AM Tomasz Gawęda 
> wrote:
>
>> Hi,
>>
>> what is the status of Continuous Processing + Aggregations? As far as I
>> remember, Jose Torres said it should  be easy to perform aggregations if
>> coalesce(1) work. IIRC it's already merged to master.
>>
>> Is this work in progress? If yes, it would be great to have full
>> aggregation/join support in Spark 2.4 in CP.
>>
>> Pozdrawiam / Best regards,
>>
>> Tomek
>>
>>
>> On 2018-07-31 10:43, Petar Zečević wrote:
>> > This one is important to us: https://issues.apache.org/
>> jira/browse/SPARK-24020 (Sort-merge join inner range optimization) but I
>> think it could be useful to others too.
>> >
>> > It is finished and is ready to be merged (was ready a month ago at
>> least).
>> >
>> > Do you think you could consider including it in 2.4?
>> >
>> > Petar
>> >
>> >
>> > Wenchen Fan @ 1970-01-01 01:00 CET:
>> >
>> >> I went through the open JIRA tickets and here is a list that we should
>> consider for Spark 2.4:
>> >>
>> >> High Priority:
>> >> SPARK-24374: Support Barrier Execution Mode in Apache Spark
>> >> This one is critical to the Spark ecosystem for deep learning. It only
>> has a few remaining works and I think we should have it in Spark 2.4.
>> >>
>> >> Middle Priority:
>> >> SPARK-23899: Built-in SQL Function Improvement
>> >> We've already added a lot of built-in functions in this release, but
>> there are a few useful higher-order functions in progress, like
>> `array_except`, `transform`, etc. It would be great if we can get them in
>> Spark 2.4.
>> >>
>> >> SPARK-14220: Build and test Spark against Scala 2.12
>> >> Very close to finishing, great to have it in Spark 2.4.
>> >>
>> >> SPARK-4502: Spark SQL reads unnecessary nested fields from Parquet
>> >> This one is there for years (thanks for your patience Michael!), and
>> is also close to finishing. Great to have it in 2.4.
>> >>
>> >> SPARK-24882: data source v2 API improvement
>> >> This is to improve the data source v2 API based on what we learned
>> during this release. From the migration of existing sources and design of
>> new features, we found some problems in the API and want to address them. I
>> believe this should be
>> >> the last significant API change to data source v2, so great to have in
>> Spark 2.4. I'll send a discuss email about it later.
>> >>
>> >> SPARK-24252: Add catalog support in Data Source V2
>> >> This is a very important feature for data source v2, and is currently
>> being discussed in the dev list.
>> >>
>> >> SPARK-24768: Have a built-in AVRO data source implementation
>> >> Most of it is done, but date/timestamp support is still missing. Great
>> to have in 2.4.
>> >>
>> >> SPARK-23243: Shuffle+Repartition on an RDD could lead to incorrect
>> answers
>> >> This is a long-standing correctness bug, great to have in 2.4.
>> >>
>> >> There are some other important features like the adaptive execution,
>> streaming SQL, etc., not in the list, since I think we are not able to
>> finish them before 2.4.
>> >>
>> >> Feel free to add more things if you think they are important to Spark
>> 2.4 by replying to this email.
>> >>
>> >> Thanks,
>> >> Wenchen
>> >>
>> >> On Mon, Jul 30, 2018 at 11:00 PM Sean Owen  wrote:
>> >>
>> >>   In theory releases happen on a time-based cadence, so it's pretty
>> much wrap up what's ready by the code freeze and ship it. In practice, the
>> cadence slips frequently, and it's very much a negotiation about what
>> features should push the
>> >>   code freeze out a few weeks every time. So, kind of a hybrid
>> approach h

Re: code freeze and branch cut for Apache Spark 2.4

2018-07-31 Thread Joseph Torres
Full continuous processing aggregation support ran into unanticipated
scalability and scheduling problems. We’re planning to overcome those by
using some of the barrier execution machinery, but since barrier execution
itself is still in progress the full support isn’t going to make it into
2.4.

Jose

On Tue, Jul 31, 2018 at 6:07 AM Tomasz Gawęda 
wrote:

> Hi,
>
> what is the status of Continuous Processing + Aggregations? As far as I
> remember, Jose Torres said it should  be easy to perform aggregations if
> coalesce(1) work. IIRC it's already merged to master.
>
> Is this work in progress? If yes, it would be great to have full
> aggregation/join support in Spark 2.4 in CP.
>
> Pozdrawiam / Best regards,
>
> Tomek
>
>
> On 2018-07-31 10:43, Petar Zečević wrote:
> > This one is important to us:
> https://issues.apache.org/jira/browse/SPARK-24020 (Sort-merge join inner
> range optimization) but I think it could be useful to others too.
> >
> > It is finished and is ready to be merged (was ready a month ago at
> least).
> >
> > Do you think you could consider including it in 2.4?
> >
> > Petar
> >
> >
> > Wenchen Fan @ 1970-01-01 01:00 CET:
> >
> >> I went through the open JIRA tickets and here is a list that we should
> consider for Spark 2.4:
> >>
> >> High Priority:
> >> SPARK-24374: Support Barrier Execution Mode in Apache Spark
> >> This one is critical to the Spark ecosystem for deep learning. It only
> has a few remaining works and I think we should have it in Spark 2.4.
> >>
> >> Middle Priority:
> >> SPARK-23899: Built-in SQL Function Improvement
> >> We've already added a lot of built-in functions in this release, but
> there are a few useful higher-order functions in progress, like
> `array_except`, `transform`, etc. It would be great if we can get them in
> Spark 2.4.
> >>
> >> SPARK-14220: Build and test Spark against Scala 2.12
> >> Very close to finishing, great to have it in Spark 2.4.
> >>
> >> SPARK-4502: Spark SQL reads unnecessary nested fields from Parquet
> >> This one is there for years (thanks for your patience Michael!), and is
> also close to finishing. Great to have it in 2.4.
> >>
> >> SPARK-24882: data source v2 API improvement
> >> This is to improve the data source v2 API based on what we learned
> during this release. From the migration of existing sources and design of
> new features, we found some problems in the API and want to address them. I
> believe this should be
> >> the last significant API change to data source v2, so great to have in
> Spark 2.4. I'll send a discuss email about it later.
> >>
> >> SPARK-24252: Add catalog support in Data Source V2
> >> This is a very important feature for data source v2, and is currently
> being discussed in the dev list.
> >>
> >> SPARK-24768: Have a built-in AVRO data source implementation
> >> Most of it is done, but date/timestamp support is still missing. Great
> to have in 2.4.
> >>
> >> SPARK-23243: Shuffle+Repartition on an RDD could lead to incorrect
> answers
> >> This is a long-standing correctness bug, great to have in 2.4.
> >>
> >> There are some other important features like the adaptive execution,
> streaming SQL, etc., not in the list, since I think we are not able to
> finish them before 2.4.
> >>
> >> Feel free to add more things if you think they are important to Spark
> 2.4 by replying to this email.
> >>
> >> Thanks,
> >> Wenchen
> >>
> >> On Mon, Jul 30, 2018 at 11:00 PM Sean Owen  wrote:
> >>
> >>   In theory releases happen on a time-based cadence, so it's pretty
> much wrap up what's ready by the code freeze and ship it. In practice, the
> cadence slips frequently, and it's very much a negotiation about what
> features should push the
> >>   code freeze out a few weeks every time. So, kind of a hybrid approach
> here that works OK.
> >>
> >>   Certainly speak up if you think there's something that really needs
> to get into 2.4. This is that discuss thread.
> >>
> >>   (BTW I updated the page you mention just yesterday, to reflect the
> plan suggested in this thread.)
> >>
> >>   On Mon, Jul 30, 2018 at 9:51 AM Tom Graves
>  wrote:
> >>
> >>   Shouldn't this be a discuss thread?
> >>
> >>   I'm also happy to see more release managers and agree the time is
> getting close, but we should see what features are in progress and see how
> close things are and propose a date based on that.  Cutting a branch to
> soon just creates
> >>   more work for committers to push to more branches.
> >>
> >>http://spark.apache.org/versioning-policy.html mentioned the code
> freeze and release branch cut mid-august.
> >>
> >>   Tom
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
>


Re: code freeze and branch cut for Apache Spark 2.4

2018-07-31 Thread Stavros Kontopoulos
I have a PR out for SPARK-14540 (Support Scala 2.12 closures and Java 8
lambdas in ClosureCleaner).
This should allows us to add support for Scala 2.12, I think we can resolve
this long standing issue with 2.4.

Best,
Stavros

On Tue, Jul 31, 2018 at 4:07 PM, Tomasz Gawęda 
wrote:

> Hi,
>
> what is the status of Continuous Processing + Aggregations? As far as I
> remember, Jose Torres said it should  be easy to perform aggregations if
> coalesce(1) work. IIRC it's already merged to master.
>
> Is this work in progress? If yes, it would be great to have full
> aggregation/join support in Spark 2.4 in CP.
>
> Pozdrawiam / Best regards,
>
> Tomek
>
>
> On 2018-07-31 10:43, Petar Zečević wrote:
> > This one is important to us: https://issues.apache.org/
> jira/browse/SPARK-24020 (Sort-merge join inner range optimization) but I
> think it could be useful to others too.
> >
> > It is finished and is ready to be merged (was ready a month ago at
> least).
> >
> > Do you think you could consider including it in 2.4?
> >
> > Petar
> >
> >
> > Wenchen Fan @ 1970-01-01 01:00 CET:
> >
> >> I went through the open JIRA tickets and here is a list that we should
> consider for Spark 2.4:
> >>
> >> High Priority:
> >> SPARK-24374: Support Barrier Execution Mode in Apache Spark
> >> This one is critical to the Spark ecosystem for deep learning. It only
> has a few remaining works and I think we should have it in Spark 2.4.
> >>
> >> Middle Priority:
> >> SPARK-23899: Built-in SQL Function Improvement
> >> We've already added a lot of built-in functions in this release, but
> there are a few useful higher-order functions in progress, like
> `array_except`, `transform`, etc. It would be great if we can get them in
> Spark 2.4.
> >>
> >> SPARK-14220: Build and test Spark against Scala 2.12
> >> Very close to finishing, great to have it in Spark 2.4.
> >>
> >> SPARK-4502: Spark SQL reads unnecessary nested fields from Parquet
> >> This one is there for years (thanks for your patience Michael!), and is
> also close to finishing. Great to have it in 2.4.
> >>
> >> SPARK-24882: data source v2 API improvement
> >> This is to improve the data source v2 API based on what we learned
> during this release. From the migration of existing sources and design of
> new features, we found some problems in the API and want to address them. I
> believe this should be
> >> the last significant API change to data source v2, so great to have in
> Spark 2.4. I'll send a discuss email about it later.
> >>
> >> SPARK-24252: Add catalog support in Data Source V2
> >> This is a very important feature for data source v2, and is currently
> being discussed in the dev list.
> >>
> >> SPARK-24768: Have a built-in AVRO data source implementation
> >> Most of it is done, but date/timestamp support is still missing. Great
> to have in 2.4.
> >>
> >> SPARK-23243: Shuffle+Repartition on an RDD could lead to incorrect
> answers
> >> This is a long-standing correctness bug, great to have in 2.4.
> >>
> >> There are some other important features like the adaptive execution,
> streaming SQL, etc., not in the list, since I think we are not able to
> finish them before 2.4.
> >>
> >> Feel free to add more things if you think they are important to Spark
> 2.4 by replying to this email.
> >>
> >> Thanks,
> >> Wenchen
> >>
> >> On Mon, Jul 30, 2018 at 11:00 PM Sean Owen  wrote:
> >>
> >>   In theory releases happen on a time-based cadence, so it's pretty
> much wrap up what's ready by the code freeze and ship it. In practice, the
> cadence slips frequently, and it's very much a negotiation about what
> features should push the
> >>   code freeze out a few weeks every time. So, kind of a hybrid approach
> here that works OK.
> >>
> >>   Certainly speak up if you think there's something that really needs
> to get into 2.4. This is that discuss thread.
> >>
> >>   (BTW I updated the page you mention just yesterday, to reflect the
> plan suggested in this thread.)
> >>
> >>   On Mon, Jul 30, 2018 at 9:51 AM Tom Graves
>  wrote:
> >>
> >>   Shouldn't this be a discuss thread?
> >>
> >>   I'm also happy to see more release managers and agree the time is
> getting close, but we should see what features are in progress and see how
> close things are and propose a date based on that.  Cutting a branch to
> soon just creates
> >>   more work for committers to push to more branches.
> >>
> >>http://spark.apache.org/versioning-policy.html mentioned the code
> freeze and release branch cut mid-august.
> >>
> >>   Tom
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
>


Re: code freeze and branch cut for Apache Spark 2.4

2018-07-31 Thread Tomasz Gawęda
Hi,

what is the status of Continuous Processing + Aggregations? As far as I 
remember, Jose Torres said it should  be easy to perform aggregations if 
coalesce(1) work. IIRC it's already merged to master.

Is this work in progress? If yes, it would be great to have full 
aggregation/join support in Spark 2.4 in CP.

Pozdrawiam / Best regards,

Tomek


On 2018-07-31 10:43, Petar Zečević wrote:
> This one is important to us: 
> https://issues.apache.org/jira/browse/SPARK-24020 (Sort-merge join inner 
> range optimization) but I think it could be useful to others too.
>
> It is finished and is ready to be merged (was ready a month ago at least).
>
> Do you think you could consider including it in 2.4?
>
> Petar
>
>
> Wenchen Fan @ 1970-01-01 01:00 CET:
>
>> I went through the open JIRA tickets and here is a list that we should 
>> consider for Spark 2.4:
>>
>> High Priority:
>> SPARK-24374: Support Barrier Execution Mode in Apache Spark
>> This one is critical to the Spark ecosystem for deep learning. It only has a 
>> few remaining works and I think we should have it in Spark 2.4.
>>
>> Middle Priority:
>> SPARK-23899: Built-in SQL Function Improvement
>> We've already added a lot of built-in functions in this release, but there 
>> are a few useful higher-order functions in progress, like `array_except`, 
>> `transform`, etc. It would be great if we can get them in Spark 2.4.
>>
>> SPARK-14220: Build and test Spark against Scala 2.12
>> Very close to finishing, great to have it in Spark 2.4.
>>
>> SPARK-4502: Spark SQL reads unnecessary nested fields from Parquet
>> This one is there for years (thanks for your patience Michael!), and is also 
>> close to finishing. Great to have it in 2.4.
>>
>> SPARK-24882: data source v2 API improvement
>> This is to improve the data source v2 API based on what we learned during 
>> this release. From the migration of existing sources and design of new 
>> features, we found some problems in the API and want to address them. I 
>> believe this should be
>> the last significant API change to data source v2, so great to have in Spark 
>> 2.4. I'll send a discuss email about it later.
>>
>> SPARK-24252: Add catalog support in Data Source V2
>> This is a very important feature for data source v2, and is currently being 
>> discussed in the dev list.
>>
>> SPARK-24768: Have a built-in AVRO data source implementation
>> Most of it is done, but date/timestamp support is still missing. Great to 
>> have in 2.4.
>>
>> SPARK-23243: Shuffle+Repartition on an RDD could lead to incorrect answers
>> This is a long-standing correctness bug, great to have in 2.4.
>>
>> There are some other important features like the adaptive execution, 
>> streaming SQL, etc., not in the list, since I think we are not able to 
>> finish them before 2.4.
>>
>> Feel free to add more things if you think they are important to Spark 2.4 by 
>> replying to this email.
>>
>> Thanks,
>> Wenchen
>>
>> On Mon, Jul 30, 2018 at 11:00 PM Sean Owen  wrote:
>>
>>   In theory releases happen on a time-based cadence, so it's pretty much 
>> wrap up what's ready by the code freeze and ship it. In practice, the 
>> cadence slips frequently, and it's very much a negotiation about what 
>> features should push the
>>   code freeze out a few weeks every time. So, kind of a hybrid approach here 
>> that works OK.
>>
>>   Certainly speak up if you think there's something that really needs to get 
>> into 2.4. This is that discuss thread.
>>
>>   (BTW I updated the page you mention just yesterday, to reflect the plan 
>> suggested in this thread.)
>>
>>   On Mon, Jul 30, 2018 at 9:51 AM Tom Graves  
>> wrote:
>>
>>   Shouldn't this be a discuss thread?
>>
>>   I'm also happy to see more release managers and agree the time is getting 
>> close, but we should see what features are in progress and see how close 
>> things are and propose a date based on that.  Cutting a branch to soon just 
>> creates
>>   more work for committers to push to more branches.
>>
>>http://spark.apache.org/versioning-policy.html mentioned the code freeze 
>> and release branch cut mid-august.
>>
>>   Tom
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>



Re: code freeze and branch cut for Apache Spark 2.4

2018-07-31 Thread Petar Zečević


This one is important to us: https://issues.apache.org/jira/browse/SPARK-24020 
(Sort-merge join inner range optimization) but I think it could be useful to 
others too. 

It is finished and is ready to be merged (was ready a month ago at least).

Do you think you could consider including it in 2.4?

Petar


Wenchen Fan @ 1970-01-01 01:00 CET:

> I went through the open JIRA tickets and here is a list that we should 
> consider for Spark 2.4:
>
> High Priority:
> SPARK-24374: Support Barrier Execution Mode in Apache Spark
> This one is critical to the Spark ecosystem for deep learning. It only has a 
> few remaining works and I think we should have it in Spark 2.4.
>
> Middle Priority:
> SPARK-23899: Built-in SQL Function Improvement
> We've already added a lot of built-in functions in this release, but there 
> are a few useful higher-order functions in progress, like `array_except`, 
> `transform`, etc. It would be great if we can get them in Spark 2.4.
>
> SPARK-14220: Build and test Spark against Scala 2.12
> Very close to finishing, great to have it in Spark 2.4.
>
> SPARK-4502: Spark SQL reads unnecessary nested fields from Parquet
> This one is there for years (thanks for your patience Michael!), and is also 
> close to finishing. Great to have it in 2.4.
>
> SPARK-24882: data source v2 API improvement
> This is to improve the data source v2 API based on what we learned during 
> this release. From the migration of existing sources and design of new 
> features, we found some problems in the API and want to address them. I 
> believe this should be
> the last significant API change to data source v2, so great to have in Spark 
> 2.4. I'll send a discuss email about it later.
>
> SPARK-24252: Add catalog support in Data Source V2
> This is a very important feature for data source v2, and is currently being 
> discussed in the dev list.
>
> SPARK-24768: Have a built-in AVRO data source implementation
> Most of it is done, but date/timestamp support is still missing. Great to 
> have in 2.4.
>
> SPARK-23243: Shuffle+Repartition on an RDD could lead to incorrect answers
> This is a long-standing correctness bug, great to have in 2.4.
>
> There are some other important features like the adaptive execution, 
> streaming SQL, etc., not in the list, since I think we are not able to finish 
> them before 2.4.
>
> Feel free to add more things if you think they are important to Spark 2.4 by 
> replying to this email.
>
> Thanks,
> Wenchen
>
> On Mon, Jul 30, 2018 at 11:00 PM Sean Owen  wrote:
>
>  In theory releases happen on a time-based cadence, so it's pretty much wrap 
> up what's ready by the code freeze and ship it. In practice, the cadence 
> slips frequently, and it's very much a negotiation about what features should 
> push the
>  code freeze out a few weeks every time. So, kind of a hybrid approach here 
> that works OK. 
>
>  Certainly speak up if you think there's something that really needs to get 
> into 2.4. This is that discuss thread.
>
>  (BTW I updated the page you mention just yesterday, to reflect the plan 
> suggested in this thread.)
>
>  On Mon, Jul 30, 2018 at 9:51 AM Tom Graves  
> wrote:
>
>  Shouldn't this be a discuss thread?  
>
>  I'm also happy to see more release managers and agree the time is getting 
> close, but we should see what features are in progress and see how close 
> things are and propose a date based on that.  Cutting a branch to soon just 
> creates
>  more work for committers to push to more branches. 
>
>   http://spark.apache.org/versioning-policy.html mentioned the code freeze 
> and release branch cut mid-august.
>
>  Tom


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: code freeze and branch cut for Apache Spark 2.4

2018-07-31 Thread Marco Gaido
Hi Wenchen,

I think it would be great to consider also
 - SPARK-24598 <https://issues.apache.org/jira/browse/SPARK-24598>:
Datatype overflow conditions gives incorrect result

As it is a correctness bug. What do you think?

Thanks,
Marco

2018-07-31 4:01 GMT+02:00 Wenchen Fan :

> I went through the open JIRA tickets and here is a list that we should
> consider for Spark 2.4:
>
> *High Priority*:
> SPARK-24374 <https://issues.apache.org/jira/browse/SPARK-24374>: Support
> Barrier Execution Mode in Apache Spark
> This one is critical to the Spark ecosystem for deep learning. It only has
> a few remaining works and I think we should have it in Spark 2.4.
>
> *Middle Priority*:
> SPARK-23899 <https://issues.apache.org/jira/browse/SPARK-23899>: Built-in
> SQL Function Improvement
> We've already added a lot of built-in functions in this release, but there
> are a few useful higher-order functions in progress, like `array_except`,
> `transform`, etc. It would be great if we can get them in Spark 2.4.
>
> SPARK-14220 <https://issues.apache.org/jira/browse/SPARK-14220>: Build
> and test Spark against Scala 2.12
> Very close to finishing, great to have it in Spark 2.4.
>
> SPARK-4502 <https://issues.apache.org/jira/browse/SPARK-4502>: Spark SQL
> reads unnecessary nested fields from Parquet
> This one is there for years (thanks for your patience Michael!), and is
> also close to finishing. Great to have it in 2.4.
>
> SPARK-24882 <https://issues.apache.org/jira/browse/SPARK-24882>: data
> source v2 API improvement
> This is to improve the data source v2 API based on what we learned during
> this release. From the migration of existing sources and design of new
> features, we found some problems in the API and want to address them. I
> believe this should be the last significant API change to data source
> v2, so great to have in Spark 2.4. I'll send a discuss email about it later.
>
> SPARK-24252 <https://issues.apache.org/jira/browse/SPARK-24252>: Add
> catalog support in Data Source V2
> This is a very important feature for data source v2, and is currently
> being discussed in the dev list.
>
> SPARK-24768 <https://issues.apache.org/jira/browse/SPARK-24768>: Have a
> built-in AVRO data source implementation
> Most of it is done, but date/timestamp support is still missing. Great to
> have in 2.4.
>
> SPARK-23243 <https://issues.apache.org/jira/browse/SPARK-23243>:
> Shuffle+Repartition on an RDD could lead to incorrect answers
> This is a long-standing correctness bug, great to have in 2.4.
>
> There are some other important features like the adaptive execution,
> streaming SQL, etc., not in the list, since I think we are not able to
> finish them before 2.4.
>
> Feel free to add more things if you think they are important to Spark 2.4
> by replying to this email.
>
> Thanks,
> Wenchen
>
> On Mon, Jul 30, 2018 at 11:00 PM Sean Owen  wrote:
>
>> In theory releases happen on a time-based cadence, so it's pretty much
>> wrap up what's ready by the code freeze and ship it. In practice, the
>> cadence slips frequently, and it's very much a negotiation about what
>> features should push the code freeze out a few weeks every time. So, kind
>> of a hybrid approach here that works OK.
>>
>> Certainly speak up if you think there's something that really needs to
>> get into 2.4. This is that discuss thread.
>>
>> (BTW I updated the page you mention just yesterday, to reflect the plan
>> suggested in this thread.)
>>
>> On Mon, Jul 30, 2018 at 9:51 AM Tom Graves 
>> wrote:
>>
>>> Shouldn't this be a discuss thread?
>>>
>>> I'm also happy to see more release managers and agree the time is
>>> getting close, but we should see what features are in progress and see how
>>> close things are and propose a date based on that.  Cutting a branch to
>>> soon just creates more work for committers to push to more branches.
>>>
>>>  http://spark.apache.org/versioning-policy.html mentioned the code
>>> freeze and release branch cut mid-august.
>>>
>>>
>>> Tom
>>>
>>>


Re: code freeze and branch cut for Apache Spark 2.4

2018-07-30 Thread Wenchen Fan
I went through the open JIRA tickets and here is a list that we should
consider for Spark 2.4:

*High Priority*:
SPARK-24374 <https://issues.apache.org/jira/browse/SPARK-24374>: Support
Barrier Execution Mode in Apache Spark
This one is critical to the Spark ecosystem for deep learning. It only has
a few remaining works and I think we should have it in Spark 2.4.

*Middle Priority*:
SPARK-23899 <https://issues.apache.org/jira/browse/SPARK-23899>: Built-in
SQL Function Improvement
We've already added a lot of built-in functions in this release, but there
are a few useful higher-order functions in progress, like `array_except`,
`transform`, etc. It would be great if we can get them in Spark 2.4.

SPARK-14220 <https://issues.apache.org/jira/browse/SPARK-14220>: Build and
test Spark against Scala 2.12
Very close to finishing, great to have it in Spark 2.4.

SPARK-4502 <https://issues.apache.org/jira/browse/SPARK-4502>: Spark SQL
reads unnecessary nested fields from Parquet
This one is there for years (thanks for your patience Michael!), and is
also close to finishing. Great to have it in 2.4.

SPARK-24882 <https://issues.apache.org/jira/browse/SPARK-24882>: data
source v2 API improvement
This is to improve the data source v2 API based on what we learned during
this release. From the migration of existing sources and design of new
features, we found some problems in the API and want to address them. I
believe this should be the last significant API change to data source
v2, so great to have in Spark 2.4. I'll send a discuss email about it later.

SPARK-24252 <https://issues.apache.org/jira/browse/SPARK-24252>: Add
catalog support in Data Source V2
This is a very important feature for data source v2, and is currently being
discussed in the dev list.

SPARK-24768 <https://issues.apache.org/jira/browse/SPARK-24768>: Have a
built-in AVRO data source implementation
Most of it is done, but date/timestamp support is still missing. Great to
have in 2.4.

SPARK-23243 <https://issues.apache.org/jira/browse/SPARK-23243>:
Shuffle+Repartition on an RDD could lead to incorrect answers
This is a long-standing correctness bug, great to have in 2.4.

There are some other important features like the adaptive execution,
streaming SQL, etc., not in the list, since I think we are not able to
finish them before 2.4.

Feel free to add more things if you think they are important to Spark 2.4
by replying to this email.

Thanks,
Wenchen

On Mon, Jul 30, 2018 at 11:00 PM Sean Owen  wrote:

> In theory releases happen on a time-based cadence, so it's pretty much
> wrap up what's ready by the code freeze and ship it. In practice, the
> cadence slips frequently, and it's very much a negotiation about what
> features should push the code freeze out a few weeks every time. So, kind
> of a hybrid approach here that works OK.
>
> Certainly speak up if you think there's something that really needs to get
> into 2.4. This is that discuss thread.
>
> (BTW I updated the page you mention just yesterday, to reflect the plan
> suggested in this thread.)
>
> On Mon, Jul 30, 2018 at 9:51 AM Tom Graves 
> wrote:
>
>> Shouldn't this be a discuss thread?
>>
>> I'm also happy to see more release managers and agree the time is getting
>> close, but we should see what features are in progress and see how close
>> things are and propose a date based on that.  Cutting a branch to soon just
>> creates more work for committers to push to more branches.
>>
>>  http://spark.apache.org/versioning-policy.html mentioned the code
>> freeze and release branch cut mid-august.
>>
>>
>> Tom
>>
>>


Re: code freeze and branch cut for Apache Spark 2.4

2018-07-30 Thread Sean Owen
In theory releases happen on a time-based cadence, so it's pretty much wrap
up what's ready by the code freeze and ship it. In practice, the cadence
slips frequently, and it's very much a negotiation about what features
should push the code freeze out a few weeks every time. So, kind of a
hybrid approach here that works OK.

Certainly speak up if you think there's something that really needs to get
into 2.4. This is that discuss thread.

(BTW I updated the page you mention just yesterday, to reflect the plan
suggested in this thread.)

On Mon, Jul 30, 2018 at 9:51 AM Tom Graves 
wrote:

> Shouldn't this be a discuss thread?
>
> I'm also happy to see more release managers and agree the time is getting
> close, but we should see what features are in progress and see how close
> things are and propose a date based on that.  Cutting a branch to soon just
> creates more work for committers to push to more branches.
>
>  http://spark.apache.org/versioning-policy.html mentioned the code freeze
> and release branch cut mid-august.
>
>
> Tom
>
>


Re: code freeze and branch cut for Apache Spark 2.4

2018-07-30 Thread Tom Graves
 Shouldn't this be a discuss thread?  
I'm also happy to see more release managers and agree the time is getting 
close, but we should see what features are in progress and see how close things 
are and propose a date based on that.  Cutting a branch to soon just creates 
more work for committers to push to more branches. 
 http://spark.apache.org/versioning-policy.html mentioned the code freeze and 
release branch cut mid-august.

Tom
On Friday, July 6, 2018, 11:47:35 AM CDT, Reynold Xin  
wrote:  
 
 FYI 6 mo is coming up soon since the last release. We will cut the branch and 
code freeze on Aug 1st in order to get 2.4 out on time.
  

Re: code freeze and branch cut for Apache Spark 2.4

2018-07-29 Thread Holden Karau
I’m excited to have more folks rotate through release manager :)

On Sun, Jul 29, 2018 at 3:57 PM Stavros Kontopoulos <
stavros.kontopou...@lightbend.com> wrote:

> +1. That would great!
>
> Thanks,
> Stavros
>
> On Sun, Jul 29, 2018 at 5:05 PM, Wenchen Fan  wrote:
>
>> If no one objects, how about we make the code freeze one week later(Aug
>> 8th)?
>>
>> BTW I'd like to volunteer to serve as the release manager for Spark 2.4.
>> I'm familiar with most of the major features targeted for the 2.4 release.
>> I also have a lot of free time during this release timeframe and should be
>> able to figure out problems that may appear during the release.
>>
>> Thanks,
>> Wenchen
>>
>> On Fri, Jul 27, 2018 at 11:27 PM Stavros Kontopoulos <
>> stavros.kontopou...@lightbend.com> wrote:
>>
>>> Extending code freeze date would be great for me too, I am working on a
>>> PR for supporting scala 2.12, I am close but need some more time.
>>> We could get it into 2.4.
>>>
>>> Stavros
>>>
>>> On Fri, Jul 27, 2018 at 9:27 AM, Wenchen Fan 
>>> wrote:
>>>
>>>> This seems fine to me.
>>>>
>>>> BTW Ryan Blue and I are working on some data source v2 stuff and
>>>> hopefully we can get more things done with one more week.
>>>>
>>>> Thanks,
>>>> Wenchen
>>>>
>>>> On Thu, Jul 26, 2018 at 1:14 PM Xingbo Jiang 
>>>> wrote:
>>>>
>>>>> Xiangrui and I are leading an effort to implement a highly desirable
>>>>> feature, Barrier Execution Mode.
>>>>> https://issues.apache.org/jira/browse/SPARK-24374. This introduces a
>>>>> new scheduling model to Apache Spark so users can properly embed
>>>>> distributed DL training as a Spark stage to simplify the distributed
>>>>> training workflow. The prototype has been demoed in the Spark Summit
>>>>> keynote. This new feature got a very positive feedback from the whole
>>>>> community. The design doc and pull requests got more comments than we
>>>>> initially anticipated. We want to finish this feature in the upcoming
>>>>> release, Spark 2.4. Would it be possible to have an extension of code
>>>>> freeze for a week?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Xingbo
>>>>>
>>>>> 2018-07-07 0:47 GMT+08:00 Reynold Xin :
>>>>>
>>>>>> FYI 6 mo is coming up soon since the last release. We will cut the
>>>>>> branch and code freeze on Aug 1st in order to get 2.4 out on time.
>>>>>>
>>>>>>
>>>>>
>>>
>>>
>>> --
>>> Stavros Kontopoulos
>>>
>>> *Senior Software Engineer*
>>> *Lightbend, Inc.*
>>>
>>> *p:  +30 6977967274 <%2B1%20650%20678%200020>*
>>> *e: stavros.kontopou...@lightbend.com* 
>>>
>>>
>>>
>
>
> --
> Stavros Kontopoulos
>
> *Senior Software Engineer*
> *Lightbend, Inc.*
>
> *p:  +30 6977967274 <%2B1%20650%20678%200020>*
> *e: stavros.kontopou...@lightbend.com* 
>
>
> --
Twitter: https://twitter.com/holdenkarau


Re: code freeze and branch cut for Apache Spark 2.4

2018-07-29 Thread Stavros Kontopoulos
+1. That would great!

Thanks,
Stavros

On Sun, Jul 29, 2018 at 5:05 PM, Wenchen Fan  wrote:

> If no one objects, how about we make the code freeze one week later(Aug
> 8th)?
>
> BTW I'd like to volunteer to serve as the release manager for Spark 2.4.
> I'm familiar with most of the major features targeted for the 2.4 release.
> I also have a lot of free time during this release timeframe and should be
> able to figure out problems that may appear during the release.
>
> Thanks,
> Wenchen
>
> On Fri, Jul 27, 2018 at 11:27 PM Stavros Kontopoulos  lightbend.com> wrote:
>
>> Extending code freeze date would be great for me too, I am working on a
>> PR for supporting scala 2.12, I am close but need some more time.
>> We could get it into 2.4.
>>
>> Stavros
>>
>> On Fri, Jul 27, 2018 at 9:27 AM, Wenchen Fan  wrote:
>>
>>> This seems fine to me.
>>>
>>> BTW Ryan Blue and I are working on some data source v2 stuff and
>>> hopefully we can get more things done with one more week.
>>>
>>> Thanks,
>>> Wenchen
>>>
>>> On Thu, Jul 26, 2018 at 1:14 PM Xingbo Jiang 
>>> wrote:
>>>
>>>> Xiangrui and I are leading an effort to implement a highly desirable
>>>> feature, Barrier Execution Mode. https://issues.apache.org/
>>>> jira/browse/SPARK-24374. This introduces a new scheduling model to
>>>> Apache Spark so users can properly embed distributed DL training as a Spark
>>>> stage to simplify the distributed training workflow. The prototype has been
>>>> demoed in the Spark Summit keynote. This new feature got a very positive
>>>> feedback from the whole community. The design doc and pull requests got
>>>> more comments than we initially anticipated. We want to finish this feature
>>>> in the upcoming release, Spark 2.4. Would it be possible to have an
>>>> extension of code freeze for a week?
>>>>
>>>> Thanks,
>>>>
>>>> Xingbo
>>>>
>>>> 2018-07-07 0:47 GMT+08:00 Reynold Xin :
>>>>
>>>>> FYI 6 mo is coming up soon since the last release. We will cut the
>>>>> branch and code freeze on Aug 1st in order to get 2.4 out on time.
>>>>>
>>>>>
>>>>
>>
>>
>> --
>> Stavros Kontopoulos
>>
>> *Senior Software Engineer*
>> *Lightbend, Inc.*
>>
>> *p:  +30 6977967274 <%2B1%20650%20678%200020>*
>> *e: stavros.kontopou...@lightbend.com* 
>>
>>
>>


-- 
Stavros Kontopoulos

*Senior Software Engineer*
*Lightbend, Inc.*

*p:  +30 6977967274 <%2B1%20650%20678%200020>*
*e: stavros.kontopou...@lightbend.com* 


Re: code freeze and branch cut for Apache Spark 2.4

2018-07-29 Thread Wenchen Fan
If no one objects, how about we make the code freeze one week later(Aug
8th)?

BTW I'd like to volunteer to serve as the release manager for Spark 2.4.
I'm familiar with most of the major features targeted for the 2.4 release.
I also have a lot of free time during this release timeframe and should be
able to figure out problems that may appear during the release.

Thanks,
Wenchen

On Fri, Jul 27, 2018 at 11:27 PM Stavros Kontopoulos <
stavros.kontopou...@lightbend.com> wrote:

> Extending code freeze date would be great for me too, I am working on a PR
> for supporting scala 2.12, I am close but need some more time.
> We could get it into 2.4.
>
> Stavros
>
> On Fri, Jul 27, 2018 at 9:27 AM, Wenchen Fan  wrote:
>
>> This seems fine to me.
>>
>> BTW Ryan Blue and I are working on some data source v2 stuff and
>> hopefully we can get more things done with one more week.
>>
>> Thanks,
>> Wenchen
>>
>> On Thu, Jul 26, 2018 at 1:14 PM Xingbo Jiang 
>> wrote:
>>
>>> Xiangrui and I are leading an effort to implement a highly desirable
>>> feature, Barrier Execution Mode.
>>> https://issues.apache.org/jira/browse/SPARK-24374. This introduces a
>>> new scheduling model to Apache Spark so users can properly embed
>>> distributed DL training as a Spark stage to simplify the distributed
>>> training workflow. The prototype has been demoed in the Spark Summit
>>> keynote. This new feature got a very positive feedback from the whole
>>> community. The design doc and pull requests got more comments than we
>>> initially anticipated. We want to finish this feature in the upcoming
>>> release, Spark 2.4. Would it be possible to have an extension of code
>>> freeze for a week?
>>>
>>> Thanks,
>>>
>>> Xingbo
>>>
>>> 2018-07-07 0:47 GMT+08:00 Reynold Xin :
>>>
>>>> FYI 6 mo is coming up soon since the last release. We will cut the
>>>> branch and code freeze on Aug 1st in order to get 2.4 out on time.
>>>>
>>>>
>>>
>
>
> --
> Stavros Kontopoulos
>
> *Senior Software Engineer*
> *Lightbend, Inc.*
>
> *p:  +30 6977967274 <%2B1%20650%20678%200020>*
> *e: stavros.kontopou...@lightbend.com* 
>
>
>


Re: code freeze and branch cut for Apache Spark 2.4

2018-07-27 Thread Stavros Kontopoulos
Extending code freeze date would be great for me too, I am working on a PR
for supporting scala 2.12, I am close but need some more time.
We could get it into 2.4.

Stavros

On Fri, Jul 27, 2018 at 9:27 AM, Wenchen Fan  wrote:

> This seems fine to me.
>
> BTW Ryan Blue and I are working on some data source v2 stuff and hopefully
> we can get more things done with one more week.
>
> Thanks,
> Wenchen
>
> On Thu, Jul 26, 2018 at 1:14 PM Xingbo Jiang 
> wrote:
>
>> Xiangrui and I are leading an effort to implement a highly desirable
>> feature, Barrier Execution Mode. https://issues.apache.org/
>> jira/browse/SPARK-24374. This introduces a new scheduling model to
>> Apache Spark so users can properly embed distributed DL training as a Spark
>> stage to simplify the distributed training workflow. The prototype has been
>> demoed in the Spark Summit keynote. This new feature got a very positive
>> feedback from the whole community. The design doc and pull requests got
>> more comments than we initially anticipated. We want to finish this feature
>> in the upcoming release, Spark 2.4. Would it be possible to have an
>> extension of code freeze for a week?
>>
>> Thanks,
>>
>> Xingbo
>>
>> 2018-07-07 0:47 GMT+08:00 Reynold Xin :
>>
>>> FYI 6 mo is coming up soon since the last release. We will cut the
>>> branch and code freeze on Aug 1st in order to get 2.4 out on time.
>>>
>>>
>>


-- 
Stavros Kontopoulos

*Senior Software Engineer*
*Lightbend, Inc.*

*p:  +30 6977967274 <%2B1%20650%20678%200020>*
*e: stavros.kontopou...@lightbend.com* 


Re: code freeze and branch cut for Apache Spark 2.4

2018-07-27 Thread Wenchen Fan
This seems fine to me.

BTW Ryan Blue and I are working on some data source v2 stuff and hopefully
we can get more things done with one more week.

Thanks,
Wenchen

On Thu, Jul 26, 2018 at 1:14 PM Xingbo Jiang  wrote:

> Xiangrui and I are leading an effort to implement a highly desirable
> feature, Barrier Execution Mode.
> https://issues.apache.org/jira/browse/SPARK-24374. This introduces a new
> scheduling model to Apache Spark so users can properly embed distributed DL
> training as a Spark stage to simplify the distributed training workflow.
> The prototype has been demoed in the Spark Summit keynote. This new feature
> got a very positive feedback from the whole community. The design doc and
> pull requests got more comments than we initially anticipated. We want to
> finish this feature in the upcoming release, Spark 2.4. Would it be
> possible to have an extension of code freeze for a week?
>
> Thanks,
>
> Xingbo
>
> 2018-07-07 0:47 GMT+08:00 Reynold Xin :
>
>> FYI 6 mo is coming up soon since the last release. We will cut the branch
>> and code freeze on Aug 1st in order to get 2.4 out on time.
>>
>>
>


Re: code freeze and branch cut for Apache Spark 2.4

2018-07-25 Thread Xingbo Jiang
Xiangrui and I are leading an effort to implement a highly desirable
feature, Barrier Execution Mode.
https://issues.apache.org/jira/browse/SPARK-24374. This introduces a new
scheduling model to Apache Spark so users can properly embed distributed DL
training as a Spark stage to simplify the distributed training workflow.
The prototype has been demoed in the Spark Summit keynote. This new feature
got a very positive feedback from the whole community. The design doc and
pull requests got more comments than we initially anticipated. We want to
finish this feature in the upcoming release, Spark 2.4. Would it be
possible to have an extension of code freeze for a week?

Thanks,

Xingbo

2018-07-07 0:47 GMT+08:00 Reynold Xin :

> FYI 6 mo is coming up soon since the last release. We will cut the branch
> and code freeze on Aug 1st in order to get 2.4 out on time.
>
>


code freeze and branch cut for Apache Spark 2.4

2018-07-06 Thread Reynold Xin
FYI 6 mo is coming up soon since the last release. We will cut the branch
and code freeze on Aug 1st in order to get 2.4 out on time.


Spark 2.2 Code-freeze - 3/20

2017-03-15 Thread Michael Armbrust
Hey Everyone,

Just a quick announcement that I'm planning to cut the branch for Spark 2.2
this coming Monday (3/20).  Please try and get things merged before then
and also please begin retargeting of any issues that you don't think will
make the release.

Michael


Re: Code freeze?

2016-04-18 Thread Sean Owen
FWIW, here's what I do to look at JIRA's answer to this:

1) Go download http://almworks.com/jiraclient/overview.html
2) Set up a query for "target = 2.0.0 and status = Open, In Progress, Reopened"
3) Set up sub-queries for bugs vs non-bugs, and for critical, blocker and other

Right now there are 172 issues open for 2.0.0. 40 are bugs, 4 of which
are critical and 1 of which is a blocker. 9 non-bugs are blockers, 5
critical.

JIRA info is inevitably noisy, but now is a good time to make this
info meaningful so we have some shared reference about the short-term
plan.

What I suggest we do now is ...

a) un-target anything that wasn't targeted to 2.0.0 by a committer
b) committers un-target or re-target anything they know isn't that
important for 2.0.0 (thanks jkbradley)
c) focus on bugs > features, high priority > low priority this week
d) see where we are next week, repeat

I suggest we simply have "no blockers" as an exit criteria, with a
strong pref for "no critical bugs either".

It's a major release, so taking a little extra time to get it all done
comfortably is both possible and unusually important. A couple weeks
indeed might be realistic for an RC, but it really depends on burndown
more than anything.

On Mon, Apr 18, 2016 at 8:23 AM, Pete Robbins <robbin...@gmail.com> wrote:
> Is there a list of Jiras to be considered for 2.0? I would really like to
> get https://issues.apache.org/jira/browse/SPARK-13745 in so that Big Endian
> platforms are not broken.
>
> Cheers,
>
> On Wed, 13 Apr 2016 at 08:51 Reynold Xin <r...@databricks.com> wrote:
>>
>> I think the main things are API things that we need to get right.
>>
>> - Implement essential DDLs
>> https://issues.apache.org/jira/browse/SPARK-14118  this blocks the next one
>>
>> - Merge HiveContext and SQLContext and create SparkSession
>> https://issues.apache.org/jira/browse/SPARK-13485
>>
>> - Separate out local linear algebra as a standalone module without Spark
>> dependency https://issues.apache.org/jira/browse/SPARK-13944
>>
>> - Run Spark without assembly jars (mostly done?)
>>
>>
>> Probably realistic to have it in ~ 2 weeks.
>>
>>
>>
>> On Wed, Apr 13, 2016 at 12:45 AM, Sean Owen <so...@cloudera.com> wrote:
>>>
>>> I've heard several people refer to a code freeze for 2.0. Unless I missed
>>> it, nobody has discussed a particular date for this:
>>> https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage
>>>
>>> I'd like to start with a review of JIRAs before anyone decides a freeze
>>> is appropriate. There are hundreds of issues, some blockers, still targeted
>>> for 2.0. Probably best for everyone to review and retarget non essentials
>>> and then see where we are at?
>>
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Code freeze?

2016-04-18 Thread Pete Robbins
Is there a list of Jiras to be considered for 2.0? I would really like to
get https://issues.apache.org/jira/browse/SPARK-13745 in so that Big Endian
platforms are not broken.

Cheers,

On Wed, 13 Apr 2016 at 08:51 Reynold Xin <r...@databricks.com> wrote:

> I think the main things are API things that we need to get right.
>
> - Implement essential DDLs
> https://issues.apache.org/jira/browse/SPARK-14118  this blocks the next
> one
>
> - Merge HiveContext and SQLContext and create SparkSession
> https://issues.apache.org/jira/browse/SPARK-13485
>
> - Separate out local linear algebra as a standalone module without Spark
> dependency https://issues.apache.org/jira/browse/SPARK-13944
>
> - Run Spark without assembly jars (mostly done?)
>
>
> Probably realistic to have it in ~ 2 weeks.
>
>
>
> On Wed, Apr 13, 2016 at 12:45 AM, Sean Owen <so...@cloudera.com> wrote:
>
>> I've heard several people refer to a code freeze for 2.0. Unless I missed
>> it, nobody has discussed a particular date for this:
>> https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage
>>
>> I'd like to start with a review of JIRAs before anyone decides a freeze
>> is appropriate. There are hundreds of issues, some blockers, still targeted
>> for 2.0. Probably best for everyone to review and retarget non essentials
>> and then see where we are at?
>>
>
>


Re: Reminder about Spark 1.5.0 code freeze deadline of Aug 1st

2015-07-29 Thread Sean Owen
Right now, 603 issues have been resolved for 1.5.0. 424 are still
targeted for 1.5.0, of which 33 are marked Blocker and 60 Critical.
This count is not supposed to be 0 at this point, but must
conceptually get to 0 at the time of 1.5.0's release. Most will simply
be un-targeted or pushed down the road.

If the plan is to begin meaningful testing on Aug 1 (great) and
release by Aug 15, this seems to be far too large. Yes, it just means
some prioritization has to happen. Target Version and Priority still
seem like the right tools to communicate this.

Let me put up a straw-man: untarget any JIRA targeted to 1.5.0 that
isn't Blocker or Critical on Aug 1. (JIRAs can be explicitly
retargeted in the following week.) This still leaves 93 issues, which
seems unrealistic to address in 2 weeks.

What are additional or alternative steps to handle this?
- Untarget a lot of the remaining 93?
- Push out 1.5 by X weeks to address more items?
- Argue there's another way to manage this?

On Wed, Jul 29, 2015 at 6:46 AM, Reynold Xin r...@databricks.com wrote:
 Hey All,

 Just a friendly reminder that Aug 1st is the feature freeze for Spark
 1.5, meaning major outstanding changes will need to land in the this
 week.

 After May 1st we'll package a release for testing and then go into the
 normal triage process where bugs are prioritized and some smaller
 features are allowed on a case by case basis (if they are very low risk/
 additive/feature flagged/etc).

 As always, I'll invite the community to help participate in code
 review of patches in the this week, since review bandwidth is the
 single biggest determinant of how many features will get in. Please
 also keep in mind that most active committers are working overtime
 (nights/weekends) during this period and will try their best to help
 usher in as many patches as possible, along with their own code.

 As a reminder, release window dates are always maintained on the wiki
 and are updated after each release according to our 3 month release
 cadence:

 https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage

 Thanks - and happy coding!

 - Reynold




-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Reminder about Spark 1.5.0 code freeze deadline of Aug 1st

2015-07-28 Thread Reynold Xin
Hey All,

Just a friendly reminder that Aug 1st is the feature freeze for Spark
1.5, meaning major outstanding changes will need to land in the this
week.

After May 1st we'll package a release for testing and then go into the
normal triage process where bugs are prioritized and some smaller
features are allowed on a case by case basis (if they are very low risk/
additive/feature flagged/etc).

As always, I'll invite the community to help participate in code
review of patches in the this week, since review bandwidth is the
single biggest determinant of how many features will get in. Please
also keep in mind that most active committers are working overtime
(nights/weekends) during this period and will try their best to help
usher in as many patches as possible, along with their own code.

As a reminder, release window dates are always maintained on the wiki
and are updated after each release according to our 3 month release
cadence:

https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage

Thanks - and happy coding!

- Reynold