Re: Thoughts on Spark 3 release, or a preview release

Reynold Xin Thu, 12 Sep 2019 17:32:44 -0700

+1! Long due for a preview release.

On Thu, Sep 12, 2019 at 5:26 PM, Holden Karau < hol...@pigscanfly.ca > wrote:


> 
> I like the idea from the PoV of giving folks something to start testing
> against and exploring so they can raise issues with us earlier in the
> process and we have more time to make calls around this.
> 
> On Thu, Sep 12, 2019 at 4:15 PM John Zhuge < jzhuge@ apache. org (
> jzh...@apache.org ) > wrote:
> 
> 
>> +1  Like the idea as a user and a DSv2 contributor.
>> 
>> 
>> On Thu, Sep 12, 2019 at 4:10 PM Jungtaek Lim < kabhwan@ gmail. com (
>> kabh...@gmail.com ) > wrote:
>> 
>> 
>>> +1 (as a contributor) from me to have preview release on Spark 3 as it
>>> would help to test the feature. When to cut preview release is
>>> questionable, as major works are ideally to be done before that - if we
>>> are intended to introduce new features before official release, that
>>> should work regardless of this, but if we are intended to have opportunity
>>> to test earlier, ideally it should.
>>> 
>>> 
>>> As a one of contributors in structured streaming area, I'd like to add
>>> some items for Spark 3.0, both "must be done" and "better to have". For
>>> "better to have", I pick some items for new features which committers
>>> reviewed couple of rounds and dropped off without soft-reject (No valid
>>> reason to stop). For Spark 2.4 users, only added feature for structured
>>> streaming is Kafka delegation token. (given we assume revising Kafka
>>> consumer pool as improvement) I hope we provide some gifts for structured
>>> streaming users in Spark 3.0 envelope.
>>> 
>>> 
>>> > must be done
>>> * SPARK-26154 Stream-stream joins - left outer join gives inconsistent
>>> output
>>> 
>>> It's a correctness issue with multiple users reported, being reported at
>>> Nov. 2018. There's a way to reproduce it consistently, and we have a patch
>>> submitted at Jan. 2019 to fix it.
>>> 
>>> 
>>> > better to have
>>> * SPARK-23539 Add support for Kafka headers in Structured Streaming
>>> * SPARK-26848 Introduce new option to Kafka source - specify timestamp to
>>> start and end offset
>>> * SPARK-20568 Delete files after processing in structured streaming
>>> 
>>> 
>>> There're some more new features/improvements items in SS, but given we're
>>> talking about ramping-down, above list might be realistic one.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Thu, Sep 12, 2019 at 9:53 AM Jean Georges Perrin < jgp@ jgp. net (
>>> j...@jgp.net ) > wrote:
>>> 
>>> 
>>>> As a user/non committer, +1
>>>> 
>>>> 
>>>> I love the idea of an early 3.0.0 so we can test current dev against it, I
>>>> know the final 3.x will probably need another round of testing when it
>>>> gets out, but less for sure... I know I could checkout and compile, but
>>>> having a “packaged” preversion is great if it does not take too much time
>>>> to the team...
>>>> 
>>>> jg
>>>> 
>>>> 
>>>> 
>>>> On Sep 11, 2019, at 20:40, Hyukjin Kwon < gurwls223@ gmail. com (
>>>> gurwls...@gmail.com ) > wrote:
>>>> 
>>>> 
>>>> 
>>>>> +1 from me too but I would like to know what other people think too.
>>>>> 
>>>>> 
>>>>> 2019년 9월 12일 (목) 오전 9:07, Dongjoon Hyun < dongjoon. hyun@ gmail. com (
>>>>> dongjoon.h...@gmail.com ) >님이 작성:
>>>>> 
>>>>> 
>>>>>> Thank you, Sean.
>>>>>> 
>>>>>> 
>>>>>> I'm also +1 for the following three.
>>>>>> 
>>>>>> 
>>>>>> 1. Start to ramp down (by the official branch-3.0 cut)
>>>>>> 2. Apache Spark 3.0.0-preview in 2019
>>>>>> 3. Apache Spark 3.0.0 in early 2020
>>>>>> 
>>>>>> 
>>>>>> For JDK11 clean-up, it will meet the timeline and `3.0.0-preview` helps 
>>>>>> it
>>>>>> a lot.
>>>>>> 
>>>>>> 
>>>>>> After this discussion, can we have some timeline for `Spark 3.0 Release
>>>>>> Window` in our versioning-policy page?
>>>>>> 
>>>>>> 
>>>>>> - https:/ / spark. apache. org/ versioning-policy. html (
>>>>>> https://spark.apache.org/versioning-policy.html )
>>>>>> 
>>>>>> 
>>>>>> Bests,
>>>>>> Dongjoon.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Wed, Sep 11, 2019 at 11:54 AM Michael Heuer < heuermh@ gmail. com (
>>>>>> heue...@gmail.com ) > wrote:
>>>>>> 
>>>>>> 
>>>>>>> I would love to see Spark + Hadoop + Parquet + Avro compatibility 
>>>>>>> problems
>>>>>>> resolved, e.g.
>>>>>>> 
>>>>>>> 
>>>>>>> https:/ / issues. apache. org/ jira/ browse/ SPARK-25588 (
>>>>>>> https://issues.apache.org/jira/browse/SPARK-25588 )
>>>>>>> https:/ / issues. apache. org/ jira/ browse/ SPARK-27781 (
>>>>>>> https://issues.apache.org/jira/browse/SPARK-27781 )
>>>>>>> 
>>>>>>> 
>>>>>>> Note that Avro is now at 1.9.1, binary-incompatible with 1.8.x.  As far 
>>>>>>> as
>>>>>>> I know, Parquet has not cut a release based on this new version.
>>>>>>> 
>>>>>>> 
>>>>>>> Then out of curiosity, are the new Spark Graph APIs targeting 3.0?
>>>>>>> 
>>>>>>> 
>>>>>>> https:/ / github. com/ apache/ spark/ pull/ 24851 (
>>>>>>> https://github.com/apache/spark/pull/24851 )
>>>>>>> https:/ / github. com/ apache/ spark/ pull/ 24297 (
>>>>>>> https://github.com/apache/spark/pull/24297 )
>>>>>>> 
>>>>>>> 
>>>>>>>    michael
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> On Sep 11, 2019, at 1:37 PM, Sean Owen < srowen@ apache. org (
>>>>>>>> sro...@apache.org ) > wrote:
>>>>>>>> 
>>>>>>>> I'm curious what current feelings are about ramping down towards a
>>>>>>>> Spark 3 release. It feels close to ready. There is no fixed date,
>>>>>>>> though in the past we had informally tossed around "back end of 2019".
>>>>>>>> For reference, Spark 1 was May 2014, Spark 2 was July 2016. I'd expect
>>>>>>>> Spark 2 to last longer, so to speak, but feels like Spark 3 is coming
>>>>>>>> due.
>>>>>>>> 
>>>>>>>> What are the few major items that must get done for Spark 3, in your
>>>>>>>> opinion? Below are all of the open JIRAs for 3.0 (which everyone
>>>>>>>> should feel free to update with things that aren't really needed for
>>>>>>>> Spark 3; I already triaged some).
>>>>>>>> 
>>>>>>>> For me, it's:
>>>>>>>> - DSv2?
>>>>>>>> - Finishing touches on the Hive, JDK 11 update
>>>>>>>> 
>>>>>>>> What about considering a preview release earlier, as happened for
>>>>>>>> Spark 2, to get feedback much earlier than the RC cycle? Could that
>>>>>>>> even happen ... about now?
>>>>>>>> 
>>>>>>>> I'm also wondering what a realistic estimate of Spark 3 release is. My
>>>>>>>> guess is quite early 2020, from here.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> SPARK-29014 DataSourceV2: Clean up current, default, and session 
>>>>>>>> catalog
>>>>>>>> uses
>>>>>>>> SPARK-28900 Test Pyspark, SparkR on JDK 11 with run-tests
>>>>>>>> SPARK-28883 Fix a flaky test: ThriftServerQueryTestSuite
>>>>>>>> SPARK-28717 Update SQL ALTER TABLE RENAME  to use TableCatalog API
>>>>>>>> SPARK-28588 Build a SQL reference doc
>>>>>>>> SPARK-28629 Capture the missing rules in HiveSessionStateBuilder
>>>>>>>> SPARK-28684 Hive module support JDK 11
>>>>>>>> SPARK-28548 explain() shows wrong result for persisted DataFrames
>>>>>>>> after some operations
>>>>>>>> SPARK-28372 Document Spark WEB UI
>>>>>>>> SPARK-28476 Support ALTER DATABASE SET LOCATION
>>>>>>>> SPARK-28264 Revisiting Python / pandas UDF
>>>>>>>> SPARK-28301 fix the behavior of table name resolution with 
>>>>>>>> multi-catalog
>>>>>>>> SPARK-28155 do not leak SaveMode to file source v2
>>>>>>>> SPARK-28103 Cannot infer filters from union table with empty local
>>>>>>>> relation table properly
>>>>>>>> SPARK-28024 Incorrect numeric values when out of range
>>>>>>>> SPARK-27936 Support local dependency uploading from --py-files
>>>>>>>> SPARK-27884 Deprecate Python 2 support in Spark 3.0
>>>>>>>> SPARK-27763 Port test cases from PostgreSQL to Spark SQL
>>>>>>>> SPARK-27780 Shuffle server & client should be versioned to enable
>>>>>>>> smoother upgrade
>>>>>>>> SPARK-27714 Support Join Reorder based on Genetic Algorithm when the #
>>>>>>>> of joined tables > 12
>>>>>>>> SPARK-27471 Reorganize public v2 catalog API
>>>>>>>> SPARK-27520 Introduce a global config system to replace
>>>>>>>> hadoopConfiguration
>>>>>>>> SPARK-24625 put all the backward compatible behavior change configs
>>>>>>>> under spark.sql.legacy.*
>>>>>>>> SPARK-24640 size(null) returns null
>>>>>>>> SPARK-24702 Unable to cast to calendar interval in spark sql.
>>>>>>>> SPARK-24838 Support uncorrelated IN/EXISTS subqueries for more 
>>>>>>>> operators
>>>>>>>> SPARK-24941 Add RDDBarrier.coalesce() function
>>>>>>>> SPARK-25017 Add test suite for ContextBarrierState
>>>>>>>> SPARK-25083 remove the type erasure hack in data source scan
>>>>>>>> SPARK-25383 Image data source supports sample pushdown
>>>>>>>> SPARK-27272 Enable blacklisting of node/executor on fetch failures by
>>>>>>>> default
>>>>>>>> SPARK-27296 User Defined Aggregating Functions (UDAFs) have a major
>>>>>>>> efficiency problem
>>>>>>>> SPARK-25128 multiple simultaneous job submissions against k8s backend
>>>>>>>> cause driver pods to hang
>>>>>>>> SPARK-26731 remove EOLed spark jobs from jenkins
>>>>>>>> SPARK-26664 Make DecimalType's minimum adjusted scale configurable
>>>>>>>> SPARK-21559 Remove Mesos fine-grained mode
>>>>>>>> SPARK-24942 Improve cluster resource management with jobs containing
>>>>>>>> barrier stage
>>>>>>>> SPARK-25914 Separate projection from grouping and aggregate in logical
>>>>>>>> Aggregate
>>>>>>>> SPARK-26022 PySpark Comparison with Pandas
>>>>>>>> SPARK-20964 Make some keywords reserved along with the ANSI/SQL 
>>>>>>>> standard
>>>>>>>> SPARK-26221 Improve Spark SQL instrumentation and metrics
>>>>>>>> SPARK-26425 Add more constraint checks in file streaming source to
>>>>>>>> avoid checkpoint corruption
>>>>>>>> SPARK-25843 Redesign rangeBetween API
>>>>>>>> SPARK-25841 Redesign window function rangeBetween API
>>>>>>>> SPARK-25752 Add trait to easily whitelist logical operators that
>>>>>>>> produce named output from CleanupAliases
>>>>>>>> SPARK-23210 Introduce the concept of default value to schema
>>>>>>>> SPARK-25640 Clarify/Improve EvalType for grouped aggregate and window
>>>>>>>> aggregate
>>>>>>>> SPARK-25531 new write APIs for data source v2
>>>>>>>> SPARK-25547 Pluggable jdbc connection factory
>>>>>>>> SPARK-20845 Support specification of column names in INSERT INTO
>>>>>>>> SPARK-24417 Build and Run Spark on JDK11
>>>>>>>> SPARK-24724 Discuss necessary info and access in barrier mode + 
>>>>>>>> Kubernetes
>>>>>>>> 
>>>>>>>> SPARK-24725 Discuss necessary info and access in barrier mode + Mesos
>>>>>>>> SPARK-25074 Implement maxNumConcurrentTasks() in
>>>>>>>> MesosFineGrainedSchedulerBackend
>>>>>>>> SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2
>>>>>>>> SPARK-25186 Stabilize Data Source V2 API
>>>>>>>> SPARK-25376 Scenarios we should handle but missed in 2.4 for barrier
>>>>>>>> execution mode
>>>>>>>> SPARK-25390 data source V2 API refactoring
>>>>>>>> SPARK-7768 Make user-defined type (UDT) API public
>>>>>>>> SPARK-14922 Alter Table Drop Partition Using Predicate-based Partition
>>>>>>>> Spec
>>>>>>>> SPARK-15691 Refactor and improve Hive support
>>>>>>>> SPARK-15694 Implement ScriptTransformation in sql/core
>>>>>>>> SPARK-16217 Support SELECT INTO statement
>>>>>>>> SPARK-16452 basic INFORMATION_SCHEMA support
>>>>>>>> SPARK-18134 SQL: MapType in Group BY and Joins not working
>>>>>>>> SPARK-18245 Improving support for bucketed table
>>>>>>>> SPARK-19842 Informational Referential Integrity Constraints Support in
>>>>>>>> Spark
>>>>>>>> SPARK-22231 Support of map, filter, withColumn, dropColumn in nested
>>>>>>>> list of structures
>>>>>>>> SPARK-22632 Fix the behavior of timestamp values for R's DataFrame to
>>>>>>>> respect session timezone
>>>>>>>> SPARK-22386 Data Source V2 improvements
>>>>>>>> SPARK-24723 Discuss necessary info and access in barrier mode + YARN
>>>>>>>> 
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
>>>>>>>> dev-unsubscr...@spark.apache.org )
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> 
>>> --
>>> Name : Jungtaek Lim
>>> Blog : http:/ / medium. com/ @ heartsavior ( http://medium.com/@heartsavior
>>> )
>>> Twitter : http:/ / twitter. com/ heartsavior (
>>> http://twitter.com/heartsavior )
>>> LinkedIn : http:/ / www. linkedin. com/ in/ heartsavior (
>>> http://www.linkedin.com/in/heartsavior )
>>> 
>> 
>> 
>> 
>> 
>> --
>> John Zhuge
>> 
> 
> 
> 
> 
> --
> Twitter: https:/ / twitter. com/ holdenkarau (
> https://twitter.com/holdenkarau )
> 
> Books (Learning Spark, High Performance Spark, etc.): https:/ / amzn. to/ 
> 2MaRAG9
> ( https://amzn.to/2MaRAG9 )
> YouTube Live Streams: https:/ / www. youtube. com/ user/ holdenkarau (
> https://www.youtube.com/user/holdenkarau )
>

Re: Thoughts on Spark 3 release, or a preview release

Reply via email to