+1! Long due for a preview release. On Thu, Sep 12, 2019 at 5:26 PM, Holden Karau < [email protected] > wrote:
> > I like the idea from the PoV of giving folks something to start testing > against and exploring so they can raise issues with us earlier in the > process and we have more time to make calls around this. > > On Thu, Sep 12, 2019 at 4:15 PM John Zhuge < jzhuge@ apache. org ( > [email protected] ) > wrote: > > >> +1 Like the idea as a user and a DSv2 contributor. >> >> >> On Thu, Sep 12, 2019 at 4:10 PM Jungtaek Lim < kabhwan@ gmail. com ( >> [email protected] ) > wrote: >> >> >>> +1 (as a contributor) from me to have preview release on Spark 3 as it >>> would help to test the feature. When to cut preview release is >>> questionable, as major works are ideally to be done before that - if we >>> are intended to introduce new features before official release, that >>> should work regardless of this, but if we are intended to have opportunity >>> to test earlier, ideally it should. >>> >>> >>> As a one of contributors in structured streaming area, I'd like to add >>> some items for Spark 3.0, both "must be done" and "better to have". For >>> "better to have", I pick some items for new features which committers >>> reviewed couple of rounds and dropped off without soft-reject (No valid >>> reason to stop). For Spark 2.4 users, only added feature for structured >>> streaming is Kafka delegation token. (given we assume revising Kafka >>> consumer pool as improvement) I hope we provide some gifts for structured >>> streaming users in Spark 3.0 envelope. >>> >>> >>> > must be done >>> * SPARK-26154 Stream-stream joins - left outer join gives inconsistent >>> output >>> >>> It's a correctness issue with multiple users reported, being reported at >>> Nov. 2018. There's a way to reproduce it consistently, and we have a patch >>> submitted at Jan. 2019 to fix it. >>> >>> >>> > better to have >>> * SPARK-23539 Add support for Kafka headers in Structured Streaming >>> * SPARK-26848 Introduce new option to Kafka source - specify timestamp to >>> start and end offset >>> * SPARK-20568 Delete files after processing in structured streaming >>> >>> >>> There're some more new features/improvements items in SS, but given we're >>> talking about ramping-down, above list might be realistic one. >>> >>> >>> >>> >>> >>> On Thu, Sep 12, 2019 at 9:53 AM Jean Georges Perrin < jgp@ jgp. net ( >>> [email protected] ) > wrote: >>> >>> >>>> As a user/non committer, +1 >>>> >>>> >>>> I love the idea of an early 3.0.0 so we can test current dev against it, I >>>> know the final 3.x will probably need another round of testing when it >>>> gets out, but less for sure... I know I could checkout and compile, but >>>> having a “packaged” preversion is great if it does not take too much time >>>> to the team... >>>> >>>> jg >>>> >>>> >>>> >>>> On Sep 11, 2019, at 20:40, Hyukjin Kwon < gurwls223@ gmail. com ( >>>> [email protected] ) > wrote: >>>> >>>> >>>> >>>>> +1 from me too but I would like to know what other people think too. >>>>> >>>>> >>>>> 2019년 9월 12일 (목) 오전 9:07, Dongjoon Hyun < dongjoon. hyun@ gmail. com ( >>>>> [email protected] ) >님이 작성: >>>>> >>>>> >>>>>> Thank you, Sean. >>>>>> >>>>>> >>>>>> I'm also +1 for the following three. >>>>>> >>>>>> >>>>>> 1. Start to ramp down (by the official branch-3.0 cut) >>>>>> 2. Apache Spark 3.0.0-preview in 2019 >>>>>> 3. Apache Spark 3.0.0 in early 2020 >>>>>> >>>>>> >>>>>> For JDK11 clean-up, it will meet the timeline and `3.0.0-preview` helps >>>>>> it >>>>>> a lot. >>>>>> >>>>>> >>>>>> After this discussion, can we have some timeline for `Spark 3.0 Release >>>>>> Window` in our versioning-policy page? >>>>>> >>>>>> >>>>>> - https:/ / spark. apache. org/ versioning-policy. html ( >>>>>> https://spark.apache.org/versioning-policy.html ) >>>>>> >>>>>> >>>>>> Bests, >>>>>> Dongjoon. >>>>>> >>>>>> >>>>>> >>>>>> On Wed, Sep 11, 2019 at 11:54 AM Michael Heuer < heuermh@ gmail. com ( >>>>>> [email protected] ) > wrote: >>>>>> >>>>>> >>>>>>> I would love to see Spark + Hadoop + Parquet + Avro compatibility >>>>>>> problems >>>>>>> resolved, e.g. >>>>>>> >>>>>>> >>>>>>> https:/ / issues. apache. org/ jira/ browse/ SPARK-25588 ( >>>>>>> https://issues.apache.org/jira/browse/SPARK-25588 ) >>>>>>> https:/ / issues. apache. org/ jira/ browse/ SPARK-27781 ( >>>>>>> https://issues.apache.org/jira/browse/SPARK-27781 ) >>>>>>> >>>>>>> >>>>>>> Note that Avro is now at 1.9.1, binary-incompatible with 1.8.x. As far >>>>>>> as >>>>>>> I know, Parquet has not cut a release based on this new version. >>>>>>> >>>>>>> >>>>>>> Then out of curiosity, are the new Spark Graph APIs targeting 3.0? >>>>>>> >>>>>>> >>>>>>> https:/ / github. com/ apache/ spark/ pull/ 24851 ( >>>>>>> https://github.com/apache/spark/pull/24851 ) >>>>>>> https:/ / github. com/ apache/ spark/ pull/ 24297 ( >>>>>>> https://github.com/apache/spark/pull/24297 ) >>>>>>> >>>>>>> >>>>>>> michael >>>>>>> >>>>>>> >>>>>>> >>>>>>>> On Sep 11, 2019, at 1:37 PM, Sean Owen < srowen@ apache. org ( >>>>>>>> [email protected] ) > wrote: >>>>>>>> >>>>>>>> I'm curious what current feelings are about ramping down towards a >>>>>>>> Spark 3 release. It feels close to ready. There is no fixed date, >>>>>>>> though in the past we had informally tossed around "back end of 2019". >>>>>>>> For reference, Spark 1 was May 2014, Spark 2 was July 2016. I'd expect >>>>>>>> Spark 2 to last longer, so to speak, but feels like Spark 3 is coming >>>>>>>> due. >>>>>>>> >>>>>>>> What are the few major items that must get done for Spark 3, in your >>>>>>>> opinion? Below are all of the open JIRAs for 3.0 (which everyone >>>>>>>> should feel free to update with things that aren't really needed for >>>>>>>> Spark 3; I already triaged some). >>>>>>>> >>>>>>>> For me, it's: >>>>>>>> - DSv2? >>>>>>>> - Finishing touches on the Hive, JDK 11 update >>>>>>>> >>>>>>>> What about considering a preview release earlier, as happened for >>>>>>>> Spark 2, to get feedback much earlier than the RC cycle? Could that >>>>>>>> even happen ... about now? >>>>>>>> >>>>>>>> I'm also wondering what a realistic estimate of Spark 3 release is. My >>>>>>>> guess is quite early 2020, from here. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> SPARK-29014 DataSourceV2: Clean up current, default, and session >>>>>>>> catalog >>>>>>>> uses >>>>>>>> SPARK-28900 Test Pyspark, SparkR on JDK 11 with run-tests >>>>>>>> SPARK-28883 Fix a flaky test: ThriftServerQueryTestSuite >>>>>>>> SPARK-28717 Update SQL ALTER TABLE RENAME to use TableCatalog API >>>>>>>> SPARK-28588 Build a SQL reference doc >>>>>>>> SPARK-28629 Capture the missing rules in HiveSessionStateBuilder >>>>>>>> SPARK-28684 Hive module support JDK 11 >>>>>>>> SPARK-28548 explain() shows wrong result for persisted DataFrames >>>>>>>> after some operations >>>>>>>> SPARK-28372 Document Spark WEB UI >>>>>>>> SPARK-28476 Support ALTER DATABASE SET LOCATION >>>>>>>> SPARK-28264 Revisiting Python / pandas UDF >>>>>>>> SPARK-28301 fix the behavior of table name resolution with >>>>>>>> multi-catalog >>>>>>>> SPARK-28155 do not leak SaveMode to file source v2 >>>>>>>> SPARK-28103 Cannot infer filters from union table with empty local >>>>>>>> relation table properly >>>>>>>> SPARK-28024 Incorrect numeric values when out of range >>>>>>>> SPARK-27936 Support local dependency uploading from --py-files >>>>>>>> SPARK-27884 Deprecate Python 2 support in Spark 3.0 >>>>>>>> SPARK-27763 Port test cases from PostgreSQL to Spark SQL >>>>>>>> SPARK-27780 Shuffle server & client should be versioned to enable >>>>>>>> smoother upgrade >>>>>>>> SPARK-27714 Support Join Reorder based on Genetic Algorithm when the # >>>>>>>> of joined tables > 12 >>>>>>>> SPARK-27471 Reorganize public v2 catalog API >>>>>>>> SPARK-27520 Introduce a global config system to replace >>>>>>>> hadoopConfiguration >>>>>>>> SPARK-24625 put all the backward compatible behavior change configs >>>>>>>> under spark.sql.legacy.* >>>>>>>> SPARK-24640 size(null) returns null >>>>>>>> SPARK-24702 Unable to cast to calendar interval in spark sql. >>>>>>>> SPARK-24838 Support uncorrelated IN/EXISTS subqueries for more >>>>>>>> operators >>>>>>>> SPARK-24941 Add RDDBarrier.coalesce() function >>>>>>>> SPARK-25017 Add test suite for ContextBarrierState >>>>>>>> SPARK-25083 remove the type erasure hack in data source scan >>>>>>>> SPARK-25383 Image data source supports sample pushdown >>>>>>>> SPARK-27272 Enable blacklisting of node/executor on fetch failures by >>>>>>>> default >>>>>>>> SPARK-27296 User Defined Aggregating Functions (UDAFs) have a major >>>>>>>> efficiency problem >>>>>>>> SPARK-25128 multiple simultaneous job submissions against k8s backend >>>>>>>> cause driver pods to hang >>>>>>>> SPARK-26731 remove EOLed spark jobs from jenkins >>>>>>>> SPARK-26664 Make DecimalType's minimum adjusted scale configurable >>>>>>>> SPARK-21559 Remove Mesos fine-grained mode >>>>>>>> SPARK-24942 Improve cluster resource management with jobs containing >>>>>>>> barrier stage >>>>>>>> SPARK-25914 Separate projection from grouping and aggregate in logical >>>>>>>> Aggregate >>>>>>>> SPARK-26022 PySpark Comparison with Pandas >>>>>>>> SPARK-20964 Make some keywords reserved along with the ANSI/SQL >>>>>>>> standard >>>>>>>> SPARK-26221 Improve Spark SQL instrumentation and metrics >>>>>>>> SPARK-26425 Add more constraint checks in file streaming source to >>>>>>>> avoid checkpoint corruption >>>>>>>> SPARK-25843 Redesign rangeBetween API >>>>>>>> SPARK-25841 Redesign window function rangeBetween API >>>>>>>> SPARK-25752 Add trait to easily whitelist logical operators that >>>>>>>> produce named output from CleanupAliases >>>>>>>> SPARK-23210 Introduce the concept of default value to schema >>>>>>>> SPARK-25640 Clarify/Improve EvalType for grouped aggregate and window >>>>>>>> aggregate >>>>>>>> SPARK-25531 new write APIs for data source v2 >>>>>>>> SPARK-25547 Pluggable jdbc connection factory >>>>>>>> SPARK-20845 Support specification of column names in INSERT INTO >>>>>>>> SPARK-24417 Build and Run Spark on JDK11 >>>>>>>> SPARK-24724 Discuss necessary info and access in barrier mode + >>>>>>>> Kubernetes >>>>>>>> >>>>>>>> SPARK-24725 Discuss necessary info and access in barrier mode + Mesos >>>>>>>> SPARK-25074 Implement maxNumConcurrentTasks() in >>>>>>>> MesosFineGrainedSchedulerBackend >>>>>>>> SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2 >>>>>>>> SPARK-25186 Stabilize Data Source V2 API >>>>>>>> SPARK-25376 Scenarios we should handle but missed in 2.4 for barrier >>>>>>>> execution mode >>>>>>>> SPARK-25390 data source V2 API refactoring >>>>>>>> SPARK-7768 Make user-defined type (UDT) API public >>>>>>>> SPARK-14922 Alter Table Drop Partition Using Predicate-based Partition >>>>>>>> Spec >>>>>>>> SPARK-15691 Refactor and improve Hive support >>>>>>>> SPARK-15694 Implement ScriptTransformation in sql/core >>>>>>>> SPARK-16217 Support SELECT INTO statement >>>>>>>> SPARK-16452 basic INFORMATION_SCHEMA support >>>>>>>> SPARK-18134 SQL: MapType in Group BY and Joins not working >>>>>>>> SPARK-18245 Improving support for bucketed table >>>>>>>> SPARK-19842 Informational Referential Integrity Constraints Support in >>>>>>>> Spark >>>>>>>> SPARK-22231 Support of map, filter, withColumn, dropColumn in nested >>>>>>>> list of structures >>>>>>>> SPARK-22632 Fix the behavior of timestamp values for R's DataFrame to >>>>>>>> respect session timezone >>>>>>>> SPARK-22386 Data Source V2 improvements >>>>>>>> SPARK-24723 Discuss necessary info and access in barrier mode + YARN >>>>>>>> >>>>>>>> --------------------------------------------------------------------- >>>>>>>> To unsubscribe e-mail: dev-unsubscribe@ spark. apache. org ( >>>>>>>> [email protected] ) >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>> >>>> >>> >>> >>> >>> >>> -- >>> Name : Jungtaek Lim >>> Blog : http:/ / medium. com/ @ heartsavior ( http://medium.com/@heartsavior >>> ) >>> Twitter : http:/ / twitter. com/ heartsavior ( >>> http://twitter.com/heartsavior ) >>> LinkedIn : http:/ / www. linkedin. com/ in/ heartsavior ( >>> http://www.linkedin.com/in/heartsavior ) >>> >> >> >> >> >> -- >> John Zhuge >> > > > > > -- > Twitter: https:/ / twitter. com/ holdenkarau ( > https://twitter.com/holdenkarau ) > > Books (Learning Spark, High Performance Spark, etc.): https:/ / amzn. to/ > 2MaRAG9 > ( https://amzn.to/2MaRAG9 ) > YouTube Live Streams: https:/ / www. youtube. com/ user/ holdenkarau ( > https://www.youtube.com/user/holdenkarau ) >
