I’d like to see an independent Spark Catalyst without Spark Core and Hadoop 
dependencies in Spark 3.0 .


I created Enzyme (A Spark SQL compatible SQL engine that depends on Spark 
Catalyst) in Wacai for performance reason in a non-distribute scenario.


Enzyme is a simplified version of Spark SQL, similar to liancheng’s toy 
projecthttps://github.com/liancheng/spear , but aims to keep compatibility with 
Spark SQL and Dataframe with Hive UDF support.


The implementation of Enzyme is a shame mimic of existing code from Spark SQL. 
Besides, I tuned it for better performance, lower memory and CPU usage.


We mainly use Enzyme for using SQL as a DSL in our inner product for data 
analysts. And guys from other comanies in China are interested in using Enzyme 
for ML serving. My colleagues are trying to use Enzyme in Flink Streaming 
because we can reuse our existing Hive UDFs with Enzyme.


This is my reason for make Spark Catalyst independent. And we will open source 
Enzyme several months later.


Spark Catalyst is awesome. Personally, I hope it goes beyond Spark and finally 
become a great alternative of Calcite.




Best Regards,
Darcy Shen


原始邮件
发件人:Xiao ligatorsm...@gmail.com
收件人:vaquar khanvaquar.k...@gmail.com
抄送:Reynold xinr...@databricks.com; Mridul muralidharanmri...@gmail.com; Mark 
hamstram...@clearstorydata.com; 银狐andyye...@gmail.com; 
user@spark.apache.org...@spark.apache.org
发送时间:2018年9月6日(周四) 23:59
主题:Re: time for Apache Spark 3.0?


Yesterday, the 2.4 branch was created. Based on the above discussion, I think 
we can bump the master branch to3.0.0-SNAPSHOT. Any concern?


Thanks,


Xiao


vaquar khan vaquar.k...@gmail.com 于2018年6月16日周六 上午10:21写道:

+1 for 2.4 next, followed by 3.0.  

Where we can get Apache Spark road map for 2.4 and 2.5 .... 3.0 ?
is it possible we can share future release proposed specification same like 
releases (https://spark.apache.org/releases/spark-release-2-3-0.html)


Regards,
Viquar khan


On Sat, Jun 16, 2018 at 12:02 PM, vaquar khan vaquar.k...@gmail.com wrote:

Plz ignore last email link (you tube )not sure how it added .
Apologies not sure how to delete it.




On Sat, Jun 16, 2018 at 11:58 AM, vaquar khan vaquar.k...@gmail.com wrote:

+1


https://www.youtube.com/watch?v=-ik7aJ5U6kg



Regards,
Vaquar khan


On Fri, Jun 15, 2018 at 4:55 PM, Reynold Xin r...@databricks.com wrote:

Yes. At this rate I think it's better to do 2.4 next, followed by 3.0.




On Fri, Jun 15, 2018 at 10:52 AM Mridul Muralidharan mri...@gmail.com wrote:

I agree, I dont see pressing need for major version bump as well.
 
 
 Regards,
 Mridul
 On Fri, Jun 15, 2018 at 10:25 AM Mark Hamstra m...@clearstorydata.com wrote:
 
  Changing major version numbers is not about new features or a vague notion 
that it is time to do something that will be seen to be a significant release. 
It is about breaking stable public APIs.
 
  I still remain unconvinced that the next version can't be 2.4.0.
 
  On Fri, Jun 15, 2018 at 1:34 AM Andy andyye...@gmail.com wrote:
 
  Dear all:
 
  It have been 2 months since this topic being proposed. Any progress now? 2018 
has been passed about 1/2.
 
  I agree with that the new version should be some exciting new feature. How 
about this one:
 
  6. ML/DL framework to be integrated as core component and feature. (Such as 
Angel / BigDL / ……)
 
  3.0 is a very important version for an good open source project. It should be 
better to drift away the historical burden and focus in new area. Spark has 
been widely used all over the world as a successful big data framework. And it 
can be better than that.
 
  Andy
 
 
  On Thu, Apr 5, 2018 at 7:20 AM Reynold Xin r...@databricks.com wrote:
 
  There was a discussion thread on scala-contributors about Apache Spark not 
yet supporting Scala 2.12, and that got me to think perhaps it is about time 
for Spark to work towards the 3.0 release. By the time it comes out, it will be 
more than 2 years since Spark 2.0.
 
  For contributors less familiar with Spark’s history, I want to give more 
context on Spark releases:
 
  1. Timeline: Spark 1.0 was released May 2014. Spark 2.0 was July 2016. If we 
were to maintain the ~ 2 year cadence, it is time to work on Spark 3.0 in 2018.
 
  2. Spark’s versioning policy promises that Spark does not break stable APIs 
in feature releases (e.g. 2.1, 2.2). API breaking changes are sometimes a 
necessary evil, and can be done in major releases (e.g. 1.6 to 2.0, 2.x to 3.0).
 
  3. That said, a major version isn’t necessarily the playground for disruptive 
API changes to make it painful for users to update. The main purpose of a major 
release is an opportunity to fix things that are broken in the current API and 
remove certain deprecated APIs.
 
  4. Spark as a project has a culture of evolving architecture and developing 
major new features incrementally, so major releases are not the only time for 
exciting new features. For example, the bulk of the work in the move towards 
the DataFrame API was done in Spark 1.3, and Continuous Processing was 
introduced in Spark 2.3. Both were feature releases rather than major releases.
 
 
  You can find more background in the thread discussing Spark 2.0: 
http://apache-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-Spark-2-0-td15122.html
 
 
  The primary motivating factor IMO for a major version bump is to support 
Scala 2.12, which requires minor API breaking changes to Spark’s APIs. Similar 
to Spark 2.0, I think there are also opportunities for other changes that we 
know have been biting us for a long time but can’t be changed in feature 
releases (to be clear, I’m actually not sure they are all good ideas, but I’m 
writing them down as candidates for consideration):
 
  1. Support Scala 2.12.
 
  2. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in Spark 
2.x.
 
  3. Shade all dependencies.
 
  4. Change the reserved keywords in Spark SQL to be more ANSI-SQL compliant, 
to prevent users from shooting themselves in the foot, e.g. “SELECT 2 SECOND” 
-- is “SECOND” an interval unit or an alias? To make it less painful for users 
to upgrade here, I’d suggest creating a flag for backward compatibility mode.
 
  5. Similar to 4, make our type coercion rule in DataFrame/SQL more standard 
compliant, and have a flag for backward compatibility.
 
  6. Miscellaneous other small changes documented in JIRA already (e.g. 
“JavaPairRDD flatMapValues requires function returning Iterable, not Iterator”, 
“Prevent column name duplication in temporary view”).
 
 
  Now the reality of a major version bump is that the world often thinks in 
terms of what exciting features are coming. I do think there are a number of 
major changes happening already that can be part of the 3.0 release, if they 
make it in:
 
  1. Scala 2.12 support (listing it twice)
  2. Continuous Processing non-experimental
  3. Kubernetes support non-experimental
  4. A more flushed out version of data source API v2 (I don’t think it is 
realistic to stabilize that in one release)
  5. Hadoop 3.0 support
  6. ...
 
 
 
  Similar to the 2.0 discussion, this thread should focus on the framework and 
whether it’d make sense to create Spark 3.0 as the next release, rather than 
the individual feature requests. Those are important but are best done in their 
own separate threads.
 
 
 
 






-- 

Regards,
Vaquar Khan

+1 -224-436-0783
Greater Chicago






-- 

Regards,
Vaquar Khan

+1 -224-436-0783
Greater Chicago






-- 

Regards,
Vaquar Khan

+1 -224-436-0783
Greater Chicago

Reply via email to