Hi all,

Currently CarbonData only works with spark1.5 and spark1.6, as Apache Spark
community is moving to 2.1, more and more user will deploy spark 2.x in
production environment. In order to make CarbonData even more popular, I
think now it is good time to start considering spark2.x integration with
CarbonData.

Moreover, we can take this as a chance to refactory CarbonData to make it
both easier to use and higher performance.

Usability:
Instead of using CarbonContext, in spark2 integration, user should able to
1. use native SparkSession in the spark application to create and query
table backed by CarbonData files with full feature support, including index
and late decode optimization.

2. use CarbonData's API and tool to acomplish carbon specific tasks, like
compaction, delete segment, etc.

Perforamnce:
1. deep integration with Datasource API and leveraging spark2's whole stage
codegen feature.

2. provide implementation of vectorized record reader, to improve scanning
performance.

Since spark2 changes a lot comparing to spark 1.6, it may take some time to
complete all these features. With the help of contributors and committers, I
hope we can have basic features working in next CarbonData release. 

What do you think about this idea? All kinds of contribution and suggestions
are welcomed.

Regards,
Jacky Li




--
View this message in context: 
http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Feature-Proposal-Spark-2-integration-with-CarbonData-tp3236.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at 
Nabble.com.

Reply via email to