[Disscussion] Support GloabalSort in the CDC Flow

2020-05-12 Thread haomarch
If there is GloabalSort table in the CDC Flow. The following exception will
be throwed:

Exception in thread "main" java.lang.RuntimeException: column: id specified
in sort columns does not exist in schema
at
org.apache.carbondata.sdk.file.CarbonWriterBuilder.buildTableSchema(CarbonWriterBuilder.java:828)
at
org.apache.carbondata.sdk.file.CarbonWriterBuilder.buildCarbonTable(CarbonWriterBuilder.java:794)
at
org.apache.carbondata.sdk.file.CarbonWriterBuilder.buildLoadModel(CarbonWriterBuilder.java:720)
at
org.apache.spark.sql.carbondata.execution.datasources.CarbonSparkDataSourceUtil$.prepareLoadModel(CarbonSparkDataSourceUtil.scala:281)
at
org.apache.spark.sql.carbondata.execution.datasources.SparkCarbonFileFormat.prepareWrite(SparkCarbonFileFormat.scala:141)
at
org.apache.spark.sql.execution.command.mutation.merge.CarbonMergeDataSetCommand.processIUD(CarbonMergeDataSetCommand.scala:269)
at
org.apache.spark.sql.execution.command.mutation.merge.CarbonMergeDataSetCommand.processData(CarbonMergeDataSetCommand.scala:152)



--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


[Disscussion] Remove 'Create Stream'

2020-05-12 Thread haomarch
There are serveral reasons make it hard to apply to production systems

1. No recovery.
2. Lack of data integrity assurance, especially when the stream was reboot.
3. Users do not understand the details of 'Create Stream', difficult to
trace and debug.

Shall we remove the 'create stream' flow? 



--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [Disscussion] Remove 'Create Stream'

2020-05-12 Thread David CaiQiang
How about mark the stream SQL as experimental?

Now in some cases, it is an easy way for the user to understand the
streaming table.

We can improve it in the future.



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [Disscussion] Support GloabalSort in the CDC Flow

2020-05-12 Thread David CaiQiang
In my opinion, this is an issue if it can't work.

Better to change the topic title to use ‘question'/'issue' instead of
'discussion'.



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


[Discussion] Optimize the Update Performance

2020-05-12 Thread haomarch
There is an interesting paper "L-Store: A Real-time OLTP and OLAP System",
which uses an creative way to improve update performance.

The Idea is: 
*1. Store the updated column value in the tail page*.
When update any column of a record, a new tail page is created and appended
to the page dictionary.
In the tail page, only the updated column value is stored, comparing with
the current implement of carbondata in which we write the whole row even
only a few columns are updated, L-Store's way can avoid write amplification
effectively.
In the tail page, the rowid and updatedcolumnid are also stored together
with the updated columnvalue,
based on the updatedcolumnid, the row data can be achievd by read the base
page and tail pages during query processing.
*2. Increment update in the tail page.*
Assume that we update 2 columns,1 column per update. There are two ways to
store update columns in the tail page:

 2.1: Non-incremental Update:
/basepage   
tailpage1 
tailpage2 /

 2.2: Incremental Update: 
/basepage   
tailpage1 
tailpage2  /

Non-incremental Update only stores the updated column value for this update,
which has lower write amplification but worse query performance.
incremental Update stores the update column value for this updated together
the updated column values of previous updates, which has higher write
amplification but better query performance.

We shall study the work of L-Store, and optimize the update performance, it
will carbondata's competitiveness




--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/