Hi Ajantha,

   Thanks for your initiative, I have couple of questions even though.

a) As per your explanation the dataset validation is already done as part
of the source table, this is what you mean? What I understand is the insert
select queries are going to get some benefits since we don’t do some
additional steps.

What about if your destination table has some different table properties
like few columns may have non null properties or date format or decimal
precision’s or scale may be different.
So you may need a bad record support then  , how you are going to handle
such scenarios? Correct me if I misinterpreted your points.

Regards,
Sujith


On Fri, 20 Dec 2019 at 5:25 AM, Ajantha Bhat <ajanthab...@gmail.com> wrote:

> Currently carbondata "insert into" uses the CarbonLoadDataCommand itself.
> Load process has steps like parsing and converter step with bad record
> support.
> Insert into doesn't require these steps as data is already validated and
> converted from source table or dataframe.
>
> Some identified changes are below.
>
> 1. Need to refactor and separate load and insert at driver side to skip
> converter step and unify flow for No sort and global sort insert.
> 2. Need to avoid reorder of each row. By changing select dataframe's
> projection order itself during the insert into.
> 3. For carbon to carbon insert, need to provide the ReadSupport and use
> RecordReader (vector reader currently doesn't support ReadSupport) to
> handle null values, time stamp cutoff (direct dictionary) from scanRDD
> result.
> 4. Need to handle insert into partition/non-partition table in local sort,
> global sort, no sort, range columns, compaction flow.
>
> The final goal is to improve insert performance by keeping only required
> logic and also decrease the memory footprint.
>
> If you have any other suggestions or optimizations related to this let me
> know.
>
> Thanks,
> Ajantha
>

Reply via email to