Re: Propose configurable page size in MB (via carbon property)

2018-10-22 Thread xuchuanyin
OK, anyway please take care of the loading performance. The validation can
only be checked for those fields that may cross the boundary (e.g. varchar
and complex), and for the ordinary fields, just skip the validation.



--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: Propose configurable page size in MB (via carbon property)

2018-10-22 Thread Ajantha Bhat
Hi xuchuanyin,

Thanks for your inputs. Please find some details below.

1. Already there was a size based validation in code for each row
processing.
In 'isVarCharColumnFul()' method. It was checking only for varchar columns.
Now I am checking complex as well as string columns.

2. The logic is for dividing complex byte array to flat byte array is taken
from TablePage.addComplexColumn(). This computation will be moved to my new
method and it will be avoided here.
So no extra computation.

3. Yes,  I will make it as create table property instead of carbon property.
Also I will measure Load performance. Once changes are made.

Thanks,
Ajantha


On Fri, Oct 19, 2018 at 1:56 PM xuchuanyin  wrote:

> Hi, ajantha.
>
> I just go through your PR and think we may need to rethink about this
> feature especially its impact. I leaved a comment under your PR and will
> paste it here for further communication in community.
>
> I'm afraid that in common scenarios even we do not face the page size
> problems and play in the safe area, carbondata will still call this method
> to check the boundaries, which will cause data loading performance
> decreasing.
> So is there a way to avoid unnecessary checking here?
>
> In my opinion, to determine the upper bound of the number of rows in a
> page,
> the default strategy is 'number based' (32000 as the upper bound). Now you
> are adding another additional strategy 'capacity based' (xxMB as the upper
> bound).
>
> There can be multiple strategies for per load, the default is [number
> based], but the user can also configure [number based, capacity based]. So
> before loading, we can get the strategies and apply them while processing.
> At the same time, if the strategies is [number based], we do not need to
> check the capacity, thus avoiding the problem I mentioned above.
>
> Note that we store the rowId in each page using short, it means that the
> number based strategy is a default yet required strategy.
>
> Also, by default, the capacity based strategy is not configured. As for
> this
> strategy, user can add it in:
> 1. TBLProperties in creating table
> 2. Options in loading data
> 3. Options in SdkWriter
> 4. Options in creating table using spark file format
> 5. Options in DataFrameWriter
>
> By all means, we should not configure it in system property, because only
> few of tables use this feature. However adding it in system property will
> decrease their loading performance.
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>