[
https://issues.apache.org/jira/browse/MADLIB-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16426356#comment-16426356
]
Jingyi Mei commented on MADLIB-1224:
------------------------------------
Some math:
# For row size, since each row from source table will be put into an array
first, and then arrays will be aggregated using madlib.matrix_agg, ([ ] double
precision, with 8 byte for one element in an array), the estimated size s for
one super-row is:
_s = 8byte* num_of_element_in_an_array* buffer_size_
And s <= 1GB
# For data distribution, each segment will get k super-rows:
_k = total_num_of_rows_in_source_table/(buffer_size * num_of_segment)_
And k >= p,
where p is # of rows stored in each segment (threshold)
We make a more conservative constraint in calculation 1, i.e., s<=600MB, and
after simplifying 1 and 2, we got
_Buffer_size <= min(75 million/num_of_element_in_an_array,
total_num_of_rows_in_source_table/(num_of_segment*p))_
> Select default buffer size for mini-batch preprocessor
> ------------------------------------------------------
>
> Key: MADLIB-1224
> URL: https://issues.apache.org/jira/browse/MADLIB-1224
> Project: Apache MADlib
> Issue Type: Improvement
> Components: Module: Utilities
> Reporter: Jingyi Mei
> Priority: Major
> Fix For: v1.14
>
>
> As a follow up of https://issues.apache.org/jira/browse/MADLIB-1200
>
> In minibatch_preprocessor, we made buffer_size as an optional parameter. If
> it is not set, some default value will be assigned. Current considerations
> are:
> # Within segment, each cell has 1GB limit so that we can't put too many rows
> into one super row to exceed the limit
> # Among segments, data should be distributed as equally as possible to avoid
> data skew so that GPDB can work more efficiently.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)