[GitHub] [hudi] nsivabalan commented on issue #3676: MOR table rolls out new parquet files at 10MB for new inserts - even though max file size set as 128MB

GitBox Sun, 19 Sep 2021 10:27:58 -0700


nsivabalan commented on issue #3676:
URL: https://github.com/apache/hudi/issues/3676#issuecomment-922508543

I found the rootcause. Looks like in MOR, when an index is used which cannot
index log files (which is the case for all out of box indexes in hudi), we just
choose the smallest parquet file for every commit. So, over time, every file
will grow to become fullest is the idea here.

source
[link](https://github.com/apache/hudi/blob/3354fac42f9a2c4dbc8ac73ca4749160e9b9459b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/deltacommit/SparkUpsertDeltaCommitPartitioner.java#L66)

@vinothchandar : do you know whats the reason here. my understanding is
that, new inserts will not go into delta logs at all and only updates go into
delta logs. If this statement is true, we don't need to do special handling if
HoodieIndex can index log files or not.
Or am I missing something here.

@FelixKJose : So, if you keep adding more and more commits, each parquet
data file will slowly grow to get to its full size. just that for one commit,
only one base file per partition will be chosen.

Also, a suggestion wrt some other configs as we are dealing w/ small file
handling. do pay attention to
[recordsizeestimate](https://hudi.apache.org/docs/configurations#hoodiecopyonwriterecordsizeestimate)
. bcoz, only those files whose size is > (
[recordsizeestimationthreshold](https://hudi.apache.org/docs/configurations#hoodierecordsizeestimationthreshold)
*
[smallfilelimit](https://hudi.apache.org/docs/configurations#hoodieclusteringplanstrategysmallfilelimit))
will be considered while determining the avg record size.

To illustrate w/ an example:
for first commit, hudi relies on
[recordsizeestimate](https://hudi.apache.org/docs/configurations#hoodiecopyonwriterecordsizeestimate
) to pack records into data files. After that, hudi can calculate based on
previous commit stats. But even in that, only those files which has some min
size threshold will be considered. for eg, as per default values, record size
estimation threshold is 1.0 and parquet small file size is 100Mb. and so only
those data files whose size is > 100Mb will be looked into to determine avg
record size. Until then, hudi just takes the value from
[recordsizeestimate](https://hudi.apache.org/docs/configurations#hoodiecopyonwriterecordsizeestimate
).

So, for eg, avg record size in your case is 200kb. but you did not set the
right value for
[recordsizeestimate](https://hudi.apache.org/docs/configurations#hoodiecopyonwriterecordsizeestimate
). So, until there is one data file whose size is > 100Mb, hudi will assume
record size as
[recordsizeestimate](https://hudi.apache.org/docs/configurations#hoodiecopyonwriterecordsizeestimate
). So, in summary, hudi will keep assuming avg record size as 1024 and assign
records to data files based on this estimation.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #3676: MOR table rolls out new parquet files at 10MB for new inserts - even though max file size set as 128MB

Reply via email to