gfphoenix78 commented on code in PR #1364:
URL: https://github.com/apache/cloudberry/pull/1364#discussion_r2370823806
##########
contrib/pax_storage/src/cpp/storage/pax.cc:
##########
@@ -280,17 +282,22 @@ void TableWriter::Open() {
// insert tuple into the aux table before inserting any tuples.
cbdb::InsertMicroPartitionPlaceHolder(RelationGetRelid(relation_),
current_blockno_);
+ cur_physical_size_ = 0;
}
void TableWriter::WriteTuple(TupleTableSlot *slot) {
Assert(writer_);
Assert(strategy_);
- // should check split strategy before write tuple
- // otherwise, may got a empty file in the disk
- if (strategy_->ShouldSplit(writer_->PhysicalSize(), num_tuples_)) {
- writer_->Close();
- writer_ = nullptr;
- Open();
+ // Sampled split check to reduce PhysicalSize() overhead
+ // We first perform a sampled pre-write check to avoid empty files.
+ if ((num_tuples_ % PAX_SPLIT_CHECK_INTERVAL) == 0) {
+ cur_physical_size_ = writer_->PhysicalSize();
+ if (strategy_->ShouldSplit(cur_physical_size_, num_tuples_)) {
+ writer_->Close();
+ writer_ = nullptr;
+ Open();
+ cur_physical_size_ = 0;
+ }
Review Comment:
Be careful, the file size is a soft limit, but the maximum number of tuples
is a hard limit. Because the maximum number os tuples is also constraint by the
layout of `offset` in CTID, defined in
https://github.com/apache/cloudberry/blob/main/contrib/pax_storage/src/cpp/storage/pax_itemptr.h#L38
##########
contrib/pax_storage/src/cpp/storage/orc/orc_writer.cc:
##########
@@ -361,7 +361,11 @@ std::vector<std::pair<int, Datum>>
OrcWriter::PrepareWriteTuple(
// Numeric always need ensure that with the 4B header, otherwise it will
// be converted twice in the vectorization path.
if (required_stats_cols[i] || VARATT_IS_COMPRESSED(tts_value_vl) ||
- VARATT_IS_EXTERNAL(tts_value_vl) || attrs->atttypid == NUMERICOID) {
+ VARATT_IS_EXTERNAL(tts_value_vl)
+#ifdef VEC_BUILD
+ || attrs->atttypid == NUMERICOID
+#endif
Review Comment:
Why is affected by this macro? The function only takes effect for write.
It's the same no matter vectorization is on or off.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]