Re: [PR] GH-45497: [C++][CSV] Avoid buffer overflow when a line has too many columns [arrow]

via GitHub Tue, 11 Feb 2025 20:19:22 -0800


mapleFU commented on code in PR #45498:
URL: https://github.com/apache/arrow/pull/45498#discussion_r1951932595



##########
cpp/src/arrow/csv/parser.cc:
##########
@@ -158,7 +173,7 @@ class ResizableValueDescWriter : public 
ValueDescWriter<ResizableValueDescWriter
   void PushValue(ParsedValueDesc v) {
     if (ARROW_PREDICT_FALSE(values_size_ == values_capacity_)) {
       values_capacity_ = values_capacity_ * 2;
-      ARROW_CHECK_OK(values_buffer_->Resize(values_capacity_ * 
sizeof(*values_)));
+      status_ &= values_buffer_->Resize(values_capacity_ * sizeof(*values_));

Review Comment:
   when `!status_.ok()`, the `values_buffer_` isn't gurantee to exist?



##########
cpp/src/arrow/csv/parser.cc:
##########
@@ -171,12 +186,26 @@ class ResizableValueDescWriter : public 
ValueDescWriter<ResizableValueDescWriter
 // faster CSV parsing code.
 class PresizedValueDescWriter : public 
ValueDescWriter<PresizedValueDescWriter> {
  public:
+  // The number of offsets being written will be `1 + num_rows * num_cols`,
+  // however we allow for one extraneous write in case of excessive columns,
+  // hence `2 + num_rows * num_cols` (see explanation in PushValue below).
   PresizedValueDescWriter(MemoryPool* pool, int32_t num_rows, int32_t num_cols)
-      : ValueDescWriter(pool, /*values_capacity=*/1 + num_rows * num_cols) {}
+      : ValueDescWriter(pool, /*values_capacity=*/2 + num_rows * num_cols) {}
 
   void PushValue(ParsedValueDesc v) {
     DCHECK_LT(values_size_, values_capacity_);
-    values_[values_size_++] = v;
+    values_[values_size_] = v;
+    // We must take care not to write past the buffer's end if the line being
+    // parsed has more than `num_cols` columns. The obvious solution of setting
+    // an error status hurts too much on benchmarks, which is why we instead
+    // cap `values_size_` to stay inside the buffer.
+    //
+    // Not setting an error immediately is not a problem since the `num_cols`
+    // mismatch is detected later in ParseLine.
+    //
+    // Note that we want `values_size_` to reflect the number of written values
+    // in the nominal case, which is why we choose a slightly larger 
`values_capacity_`.
+    values_size_ += (values_size_ != values_capacity_ - 1);

Review Comment:
   So here the last buffer is to pushing the newest change and avoid break the 
previous valid buffer. But I don't know that is `1 + num_rows * num_cols` is 
enough since the last value ( `values_[num_rows * num_cols]` ) can be written?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-45497: [C++][CSV] Avoid buffer overflow when a line has too many columns [arrow]

Reply via email to