That's really helpful and your ExtendTable function helps me understand the
internals a little better. You're right my sample code only adds one column
although I also repeat the function multiple times on the same table
pointer.

On Wed, 28 Feb 2024 at 06:35, Aldrin <octalene....@pm.me> wrote:

> There may be something now, but I wrote this a few years ago and it may be
> helpful [1].
>
> The function, ExtendTable​, takes a base_table and adds the columns from
> ext_table to it as a "column bind". FieldVec is an arrow::FieldVector which
> is a std::vector<std::shared_ptr<arrow::Field>> [2]. Similarly,
> ChunkedArrVec is an arrow::ChunkedArrayVector which is a
> std::vector<std::shared_ptr<arrow::ChunkedArray>> [3]. My relevant header
> is datatypes.hpp [4].
>
> This is a zero-copy approach in the sense that I'm copying shared_ptr, but
> not the data itself. This requires extending both the schema and the vector
> of columns (which I think should be self explanatory in the code).
>
> Otherwise, I'm quite sure adding columns [5] is a zero-copy function (your
> sample code doesn't seem to add more than one column).
>
> [1]:
> https://gitlab.com/skyhookdm/skytether-singlecell/-/blob/develop/src/cpp/processing/dataops.cpp#L305-L337
> [2]:
> https://github.com/apache/arrow/blob/main/cpp/src/arrow/type_fwd.h#L68
> [3]:
> https://github.com/apache/arrow/blob/main/cpp/src/arrow/type_fwd.h#L88
> [4]:
> https://gitlab.com/skyhookdm/skytether-singlecell/-/blob/develop/src/cpp/headers/datatypes.hpp
> [5]:
> https://arrow.apache.org/docs/cpp/api/table.html#_CPPv4NK5arrow5Table9AddColumnEiNSt10shared_ptrI5FieldEENSt10shared_ptrI12ChunkedArrayEE
>
>
> # ------------------------------
> # Aldrin
>
> https://github.com/drin/
> https://gitlab.com/octalene
> https://keybase.io/octalene
>
> On Tuesday, February 27th, 2024 at 09:16, Blair Azzopardi <
> blai...@gmail.com> wrote:
>
> Hi
>
> I'm curious if there's a way of creating a zero copy union of two tables.
> Currently, I'm augmenting an existing table by adding new columns (with say
> moving averages - see snippet below). I do a pointer swap at the end and
> release the memory of the old table (reset).
>
> I wonder if it's more efficient if I created a new table with the new
> columns and then created some kind of "zero-copy table union" of the new
> table with the old table. Does that exist?
>
> That said, perhaps the AddColumn method does re-use the existing table
> memory location when it creates a "new Table".
>
> arrow::Status AddMovingAverage(shared_ptr<arrow::Table>& table,
> const std::string& colNameIn, int n,
> const std::string& colNameOut) {
> auto vals = table->GetColumnByName(colNameIn);
>
> // calculate moving average vector
> vector<double> ma
> ....
>
> // convert vector to arrow array
> shared_ptr<arrow::Array> ma_arr;
> arrow::DoubleBuilder dbl_builder = arrow::DoubleBuilder();
>
> ARROW_RETURN_NOT_OK(dbl_builder.AppendValues(ma.begin(), ma.end()));
> ARROW_ASSIGN_OR_RAISE(ma_arr, dbl_builder.Finish());
> // LOG(INFO) << ma_arr->ToString() << std::endl;
>
> // add new column to table (need to convert to chunked array first)
> auto f0 = arrow::field(colNameOut, arrow::float64());
> auto ma_chunked_arr = std::make_shared<arrow::ChunkedArray>(ma_arr);
>
> // Can this be done more efficiently with copying the original table to
> // a new memory location?
> ARROW_ASSIGN_OR_RAISE(auto new_table,
> table->AddColumn(0, f0, ma_chunked_arr));
>
> // swap pointer to new table and clean up
> table.swap(new_table);
> new_table.reset();
>
> return arrow::Status::OK();
> }
>
>
>

Reply via email to