There may be something now, but I wrote this a few years ago and it may be 
helpful [1].

The function, `ExtendTable`, takes a base_table and adds the columns from 
ext_table to it as a "column bind". FieldVec is an arrow::FieldVector which is 
a std::vector<std::shared_ptr<arrow::Field>> [2]. Similarly, ChunkedArrVec is 
an arrow::ChunkedArrayVector which is a 
std::vector<std::shared_ptr<arrow::ChunkedArray>> [3]. My relevant header is 
datatypes.hpp [4].

This is a zero-copy approach in the sense that I'm copying shared_ptr, but not 
the data itself. This requires extending both the schema and the vector of 
columns (which I think should be self explanatory in the code).

Otherwise, I'm quite sure adding columns [5] is a zero-copy function (your 
sample code doesn't seem to add more than one column).

[1]: 
https://gitlab.com/skyhookdm/skytether-singlecell/-/blob/develop/src/cpp/processing/dataops.cpp#L305-L337

[2]: https://github.com/apache/arrow/blob/main/cpp/src/arrow/type_fwd.h#L68

[3]: https://github.com/apache/arrow/blob/main/cpp/src/arrow/type_fwd.h#L88

[4]: 
https://gitlab.com/skyhookdm/skytether-singlecell/-/blob/develop/src/cpp/headers/datatypes.hpp

[5]: 
https://arrow.apache.org/docs/cpp/api/table.html#_CPPv4NK5arrow5Table9AddColumnEiNSt10shared_ptrI5FieldEENSt10shared_ptrI12ChunkedArrayEE




# ------------------------------

# Aldrin


https://github.com/drin/

https://gitlab.com/octalene

https://keybase.io/octalene


On Tuesday, February 27th, 2024 at 09:16, Blair Azzopardi <blai...@gmail.com> 
wrote:

> Hi
> 

> I'm curious if there's a way of creating a zero copy union of two tables. 
> Currently, I'm augmenting an existing table by adding new columns (with say 
> moving averages - see snippet below). I do a pointer swap at the end and 
> release the memory of the old table (reset).
> 

> I wonder if it's more efficient if I created a new table with the new columns 
> and then created some kind of "zero-copy table union" of the new table with 
> the old table. Does that exist?
> 

> That said, perhaps the AddColumn method does re-use the existing table memory 
> location when it creates a "new Table".
> 

> arrow::Status AddMovingAverage(shared_ptr<arrow::Table>& table,
> const std::string& colNameIn, int n,
> const std::string& colNameOut) {
> auto vals = table->GetColumnByName(colNameIn);
> 

> // calculate moving average vector
> vector<double> ma....
> 

> // convert vector to arrow array
> shared_ptr<arrow::Array> ma_arr;
> arrow::DoubleBuilder dbl_builder = arrow::DoubleBuilder();
> 

> ARROW_RETURN_NOT_OK(dbl_builder.AppendValues(ma.begin(), ma.end()));
> ARROW_ASSIGN_OR_RAISE(ma_arr, dbl_builder.Finish());
> // LOG(INFO) << ma_arr->ToString() << std::endl;
> 

> // add new column to table (need to convert to chunked array first)
> auto f0 = arrow::field(colNameOut, arrow::float64());
> auto ma_chunked_arr = std::make_shared<arrow::ChunkedArray>(ma_arr);
> 

> // Can this be done more efficiently with copying the original table to
> // a new memory location?
> ARROW_ASSIGN_OR_RAISE(auto new_table,
> table->AddColumn(0, f0, ma_chunked_arr));
> 

> // swap pointer to new table and clean up
> table.swap(new_table);
> new_table.reset();
> 

> return arrow::Status::OK();
> }

Attachment: publickey - octalene.dev@pm.me - 0x21969656.asc
Description: application/pgp-keys

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to