Re: Need help on ArrayaSpan and writing C++ udf

Wenbo Hu Mon, 17 Jul 2023 05:08:55 -0700

Hi Jin,

> but why copy to *out_values++ instead of
> *out_values and add 32 to out_values afterwards?
    I'm implementing the sha256 function as a scalar function, but it
always inputs with an array, so on visitor pattern, I'll write a 32
byte hash into the pointer and move to the next for next visit.
    Something like:
```


struct BinarySha256Visitor {
    BinarySha256Visitor(uint8_t **out) {
        this->out = out;
    }
    arrow::Status VisitNull() {
        return arrow::Status::OK();
    }

    arrow::Status VisitValue(std::string_view v) {

        uint8_t hash[32];
        sha256(v, hash);

        memcpy(*out++, hash, 32);

        return arrow::Status::OK();
    }

    uint8_t ** out;
};

arrow::Status Sha256Func(cp::KernelContext *ctx, const cp::ExecSpan
&batch, cp::ExecResult *out) {
    arrow::ArraySpanVisitor<arrow::BinaryType> visitor;

    auto *out_values = out->array_span_mutable()->GetValues<uint8_t*>(1);
    BinarySha256Visitor visit(out_values);
    ARROW_RETURN_NOT_OK(visitor.Visit(batch[0].array, &visit));

    return arrow::Status::OK();
}
```
Is it as expected?

Jin Shang <shangjin1...@gmail.com> 于2023年7月17日周一 19:44写道：
>
> Hi Wenbo,
>
> I'd like to known what's the *three* `buffers` are in ArraySpan. What are
> > `1` means when `GetValues` called?
>
> The meaning of buffers in an ArraySpan depends on the layout of its data
> type. FixedSizeBinary is a fixed-size primitive type, so it has two
> buffers, one validity buffer and one data buffer. So GetValues(1) would
> return a pointer to the data buffer.
> Layouts of data types can be found here[1].
>
> what is the actual type should I get from `GetValues`?
> >
> Buffer data is stored as raw bytes (uint8_t) but can be reinterpreted as
> any type to suit your need. The template parameter for GetValue is simply
> forwarded to reinterpret_cast. There are discussions[2] on the soundness of
> using uint8_t to represent bytes but it is what we use now. Since you are
> only doing a memcpy, uint8_t should be good.
>
> Maybe, `auto *out_values = out->array_span_mutable()->GetValues(uint8_t
> > *>(1);` and `memcpy(*out_values++, some_ptr, 32);`?
> >
> I may be missing something, but why copy to *out_values++ instead of
> *out_values and add 32 to out_values afterwards? Otherwise I agree this is
> the way to go.
>
> [1]
> https://arrow.apache.org/docs/format/Columnar.html#buffer-listing-for-each-layout
> [2] https://github.com/apache/arrow/issues/36123
>
>
> On Mon, Jul 17, 2023 at 4:44 PM Wenbo Hu <huwenbo1...@gmail.com> wrote:
>
> > Hi,
> >     I'm using Acero as  the stream executor to run large scale data
> > transformation. The core data used in UDF is `ArraySpan` in
> > `ExecSpan`, but not much document on ArraySpan. I'd like to known
> > what's the *three* `buffers` are in ArraySpan. What are `1` means when
> > `GetValues` called?
> >     For input data, I can use a `ArraySpanVisitor` to iterator over
> > different input types. But for output data, I don't know how to write
> > to the`array_span_mutable()` if it is not a simple c_type.
> >     For example, I'm implementing a sha256 udf, which input is
> > `arrow::utf8()` and the output is `arrow::fixed_size_binary(32)`, then
> > how can I directly write to the out buffers and what is the actual
> > type should I get from `GetValues`?
> >     Maybe, `auto *out_values =
> > out->array_span_mutable()->GetValues(uint8_t *>(1);` and
> > `memcpy(*out_values++, some_ptr, 32);`?
> >
> > --
> > ---------------------
> > Best Regards,
> > Wenbo Hu,
> >



-- 
---------------------
Best Regards,
Wenbo Hu,

Re: Need help on ArrayaSpan and writing C++ udf

Reply via email to