Hi Wenbo,

Sorry I wasn't clear enough. There are two issues with the code snippet.

As Aldrin pointed out, you are working with uint8_t*, which I don't think
is right. The buffer stores bytes, not pointers.

auto *out_values = out->array_span_mutable()->GetValues<uint8_t*>(1);
>
It should be GetValues<uint8_t>(1), not GetValues<uint8_t*>(1)

memcpy(*out++, hash, 32);
>
It should be memcpy(out, hash, 32); out += 32;

Also, I see that you are not handling nulls at all. I assume you want to
propagate nulls from the input. In that case you'll want to set
kernel.null_handling=NullHandling::INTERSECTION, so that the executor will
handle the validity bitmap for you.

Best,
Jin

On Mon, Jul 17, 2023 at 10:41 PM Weston Pace <weston.p...@gmail.com> wrote:

> > I may be missing something, but why copy to *out_values++ instead of
> > *out_values and add 32 to out_values afterwards? Otherwise I agree this
> is
> > the way to go.
>
> I agree with Jin.  You should probably be incrementing `out` by 32 each
> time `VisitValue` is called.
>
> On Mon, Jul 17, 2023 at 6:38 AM Aldrin <octalene....@pm.me.invalid> wrote:
>
> > Oh wait, I see now that you're incrementing with a uint8_t*. That could
> be
> > fine for your own use, but you might want to make sure it aligns with the
> > type of your output (Int64Array vs Int32Array).
> >
> > Sent from Proton Mail for iOS
> >
> >
> > On Mon, Jul 17, 2023 at 06:20, Aldrin <octalene....@pm.me.INVALID
> > <On+Mon,+Jul+17,+2023+at+06:20,+Aldrin+%3C%3Ca+href=>> wrote:
> >
> > Hi Wenbo,
> >
> > An ArraySpan is like an ArrayData but does not own the data, so the
> > ColumnarFormat doc that Jon shared is relevant for both.
> >
> > In the case of a binary format, the output ArraySpan must have at least 2
> > buffers: the offsets and the contiguous binary data (values). If the
> output
> > of your UDF is something like an Int32Array with no nulls, then I think
> > you're writing output correctly.
> >
> > But, since your pointer is a uint8_t, I think Jin is right and `++` is
> > going to move your pointer 1 byte instead of 32 bytes like you intend.
> >
> > Sent from Proton Mail for iOS
> >
> >
> > On Mon, Jul 17, 2023 at 05:06, Wenbo Hu <huwenbo1...@gmail.com
> > <On+Mon,+Jul+17,+2023+at+05:06,+Wenbo+Hu+%3C%3Ca+href=>> wrote:
> >
> > Hi Jin,
> >
> > > but why copy to *out_values++ instead of
> > > *out_values and add 32 to out_values afterwards?
> > I'm implementing the sha256 function as a scalar function, but it
> > always inputs with an array, so on visitor pattern, I'll write a 32
> > byte hash into the pointer and move to the next for next visit.
> > Something like:
> > ```
> >
> > struct BinarySha256Visitor {
> > BinarySha256Visitor(uint8_t **out) {
> > this->out = out;
> > }
> > arrow::Status VisitNull() {
> > return arrow::Status::OK();
> > }
> >
> > arrow::Status VisitValue(std::string_view v) {
> >
> > uint8_t hash[32];
> > sha256(v, hash);
> >
> > memcpy(*out++, hash, 32);
> >
> > return arrow::Status::OK();
> > }
> >
> > uint8_t ** out;
> > };
> >
> > arrow::Status Sha256Func(cp::KernelContext *ctx, const cp::ExecSpan
> > &batch, cp::ExecResult *out) {
> > arrow::ArraySpanVisitor<arrow::BinaryType> visitor;
> >
> > auto *out_values = out->array_span_mutable()->GetValues<uint8_t*>(1);
> > BinarySha256Visitor visit(out_values);
> > ARROW_RETURN_NOT_OK(visitor.Visit(batch[0].array, &visit));
> >
> > return arrow::Status::OK();
> > }
> > ```
> > Is it as expected?
> >
> > Jin Shang <shangjin1...@gmail.com> 于2023年7月17日周一 19:44写道:
> > >
> > > Hi Wenbo,
> > >
> > > I'd like to known what's the *three* `buffers` are in ArraySpan. What
> are
> > > > `1` means when `GetValues` called?
> > >
> > > The meaning of buffers in an ArraySpan depends on the layout of its
> data
> > > type. FixedSizeBinary is a fixed-size primitive type, so it has two
> > > buffers, one validity buffer and one data buffer. So GetValues(1) would
> > > return a pointer to the data buffer.
> > > Layouts of data types can be found here[1].
> > >
> > > what is the actual type should I get from `GetValues`?
> > > >
> > > Buffer data is stored as raw bytes (uint8_t) but can be reinterpreted
> as
> > > any type to suit your need. The template parameter for GetValue is
> simply
> > > forwarded to reinterpret_cast. There are discussions[2] on the
> soundness
> > of
> > > using uint8_t to represent bytes but it is what we use now. Since you
> are
> > > only doing a memcpy, uint8_t should be good.
> > >
> > > Maybe, `auto *out_values = out->array_span_mutable()->GetValues(uint8_t
> > > > *>(1);` and `memcpy(*out_values++, some_ptr, 32);`?
> > > >
> > > I may be missing something, but why copy to *out_values++ instead of
> > > *out_values and add 32 to out_values afterwards? Otherwise I agree this
> > is
> > > the way to go.
> > >
> > > [1]
> > >
> >
> https://arrow.apache.org/docs/format/Columnar.html#buffer-listing-for-each-layout
> > > [2] https://github.com/apache/arrow/issues/36123
> > >
> > >
> > > On Mon, Jul 17, 2023 at 4:44 PM Wenbo Hu <huwenbo1...@gmail.com>
> wrote:
> > >
> > > > Hi,
> > > > I'm using Acero as the stream executor to run large scale data
> > > > transformation. The core data used in UDF is `ArraySpan` in
> > > > `ExecSpan`, but not much document on ArraySpan. I'd like to known
> > > > what's the *three* `buffers` are in ArraySpan. What are `1` means
> when
> > > > `GetValues` called?
> > > > For input data, I can use a `ArraySpanVisitor` to iterator over
> > > > different input types. But for output data, I don't know how to write
> > > > to the`array_span_mutable()` if it is not a simple c_type.
> > > > For example, I'm implementing a sha256 udf, which input is
> > > > `arrow::utf8()` and the output is `arrow::fixed_size_binary(32)`,
> then
> > > > how can I directly write to the out buffers and what is the actual
> > > > type should I get from `GetValues`?
> > > > Maybe, `auto *out_values =
> > > > out->array_span_mutable()->GetValues(uint8_t *>(1);` and
> > > > `memcpy(*out_values++, some_ptr, 32);`?
> > > >
> > > > --
> > > > ---------------------
> > > > Best Regards,
> > > > Wenbo Hu,
> > > >
> >
> >
> >
> > --
> > ---------------------
> > Best Regards,
> > Wenbo Hu,
> >
> >
>

Reply via email to