Re: [C++] How to add user defined functions to arrow compute

Prateem Mandal Tue, 09 Jul 2024 08:23:20 -0700

Hi Felipe,

I made some progress. Here is what I came up with to traverse through the
array structure for each column in the span and build up the output array
inside my custom function



>
>
>
>
> *  arrow::Int64Builder builder(ctx->memory_pool());
> RETURN_NOT_OK(arrow::VisitArraySpanInline<int64_t>(batch[0].array,
> [&](int64_t v){ return builder.Append(v); },     [&](){ return
> builder.AppendNull(); }));  std::shared_ptr<arrow::Array> outvalues_;
> RETURN_NOT_OK(builder.Finish(&outvalues_));*


I am using int64_t array as return value but this will eventually be a
binary. I think the builder is the right pattern as most columns are string
of variable length and encrypted data will be same size + some
initialization vector prepended to it.
This did not compile however. The error I got is

*error: ‘VisitStatus’ is not a member of
> ‘arrow::internal::ArraySpanInlineVisitor<long int, void>’*


I guess this is because ArraySpanInlineVisitor is not exported. So I guess
I will have to replicate the logic within ArraySpanInlineVisitor. Can you
confirm if I am on the correct path here?

Thanks
Prateem

On Tue, Jul 9, 2024 at 12:35 AM Prateem Mandal <[email protected]> wrote:

> Hi Felipe,
>
> Thank you for your response. I also found "example/arrow/udf_example.cc".
> Between that and your example, I was able to build, register and use my own
> trivial increment function. However, your solution template did clarify and
> simultaneously raise a few issues for me.
>
> The first is allocating an output binary array (in case of encryption).
> Your suggested pseudocode
> *|> auto out_data = PrealocateBinaryArrayForMyEncryption(arg0,
> /*source_data_size*/sizeof(double)); *
> takes arg0 to get the number of elements in the array and in case the
> input array contains elements of non fixed type then further size of each
> entry in the array in order to calculate the total allocation size. Am I
> correct? What function should I use to allocate a BinaryArray (assuming
> BinaryArray the right choice for an array of binaries)? Can you please
> provide me with some references to example code that does this?
>
>
> The second question is about handling nulls. Could you please again refer
> to me some code that handles nulls through first checking the validity
> buffer? I, for example, do not know where the validity buffer is in the
> ExecSpan.
>
> Few additional questions. In the example code in
> "example/arrow/udf_example.cc", it dereferences the array with index 1 in
> the batch.
> *|> batch[0].array.GetValues<int64_t>(1)*
> What is special about this index? Probably there are three arrays in every
> span, what is there in an array with index 0 and 2?
>
> Apologies for too many questions.
>
> Thanks,
> Prateem
>
> On Mon, Jul 8, 2024 at 3:55 PM Felipe Oliveira Carvalho <
> [email protected]> wrote:
>
>> Hi,
>>
>> ArrayKernelExec must be a pointer to a C function.
>>
>> using ArrayKernelExec = Status (*)(KernelContext*, const ExecSpan&,
>> ExecResult*);
>>
>> Status EncryptFloat64(KernelContext* ctx, const ExecSpan& batch,
>> ExecResult* out) {
>>   auto& arg0 = batch[0];
>>   auto out_data = PrealocateBinaryArrayForMyEncryption(arg0,
>> /*source_data_size*/sizeof(double));
>>   if (arg0.MayHaveNulls()) {
>>     // specialized loop encrypting all the values (handling nulls)
>>   } else {
>>     // specialized loop encrypting all the values (ignoring validity
>> buffer)
>>   }
>>   *out = std::move(out_data);
>>   return Status::OK();
>> }
>>
>> const arrow::Status encrstatus =
>> encryptfunc.AddKernel({arrow::float64()}, arrow::binary(), EncryptFloat64);
>>
>> __
>> Felipe
>>
>> On Sun, Jul 7, 2024 at 5:51 PM Prateem Mandal <[email protected]> wrote:
>>
>>> Hello,
>>>
>>> I am implementing an encryption and decryption transformation function
>>> that takes a parquet file as input and encrypts each column of each row
>>> using AES-CTR. (I am aware of Parquet Modular Encryption but that is not an
>>> option right now for various reasons).
>>>
>>> I am using Arrow to read, process and write encrypted/decrypted files. I
>>> am doing this in C++ due to performance reasons.
>>>
>>> I was able to read the parquet file, recordbatch by record batch and for
>>> each record batch, I am attempting to apply encryption/decryption to each
>>> column array using my own provided compute function. I am following the
>>> pattern shown here
>>> <https://arrow.apache.org/docs/cpp/tutorials/compute_tutorial.html#calculating-element-wise-array-addition-with-callfunction>
>>> .
>>>
>>> In order to proceed I need to add my encryption/decryption function. I
>>> am doing something like the following
>>>
>>>
>>> *arrow::compute::ScalarFunction encryptfunc =
>>> arrow::compute::ScalarFunction("encr", arrow::compute::Arity::Unary(),
>>> arrow::compute::FunctionDoc::Empty());*
>>> *const arrow::Status &encrstatus =
>>> encryptfunc.AddKernel({arrow::float64()}, arrow::binary(),
>>> arrow::compute::ArrayKernelExec(???));*
>>>
>>> Now I am unsure where I can plug in my lambda function that encapsulates
>>> the key and encryption logic? Is this even possible?
>>>
>>> Thanks
>>> Prateem
>>>
>>>

Re: [C++] How to add user defined functions to arrow compute

Reply via email to