Re: [C++] How to add user defined functions to arrow compute

Prateem Mandal Mon, 08 Jul 2024 16:35:35 -0700

Hi Felipe,

Thank you for your response. I also found "example/arrow/udf_example.cc".
Between that and your example, I was able to build, register and use my own
trivial increment function. However, your solution template did clarify and
simultaneously raise a few issues for me.

The first is allocating an output binary array (in case of encryption).
Your suggested pseudocode
*|> auto out_data = PrealocateBinaryArrayForMyEncryption(arg0,
/*source_data_size*/sizeof(double)); *
takes arg0 to get the number of elements in the array and in case the input
array contains elements of non fixed type then further size of each entry
in the array in order to calculate the total allocation size. Am I correct?
What function should I use to allocate a BinaryArray (assuming BinaryArray
the right choice for an array of binaries)? Can you please provide me with
some references to example code that does this?

The second question is about handling nulls. Could you please again refer
to me some code that handles nulls through first checking the validity
buffer? I, for example, do not know where the validity buffer is in the
ExecSpan.

Few additional questions. In the example code in
"example/arrow/udf_example.cc", it dereferences the array with index 1 in
the batch.
*|> batch[0].array.GetValues<int64_t>(1)*
What is special about this index? Probably there are three arrays in every
span, what is there in an array with index 0 and 2?

Apologies for too many questions.

Thanks,
Prateem

On Mon, Jul 8, 2024 at 3:55 PM Felipe Oliveira Carvalho <[email protected]>
wrote:

> Hi,
>
> ArrayKernelExec must be a pointer to a C function.
>
> using ArrayKernelExec = Status (*)(KernelContext*, const ExecSpan&,
> ExecResult*);
>
> Status EncryptFloat64(KernelContext* ctx, const ExecSpan& batch,
> ExecResult* out) {
>   auto& arg0 = batch[0];
>   auto out_data = PrealocateBinaryArrayForMyEncryption(arg0,
> /*source_data_size*/sizeof(double));
>   if (arg0.MayHaveNulls()) {
>     // specialized loop encrypting all the values (handling nulls)
>   } else {
>     // specialized loop encrypting all the values (ignoring validity
> buffer)
>   }
>   *out = std::move(out_data);
>   return Status::OK();
> }
>
> const arrow::Status encrstatus = encryptfunc.AddKernel({arrow::float64()},
> arrow::binary(), EncryptFloat64);
>
> __
> Felipe
>
> On Sun, Jul 7, 2024 at 5:51 PM Prateem Mandal <[email protected]> wrote:
>
>> Hello,
>>
>> I am implementing an encryption and decryption transformation function
>> that takes a parquet file as input and encrypts each column of each row
>> using AES-CTR. (I am aware of Parquet Modular Encryption but that is not an
>> option right now for various reasons).
>>
>> I am using Arrow to read, process and write encrypted/decrypted files. I
>> am doing this in C++ due to performance reasons.
>>
>> I was able to read the parquet file, recordbatch by record batch and for
>> each record batch, I am attempting to apply encryption/decryption to each
>> column array using my own provided compute function. I am following the
>> pattern shown here
>> <https://arrow.apache.org/docs/cpp/tutorials/compute_tutorial.html#calculating-element-wise-array-addition-with-callfunction>
>> .
>>
>> In order to proceed I need to add my encryption/decryption function. I am
>> doing something like the following
>>
>>
>> *arrow::compute::ScalarFunction encryptfunc =
>> arrow::compute::ScalarFunction("encr", arrow::compute::Arity::Unary(),
>> arrow::compute::FunctionDoc::Empty());*
>> *const arrow::Status &encrstatus =
>> encryptfunc.AddKernel({arrow::float64()}, arrow::binary(),
>> arrow::compute::ArrayKernelExec(???));*
>>
>> Now I am unsure where I can plug in my lambda function that encapsulates
>> the key and encryption logic? Is this even possible?
>>
>> Thanks
>> Prateem
>>
>>

Re: [C++] How to add user defined functions to arrow compute

Reply via email to