Re: [C++] How to add user defined functions to arrow compute

Felipe Oliveira Carvalho Wed, 10 Jul 2024 07:27:09 -0700

The parameter to VisitArraySpanInline<T> should be Int64Type and not
int64_t. Are you going to keep nulls in plain text?


PrealocateBinaryArrayForMyEncryption() was just a placeholder. You can call
`Reserve()` on the builder to make the builder preallocate buffers so that
Append*() don't have to grow the buffers from time to time.

__
Felipe

On Tue, Jul 9, 2024 at 12:23 PM Prateem Mandal <[email protected]> wrote:

> Hi Felipe,
>
> I made some progress. Here is what I came up with to traverse through the
> array structure for each column in the span and build up the output array
> inside my custom function
>
>
>>
>>
>>
>>
>> *  arrow::Int64Builder builder(ctx->memory_pool());
>> RETURN_NOT_OK(arrow::VisitArraySpanInline<int64_t>(batch[0].array,
>> [&](int64_t v){ return builder.Append(v); },     [&](){ return
>> builder.AppendNull(); }));  std::shared_ptr<arrow::Array> outvalues_;
>> RETURN_NOT_OK(builder.Finish(&outvalues_));*
>
>
> I am using int64_t array as return value but this will eventually be a
> binary. I think the builder is the right pattern as most columns are string
> of variable length and encrypted data will be same size + some
> initialization vector prepended to it.
> This did not compile however. The error I got is
>
> *error: ‘VisitStatus’ is not a member of
>> ‘arrow::internal::ArraySpanInlineVisitor<long int, void>’*
>
>
> I guess this is because ArraySpanInlineVisitor is not exported. So I guess
> I will have to replicate the logic within ArraySpanInlineVisitor. Can you
> confirm if I am on the correct path here?
>
> Thanks
> Prateem
>
> On Tue, Jul 9, 2024 at 12:35 AM Prateem Mandal <[email protected]> wrote:
>
>> Hi Felipe,
>>
>> Thank you for your response. I also found "example/arrow/udf_example.cc".
>> Between that and your example, I was able to build, register and use my own
>> trivial increment function. However, your solution template did clarify and
>> simultaneously raise a few issues for me.
>>
>> The first is allocating an output binary array (in case of encryption).
>> Your suggested pseudocode
>> *|> auto out_data = PrealocateBinaryArrayForMyEncryption(arg0,
>> /*source_data_size*/sizeof(double)); *
>> takes arg0 to get the number of elements in the array and in case the
>> input array contains elements of non fixed type then further size of each
>> entry in the array in order to calculate the total allocation size. Am I
>> correct? What function should I use to allocate a BinaryArray (assuming
>> BinaryArray the right choice for an array of binaries)? Can you please
>> provide me with some references to example code that does this?
>>
>>
>> The second question is about handling nulls. Could you please again refer
>> to me some code that handles nulls through first checking the validity
>> buffer? I, for example, do not know where the validity buffer is in the
>> ExecSpan.
>>
>> Few additional questions. In the example code in
>> "example/arrow/udf_example.cc", it dereferences the array with index 1 in
>> the batch.
>> *|> batch[0].array.GetValues<int64_t>(1)*
>> What is special about this index? Probably there are three arrays in
>> every span, what is there in an array with index 0 and 2?
>>
>> Apologies for too many questions.
>>
>> Thanks,
>> Prateem
>>
>> On Mon, Jul 8, 2024 at 3:55 PM Felipe Oliveira Carvalho <
>> [email protected]> wrote:
>>
>>> Hi,
>>>
>>> ArrayKernelExec must be a pointer to a C function.
>>>
>>> using ArrayKernelExec = Status (*)(KernelContext*, const ExecSpan&,
>>> ExecResult*);
>>>
>>> Status EncryptFloat64(KernelContext* ctx, const ExecSpan& batch,
>>> ExecResult* out) {
>>>   auto& arg0 = batch[0];
>>>   auto out_data = PrealocateBinaryArrayForMyEncryption(arg0,
>>> /*source_data_size*/sizeof(double));
>>>   if (arg0.MayHaveNulls()) {
>>>     // specialized loop encrypting all the values (handling nulls)
>>>   } else {
>>>     // specialized loop encrypting all the values (ignoring validity
>>> buffer)
>>>   }
>>>   *out = std::move(out_data);
>>>   return Status::OK();
>>> }
>>>
>>> const arrow::Status encrstatus =
>>> encryptfunc.AddKernel({arrow::float64()}, arrow::binary(), EncryptFloat64);
>>>
>>> __
>>> Felipe
>>>
>>> On Sun, Jul 7, 2024 at 5:51 PM Prateem Mandal <[email protected]> wrote:
>>>
>>>> Hello,
>>>>
>>>> I am implementing an encryption and decryption transformation function
>>>> that takes a parquet file as input and encrypts each column of each row
>>>> using AES-CTR. (I am aware of Parquet Modular Encryption but that is not an
>>>> option right now for various reasons).
>>>>
>>>> I am using Arrow to read, process and write encrypted/decrypted files.
>>>> I am doing this in C++ due to performance reasons.
>>>>
>>>> I was able to read the parquet file, recordbatch by record batch and
>>>> for each record batch, I am attempting to apply encryption/decryption to
>>>> each column array using my own provided compute function. I am following
>>>> the pattern shown here
>>>> <https://arrow.apache.org/docs/cpp/tutorials/compute_tutorial.html#calculating-element-wise-array-addition-with-callfunction>
>>>> .
>>>>
>>>> In order to proceed I need to add my encryption/decryption function. I
>>>> am doing something like the following
>>>>
>>>>
>>>> *arrow::compute::ScalarFunction encryptfunc =
>>>> arrow::compute::ScalarFunction("encr", arrow::compute::Arity::Unary(),
>>>> arrow::compute::FunctionDoc::Empty());*
>>>> *const arrow::Status &encrstatus =
>>>> encryptfunc.AddKernel({arrow::float64()}, arrow::binary(),
>>>> arrow::compute::ArrayKernelExec(???));*
>>>>
>>>> Now I am unsure where I can plug in my lambda function that
>>>> encapsulates the key and encryption logic? Is this even possible?
>>>>
>>>> Thanks
>>>> Prateem
>>>>
>>>>

Re: [C++] How to add user defined functions to arrow compute

Reply via email to