The parameter to VisitArraySpanInline<T> should be Int64Type and not int64_t. Are you going to keep nulls in plain text?
PrealocateBinaryArrayForMyEncryption() was just a placeholder. You can call `Reserve()` on the builder to make the builder preallocate buffers so that Append*() don't have to grow the buffers from time to time. __ Felipe On Tue, Jul 9, 2024 at 12:23 PM Prateem Mandal <[email protected]> wrote: > Hi Felipe, > > I made some progress. Here is what I came up with to traverse through the > array structure for each column in the span and build up the output array > inside my custom function > > >> >> >> >> >> * arrow::Int64Builder builder(ctx->memory_pool()); >> RETURN_NOT_OK(arrow::VisitArraySpanInline<int64_t>(batch[0].array, >> [&](int64_t v){ return builder.Append(v); }, [&](){ return >> builder.AppendNull(); })); std::shared_ptr<arrow::Array> outvalues_; >> RETURN_NOT_OK(builder.Finish(&outvalues_));* > > > I am using int64_t array as return value but this will eventually be a > binary. I think the builder is the right pattern as most columns are string > of variable length and encrypted data will be same size + some > initialization vector prepended to it. > This did not compile however. The error I got is > > *error: ‘VisitStatus’ is not a member of >> ‘arrow::internal::ArraySpanInlineVisitor<long int, void>’* > > > I guess this is because ArraySpanInlineVisitor is not exported. So I guess > I will have to replicate the logic within ArraySpanInlineVisitor. Can you > confirm if I am on the correct path here? > > Thanks > Prateem > > On Tue, Jul 9, 2024 at 12:35 AM Prateem Mandal <[email protected]> wrote: > >> Hi Felipe, >> >> Thank you for your response. I also found "example/arrow/udf_example.cc". >> Between that and your example, I was able to build, register and use my own >> trivial increment function. However, your solution template did clarify and >> simultaneously raise a few issues for me. >> >> The first is allocating an output binary array (in case of encryption). >> Your suggested pseudocode >> *|> auto out_data = PrealocateBinaryArrayForMyEncryption(arg0, >> /*source_data_size*/sizeof(double)); * >> takes arg0 to get the number of elements in the array and in case the >> input array contains elements of non fixed type then further size of each >> entry in the array in order to calculate the total allocation size. Am I >> correct? What function should I use to allocate a BinaryArray (assuming >> BinaryArray the right choice for an array of binaries)? Can you please >> provide me with some references to example code that does this? >> >> >> The second question is about handling nulls. Could you please again refer >> to me some code that handles nulls through first checking the validity >> buffer? I, for example, do not know where the validity buffer is in the >> ExecSpan. >> >> Few additional questions. In the example code in >> "example/arrow/udf_example.cc", it dereferences the array with index 1 in >> the batch. >> *|> batch[0].array.GetValues<int64_t>(1)* >> What is special about this index? Probably there are three arrays in >> every span, what is there in an array with index 0 and 2? >> >> Apologies for too many questions. >> >> Thanks, >> Prateem >> >> On Mon, Jul 8, 2024 at 3:55 PM Felipe Oliveira Carvalho < >> [email protected]> wrote: >> >>> Hi, >>> >>> ArrayKernelExec must be a pointer to a C function. >>> >>> using ArrayKernelExec = Status (*)(KernelContext*, const ExecSpan&, >>> ExecResult*); >>> >>> Status EncryptFloat64(KernelContext* ctx, const ExecSpan& batch, >>> ExecResult* out) { >>> auto& arg0 = batch[0]; >>> auto out_data = PrealocateBinaryArrayForMyEncryption(arg0, >>> /*source_data_size*/sizeof(double)); >>> if (arg0.MayHaveNulls()) { >>> // specialized loop encrypting all the values (handling nulls) >>> } else { >>> // specialized loop encrypting all the values (ignoring validity >>> buffer) >>> } >>> *out = std::move(out_data); >>> return Status::OK(); >>> } >>> >>> const arrow::Status encrstatus = >>> encryptfunc.AddKernel({arrow::float64()}, arrow::binary(), EncryptFloat64); >>> >>> __ >>> Felipe >>> >>> On Sun, Jul 7, 2024 at 5:51 PM Prateem Mandal <[email protected]> wrote: >>> >>>> Hello, >>>> >>>> I am implementing an encryption and decryption transformation function >>>> that takes a parquet file as input and encrypts each column of each row >>>> using AES-CTR. (I am aware of Parquet Modular Encryption but that is not an >>>> option right now for various reasons). >>>> >>>> I am using Arrow to read, process and write encrypted/decrypted files. >>>> I am doing this in C++ due to performance reasons. >>>> >>>> I was able to read the parquet file, recordbatch by record batch and >>>> for each record batch, I am attempting to apply encryption/decryption to >>>> each column array using my own provided compute function. I am following >>>> the pattern shown here >>>> <https://arrow.apache.org/docs/cpp/tutorials/compute_tutorial.html#calculating-element-wise-array-addition-with-callfunction> >>>> . >>>> >>>> In order to proceed I need to add my encryption/decryption function. I >>>> am doing something like the following >>>> >>>> >>>> *arrow::compute::ScalarFunction encryptfunc = >>>> arrow::compute::ScalarFunction("encr", arrow::compute::Arity::Unary(), >>>> arrow::compute::FunctionDoc::Empty());* >>>> *const arrow::Status &encrstatus = >>>> encryptfunc.AddKernel({arrow::float64()}, arrow::binary(), >>>> arrow::compute::ArrayKernelExec(???));* >>>> >>>> Now I am unsure where I can plug in my lambda function that >>>> encapsulates the key and encryption logic? Is this even possible? >>>> >>>> Thanks >>>> Prateem >>>> >>>>
