robreeves commented on code in PR #1938: URL: https://github.com/apache/auron/pull/1938#discussion_r3035089073
########## native-engine/datafusion-ext-exprs/src/spark_randn.rs: ########## @@ -0,0 +1,303 @@ +// Licensed to the Apache Software Foundation (ASF) under one or more +// contributor license agreements. See the NOTICE file distributed with +// this work for additional information regarding copyright ownership. +// The ASF licenses this file to You under the Apache License, Version 2.0 +// (the "License"); you may not use this file except in compliance with +// the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +use std::{ + any::Any, + fmt::{Debug, Display, Formatter}, + hash::{Hash, Hasher}, + sync::Arc, +}; + +use arrow::{ + array::{Float64Array, RecordBatch}, + datatypes::{DataType, Schema}, +}; +use datafusion::{ + common::Result, + logical_expr::ColumnarValue, + physical_expr::{PhysicalExpr, PhysicalExprRef}, +}; +use parking_lot::Mutex; +use rand::{SeedableRng, rngs::StdRng}; +use rand_distr::{Distribution, StandardNormal}; + +use crate::down_cast_any_ref; + +/// Returns random values with independent and identically distributed (i.e.d.) +/// samples drawn from the standard normal distribution. +/// +/// Matches Spark's behavior: +/// - RNG is seeded with `seed + partition_id` +/// - RNG state advances for each row (stateful across batches) +pub struct SparkRandnExpr { + seed: i64, + partition_id: usize, + rng: Mutex<StdRng>, +} + +impl SparkRandnExpr { + pub fn new(seed: i64, partition_id: usize) -> Self { + let effective_seed = (seed as u64).wrapping_add(partition_id as u64); + Self { + seed, + partition_id, + rng: Mutex::new(StdRng::seed_from_u64(effective_seed)), + } Review Comment: It is preferable to use the built in rand implementations to keep it simple. randn is non-deterministic so it doesn't need to match spark's output exactly -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
