Re: [VOTE] [Rust] Move Ballista to new arrow-ballista repository
+1 On Tue, May 17, 2022 at 1:35 PM QP Hou wrote: > +1 (binding) > > On Tue, May 17, 2022 at 1:27 PM David Li wrote: > > > > +1 (binding) > > > > On Tue, May 17, 2022, at 16:00, Neal Richardson wrote: > > > +1 > > > > > > On Tue, May 17, 2022 at 12:46 PM Andrew Lamb > wrote: > > > > > >> +1 (binding) > > >> > > >> On Mon, May 16, 2022 at 9:56 AM Andy Grove > wrote: > > >> > > >> > I would like to propose that we move the Ballista project to a new > > >> > top-level *arrow-ballista* repository. > > >> > > > >> > The rationale for this (copied from the GitHub issue [1]) is: > > >> > > > >> >- > > >> > > > >> >Decouple release process for DataFusion and Ballista > > >> >- > > >> > > > >> >Allow each project to have top-level documentation and user > guides > > >> that > > >> >are targeting the appropriate audience > > >> >- > > >> > > > >> >Reduce issue tracking and PR review burden for DataFusion > maintainers > > >> >who are not as interested in Ballista > > >> >- > > >> > > > >> >Help avoid accidental circular dependencies being introduced > between > > >> the > > >> >projects (such as [3]) > > >> >- > > >> > > > >> >Helps formalize the public API for DataFusion that other query > engines > > >> >should be using > > >> > > > >> > There is a design docment [3] that outlines the plan for > implementing > > >> this. > > >> > > > >> > Only votes from PMC members are binding, but all members of the > community > > >> > are encouraged to test the release and vote with "(non-binding)". > The > > >> vote > > >> > will run for at least 72 hours. > > >> > > > >> > [ ] +1 Proceed with moving Ballista to a new arrow-ballista > repository [ > > >> ] > > >> > +0 > > >> > > > >> > [ ] -1 Do not proceed with moving Ballista to a new arrow-ballista > > >> > repository because ... > > >> > > > >> > Here is my vote: > > >> > > > >> > +1 (binding) > > >> > > > >> > [1] https://github.com/apache/arrow-datafusion/issues/2502 > > >> > > > >> > [2] https://github.com/apache/arrow-datafusion/issues/2433 > > >> > > > >> > [3] > > >> > > > >> > > > >> > https://docs.google.com/document/d/1jNRbadyStSrV5kifwn0khufAwq6OnzGczG4z8oTQJP4/edit?usp=sharing > > >> > > > >> >
Re: [VOTE][RUST] Release Apache Arrow Rust 11.0.0 RC1
+1 (non-binding) Verified on macOS 12.3 on M1Max Thanks, Lin On Fri, Mar 18, 2022 at 5:52 PM QP Hou wrote: > +1 (binding) > Thanks, > QP Hou > > On Fri, Mar 18, 2022 at 1:01 AM Andrew Lamb wrote: > > > > Hi, > > > > I would like to propose a release of Apache Arrow Rust Implementation, > > version 11.0.0. > > > > This release candidate is based on commit: > > 5d6b638111e3f9c72dc8504ea98e46914fc93af5 [1] > > > > The proposed release tarball and signatures are hosted at [2]. > > > > The changelog is located at [3]. > > > > Please download, verify checksums and signatures, run the unit tests, > > and vote on the release. There is a script [4] that automates some of > > the verification. > > > > The vote will be open for at least 72 hours. > > > > [ ] +1 Release this as Apache Arrow Rust > > [ ] +0 > > [ ] -1 Do not release this as Apache Arrow Rust because... > > > > [1]: > > > https://github.com/apache/arrow-rs/tree/5d6b638111e3f9c72dc8504ea98e46914fc93af5 > > [2]: > https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-11.0.0-rc1 > > [3]: > > > https://github.com/apache/arrow-rs/blob/5d6b638111e3f9c72dc8504ea98e46914fc93af5/CHANGELOG.md > > [4]: > > > https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh > > - >
Re: [ANNOUNCE] New Arrow committers: Raphael Taylor-Davies, Wang Xudong, Yijie Shen, and Kun Liu
Congrats to you all! On Wed, Mar 9, 2022 at 9:19 AM Chao Sun wrote: > Congrats all! > > On Wed, Mar 9, 2022 at 9:16 AM Micah Kornfield > wrote: > > > > Congrats! > > > > On Wed, Mar 9, 2022 at 8:36 AM Weston Pace > wrote: > > > > > Congratulations to all of you! > > > > > > On Wed, Mar 9, 2022, 4:52 AM Matthew Turner < > matthew.m.tur...@outlook.com> > > > wrote: > > > > > > > Congrats all and thank you for your contributions! It's been great to > > > work > > > > with and learn from you all. > > > > > > > > -Original Message- > > > > From: Andrew Lamb > > > > Sent: Wednesday, March 9, 2022 8:59 AM > > > > To: dev > > > > Subject: [ANNOUNCE] New Arrow committers: Raphael Taylor-Davies, Wang > > > > Xudong, Yijie Shen, and Kun Liu > > > > > > > > On behalf of the Arrow PMC, I'm happy to announce that > > > > > > > > Raphael Taylor-Davies > > > > Wang Xudong > > > > Yijie Shen > > > > Kun Liu > > > > > > > > Have all accepted invitations to become committers on Apache Arrow! > > > > Welcome, thank you for all your contributions so far, and we look > forward > > > > to continuing to drive Apache Arrow forward to an even better place > in > > > the > > > > future. > > > > > > > > This exciting growth in committers mirrors the growth of the Arrow > Rust > > > > community. > > > > > > > > Andrew > > > > > > > > p.s. sorry for the somewhat impersonal email; I was trying to avoid > > > > several very similar emails. I am truly excited for each of these > > > > individuals. > > > > > > > >
Re: [DataFusion] Question about Accumulator API and maybe potential bugs
Hi Jorge, That makes sense, thanks for the clarification. Thanks, Lin On Mon, 3 Jan 2022 at 23:49, Jorge Cardoso Leitão wrote: > Hi, > > The accumulator API is designed to accept multiple columns (e.g. the > pearson correlation takes 2 columns, not one). [0] corresponds to > the first column passed to the accumulator. All concrete implementations of > accumulators in DataFusion atm only accept one column (Sum, Avg, Count, > Min, Max), but the API is designed to accept with multiple columns. > > So, update_batch( self, values: &[ArrayRef]) corresponds to: update the > accumulator from n columns. For sum, this would be 1, for pearson > correlation this would be 2, for e.g. a ML model whose weights are computed > over all columns, this would be the number of input columns N of the model. > For stddev, you should use 1, since stddev is a function of a single > column. > > `update( self, values: &[ScalarValue])` corresponds to updating the > state with intermediary states. In a HashAggregate, we reduce each > partition, and use `update` to compute the final value from the > intermediary (scalar) states. > > Hope this helps, > Jorge > > > > On Tue, Jan 4, 2022 at 5:55 AM LM wrote: > > > Hi All, > > > > I just started looking into DataFusion and am considering using it as the > > platform for our next gen analytics solution. To get started, I tried to > > add a few functions such as stddev. While writing the code I noticed some > > discrepancies (it may also be my unfamiliarity of the code base) in the > > Accumulator API and the implementation of some functions. The API is > > defined as the following: > > > > pub trait Accumulator: Send + Sync + Debug { > > /// Returns the state of the accumulator at the end of the accumulation. > > // in the case of an average on which we track `sum` and `n`, this > function > > should return a vector > > // of two values, sum and n. > > fn state() -> Result>; > > > > /// updates the accumulator's state from a vector of scalars. > > fn update( self, values: &[ScalarValue]) -> Result<()>; > > > > /// updates the accumulator's state from a vector of arrays. > > fn update_batch( self, values: &[ArrayRef]) -> Result<()> { > > if values.is_empty() { > > return Ok(()); > > }; > > (0..values[0].len()).try_for_each(|index| { > > let v = values > > .iter() > > .map(|array| ScalarValue::try_from_array(array, index)) > > .collect::>>()?; > > self.update() > > }) > > I am only quoting the update and update_batch functions for brevity, same > > for the merge functions. So here it indicates that the update function > > takes a *vector* and update_batch takes *vector of array. * > > > > When reading code for some actual implementation for example *sum* and > > *average, > > *both implementations assume when update is called *only one *value is > > passed in; and when update_batch is called *only one *array is passed in. > > > > impl Accumulator for AvgAccumulator { > > fn state() -> Result> { > > Ok(vec![ScalarValue::from(self.count), self.sum.clone()]) > > } > > > > fn update( self, values: &[ScalarValue]) -> Result<()> { > > let values = [0]; > > > > self.count += (!values.is_null()) as u64; > > self.sum = sum::sum(, values)?; > > > > Ok(()) > > } > > > > fn update_batch( self, values: &[ArrayRef]) -> Result<()> { > > let values = [0]; > > > > self.count += (values.len() - values.data().null_count()) as u64; > > self.sum = sum::sum(, ::sum_batch(values)?)?; > > Ok(()) > > > > impl Accumulator for SumAccumulator { > > fn state() -> Result> { > > Ok(vec![self.sum.clone()]) > > } > > > > fn update( self, values: &[ScalarValue]) -> Result<()> { > > // sum(v1, v2, v3) = v1 + v2 + v3 > > self.sum = sum(, [0])?; > > Ok(()) > > } > > > > fn update_batch( self, values: &[ArrayRef]) -> Result<()> { > > let values = [0]; > > self.sum = sum(, _batch(values)?)?; > > Ok(()) > > } > > > > Could someone shed some light in case I missed anything? > > > > Regards, > > Lin > > >
[DataFusion] Question about Accumulator API and maybe potential bugs
Hi All, I just started looking into DataFusion and am considering using it as the platform for our next gen analytics solution. To get started, I tried to add a few functions such as stddev. While writing the code I noticed some discrepancies (it may also be my unfamiliarity of the code base) in the Accumulator API and the implementation of some functions. The API is defined as the following: pub trait Accumulator: Send + Sync + Debug { /// Returns the state of the accumulator at the end of the accumulation. // in the case of an average on which we track `sum` and `n`, this function should return a vector // of two values, sum and n. fn state() -> Result>; /// updates the accumulator's state from a vector of scalars. fn update( self, values: &[ScalarValue]) -> Result<()>; /// updates the accumulator's state from a vector of arrays. fn update_batch( self, values: &[ArrayRef]) -> Result<()> { if values.is_empty() { return Ok(()); }; (0..values[0].len()).try_for_each(|index| { let v = values .iter() .map(|array| ScalarValue::try_from_array(array, index)) .collect::>>()?; self.update() }) I am only quoting the update and update_batch functions for brevity, same for the merge functions. So here it indicates that the update function takes a *vector* and update_batch takes *vector of array. * When reading code for some actual implementation for example *sum* and *average, *both implementations assume when update is called *only one *value is passed in; and when update_batch is called *only one *array is passed in. impl Accumulator for AvgAccumulator { fn state() -> Result> { Ok(vec![ScalarValue::from(self.count), self.sum.clone()]) } fn update( self, values: &[ScalarValue]) -> Result<()> { let values = [0]; self.count += (!values.is_null()) as u64; self.sum = sum::sum(, values)?; Ok(()) } fn update_batch( self, values: &[ArrayRef]) -> Result<()> { let values = [0]; self.count += (values.len() - values.data().null_count()) as u64; self.sum = sum::sum(, ::sum_batch(values)?)?; Ok(()) impl Accumulator for SumAccumulator { fn state() -> Result> { Ok(vec![self.sum.clone()]) } fn update( self, values: &[ScalarValue]) -> Result<()> { // sum(v1, v2, v3) = v1 + v2 + v3 self.sum = sum(, [0])?; Ok(()) } fn update_batch( self, values: &[ArrayRef]) -> Result<()> { let values = [0]; self.sum = sum(, _batch(values)?)?; Ok(()) } Could someone shed some light in case I missed anything? Regards, Lin