Re: [VOTE] [Rust] Move Ballista to new arrow-ballista repository

2022-05-17 Thread LM
+1

On Tue, May 17, 2022 at 1:35 PM QP Hou  wrote:

> +1 (binding)
>
> On Tue, May 17, 2022 at 1:27 PM David Li  wrote:
> >
> > +1 (binding)
> >
> > On Tue, May 17, 2022, at 16:00, Neal Richardson wrote:
> > > +1
> > >
> > > On Tue, May 17, 2022 at 12:46 PM Andrew Lamb 
> wrote:
> > >
> > >> +1 (binding)
> > >>
> > >> On Mon, May 16, 2022 at 9:56 AM Andy Grove 
> wrote:
> > >>
> > >> > I would like to propose that we move the Ballista project to a new
> > >> > top-level *arrow-ballista* repository.
> > >> >
> > >> > The rationale for this (copied from the GitHub issue [1]) is:
> > >> >
> > >> >-
> > >> >
> > >> >Decouple release process for DataFusion and Ballista
> > >> >-
> > >> >
> > >> >Allow each project to have top-level documentation and user
> guides
> > >> that
> > >> >are targeting the appropriate audience
> > >> >-
> > >> >
> > >> >Reduce issue tracking and PR review burden for DataFusion
> maintainers
> > >> >who are not as interested in Ballista
> > >> >-
> > >> >
> > >> >Help avoid accidental circular dependencies being introduced
> between
> > >> the
> > >> >projects (such as [3])
> > >> >-
> > >> >
> > >> >Helps formalize the public API for DataFusion that other query
> engines
> > >> >should be using
> > >> >
> > >> > There is a design docment [3] that outlines the plan for
> implementing
> > >> this.
> > >> >
> > >> > Only votes from PMC members are binding, but all members of the
> community
> > >> > are encouraged to test the release and vote with "(non-binding)".
> The
> > >> vote
> > >> > will run for at least 72 hours.
> > >> >
> > >> > [ ] +1 Proceed with moving Ballista to a new arrow-ballista
> repository [
> > >> ]
> > >> > +0
> > >> >
> > >> > [ ] -1 Do not proceed with moving Ballista to a new arrow-ballista
> > >> > repository because ...
> > >> >
> > >> > Here is my vote:
> > >> >
> > >> > +1 (binding)
> > >> >
> > >> > [1] https://github.com/apache/arrow-datafusion/issues/2502
> > >> >
> > >> > [2] https://github.com/apache/arrow-datafusion/issues/2433
> > >> >
> > >> > [3]
> > >> >
> > >> >
> > >>
> https://docs.google.com/document/d/1jNRbadyStSrV5kifwn0khufAwq6OnzGczG4z8oTQJP4/edit?usp=sharing
> > >> >
> > >>
>


Re: [VOTE][RUST] Release Apache Arrow Rust 11.0.0 RC1

2022-03-18 Thread LM
+1 (non-binding)

Verified on macOS 12.3 on M1Max

Thanks,
Lin

On Fri, Mar 18, 2022 at 5:52 PM QP Hou  wrote:

> +1 (binding)
> Thanks,
> QP Hou
>
> On Fri, Mar 18, 2022 at 1:01 AM Andrew Lamb  wrote:
> >
> > Hi,
> >
> > I would like to propose a release of Apache Arrow Rust Implementation,
> > version 11.0.0.
> >
> > This release candidate is based on commit:
> > 5d6b638111e3f9c72dc8504ea98e46914fc93af5 [1]
> >
> > The proposed release tarball and signatures are hosted at [2].
> >
> > The changelog is located at [3].
> >
> > Please download, verify checksums and signatures, run the unit tests,
> > and vote on the release. There is a script [4] that automates some of
> > the verification.
> >
> > The vote will be open for at least 72 hours.
> >
> > [ ] +1 Release this as Apache Arrow Rust
> > [ ] +0
> > [ ] -1 Do not release this as Apache Arrow Rust  because...
> >
> > [1]:
> >
> https://github.com/apache/arrow-rs/tree/5d6b638111e3f9c72dc8504ea98e46914fc93af5
> > [2]:
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-11.0.0-rc1
> > [3]:
> >
> https://github.com/apache/arrow-rs/blob/5d6b638111e3f9c72dc8504ea98e46914fc93af5/CHANGELOG.md
> > [4]:
> >
> https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh
> > -
>


Re: [ANNOUNCE] New Arrow committers: Raphael Taylor-Davies, Wang Xudong, Yijie Shen, and Kun Liu

2022-03-09 Thread LM
Congrats to you all!

On Wed, Mar 9, 2022 at 9:19 AM Chao Sun  wrote:

> Congrats all!
>
> On Wed, Mar 9, 2022 at 9:16 AM Micah Kornfield 
> wrote:
> >
> > Congrats!
> >
> > On Wed, Mar 9, 2022 at 8:36 AM Weston Pace 
> wrote:
> >
> > > Congratulations to all of you!
> > >
> > > On Wed, Mar 9, 2022, 4:52 AM Matthew Turner <
> matthew.m.tur...@outlook.com>
> > > wrote:
> > >
> > > > Congrats all and thank you for your contributions! It's been great to
> > > work
> > > > with and learn from you all.
> > > >
> > > > -Original Message-
> > > > From: Andrew Lamb 
> > > > Sent: Wednesday, March 9, 2022 8:59 AM
> > > > To: dev 
> > > > Subject: [ANNOUNCE] New Arrow committers: Raphael Taylor-Davies, Wang
> > > > Xudong, Yijie Shen, and Kun Liu
> > > >
> > > > On behalf of the Arrow PMC, I'm happy to announce that
> > > >
> > > > Raphael Taylor-Davies
> > > > Wang Xudong
> > > > Yijie Shen
> > > > Kun Liu
> > > >
> > > > Have all accepted invitations to become committers on Apache Arrow!
> > > > Welcome, thank you for all your contributions so far, and we look
> forward
> > > > to continuing to drive Apache Arrow forward to an even better place
> in
> > > the
> > > > future.
> > > >
> > > > This exciting growth in committers mirrors the growth of the Arrow
> Rust
> > > > community.
> > > >
> > > > Andrew
> > > >
> > > > p.s. sorry for the somewhat impersonal email; I was trying to avoid
> > > > several very similar emails. I am truly excited for each of these
> > > > individuals.
> > > >
> > >
>


Re: [DataFusion] Question about Accumulator API and maybe potential bugs

2022-01-04 Thread LM
Hi Jorge,

That makes sense, thanks for the clarification.

Thanks,
Lin

On Mon, 3 Jan 2022 at 23:49, Jorge Cardoso Leitão 
wrote:

> Hi,
>
> The accumulator API is designed to accept multiple columns (e.g. the
> pearson correlation takes 2 columns, not one). [0] corresponds to
> the first column passed to the accumulator. All concrete implementations of
> accumulators in DataFusion atm only accept one column (Sum, Avg, Count,
> Min, Max), but the API is designed to accept with multiple columns.
>
> So, update_batch( self, values: &[ArrayRef]) corresponds to: update the
> accumulator from n columns. For sum, this would be 1, for pearson
> correlation this would be 2, for e.g. a ML model whose weights are computed
> over all columns, this would be the number of input columns N of the model.
> For stddev, you should use 1, since stddev is a function of a single
> column.
>
> `update( self, values: &[ScalarValue])` corresponds to updating the
> state with intermediary states. In a HashAggregate, we reduce each
> partition, and use `update` to compute the final value from the
> intermediary (scalar) states.
>
> Hope this helps,
> Jorge
>
>
>
> On Tue, Jan 4, 2022 at 5:55 AM LM  wrote:
>
> > Hi All,
> >
> > I just started looking into DataFusion and am considering using it as the
> > platform for our next gen analytics solution. To get started, I tried to
> > add a few functions such as stddev. While writing the code I noticed some
> > discrepancies (it may also be my unfamiliarity of the code base) in the
> > Accumulator API and the implementation of some functions. The API is
> > defined as the following:
> >
> > pub trait Accumulator: Send + Sync + Debug {
> > /// Returns the state of the accumulator at the end of the accumulation.
> > // in the case of an average on which we track `sum` and `n`, this
> function
> > should return a vector
> > // of two values, sum and n.
> > fn state() -> Result>;
> >
> > /// updates the accumulator's state from a vector of scalars.
> > fn update( self, values: &[ScalarValue]) -> Result<()>;
> >
> > /// updates the accumulator's state from a vector of arrays.
> > fn update_batch( self, values: &[ArrayRef]) -> Result<()> {
> > if values.is_empty() {
> > return Ok(());
> > };
> > (0..values[0].len()).try_for_each(|index| {
> > let v = values
> > .iter()
> > .map(|array| ScalarValue::try_from_array(array, index))
> > .collect::>>()?;
> > self.update()
> > })
> > I am only quoting the update and update_batch functions for brevity, same
> > for the merge functions. So here it indicates that the update function
> > takes a *vector* and update_batch takes *vector of array. *
> >
> > When reading code for some actual implementation for example *sum* and
> > *average,
> > *both implementations assume when update is called *only one *value is
> > passed in; and when update_batch is called *only one *array is passed in.
> >
> > impl Accumulator for AvgAccumulator {
> > fn state() -> Result> {
> > Ok(vec![ScalarValue::from(self.count), self.sum.clone()])
> > }
> >
> > fn update( self, values: &[ScalarValue]) -> Result<()> {
> > let values = [0];
> >
> > self.count += (!values.is_null()) as u64;
> > self.sum = sum::sum(, values)?;
> >
> > Ok(())
> > }
> >
> > fn update_batch( self, values: &[ArrayRef]) -> Result<()> {
> > let values = [0];
> >
> > self.count += (values.len() - values.data().null_count()) as u64;
> > self.sum = sum::sum(, ::sum_batch(values)?)?;
> > Ok(())
> >
> > impl Accumulator for SumAccumulator {
> > fn state() -> Result> {
> > Ok(vec![self.sum.clone()])
> > }
> >
> > fn update( self, values: &[ScalarValue]) -> Result<()> {
> > // sum(v1, v2, v3) = v1 + v2 + v3
> > self.sum = sum(, [0])?;
> > Ok(())
> > }
> >
> > fn update_batch( self, values: &[ArrayRef]) -> Result<()> {
> > let values = [0];
> > self.sum = sum(, _batch(values)?)?;
> > Ok(())
> > }
> >
> > Could someone shed some light in case I missed anything?
> >
> > Regards,
> > Lin
> >
>


[DataFusion] Question about Accumulator API and maybe potential bugs

2022-01-03 Thread LM
Hi All,

I just started looking into DataFusion and am considering using it as the
platform for our next gen analytics solution. To get started, I tried to
add a few functions such as stddev. While writing the code I noticed some
discrepancies (it may also be my unfamiliarity of the code base) in the
Accumulator API and the implementation of some functions. The API is
defined as the following:

pub trait Accumulator: Send + Sync + Debug {
/// Returns the state of the accumulator at the end of the accumulation.
// in the case of an average on which we track `sum` and `n`, this function
should return a vector
// of two values, sum and n.
fn state() -> Result>;

/// updates the accumulator's state from a vector of scalars.
fn update( self, values: &[ScalarValue]) -> Result<()>;

/// updates the accumulator's state from a vector of arrays.
fn update_batch( self, values: &[ArrayRef]) -> Result<()> {
if values.is_empty() {
return Ok(());
};
(0..values[0].len()).try_for_each(|index| {
let v = values
.iter()
.map(|array| ScalarValue::try_from_array(array, index))
.collect::>>()?;
self.update()
})
I am only quoting the update and update_batch functions for brevity, same
for the merge functions. So here it indicates that the update function
takes a *vector* and update_batch takes *vector of array. *

When reading code for some actual implementation for example *sum* and
*average,
*both implementations assume when update is called *only one *value is
passed in; and when update_batch is called *only one *array is passed in.

impl Accumulator for AvgAccumulator {
fn state() -> Result> {
Ok(vec![ScalarValue::from(self.count), self.sum.clone()])
}

fn update( self, values: &[ScalarValue]) -> Result<()> {
let values = [0];

self.count += (!values.is_null()) as u64;
self.sum = sum::sum(, values)?;

Ok(())
}

fn update_batch( self, values: &[ArrayRef]) -> Result<()> {
let values = [0];

self.count += (values.len() - values.data().null_count()) as u64;
self.sum = sum::sum(, ::sum_batch(values)?)?;
Ok(())

impl Accumulator for SumAccumulator {
fn state() -> Result> {
Ok(vec![self.sum.clone()])
}

fn update( self, values: &[ScalarValue]) -> Result<()> {
// sum(v1, v2, v3) = v1 + v2 + v3
self.sum = sum(, [0])?;
Ok(())
}

fn update_batch( self, values: &[ArrayRef]) -> Result<()> {
let values = [0];
self.sum = sum(, _batch(values)?)?;
Ok(())
}

Could someone shed some light in case I missed anything?

Regards,
Lin