C Data interface landed on Rust

2020-12-05 Thread Jorge Cardoso Leitão
Hi, Just to let you know that with #8401 merged, Rust's implementation has now basic support for the c data interface . This enables near zero copy between rust and other implementations, which

Re: Arrow sync call November 25 at 12:00 US/Eastern, 17:00 UTC

2020-11-25 Thread Jorge Cardoso Leitão
Hi, There was no topic to discuss, so we called it a day. Have a good thanksgiving folks in the US and elsewhere :) Best, Jorge On Wed, Nov 25, 2020 at 5:44 PM Neal Richardson wrote: > Hi all, > Reminder that our biweekly call is coming up at > https://meet.google.com/vtm-teks-phx. All are

[Governance] [Proposal] Stop force-pushing to PRs after release?

2020-11-24 Thread Jorge Cardoso Leitão
Hi, Based on a discussion on PR #8481, I would like to raise a concern around git and the post-actions of a release. The background is that I was really confused that someone has force-pushed to a PR that I fielded, re-writing its history and causing the PR to break. @wes and @kszucs quickly

Re: [DISCUSS] Memory alignment in rust - what to do?

2020-11-24 Thread Jorge Cardoso Leitão
To bring closure to this thread: #8401 implements the necessary functionality to import and export from and to the C data interface, includes an integration test to run these against the API provided by pyarrow, and the CI is green. Special thanks to

Re: [CI] Github Actions statistics

2020-11-24 Thread Jorge Cardoso Leitão
Hi, Thanks a lot for sharing these. I am looking through the tests that we run, and how we run them, as I would really like to take a hit at it. However, I can't commit to this without some agreement. I took a hard look at archery and most of our builds, and these are my observations: * we

Re: [DISCUSS] Extend specification with the definition of equality?

2020-11-13 Thread Jorge Cardoso Leitão
s. Do you want to maybe try to make a PR? > > > > One small edge case to consider is how NaN float values are compared. > > I think at the specification level, it should only be bit/byte-level > binary equality without respect to the semantics of the logical data > ty

Re: [DISCUSS] Extend specification with the definition of equality?

2020-11-12 Thread Jorge Cardoso Leitão
onary value, even if the dictionaries are different, > then they are equal because the data they represent is the same) > > - Wes > > On Thu, Nov 5, 2020 at 1:13 AM Jorge Cardoso Leitão > wrote: > > > > Hi, > > > > Recently, I revisited the code for array equalit

Re: [ANNOUNCE] New Arrow committer: Andrew Lamb

2020-11-10 Thread Jorge Cardoso Leitão
Congrats, Andrew! Andrew has been doing an amazing job, both on the implementation but also at reviewing and helping others. He taught me a lot, I am having a great time working with him and I am thus really happy about this. Best, Jorge On Tue, Nov 10, 2020 at 4:42 PM Andy Grove wrote: > On

[DISCUSS] Extend specification with the definition of equality?

2020-11-04 Thread Jorge Cardoso Leitão
Hi, Recently, I revisited the code for array equality in Rust. While going through it, I observed some assumptions about how we conclude that two elements of an arrow array are equal, and when two arrays are equal. The notion of equality is also used throughout the document e.g. when we offer

Re: [ANNOUNCE] New Arrow PMC chair: Wes McKinney

2020-10-25 Thread Jorge Cardoso Leitão
Thanks a lot Jacques for taking the flag until now, and congratulations, Wes! On Sun, Oct 25, 2020 at 2:58 PM Wes McKinney wrote: > Thanks all! > > On Sun, Oct 25, 2020 at 6:29 AM Krisztián Szűcs > wrote: > > > > Congrats Wes! > > > > On Sun, Oct 25, 2020 at 2:40 AM David Li wrote: > > > > >

Experiment with DataFusion + Pyarrow

2020-10-20 Thread Jorge Cardoso Leitão
Hi, Over the past few weeks I have been running an experiment whose main goal is to run a query in (Rust's) DataFusion and use Python on it so that we can embed the Python's ecosystem on the query (a-la pyspark) (details here ). I am super

Re: [Rust] Blog post for 2.0.0

2020-10-16 Thread Jorge Cardoso Leitão
Hi, I would like to thank Fernando for raising this concern here: I also think that we still do not put enough effort in the documentation :) I admit that when I started in the project, I also had that need and just had some time to go through the code. First, I find it useful to distinguish

Re: [VOTE] Release Apache Arrow 2.0.0 - RC0

2020-10-12 Thread Jorge Cardoso Leitão
I have also built and tested the Rust implementation and browsed through the built documentation and it all looks good to me. On Mon, Oct 12, 2020 at 4:48 PM Andy Grove wrote: > +1 (binding) based on testing the Rust implementation only. > > On Sun, Oct 11, 2020 at 4:17 AM Krisztián Szűcs > >

Re: Permission denied on github

2020-10-03 Thread Jorge Cardoso Leitão
> > I think those two things plus completing the Gitbox setup are what's > needed to get added to the Apache org on GitHub. > > On Sat, Oct 3, 2020 at 12:31 PM Jorge Cardoso Leitão > wrote: > > > > Hi, > > > > Today I was trying together with the help of An

Permission denied on github

2020-10-03 Thread Jorge Cardoso Leitão
Hi, Today I was trying together with the help of Andy Grove to merge a small PR on github via the dev/merge_arrow_pr.py. I am getting a permission denied, and I am kind of blocked. The result (full output below): ERROR: Permission to apache/arrow.git denied to jorgecarleitao. fatal: Could not

Re: [ANNOUNCE] New Arrow committer: Jorge Leitão

2020-09-30 Thread Jorge Cardoso Leitão
I just wanna say that it is being an awesome experience working with you folks: IMO a great balance between pragmatism and design choices, together with a very friendly atmosphere and impressive knowledge sharing over a wide range of CS topics. So, thank you. :-) Best, Jorge On Wed, Sep 30,

Re: [Rust] Arrow SQL Adapters/Connectors

2020-09-27 Thread Jorge Cardoso Leitão
That would be awesome! I agree with this, and would be really useful, as it would leverage all the goodies that RDMS have wrt to transitions, etc. I would probably go for having database-specifics outside of the arrow project, so that they can be used by other folks beyond arrow, and keep the

Re: [DISCUSS] Memory alignment in rust - what to do?

2020-09-24 Thread Jorge Cardoso Leitão
22/09/2020 à 19:16, Jorge Cardoso Leitão a écrit : > > Hi, > > > > I had some time to look at > https://issues.apache.org/jira/browse/ARROW-10039, > > wrt to the alignment requirements that rust implementation currently > > imposes. > > > > The gist is that it

[DISCUSS] Memory alignment in rust - what to do?

2020-09-22 Thread Jorge Cardoso Leitão
Hi, I had some time to look at https://issues.apache.org/jira/browse/ARROW-10039, wrt to the alignment requirements that rust implementation currently imposes. The gist is that it is not that easy, and I would like to request some guidance. Some facts: 1. Our current implementation does not

Re: Help with memory alignment between Rust and C/Python

2020-09-18 Thread Jorge Cardoso Leitão
.0. Best, Jorge On Fri, Sep 18, 2020 at 6:08 PM Antoine Pitrou wrote: > > Le 18/09/2020 à 18:03, Jorge Cardoso Leitão a écrit : > > // panics with "memory not aligned" > > Buffer::from_raw_parts(address as *const u8, size, size) > > > > I get an address such

Help with memory alignment between Rust and C/Python

2020-09-18 Thread Jorge Cardoso Leitão
Hi, I am trying to convert pyarrow buffers into Rust buffers and vice-versa, to perform zero-copy from and to pyarrow, to and from Rust's library. I was able to perform the operation rust -> pyarrow, using something along the lines of // 64 bits system let pointer = buffer.raw_data() as i64;

Re: [C++][Compute] question about aggregate kernels

2020-09-17 Thread Jorge Cardoso Leitão
Hi, > I think what everyone else was potentially stating implicitly is that for combining details about arrays, for std. dev. and average there needs to be more state kept that is different from the elements that one is actually dealing with. For std. dev. you need to keep two numbers (same

Re: [DISCUSS] Rethinking our approach to scheduling CPU and IO work in C++?

2020-09-16 Thread Jorge Cardoso Leitão
Hi, I am not sure I fully understand, so I will try to give an example to check: we have a simple query that we want to write the result to some place: SELECT t1.b * t2.b FROM t1 JOIN ON t2 WHERE t1.a = t2.a At the physical plane, we need to 1. read each file in batches 2. join the batches 3.

Re: [C++][Compute] question about aggregate kernels

2020-09-16 Thread Jorge Cardoso Leitão
Hi Yibo, That is correct. The simplest example is an average of 3 elements {x1,x2,x3}, in two chunks: {x1} and {x2,x3}. The average of the average is not equal to the average: avg({avg({x1}), avg({x2,x3})}) = ((x1) + (x2 + x3)/2)/2 != (x1 + x2 + x3) / 3 = avg({x1,x2,x3}) We are solving this in

Thank you

2020-08-27 Thread Jorge Cardoso Leitão
Hi, I am writing to just thank all those involved in the release process. Sometimes the work of releases is not fully appreciated within development (where are the PRs ^_^?), but I find it impressive that the release is so smooth for such a complex project, and IMO that is to a large extent due

[DataFusion] Proposal to change how UDFs are called in DataFrame API

2020-08-23 Thread Jorge Cardoso Leitão
Hi, I came to a limitation that I would like to propose a resolution to. TL;DR; currently, users plan UDFs calls via a call of the form let e = scalar_functions(“my_udf”, vec![col(“a”)],DataType::Float64)]); df.select(vec![e]) The proposal is to use instead: let f = df.registry(); let e =

Re: [Rust] [DataFusion] Proposal for User Defined PlanNode / Operator API

2020-08-22 Thread Jorge Cardoso Leitão
Hi Andrew, I carefully went through the document and the PR. Thank you for this! I believe that the improvements on the PR alone are a major benefit, as it supports custom logical plans out of the box, which opens a lot of possibilities to users. I also like the idea of migrating from enum to a

Re: Polymorphism in DataFusion

2020-08-21 Thread Jorge Cardoso Leitão
think I > will have time to do this any time soon (unless it becomes directly > important for the project I am working on) > > Thanks for taking the initiative on this, > Andrew > > On Wed, Aug 19, 2020 at 2:29 PM Jorge Cardoso Leitão < > jorgecarlei...@gmail.com> wrote:

Re: Polymorphism in DataFusion

2020-08-19 Thread Jorge Cardoso Leitão
doing that at a different level than the UDF (perhaps via > `register_alias("sum", "sum_i32)` or something), again for both clarity of > DataFusion implementation as well as UDF specification. > > Andrew > > On Mon, Aug 17, 2020 at 4:52 PM Jorge Cardoso Leitão < >

Re: Polymorphism in DataFusion

2020-08-17 Thread Jorge Cardoso Leitão
semantics of "1 return type > for each UDF" to make it easier on people writing UDFs as well as > simplifying the implementation of DataFusion itself. > > Andrew > > [1] https://dev.mysql.com/doc/refman/8.0/en/create-function-udf.html > [2] https://www.p

Polymorphism in DataFusion

2020-08-17 Thread Jorge Cardoso Leitão
Hi, Recently, I have been contributing to DataFusion, and I would like to bring to your attention a question that I faced while PRing to DataFusion that IMO needs some alignment :) DataFusion supports scalar UDFs: functions that expect a type, return a type, and performs some operation on the

Re: Versioning of arrow

2020-07-29 Thread Jorge Cardoso Leitão
Hi, That makes a lot of sense. I am sorry that I did not understand that from the versioning document and the discussion on this thread. Best, Jorge On Tue, Jul 28, 2020 at 8:30 PM Wes McKinney wrote: > On Tue, Jul 28, 2020 at 8:49 AM Jorge Cardoso Leitão > wrote: > >

Re: Versioning of arrow

2020-07-28 Thread Jorge Cardoso Leitão
n whether the next release for the library I'm > > working on "should" have a major or minor version bump, I'm skeptical > that > > having that autonomy is worth the maintenance cost. > > > > Neal > > > > > > On Mon, Jul 27, 2020 at 9:37 AM Jorge C

Versioning of arrow

2020-07-27 Thread Jorge Cardoso Leitão
Hi First off, congrats for the 1.0.0 release! I am writing because I am trying to understand the versioning schema we will use going onwards. AFAI understand, 1.0.0 was assigned to all subcomponents of arrow. I.e. I can now use pyarrow and assign something like >=1,<2 on a setup.py. However,

<    1   2   3