Re: [DISCUSS] [Rust] Donate Ballista to Apache Arrow

2021-03-10 Thread Andy Grove
Thanks for the feedback so far on this proposal. I really appreciate
everyone taking the time to put so much thought (and passion!) into this.

So far, I don't think anyone is opposed to the idea of donating Ballista
but there are clearly concerns about an increased burden on current
maintainers.

We also have re-started discussions around tooling and release processes,
but it seems that there is no objection to Rust / DataFusion / Ballista
having more control over the release process but we have to put in the work
to make that happen. I am certainly motivated to help with this but I think
that is a separate conversation to donating Ballista.

To reduce the burden on existing maintainers, we could consider initially
adding Ballista in such a way that it doesn't slow down momentum on Arrow &
DataFusion by adding it as a separate Rust subproject that is not part of
the Rust workspace, and have it depend on pinned commits initially. This
would be a lightweight way of incubating the project within the mono-repo
and at some point, we can add it to the main workspace. This would be no
worse than the current situation, and it would be better because it is at
least under Arrow governance.

I would like to talk a bit more specifically about the donation at this
point now that there is some feedback.

What I propose we donate from Ballista is:

   -

   The ballista.proto file that defines an encoding for logical and
   physical query plans as well as cluster meta-data (this protobuf file could
   eventually be split into separate files for each area)
   -

   The Rust source code, which consists of these main areas:
   -

  serde code for translating between protobuf and
  Arrow/DataFusion/Ballista data structures
  -

  Distributed query planner
  -

  Scheduler process that coordinates query execution across available
  executors
  -

  Executor process that implements Flight protocol and executes query
  partitions and serializes results in Arrow IPC format

I am proposing that we specifically exclude the following parts of the
Ballista repo from the donation:

   -

   The work-in-progress JDBC driver which is not currently functional
   -

   The Spark benchmark code that I have been using for comparing performance
   -

   The Python bindings, which as far as I know are pretty much a fork of
   Jorge's datafusion-python project.

I think it is also worth mentioning that Ballista is currently only ~8k
lines of code, which is pretty small in contrast to the >100k lines of code
in the Arrow Rust project currently.

Let's keep the conversation going and see what other feedback there is
regarding the merits of donating Ballista, or not.

Thanks,

Andy.

On Wed, Mar 10, 2021 at 3:13 PM Jorge Cardoso Leitão <
jorgecarlei...@gmail.com> wrote:

> Hi,
>
> Wes, thanks a lot for your reply. Let me try to answer:
>
> 1. If the purpose of Ballista is to support multiple language
> > executors, what does segregating it from the other PL's (where
> > executors are being developed, too) serve to facilitate this goal?
> >
>
> It facilitates because the stronger the coupling is, the more entropic the
> setup is, and the more energy is required to develop and maintain it.
> In this particular case, I Imagine that each executor would depend on
> specific versions of each implementation, just like any other dependent
> that is not
> maintained by Apache Arrow does.
>
> Or is the idea that every dependent should be on the mono-repo? If we need
> to control our dependents like that, that usually indicates that we can't
> guarantee a stable API (which IMO is the root cause).
>
> 2. Use of the monorepo does not require a synchronized release cycle,
> > just as Rust does not require it now either. The only reason there
> > have not been independent Rust releases is because someone has not
> > volunteered to do it. Likewise, if DataFusion and Ballista are in the
> > same git repository, they don't have to release at the same time as
> > the core arrow / parquet crates.
> >
>
> I thought that Rust needed to be synchronized with the major release of the
> repo. Isn't this the case anymore?
>
> 3. On an incremental basis, I do not believe the increased complexity
> is significant. A multi-repository setup can be actively worse when
> development work involves both repositories at the same time. This can
> be mitigated by pinning the arrow / parquet crates as you point out,
> but that creates other issues.
>
> Could you enumerate parts from DataFusion or Ballista that would require
> work on Arrow at the same time? I proposed that division because I am
> reasonably confident will not need to be developed at the same time. I am
> confident of this because a) the APIs used by DataFusion are written to
> minimize public surfaces, so that arrow can mutate without affecting those
> APIs; b) I designed and implemented most of the DataFusion code around
> built-in functions, aggregate functions, UDFs and UDA

Re: [DISCUSS] [Rust] Donate Ballista to Apache Arrow

2021-03-10 Thread Jorge Cardoso Leitão
Hi,

Wes, thanks a lot for your reply. Let me try to answer:

1. If the purpose of Ballista is to support multiple language
> executors, what does segregating it from the other PL's (where
> executors are being developed, too) serve to facilitate this goal?
>

It facilitates because the stronger the coupling is, the more entropic the
setup is, and the more energy is required to develop and maintain it.
In this particular case, I Imagine that each executor would depend on
specific versions of each implementation, just like any other dependent
that is not
maintained by Apache Arrow does.

Or is the idea that every dependent should be on the mono-repo? If we need
to control our dependents like that, that usually indicates that we can't
guarantee a stable API (which IMO is the root cause).

2. Use of the monorepo does not require a synchronized release cycle,
> just as Rust does not require it now either. The only reason there
> have not been independent Rust releases is because someone has not
> volunteered to do it. Likewise, if DataFusion and Ballista are in the
> same git repository, they don't have to release at the same time as
> the core arrow / parquet crates.
>

I thought that Rust needed to be synchronized with the major release of the
repo. Isn't this the case anymore?

3. On an incremental basis, I do not believe the increased complexity
is significant. A multi-repository setup can be actively worse when
development work involves both repositories at the same time. This can
be mitigated by pinning the arrow / parquet crates as you point out,
but that creates other issues.

Could you enumerate parts from DataFusion or Ballista that would require
work on Arrow at the same time? I proposed that division because I am
reasonably confident will not need to be developed at the same time. I am
confident of this because a) the APIs used by DataFusion are written to
minimize public surfaces, so that arrow can mutate without affecting those
APIs; b) I designed and implemented most of the DataFusion code around
built-in functions, aggregate functions, UDFs and UDAF.

But maybe we can validate this here: Andy, during the development of
Ballista, on which the largest changes on Arrow repo were needed, did you
have to change anything on the Arrow crate or parquet crate, or was
everything done on DataFusion? If yes to any, was there a significant
burden in doing so?

4. Even without Jira, there is still the expectation for contributors
> to communicate in a way that is compatible with the Apache Way. So
> even without Jira, PMCs have an obligation to establish an alternative
> structure to have consistently open dialogue / planning about what
> people are working on or planning to work on in the future. If
> contributors are extensively discussing / planning privately, these
> discussions must be moved into the open, whether with design documents
> or issues or e-mail discussions. This was discussed ad nauseam in the
> other thread so I won't rehash those arguments.
>

I fully agree, even though I think it is a bit difficult to operationalize.
Thus, let's try like this: would you consider, under the definition used
above, discussions happening on github PRs and issues, such as what airflow
does  , as open?

Aside from these issues, the biggest lost opportunity I see if
> DF/Baliista "cast away" as it were, is that it becomes unattractive
> for the rest of us to build anything on top of these platforms
> (because at that point we have a circular dependency, which is the
> hellscape we escaped from with Parquet C++). I used the
> datafusion-python project as an example — if that were in the Arrow
> project I might consider using it in various ways or contribute to it,
> but as an external project it's less interesting to me as something to
> build on.
>

My feelings about transferring datafusion-python to arrow are shared above:
I find the idea of picking something that is well encapsulated and
decoupled from the rest and blending it into something large and less
decoupled as an entropy-generating activity, which requires more energy to
maintain. Operationally, the way I would merge a project like
datafusion-python into Apache would be by transferring ownership of the
repo on github, transfer ownership of the pypi project, and create some
secrets on github to keep twine working. Just like I mentioned for
Ballista. If people lose interest in the project, then deprecating it would
be trivial (archive the repo). If people gain interest in it, growth is
also trivial (there is already a house in place and the goals are well
defined). The interfaces are the API contracts declared as pinned
dependencies (in Cargo.toml / setup.py).

Best,
Jorge




On Wed, Mar 10, 2021 at 7:50 PM Wes McKinney  wrote:

> hi Jorge,
>
> I have some thoughts / questions on your arguments against use of the
> monorepo:
>
> 1. If the purpose of Ballista is to support multiple language
> executors, what do

Re: Is Zulip still the preferred chat application for Arrow?

2021-03-10 Thread Andy Grove
I totally agree that we should be diligent to "move discussions to ASF
channels when something of relevance to the community is being discussed".

I already have informal communications with contributors over various
channels (email, private slack groups, discord, etc), and moving these
interactions to ASF slack seems like a step towards more open collaboration.


On Wed, Mar 10, 2021 at 11:29 AM Wes McKinney  wrote:

> The Ursa Zulip is not an "official" channel, which is to say that
> community discussions there about what to build, whether there is
> consensus about something, etc., are not valid from a governance /
> Openness standpoint. Those need to take place either on the mailing
> list or the issue tracker (which at present is Jira).
>
> Arrow had a Slack instance early in its life but it led to behavioral
> antipatterns — people were asking questions or having discussions
> about the project in a place where only a small fraction of the
> community was present. We discussed and deemed that the presence of an
> Arrow Slack channel was harmful to the community as it was then
> operating, and so we shut it down. I personally will not use Slack if
> I have any alternative.
>
> The best way to communicate whether someone is working on an issue is
> to assign it to themselves, and if it is not assigned then it can be
> assumed to be free to pick up.
>
> If any of you want to use a Slack instance somewhere to chat in an IRC
> like fashion that's completely fine, just please move discussions to
> ASF channels when something of relevance to the community is being
> discussed.
>
> On Wed, Mar 10, 2021 at 12:15 PM Antoine Pitrou 
> wrote:
> >
> >
> >
> > Le 10/03/2021 à 19:15, Antoine Pitrou a écrit :
> > >
> > > Le 10/03/2021 à 19:04, Antoine Pitrou a écrit :
> > >>
> > >> Hi Andy,
> > >>
> > >> Le 10/03/2021 à 19:00, Andy Grove a écrit :
> > >>> We had a discussion on the Arrow Rust Sync call about the best place
> to
> > >>> co-ordinate on work. For example, quick questions like "is anyone
> working
> > >>> on ARROW-12345? should I pick this up?".
> > >>>
> > >>> I know that Ursa Lab hosts Zulip and I have used that in the past
> for these
> > >>> types of discussion.
> > >>>
> > >>> I also found out today that there is an official ASF slack with
> multiple
> > >>> Arrow channels, but this is only open to people who already have an
> > >>> apache.org email address (committers / PMC).
> > >>
> > >> I didn't know that the ASF had an official Slack.  I suppose that's
> the
> > >> Apache way of favoring open source software.
> > >>
> > >> I find Slack uncomfortable and annoying to deal with, and I wouldn't
> go
> > >> there.
> > >
> > > That said, and to answer your question a bit more completely, I don't
> > > there a requirement that all Arrow implementations use the same chat
> system.
> >
> > Wow, sorry: I don't think there's a requirement that ...
> >
>


Re: Is Zulip still the preferred chat application for Arrow?

2021-03-10 Thread Nate Bauernfeind
> I also found out today that there is an official ASF slack with multiple
Arrow channels, but this is only open to people who already have an
apache.org email address (committers / PMC).

FYI, non committers / PMC members can join the slack using this link:
https://s.apache.org/slack-invite

On Wed, Mar 10, 2021 at 11:04 AM Antoine Pitrou  wrote:

>
> Hi Andy,
>
> Le 10/03/2021 à 19:00, Andy Grove a écrit :
> > We had a discussion on the Arrow Rust Sync call about the best place to
> > co-ordinate on work. For example, quick questions like "is anyone working
> > on ARROW-12345? should I pick this up?".
> >
> > I know that Ursa Lab hosts Zulip and I have used that in the past for
> these
> > types of discussion.
> >
> > I also found out today that there is an official ASF slack with multiple
> > Arrow channels, but this is only open to people who already have an
> > apache.org email address (committers / PMC).
>
> I didn't know that the ASF had an official Slack.  I suppose that's the
> Apache way of favoring open source software.
>
> I find Slack uncomfortable and annoying to deal with, and I wouldn't go
> there.
>
> Regards
>
> Antoine.
>


--


Re: [DISCUSS] [Rust] Donate Ballista to Apache Arrow

2021-03-10 Thread Wes McKinney
hi Jorge,

I have some thoughts / questions on your arguments against use of the monorepo:

1. If the purpose of Ballista is to support multiple language
executors, what does segregating it from the other PL's (where
executors are being developed, too) serve to facilitate this goal?

2. Use of the monorepo does not require a synchronized release cycle,
just as Rust does not require it now either. The only reason there
have not been independent Rust releases is because someone has not
volunteered to do it. Likewise, if DataFusion and Ballista are in the
same git repository, they don't have to release at the same time as
the core arrow / parquet crates.

3. On an incremental basis, I do not believe the increased complexity
is significant. A multi-repository setup can be actively worse when
development work involves both repositories at the same time. This can
be mitigated by pinning the arrow / parquet crates as you point out,
but that creates other issues.

4. Even without Jira, there is still the expectation for contributors
to communicate in a way that is compatible with the Apache Way. So
even without Jira, PMCs have an obligation to establish an alternative
structure to have consistently open dialogue / planning about what
people are working on or planning to work on in the future. If
contributors are extensively discussing / planning privately, these
discussions must be moved into the open, whether with design documents
or issues or e-mail discussions. This was discussed ad nauseam in the
other thread so I won't rehash those arguments.

Aside from these issues, the biggest lost opportunity I see if
DF/Baliista "cast away" as it were, is that it becomes unattractive
for the rest of us to build anything on top of these platforms
(because at that point we have a circular dependency, which is the
hellscape we escaped from with Parquet C++). I used the
datafusion-python project as an example — if that were in the Arrow
project I might consider using it in various ways or contribute to it,
but as an external project it's less interesting to me as something to
build on.

On Wed, Mar 10, 2021 at 12:13 PM Jorge Cardoso Leitão
 wrote:
>
> Hi,
>
> First of all, I want to thank you very much for your work on Ballista and
> for doing it in an open source environment. It is something that should be
> emphasised and celebrated.
>
> Secondly, wrt to considering donating it to the Apache Foundation and
> Apache project in particular, I would say that we should be honored by such
> consideration. In this context, my immediate reaction is: how can we best
> support Ballista's community?
>
> My initial thoughts in this direction are:
>
> * create a new git repo for DataFusion and Ballista to reside on (e.g.
> arrow/ballista)
> * do not require the release cycle and versioning to be aligned with
> arrow's release cycle
> * do not require the usage of JIRA
> * pin the dependency of Datafusion on Arrow and parquet crate (e.g. to a
> specific commit)
>
> I feel that this setup would keep Ballista under the Foundation and Apache
> Arrow's umbrella and aligned with its goals, while at the same time put the
> least amount of burden on its community, both in terms of keeping a strict
> release schedule, tooling and CI.
>
> The rationale for the above is that whenever something is released on
> DataFusion (which hosts most of the physical ops), people will also want it
> quickly available on Ballista. Thus, having the two release cycles more
> closely related and independent of the arrow implementation's cycle is
> good. DataFusion does not have integration tests against other arrow
> implementations, and thus the integration tests are not relevant.
>
> There are 4 main reasons I would not recommend placing it in the mono-repo:
>
> 1. It would not add much
> 2. It would place Ballista on the same release schedule and git system as
> the rest of Arrow's implementation, which may not suit Ballista's own
> development pace (in either direction)
> 3. It further increases the complexity of the current repo
> 4. It would force its community to use JIRA, merge process, components,
> etc, which may not be what its community wishes for
>
> The main risk I see is that because arrow's release cycle is slow and major
> releases only, DataFusion risks missing arrow features from time to time.
> We can mitigate this with cargo and pins to commit hashes. IMO this risk
> exists in any dependency relationship and is usually a sign that there is
> an API contract and thus a trust relationship involved, which is a good
> thing.
>
> Best,
> Jorge
>
> On Tue, Mar 9, 2021 at 6:31 PM Andy Grove  wrote:
>
> > As many of you know, the reason that I got involved in Arrow back in 2018
> > was that I wanted to build a distributed compute platform in Rust, with
> > capabilities similar to Apache Spark. This led to the creation of the
> > DataFusion query engine, which is an in-memory query engine and is now part
> > of the Arrow repo.
> >
> > Over the past co

Re: [DISCUSS] [Rust] Donate Ballista to Apache Arrow

2021-03-10 Thread Andrew Lamb
> think that the problem of "there are too many PRs in the review
> queue that are not relevant to me" has straightforward solutions\

For sure -- I welcome any and all technical assistance to improving
efficiency.

 > Andrew - do you have more specific concerns that I am missing here?

I think burden on existing maintainers is my primary concern in adding
another major project to the same repo.

I certainly didn't mean to restart a discussion soliciting opinions about
our current tools / process -- it has all been articulated well in previous
threads :)

On Wed, Mar 10, 2021 at 1:13 PM Jorge Cardoso Leitão <
jorgecarlei...@gmail.com> wrote:

> Hi,
>
> First of all, I want to thank you very much for your work on Ballista and
> for doing it in an open source environment. It is something that should be
> emphasised and celebrated.
>
> Secondly, wrt to considering donating it to the Apache Foundation and
> Apache project in particular, I would say that we should be honored by such
> consideration. In this context, my immediate reaction is: how can we best
> support Ballista's community?
>
> My initial thoughts in this direction are:
>
> * create a new git repo for DataFusion and Ballista to reside on (e.g.
> arrow/ballista)
> * do not require the release cycle and versioning to be aligned with
> arrow's release cycle
> * do not require the usage of JIRA
> * pin the dependency of Datafusion on Arrow and parquet crate (e.g. to a
> specific commit)
>
> I feel that this setup would keep Ballista under the Foundation and Apache
> Arrow's umbrella and aligned with its goals, while at the same time put the
> least amount of burden on its community, both in terms of keeping a strict
> release schedule, tooling and CI.
>
> The rationale for the above is that whenever something is released on
> DataFusion (which hosts most of the physical ops), people will also want it
> quickly available on Ballista. Thus, having the two release cycles more
> closely related and independent of the arrow implementation's cycle is
> good. DataFusion does not have integration tests against other arrow
> implementations, and thus the integration tests are not relevant.
>
> There are 4 main reasons I would not recommend placing it in the mono-repo:
>
> 1. It would not add much
> 2. It would place Ballista on the same release schedule and git system as
> the rest of Arrow's implementation, which may not suit Ballista's own
> development pace (in either direction)
> 3. It further increases the complexity of the current repo
> 4. It would force its community to use JIRA, merge process, components,
> etc, which may not be what its community wishes for
>
> The main risk I see is that because arrow's release cycle is slow and major
> releases only, DataFusion risks missing arrow features from time to time.
> We can mitigate this with cargo and pins to commit hashes. IMO this risk
> exists in any dependency relationship and is usually a sign that there is
> an API contract and thus a trust relationship involved, which is a good
> thing.
>
> Best,
> Jorge
>
> On Tue, Mar 9, 2021 at 6:31 PM Andy Grove  wrote:
>
> > As many of you know, the reason that I got involved in Arrow back in 2018
> > was that I wanted to build a distributed compute platform in Rust, with
> > capabilities similar to Apache Spark. This led to the creation of the
> > DataFusion query engine, which is an in-memory query engine and is now
> part
> > of the Arrow repo.
> >
> > Over the past couple of years, I have been working outside of Arrow on a
> > project named “Ballista” [1] to continue the journey of trying to build a
> > distributed version. Due to the pandemic, I have had time over the winter
> > to put more effort into this project and have managed to build a small
> > community around it over the past few months and the project has now
> > reached a point where the basic architecture has been proven and it is
> now
> > getting a lot of attention (more than 2k stars on GitHub just recently)
> and
> > I think that it would now make sense to donate some or all of the project
> > to Apache Arrow and continue its growth here.
> >
> > For an overview of the project, please see the talk I recently gave at
> the
> > New York Open Statistical Programming Meetup [2].
> >
> > Some of the benefits that I see in donating the project to Arrow are:
> >
> >
> >-
> >
> >DataFusion also needs a scheduler and it would probably make sense to
> >push some parts of the Ballista scheduler down a level in the stack so
> > that
> >the same approach is used to scale across cores in DataFusion and to
> > scale
> >across nodes in Ballista.
> >-
> >
> >Ballista provides preliminary support for spill-to-disk functionality
> >(in Arrow IPC format) which could also benefit DataFusion and provide
> >better scalability through out-of-core processing.
> >-
> >
> >Although the Ballista scheduler is implemented in Rust, it is possible
> >to implement executo

Re: Is Zulip still the preferred chat application for Arrow?

2021-03-10 Thread Wes McKinney
The Ursa Zulip is not an "official" channel, which is to say that
community discussions there about what to build, whether there is
consensus about something, etc., are not valid from a governance /
Openness standpoint. Those need to take place either on the mailing
list or the issue tracker (which at present is Jira).

Arrow had a Slack instance early in its life but it led to behavioral
antipatterns — people were asking questions or having discussions
about the project in a place where only a small fraction of the
community was present. We discussed and deemed that the presence of an
Arrow Slack channel was harmful to the community as it was then
operating, and so we shut it down. I personally will not use Slack if
I have any alternative.

The best way to communicate whether someone is working on an issue is
to assign it to themselves, and if it is not assigned then it can be
assumed to be free to pick up.

If any of you want to use a Slack instance somewhere to chat in an IRC
like fashion that's completely fine, just please move discussions to
ASF channels when something of relevance to the community is being
discussed.

On Wed, Mar 10, 2021 at 12:15 PM Antoine Pitrou  wrote:
>
>
>
> Le 10/03/2021 à 19:15, Antoine Pitrou a écrit :
> >
> > Le 10/03/2021 à 19:04, Antoine Pitrou a écrit :
> >>
> >> Hi Andy,
> >>
> >> Le 10/03/2021 à 19:00, Andy Grove a écrit :
> >>> We had a discussion on the Arrow Rust Sync call about the best place to
> >>> co-ordinate on work. For example, quick questions like "is anyone working
> >>> on ARROW-12345? should I pick this up?".
> >>>
> >>> I know that Ursa Lab hosts Zulip and I have used that in the past for 
> >>> these
> >>> types of discussion.
> >>>
> >>> I also found out today that there is an official ASF slack with multiple
> >>> Arrow channels, but this is only open to people who already have an
> >>> apache.org email address (committers / PMC).
> >>
> >> I didn't know that the ASF had an official Slack.  I suppose that's the
> >> Apache way of favoring open source software.
> >>
> >> I find Slack uncomfortable and annoying to deal with, and I wouldn't go
> >> there.
> >
> > That said, and to answer your question a bit more completely, I don't
> > there a requirement that all Arrow implementations use the same chat system.
>
> Wow, sorry: I don't think there's a requirement that ...
>


Re: Is Zulip still the preferred chat application for Arrow?

2021-03-10 Thread Antoine Pitrou




Le 10/03/2021 à 19:15, Antoine Pitrou a écrit :


Le 10/03/2021 à 19:04, Antoine Pitrou a écrit :


Hi Andy,

Le 10/03/2021 à 19:00, Andy Grove a écrit :

We had a discussion on the Arrow Rust Sync call about the best place to
co-ordinate on work. For example, quick questions like "is anyone working
on ARROW-12345? should I pick this up?".

I know that Ursa Lab hosts Zulip and I have used that in the past for these
types of discussion.

I also found out today that there is an official ASF slack with multiple
Arrow channels, but this is only open to people who already have an
apache.org email address (committers / PMC).


I didn't know that the ASF had an official Slack.  I suppose that's the
Apache way of favoring open source software.

I find Slack uncomfortable and annoying to deal with, and I wouldn't go
there.


That said, and to answer your question a bit more completely, I don't
there a requirement that all Arrow implementations use the same chat system.


Wow, sorry: I don't think there's a requirement that ...



Re: Is Zulip still the preferred chat application for Arrow?

2021-03-10 Thread Antoine Pitrou



Le 10/03/2021 à 19:04, Antoine Pitrou a écrit :


Hi Andy,

Le 10/03/2021 à 19:00, Andy Grove a écrit :

We had a discussion on the Arrow Rust Sync call about the best place to
co-ordinate on work. For example, quick questions like "is anyone working
on ARROW-12345? should I pick this up?".

I know that Ursa Lab hosts Zulip and I have used that in the past for these
types of discussion.

I also found out today that there is an official ASF slack with multiple
Arrow channels, but this is only open to people who already have an
apache.org email address (committers / PMC).


I didn't know that the ASF had an official Slack.  I suppose that's the
Apache way of favoring open source software.

I find Slack uncomfortable and annoying to deal with, and I wouldn't go
there.


That said, and to answer your question a bit more completely, I don't 
there a requirement that all Arrow implementations use the same chat system.


Regards

Antoine.


Re: [DISCUSS] [Rust] Donate Ballista to Apache Arrow

2021-03-10 Thread Jorge Cardoso Leitão
Hi,

First of all, I want to thank you very much for your work on Ballista and
for doing it in an open source environment. It is something that should be
emphasised and celebrated.

Secondly, wrt to considering donating it to the Apache Foundation and
Apache project in particular, I would say that we should be honored by such
consideration. In this context, my immediate reaction is: how can we best
support Ballista's community?

My initial thoughts in this direction are:

* create a new git repo for DataFusion and Ballista to reside on (e.g.
arrow/ballista)
* do not require the release cycle and versioning to be aligned with
arrow's release cycle
* do not require the usage of JIRA
* pin the dependency of Datafusion on Arrow and parquet crate (e.g. to a
specific commit)

I feel that this setup would keep Ballista under the Foundation and Apache
Arrow's umbrella and aligned with its goals, while at the same time put the
least amount of burden on its community, both in terms of keeping a strict
release schedule, tooling and CI.

The rationale for the above is that whenever something is released on
DataFusion (which hosts most of the physical ops), people will also want it
quickly available on Ballista. Thus, having the two release cycles more
closely related and independent of the arrow implementation's cycle is
good. DataFusion does not have integration tests against other arrow
implementations, and thus the integration tests are not relevant.

There are 4 main reasons I would not recommend placing it in the mono-repo:

1. It would not add much
2. It would place Ballista on the same release schedule and git system as
the rest of Arrow's implementation, which may not suit Ballista's own
development pace (in either direction)
3. It further increases the complexity of the current repo
4. It would force its community to use JIRA, merge process, components,
etc, which may not be what its community wishes for

The main risk I see is that because arrow's release cycle is slow and major
releases only, DataFusion risks missing arrow features from time to time.
We can mitigate this with cargo and pins to commit hashes. IMO this risk
exists in any dependency relationship and is usually a sign that there is
an API contract and thus a trust relationship involved, which is a good
thing.

Best,
Jorge

On Tue, Mar 9, 2021 at 6:31 PM Andy Grove  wrote:

> As many of you know, the reason that I got involved in Arrow back in 2018
> was that I wanted to build a distributed compute platform in Rust, with
> capabilities similar to Apache Spark. This led to the creation of the
> DataFusion query engine, which is an in-memory query engine and is now part
> of the Arrow repo.
>
> Over the past couple of years, I have been working outside of Arrow on a
> project named “Ballista” [1] to continue the journey of trying to build a
> distributed version. Due to the pandemic, I have had time over the winter
> to put more effort into this project and have managed to build a small
> community around it over the past few months and the project has now
> reached a point where the basic architecture has been proven and it is now
> getting a lot of attention (more than 2k stars on GitHub just recently) and
> I think that it would now make sense to donate some or all of the project
> to Apache Arrow and continue its growth here.
>
> For an overview of the project, please see the talk I recently gave at the
> New York Open Statistical Programming Meetup [2].
>
> Some of the benefits that I see in donating the project to Arrow are:
>
>
>-
>
>DataFusion also needs a scheduler and it would probably make sense to
>push some parts of the Ballista scheduler down a level in the stack so
> that
>the same approach is used to scale across cores in DataFusion and to
> scale
>across nodes in Ballista.
>-
>
>Ballista provides preliminary support for spill-to-disk functionality
>(in Arrow IPC format) which could also benefit DataFusion and provide
>better scalability through out-of-core processing.
>-
>
>Although the Ballista scheduler is implemented in Rust, it is possible
>to implement executors in other languages due to the use of Flight,
> gRPC,
>and protobuf, so this may be of interest to other language
> implementations
>of Arrow as well.
>-
>
>There is already some overlap between Arrow and Ballista contributors.
>-
>
>Ballista unit tests will be part of Arrow CI which means that any
>changes to Arrow or DataFusion APIs that Ballista depends on will also
>require that the corresponding Ballista code is updated as part of the
> same
>PR.
>
>
> My main goal with this email thread is to gauge interest in donating this
> code. If there is interest in doing so then we can have a more detailed
> follow-up conversation on exactly what would be donated and where it would
> go.
>
>
> I have also filed a GitHub issue in Ballista to get feedback from current
> contributors [3].
>
>

Re: Is Zulip still the preferred chat application for Arrow?

2021-03-10 Thread Antoine Pitrou



Hi Andy,

Le 10/03/2021 à 19:00, Andy Grove a écrit :

We had a discussion on the Arrow Rust Sync call about the best place to
co-ordinate on work. For example, quick questions like "is anyone working
on ARROW-12345? should I pick this up?".

I know that Ursa Lab hosts Zulip and I have used that in the past for these
types of discussion.

I also found out today that there is an official ASF slack with multiple
Arrow channels, but this is only open to people who already have an
apache.org email address (committers / PMC).


I didn't know that the ASF had an official Slack.  I suppose that's the 
Apache way of favoring open source software.


I find Slack uncomfortable and annoying to deal with, and I wouldn't go 
there.


Regards

Antoine.


Is Zulip still the preferred chat application for Arrow?

2021-03-10 Thread Andy Grove
We had a discussion on the Arrow Rust Sync call about the best place to
co-ordinate on work. For example, quick questions like "is anyone working
on ARROW-12345? should I pick this up?".

I know that Ursa Lab hosts Zulip and I have used that in the past for these
types of discussion.

I also found out today that there is an official ASF slack with multiple
Arrow channels, but this is only open to people who already have an
apache.org email address (committers / PMC).

Looking for advice on what we should be using.

Thanks,

Andy.


Apache Arrow Rust Sync Call 3/10/2021

2021-03-10 Thread Andy Grove
Attendees


   -

   Andy Grove
   -

   Mike Seddon
   -

   Andrew Lamb
   -

   Fernando Herrera
   -

   Neville Dipale
   -

   Colin Alworth
   -

   Dominik Moritz
   -

   Jorge Leitao
   -

   Nate Bauernfiend
   -

   Patrick Horan
   -

   Ruan Pearce-Authers
   -

   Wayne Xia


Topics Discussed


   -

   Dominik gave a demo of his Arrow WebAssembly project (
   https://github.com/domoritz/arrow-wasm)
   -

   Mike discussed CAST and the fact that we currently silently reject bad
   data and return null and do not offer the choice of returning errors
   instead. Potential solutions discussed, starting with adding this option at
   the compute kernel level
   -

   We talked briefly about the potential donation of Ballista. Conversation
   to continue on the mailing list for now
   -

   We talked about Jorge’s work on “Arrow2”. There is a desire to implement
   this new design in Arrow and we talked about creating a branch in the Arrow
   repo to continue this work.
   -

   Fernando mentioned that he was looking for new items to work on but it
   wasn’t clear from looking at JIRA whether other people are already working
   on items or not because JIRA isn’t very actively maintained. This led to a
   conversation about other ways for us to communicate more informally. There
   is now an official ASF slack but this seems to only be available for people
   with an apache.org email address (committers / PMC). There is a Zulip
   chat hosted by Ursa Labs as another option (
   https://ursalabs.zulipchat.com). I will start a discussion on the
   mailing list about this.


Re: Fwd: Exposing low-level Parquet encryption to Python user (or, maybe not)

2021-03-10 Thread Antoine Pitrou



Hi Gidon,

I'm currently looking at this.

Regards

Antoine.


Le 10/03/2021 à 08:58, Gidon Gershinsky a écrit :

Hi Antoine,

All comments have been handled. Can we ask you to shepherd this PR for the
reminder of its lifecycle? (hopefully, most of this is already behind us).
https://github.com/apache/arrow/pull/8023


Cheers, Gidon


-- Forwarded message -
From: Gidon Gershinsky 
Date: Thu, Feb 18, 2021 at 6:25 PM
Subject: Re: Exposing low-level Parquet encryption to Python user (or,
maybe not)
To: dev 


Thanks, then we'll just go ahead and address the remaining comments.

Cheers, Gidon


On Thu, Feb 18, 2021 at 5:45 PM Antoine Pitrou  wrote:



I don't think there's any concern around having a process-global shared
key cache.  The discussion was just around the implementation.

Also, FTR, a standalone LRU cache class is proposed here, which may
reduce the amount of original code in the Parquet encryption PR:
https://github.com/apache/arrow/pull/8716

Best regards

Antoine.


Le 18/02/2021 à 16:40, Gidon Gershinsky a écrit :

I believe the shared structures that were debated are the key caches.

Cheers, Gidon


On Thu, Feb 18, 2021 at 6:37 AM Micah Kornfield 
wrote:



I don't think any notion of threading should be present in the
implementation, except for the required locks around shared structures.



I seem to recall the debate was how to model some class interactions to
determine what should be considered shared structures and what should

not.


On Wed, Feb 17, 2021 at 9:52 AM Gidon Gershinsky 

wrote:



This certainly sounds good to me.

Cheers, Gidon


On Wed, Feb 17, 2021 at 7:36 PM Antoine Pitrou 

wrote:




I don't think any notion of threading should be present in the
implementation, except for the required locks around shared

structures.

  I don't know where the idea of a "main thread" comes from, but it
probably shouldn't exist in a C++ library.

Regards

Antoine.



Le 17/02/2021 à 18:34, Gidon Gershinsky a écrit :

Just to clarify. There are two options, which one do you refer to? A

design

with a main thread that handles projections and the keys (relevant

for

the

projected columns); or the current code with any thread allowed to

handle

full file reading, inc the footer, column projections and their keys?

Can

you finalize this with Micah?
The good news is, Tham is still interested to resume this work, and

is

ok

with either option. Please let her know whether the current threading

model

stays, or should be modified with the changes proposed in the doc

(for

the

latter, some guidance with the details would be needed).

Cheers, Gidon


On Wed, Feb 17, 2021 at 2:40 PM Antoine Pitrou 

wrote:





Le 17/02/2021 à 12:47, Gidon Gershinsky a écrit :

 From the doc,
"To maintain consistency with the style of parquet-cpp, the above
structures should not be explicitly synchronized with individual

mutexes.

In the case of a parquet::arrow::FileReader, the request to read a

given

selection of row groups and columns is issued from a single main

thread.

Note that this does require that all keys required for a read are

assembled

on the main thread so that DecryptionKeyRetriever objects are not

directly

accessing any caches"

The current PR code doesn't require a single main thread. Any

thread

can

read any file, both footer and pages. So the key cache is shared,

to

save

N-1 interactions with the KMS server.


I don't think there's any contention on this.  IMHO the only

concerns

are about the implementation, not the semantics.

Best regards

Antoine.




Cheers, Gidon


On Wed, Feb 17, 2021 at 12:49 PM Antoine Pitrou <

anto...@python.org>

wrote:




I'm not sure a threading model is expected for an encryption

layer.

Am

I missing something?

Regards

Antoine.


Le 17/02/2021 à 06:59, Gidon Gershinsky a écrit :

Precisely, the main change is in the threading model. Afaik, the

document

proposes a model that fits pandas, but might be problematic for

other

users

of this library.
Technically, this is not showstopper though; if the community

decides

on

this model, it will be compatible with the high-level encryption

design;

but the change implementation would need to be done by pandas

experts

(not

us; but we'll help where we can).
Micah, you know this subject (and the community) better than we

do

-

we'd

much appreciate it if you'd take a lead on removing this

roadblock.


Cheers, Gidon


On Wed, Feb 17, 2021 at 6:08 AM Micah Kornfield <

emkornfi...@gmail.com



wrote:


I think some of the comments might be conflicting.  One of the

concerns

(that I would need to refresh myself on to offer an opinion

which

was

covered in Ben's doc) was the threading model we expect in the

library.


On Tue, Feb 16, 2021 at 8:03 AM Antoine Pitrou <

anto...@python.org>

wrote:




Hi Gidon,

Le 16/02/2021 à 16:42, Gidon Gershinsky a écrit :

Regarding the high-level layer, I think it waits for a

progress

at

















https://docs.google.co

Re: [Rust] [DataFusion] Topic for next Rust Sync Call

2021-03-10 Thread Wes McKinney
Regarding https://github.com/jorgecarleitao/arrow2 — please (looking
at the PMC members who work on Rust) be careful about IP provenance
considerations with code developed outside the Foundation.

On Wed, Mar 10, 2021 at 11:15 AM Dominik Moritz  wrote:
>
>  I have a talk prepared to talk about my Arrow implementation in
> WebAssembly.
>
> On Mar 10, 2021 at 04:38:21, Andrew Lamb  wrote:
>
> > Reminder that today is the next Rust sync call
> >
> > Potential topics for discussion:
> > * Ballista / DataFusion / etc
> > * I remember that someone else was going to demo the use of Arrow but I
> > can't remember exactly what that was now
> >
> > On Tue, Feb 16, 2021 at 10:59 AM Dominik Moritz  wrote:
> >
> >  Somewhat related, I tried to compile DataFusion to WASM and it didn’t work
> >
> > because of some dependencies:
> >
> > https://issues.apache.org/jira/projects/ARROW/issues/ARROW-11615. I wonder
> >
> > whether DataFusion could have a feature flag for only shipping what is WASM
> >
> > compatible?
> >
> >
> > On Feb 15, 2021 at 12:13:04, Andrew Lamb  wrote:
> >
> >
> > > Also, unrelated, is there a schedule for the sync calls? Will try and
> >
> > >
> >
> > > carve out some free time for the next one :)
> >
> > >
> >
> > > It is every other Wednesday at noon EST. Here is the original
> >
> > announcement
> >
> > > with more details:
> >
> > >
> >
> > >
> >
> >
> > https://lists.apache.org/thread.html/raa72e1a8a3ad5dbb8366e9609a041eccca87f85545c3bc3d85170cfc%40%3Cdev.arrow.apache.org%3E
> >
> > >
> >
> > >
> >
> > > On Sun, Feb 14, 2021 at 8:29 AM Ruan Pearce-Authers <
> >
> > r...@reservoirdb.com>
> >
> > > wrote:
> >
> > >
> >
> > > I'd be interested in helping spec this out, it's especially tricky atm to
> >
> > >
> >
> > > track down issues when integrating DataFusion into the same binary as
> >
> > other
> >
> > >
> >
> > > medium/large dependencies.
> >
> > >
> >
> > >
> >
> > > Recently hit a really specific issue where DataFusion depends on Parquet,
> >
> > >
> >
> > > which supports various compression algs, including Brotli, and actix-web
> >
> > >
> >
> > > also depends on a slightly different Rust implementation of Brotli. Both
> >
> > of
> >
> > >
> >
> > > these Brotli libs package the same underlying C lib separately, resulting
> >
> > >
> >
> > > in multiply-defined symbols compiling using msvc (and maybe on other
> >
> > >
> >
> > > platforms? didn't test in CI in the end).
> >
> > >
> >
> > >
> >
> > > Got a quick interim hack [1] in place for my use case which doesn't
> >
> > really
> >
> > >
> >
> > > use Parquet, so it's not pressing, but would be awesome to sort this
> >
> > >
> >
> > > properly upstream.
> >
> > >
> >
> > >
> >
> > > I guess the only major tradeoff of having a comprehensive feature setup
> >
> > is
> >
> > >
> >
> > > that it could make testing slightly harder, in terms of making sure
> >
> > no-one
> >
> > >
> >
> > > breaks the build for specific feature combinations; this can always be
> >
> > >
> >
> > > mitigated with more CI though (yay, unlimited Actions minutes for public
> >
> > >
> >
> > > repos).
> >
> > >
> >
> > >
> >
> > > Also, unrelated, is there a schedule for the sync calls? Will try and
> >
> > >
> >
> > > carve out some free time for the next one :)
> >
> > >
> >
> > >
> >
> > > [1]
> >
> > >
> >
> > >
> >
> > >
> >
> >
> > https://github.com/reservoirdb/arrow/commit/e63e157927a552ecf1a6f63ec401f0b6157b5468
> >
> > >
> >
> > >
> >
> > > -Original Message-
> >
> > >
> >
> > > From: Andrew Lamb 
> >
> > >
> >
> > > Sent: 14 February 2021 11:14
> >
> > >
> >
> > > To: dev 
> >
> > >
> >
> > > Subject: [Rust] [DataFusion] Topic for next Rust Sync Call
> >
> > >
> >
> > >
> >
> > > I would like to add the following item to the agenda call for the next
> >
> > >
> >
> > > Rust sync call:
> >
> > >
> >
> > >
> >
> > > Dependencies
> >
> > >
> >
> > >
> >
> > > Background: As the dependency stack gets larger, it will be harder to use
> >
> > >
> >
> > > DataFusion as an embedded query engine and the compile / dev times will
> >
> > get
> >
> > >
> >
> > > higher.
> >
> > >
> >
> > >
> >
> > > As we expand the supported functions of DataFusion this problem is likely
> >
> > >
> >
> > > to get worse. For example
> >
> > >
> >
> > > https://github.com/apache/arrow/pull/9243#discussion_r575716759 and
> >
> > >
> >
> > > https://github.com/apache/arrow/pull/9139
> >
> > >
> >
> > >
> >
> > > Proposal: Add Rust "features" to the datafusion crate and make many of
> >
> > the
> >
> > >
> >
> > > new dependencies optional (so that we had features like regex and unicode
> >
> > >
> >
> > > and hash which would only pull in the dependencies / have those functions
> >
> > >
> >
> > > if the features were enabled.) This approach has worked well for Arrow
> >
> > >
> >
> > > (which has only chrono and num as required dependencies)
> >
> > >
> >
> > >
> >
> > >
> >
> >
> >


Re: [DISCUSS] [Rust] Donate Ballista to Apache Arrow

2021-03-10 Thread Wes McKinney
I think that the problem of "there are too many PRs in the review
queue that are not relevant to me" has straightforward solutions (like
what Spark did https://spark-prs.appspot.com — if someone wants to
fork this and make it work for Arrow that would be awesome, I would be
willing to help if not one else volunteers). So let's try to consider
all the possible solutions before letting the "GitHub is bad for large
projects" observation harm us in unintended and potentially worse
ways.

On Wed, Mar 10, 2021 at 10:12 AM Andy Grove  wrote:
>
> Wes - thanks for the clarification around possibilities for having multiple
> repositories within Arrow governance. I agree that having separate repos
> increases burdens around integration testing and dependency  /release
> management and having a monorepo makes those things much simpler.
>
> I think it is worth digging more into this point from Andrew.
>
> > 2. I think the arrow github project and the unified workflow process in
> > particular is reaching its limits. Adding another cool, but non trivial
> > project like Ballista will likely exacerbate the challenges even more.
>
> I do see that adding Ballista will increase the burden on current
> DataFusion maintainers, and they may not be all that interested in Ballista
> itself. Ballista would potentially bring along additional contributors as
> well, increasing the burden of reviewing PRs in the short term (hopefully
> at some point we would have additional committers that are motivated to
> work on Ballista).
>
> I would certainly try and handle most of the Ballista PR reviews to start
> with until we reach a point where more people can do that, and this would
> lead to me being more closely involved in DataFusion reviews as well.
>
> Andrew - do you have more specific concerns that I am missing here?
>
> Thanks,
>
> Andy.
>
>
>
>
>
>
> On Wed, Mar 10, 2021 at 9:01 AM Andrew Lamb  wrote:
>
> > Thanks Wes -- I agree. I think moving datafusion out of the main arrow repo
> > only makes sense when the interfaces it depends on (in arrow and parquet)
> > have stabilized as that will minimize the mess / pain you describe.
> >
> > Andrew
> >
> >
> >
> > On Wed, Mar 10, 2021 at 10:09 AM Wes McKinney  wrote:
> >
> > > To give you an example of what I’m talking about. Jorge has been building
> > > this project
> > >
> > > https://github.com/jorgecarleitao/datafusion-python
> > >
> > > I think it would actually be preferable to build projects like this in
> > the
> > > monorepo because of the challenges and opportunities that arise in long
> > > term project interdependence (API changes, integration testing, etc). The
> > > more you split up interdependent projects into different GitHub
> > > repositories, the more difficult it becomes to develop and test them — we
> > > had this exact problem (it was awful) with Parquet in C++ which is why
> > the
> > > code lives in this repository now.
> > >
> > > On Wed, Mar 10, 2021 at 9:02 AM Wes McKinney 
> > wrote:
> > >
> > > > There is no problem with having multiple code-containing repositories
> > in
> > > > Apache Arrow, and the project can produce different release artifacts
> > > (for
> > > > example, Parquet has Parquet-format and Parquet-mr and these release
> > > > separately). I don’t think it’s a good idea to fragment the project
> > > > governance / set up a new PMC unless you have two distinct groups of
> > > people
> > > > who are moving in different directions.
> > > >
> > > > As an example, Arrow was initially split off from Apache Drill. Arrow
> > now
> > > > has little relationship with Drill. DataFusion and Ballista are not
> > > > analogous to that.
> > > >
> > > > Different releases can come from the same git repository also. I would
> > > > just want to make sure you have a proper debate about the long term
> > > > pros/cons of developing within a monorepo (which again are independent
> > > from
> > > > release logistics, so if these concepts are coupled in any person’s
> > mind
> > > > please decouple them).
> > > >
> > > > On Wed, Mar 10, 2021 at 8:42 AM Andy Grove 
> > > wrote:
> > > >
> > > >> Thanks, Andrew.
> > > >>
> > > >> I agree with your points and I do see the argument for
> > > DataFusion/Ballista
> > > >> being in their own repo. When I first donated DataFusion there was a
> > > >> discussion about the fact that it could be moved back out later on
> > once
> > > it
> > > >> was more mature. I will go see if I can find that conversation.
> > > >>
> > > >> Another option here would be to propose creating a new top-level
> > Apache
> > > >> project but I don't know if these components would qualify or what the
> > > >> process would be. I imagine they would need to be much more mature
> > > before
> > > >> this would be an option.
> > > >>
> > > >> Thanks,
> > > >>
> > > >> Andy.
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> On Wed, Mar 10, 2021 at 4:13 AM Andrew Lamb 
> > > wrote:
> > > >>
> > > >> > My thoughts are:
> > > >> >
> > > >> > 1. T

Re: [Rust] [DataFusion] Topic for next Rust Sync Call

2021-03-10 Thread Dominik Moritz
 I have a talk prepared to talk about my Arrow implementation in
WebAssembly.

On Mar 10, 2021 at 04:38:21, Andrew Lamb  wrote:

> Reminder that today is the next Rust sync call
>
> Potential topics for discussion:
> * Ballista / DataFusion / etc
> * I remember that someone else was going to demo the use of Arrow but I
> can't remember exactly what that was now
>
> On Tue, Feb 16, 2021 at 10:59 AM Dominik Moritz  wrote:
>
>  Somewhat related, I tried to compile DataFusion to WASM and it didn’t work
>
> because of some dependencies:
>
> https://issues.apache.org/jira/projects/ARROW/issues/ARROW-11615. I wonder
>
> whether DataFusion could have a feature flag for only shipping what is WASM
>
> compatible?
>
>
> On Feb 15, 2021 at 12:13:04, Andrew Lamb  wrote:
>
>
> > Also, unrelated, is there a schedule for the sync calls? Will try and
>
> >
>
> > carve out some free time for the next one :)
>
> >
>
> > It is every other Wednesday at noon EST. Here is the original
>
> announcement
>
> > with more details:
>
> >
>
> >
>
>
> https://lists.apache.org/thread.html/raa72e1a8a3ad5dbb8366e9609a041eccca87f85545c3bc3d85170cfc%40%3Cdev.arrow.apache.org%3E
>
> >
>
> >
>
> > On Sun, Feb 14, 2021 at 8:29 AM Ruan Pearce-Authers <
>
> r...@reservoirdb.com>
>
> > wrote:
>
> >
>
> > I'd be interested in helping spec this out, it's especially tricky atm to
>
> >
>
> > track down issues when integrating DataFusion into the same binary as
>
> other
>
> >
>
> > medium/large dependencies.
>
> >
>
> >
>
> > Recently hit a really specific issue where DataFusion depends on Parquet,
>
> >
>
> > which supports various compression algs, including Brotli, and actix-web
>
> >
>
> > also depends on a slightly different Rust implementation of Brotli. Both
>
> of
>
> >
>
> > these Brotli libs package the same underlying C lib separately, resulting
>
> >
>
> > in multiply-defined symbols compiling using msvc (and maybe on other
>
> >
>
> > platforms? didn't test in CI in the end).
>
> >
>
> >
>
> > Got a quick interim hack [1] in place for my use case which doesn't
>
> really
>
> >
>
> > use Parquet, so it's not pressing, but would be awesome to sort this
>
> >
>
> > properly upstream.
>
> >
>
> >
>
> > I guess the only major tradeoff of having a comprehensive feature setup
>
> is
>
> >
>
> > that it could make testing slightly harder, in terms of making sure
>
> no-one
>
> >
>
> > breaks the build for specific feature combinations; this can always be
>
> >
>
> > mitigated with more CI though (yay, unlimited Actions minutes for public
>
> >
>
> > repos).
>
> >
>
> >
>
> > Also, unrelated, is there a schedule for the sync calls? Will try and
>
> >
>
> > carve out some free time for the next one :)
>
> >
>
> >
>
> > [1]
>
> >
>
> >
>
> >
>
>
> https://github.com/reservoirdb/arrow/commit/e63e157927a552ecf1a6f63ec401f0b6157b5468
>
> >
>
> >
>
> > -Original Message-
>
> >
>
> > From: Andrew Lamb 
>
> >
>
> > Sent: 14 February 2021 11:14
>
> >
>
> > To: dev 
>
> >
>
> > Subject: [Rust] [DataFusion] Topic for next Rust Sync Call
>
> >
>
> >
>
> > I would like to add the following item to the agenda call for the next
>
> >
>
> > Rust sync call:
>
> >
>
> >
>
> > Dependencies
>
> >
>
> >
>
> > Background: As the dependency stack gets larger, it will be harder to use
>
> >
>
> > DataFusion as an embedded query engine and the compile / dev times will
>
> get
>
> >
>
> > higher.
>
> >
>
> >
>
> > As we expand the supported functions of DataFusion this problem is likely
>
> >
>
> > to get worse. For example
>
> >
>
> > https://github.com/apache/arrow/pull/9243#discussion_r575716759 and
>
> >
>
> > https://github.com/apache/arrow/pull/9139
>
> >
>
> >
>
> > Proposal: Add Rust "features" to the datafusion crate and make many of
>
> the
>
> >
>
> > new dependencies optional (so that we had features like regex and unicode
>
> >
>
> > and hash which would only pull in the dependencies / have those functions
>
> >
>
> > if the features were enabled.) This approach has worked well for Arrow
>
> >
>
> > (which has only chrono and num as required dependencies)
>
> >
>
> >
>
> >
>
>
>


Re: [DISCUSS] [Rust] Donate Ballista to Apache Arrow

2021-03-10 Thread Andy Grove
Wes - thanks for the clarification around possibilities for having multiple
repositories within Arrow governance. I agree that having separate repos
increases burdens around integration testing and dependency  /release
management and having a monorepo makes those things much simpler.

I think it is worth digging more into this point from Andrew.

> 2. I think the arrow github project and the unified workflow process in
> particular is reaching its limits. Adding another cool, but non trivial
> project like Ballista will likely exacerbate the challenges even more.

I do see that adding Ballista will increase the burden on current
DataFusion maintainers, and they may not be all that interested in Ballista
itself. Ballista would potentially bring along additional contributors as
well, increasing the burden of reviewing PRs in the short term (hopefully
at some point we would have additional committers that are motivated to
work on Ballista).

I would certainly try and handle most of the Ballista PR reviews to start
with until we reach a point where more people can do that, and this would
lead to me being more closely involved in DataFusion reviews as well.

Andrew - do you have more specific concerns that I am missing here?

Thanks,

Andy.






On Wed, Mar 10, 2021 at 9:01 AM Andrew Lamb  wrote:

> Thanks Wes -- I agree. I think moving datafusion out of the main arrow repo
> only makes sense when the interfaces it depends on (in arrow and parquet)
> have stabilized as that will minimize the mess / pain you describe.
>
> Andrew
>
>
>
> On Wed, Mar 10, 2021 at 10:09 AM Wes McKinney  wrote:
>
> > To give you an example of what I’m talking about. Jorge has been building
> > this project
> >
> > https://github.com/jorgecarleitao/datafusion-python
> >
> > I think it would actually be preferable to build projects like this in
> the
> > monorepo because of the challenges and opportunities that arise in long
> > term project interdependence (API changes, integration testing, etc). The
> > more you split up interdependent projects into different GitHub
> > repositories, the more difficult it becomes to develop and test them — we
> > had this exact problem (it was awful) with Parquet in C++ which is why
> the
> > code lives in this repository now.
> >
> > On Wed, Mar 10, 2021 at 9:02 AM Wes McKinney 
> wrote:
> >
> > > There is no problem with having multiple code-containing repositories
> in
> > > Apache Arrow, and the project can produce different release artifacts
> > (for
> > > example, Parquet has Parquet-format and Parquet-mr and these release
> > > separately). I don’t think it’s a good idea to fragment the project
> > > governance / set up a new PMC unless you have two distinct groups of
> > people
> > > who are moving in different directions.
> > >
> > > As an example, Arrow was initially split off from Apache Drill. Arrow
> now
> > > has little relationship with Drill. DataFusion and Ballista are not
> > > analogous to that.
> > >
> > > Different releases can come from the same git repository also. I would
> > > just want to make sure you have a proper debate about the long term
> > > pros/cons of developing within a monorepo (which again are independent
> > from
> > > release logistics, so if these concepts are coupled in any person’s
> mind
> > > please decouple them).
> > >
> > > On Wed, Mar 10, 2021 at 8:42 AM Andy Grove 
> > wrote:
> > >
> > >> Thanks, Andrew.
> > >>
> > >> I agree with your points and I do see the argument for
> > DataFusion/Ballista
> > >> being in their own repo. When I first donated DataFusion there was a
> > >> discussion about the fact that it could be moved back out later on
> once
> > it
> > >> was more mature. I will go see if I can find that conversation.
> > >>
> > >> Another option here would be to propose creating a new top-level
> Apache
> > >> project but I don't know if these components would qualify or what the
> > >> process would be. I imagine they would need to be much more mature
> > before
> > >> this would be an option.
> > >>
> > >> Thanks,
> > >>
> > >> Andy.
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> On Wed, Mar 10, 2021 at 4:13 AM Andrew Lamb 
> > wrote:
> > >>
> > >> > My thoughts are:
> > >> >
> > >> > 1. The scheduler and spill-to-disk/out of core operations sound very
> > >> good
> > >> > to bring into DataFusion and many people would benefit
> > >> >
> > >> > 2. I think the arrow github project and the unified workflow process
> > in
> > >> > particular is reaching its limits. Adding another cool, but non
> > trivial
> > >> > project like Ballista will likely exacerbate the challenges even
> more.
> > >> >
> > >> > 3. My sense is that the Rust arrow implementation is nearing feature
> > >> > completion (though we may still have one last big revamp, depending
> on
> > >> > Jorge's plans) and so I expect breaking API changes there to slow
> > down,
> > >> > lessening the value of keeping everything in the same rep.
> > >> >
> > >> > 4.  What would you think ab

Re: [DISCUSS] [Rust] Donate Ballista to Apache Arrow

2021-03-10 Thread Andrew Lamb
Thanks Wes -- I agree. I think moving datafusion out of the main arrow repo
only makes sense when the interfaces it depends on (in arrow and parquet)
have stabilized as that will minimize the mess / pain you describe.

Andrew



On Wed, Mar 10, 2021 at 10:09 AM Wes McKinney  wrote:

> To give you an example of what I’m talking about. Jorge has been building
> this project
>
> https://github.com/jorgecarleitao/datafusion-python
>
> I think it would actually be preferable to build projects like this in the
> monorepo because of the challenges and opportunities that arise in long
> term project interdependence (API changes, integration testing, etc). The
> more you split up interdependent projects into different GitHub
> repositories, the more difficult it becomes to develop and test them — we
> had this exact problem (it was awful) with Parquet in C++ which is why the
> code lives in this repository now.
>
> On Wed, Mar 10, 2021 at 9:02 AM Wes McKinney  wrote:
>
> > There is no problem with having multiple code-containing repositories in
> > Apache Arrow, and the project can produce different release artifacts
> (for
> > example, Parquet has Parquet-format and Parquet-mr and these release
> > separately). I don’t think it’s a good idea to fragment the project
> > governance / set up a new PMC unless you have two distinct groups of
> people
> > who are moving in different directions.
> >
> > As an example, Arrow was initially split off from Apache Drill. Arrow now
> > has little relationship with Drill. DataFusion and Ballista are not
> > analogous to that.
> >
> > Different releases can come from the same git repository also. I would
> > just want to make sure you have a proper debate about the long term
> > pros/cons of developing within a monorepo (which again are independent
> from
> > release logistics, so if these concepts are coupled in any person’s mind
> > please decouple them).
> >
> > On Wed, Mar 10, 2021 at 8:42 AM Andy Grove 
> wrote:
> >
> >> Thanks, Andrew.
> >>
> >> I agree with your points and I do see the argument for
> DataFusion/Ballista
> >> being in their own repo. When I first donated DataFusion there was a
> >> discussion about the fact that it could be moved back out later on once
> it
> >> was more mature. I will go see if I can find that conversation.
> >>
> >> Another option here would be to propose creating a new top-level Apache
> >> project but I don't know if these components would qualify or what the
> >> process would be. I imagine they would need to be much more mature
> before
> >> this would be an option.
> >>
> >> Thanks,
> >>
> >> Andy.
> >>
> >>
> >>
> >>
> >>
> >> On Wed, Mar 10, 2021 at 4:13 AM Andrew Lamb 
> wrote:
> >>
> >> > My thoughts are:
> >> >
> >> > 1. The scheduler and spill-to-disk/out of core operations sound very
> >> good
> >> > to bring into DataFusion and many people would benefit
> >> >
> >> > 2. I think the arrow github project and the unified workflow process
> in
> >> > particular is reaching its limits. Adding another cool, but non
> trivial
> >> > project like Ballista will likely exacerbate the challenges even more.
> >> >
> >> > 3. My sense is that the Rust arrow implementation is nearing feature
> >> > completion (though we may still have one last big revamp, depending on
> >> > Jorge's plans) and so I expect breaking API changes there to slow
> down,
> >> > lessening the value of keeping everything in the same rep.
> >> >
> >> > 4.  What would you think about pulling DataFusion out of the arrow
> >> crate in
> >> > the medium term (2-3 releases from now) and putting it into a new
> place
> >> > (alongside Ballista)?
> >> >
> >> > Andrew
> >> >
> >> > On Tue, Mar 9, 2021 at 12:30 PM Andy Grove 
> >> wrote:
> >> >
> >> > > As many of you know, the reason that I got involved in Arrow back in
> >> 2018
> >> > > was that I wanted to build a distributed compute platform in Rust,
> >> with
> >> > > capabilities similar to Apache Spark. This led to the creation of
> the
> >> > > DataFusion query engine, which is an in-memory query engine and is
> now
> >> > part
> >> > > of the Arrow repo.
> >> > >
> >> > > Over the past couple of years, I have been working outside of Arrow
> >> on a
> >> > > project named “Ballista” [1] to continue the journey of trying to
> >> build a
> >> > > distributed version. Due to the pandemic, I have had time over the
> >> winter
> >> > > to put more effort into this project and have managed to build a
> small
> >> > > community around it over the past few months and the project has now
> >> > > reached a point where the basic architecture has been proven and it
> is
> >> > now
> >> > > getting a lot of attention (more than 2k stars on GitHub just
> >> recently)
> >> > and
> >> > > I think that it would now make sense to donate some or all of the
> >> project
> >> > > to Apache Arrow and continue its growth here.
> >> > >
> >> > > For an overview of the project, please see the talk I recently gave
> at
> >> > the
> >> > > N

Re: [DISCUSS] [Rust] Donate Ballista to Apache Arrow

2021-03-10 Thread Wes McKinney
To give you an example of what I’m talking about. Jorge has been building
this project

https://github.com/jorgecarleitao/datafusion-python

I think it would actually be preferable to build projects like this in the
monorepo because of the challenges and opportunities that arise in long
term project interdependence (API changes, integration testing, etc). The
more you split up interdependent projects into different GitHub
repositories, the more difficult it becomes to develop and test them — we
had this exact problem (it was awful) with Parquet in C++ which is why the
code lives in this repository now.

On Wed, Mar 10, 2021 at 9:02 AM Wes McKinney  wrote:

> There is no problem with having multiple code-containing repositories in
> Apache Arrow, and the project can produce different release artifacts (for
> example, Parquet has Parquet-format and Parquet-mr and these release
> separately). I don’t think it’s a good idea to fragment the project
> governance / set up a new PMC unless you have two distinct groups of people
> who are moving in different directions.
>
> As an example, Arrow was initially split off from Apache Drill. Arrow now
> has little relationship with Drill. DataFusion and Ballista are not
> analogous to that.
>
> Different releases can come from the same git repository also. I would
> just want to make sure you have a proper debate about the long term
> pros/cons of developing within a monorepo (which again are independent from
> release logistics, so if these concepts are coupled in any person’s mind
> please decouple them).
>
> On Wed, Mar 10, 2021 at 8:42 AM Andy Grove  wrote:
>
>> Thanks, Andrew.
>>
>> I agree with your points and I do see the argument for DataFusion/Ballista
>> being in their own repo. When I first donated DataFusion there was a
>> discussion about the fact that it could be moved back out later on once it
>> was more mature. I will go see if I can find that conversation.
>>
>> Another option here would be to propose creating a new top-level Apache
>> project but I don't know if these components would qualify or what the
>> process would be. I imagine they would need to be much more mature before
>> this would be an option.
>>
>> Thanks,
>>
>> Andy.
>>
>>
>>
>>
>>
>> On Wed, Mar 10, 2021 at 4:13 AM Andrew Lamb  wrote:
>>
>> > My thoughts are:
>> >
>> > 1. The scheduler and spill-to-disk/out of core operations sound very
>> good
>> > to bring into DataFusion and many people would benefit
>> >
>> > 2. I think the arrow github project and the unified workflow process in
>> > particular is reaching its limits. Adding another cool, but non trivial
>> > project like Ballista will likely exacerbate the challenges even more.
>> >
>> > 3. My sense is that the Rust arrow implementation is nearing feature
>> > completion (though we may still have one last big revamp, depending on
>> > Jorge's plans) and so I expect breaking API changes there to slow down,
>> > lessening the value of keeping everything in the same rep.
>> >
>> > 4.  What would you think about pulling DataFusion out of the arrow
>> crate in
>> > the medium term (2-3 releases from now) and putting it into a new place
>> > (alongside Ballista)?
>> >
>> > Andrew
>> >
>> > On Tue, Mar 9, 2021 at 12:30 PM Andy Grove 
>> wrote:
>> >
>> > > As many of you know, the reason that I got involved in Arrow back in
>> 2018
>> > > was that I wanted to build a distributed compute platform in Rust,
>> with
>> > > capabilities similar to Apache Spark. This led to the creation of the
>> > > DataFusion query engine, which is an in-memory query engine and is now
>> > part
>> > > of the Arrow repo.
>> > >
>> > > Over the past couple of years, I have been working outside of Arrow
>> on a
>> > > project named “Ballista” [1] to continue the journey of trying to
>> build a
>> > > distributed version. Due to the pandemic, I have had time over the
>> winter
>> > > to put more effort into this project and have managed to build a small
>> > > community around it over the past few months and the project has now
>> > > reached a point where the basic architecture has been proven and it is
>> > now
>> > > getting a lot of attention (more than 2k stars on GitHub just
>> recently)
>> > and
>> > > I think that it would now make sense to donate some or all of the
>> project
>> > > to Apache Arrow and continue its growth here.
>> > >
>> > > For an overview of the project, please see the talk I recently gave at
>> > the
>> > > New York Open Statistical Programming Meetup [2].
>> > >
>> > > Some of the benefits that I see in donating the project to Arrow are:
>> > >
>> > >
>> > >-
>> > >
>> > >DataFusion also needs a scheduler and it would probably make sense
>> to
>> > >push some parts of the Ballista scheduler down a level in the
>> stack so
>> > > that
>> > >the same approach is used to scale across cores in DataFusion and
>> to
>> > > scale
>> > >across nodes in Ballista.
>> > >-
>> > >
>> > >Ballista provides preliminary

Re: [DISCUSS] [Rust] Donate Ballista to Apache Arrow

2021-03-10 Thread Wes McKinney
There is no problem with having multiple code-containing repositories in
Apache Arrow, and the project can produce different release artifacts (for
example, Parquet has Parquet-format and Parquet-mr and these release
separately). I don’t think it’s a good idea to fragment the project
governance / set up a new PMC unless you have two distinct groups of people
who are moving in different directions.

As an example, Arrow was initially split off from Apache Drill. Arrow now
has little relationship with Drill. DataFusion and Ballista are not
analogous to that.

Different releases can come from the same git repository also. I would just
want to make sure you have a proper debate about the long term pros/cons of
developing within a monorepo (which again are independent from release
logistics, so if these concepts are coupled in any person’s mind please
decouple them).

On Wed, Mar 10, 2021 at 8:42 AM Andy Grove  wrote:

> Thanks, Andrew.
>
> I agree with your points and I do see the argument for DataFusion/Ballista
> being in their own repo. When I first donated DataFusion there was a
> discussion about the fact that it could be moved back out later on once it
> was more mature. I will go see if I can find that conversation.
>
> Another option here would be to propose creating a new top-level Apache
> project but I don't know if these components would qualify or what the
> process would be. I imagine they would need to be much more mature before
> this would be an option.
>
> Thanks,
>
> Andy.
>
>
>
>
>
> On Wed, Mar 10, 2021 at 4:13 AM Andrew Lamb  wrote:
>
> > My thoughts are:
> >
> > 1. The scheduler and spill-to-disk/out of core operations sound very good
> > to bring into DataFusion and many people would benefit
> >
> > 2. I think the arrow github project and the unified workflow process in
> > particular is reaching its limits. Adding another cool, but non trivial
> > project like Ballista will likely exacerbate the challenges even more.
> >
> > 3. My sense is that the Rust arrow implementation is nearing feature
> > completion (though we may still have one last big revamp, depending on
> > Jorge's plans) and so I expect breaking API changes there to slow down,
> > lessening the value of keeping everything in the same rep.
> >
> > 4.  What would you think about pulling DataFusion out of the arrow crate
> in
> > the medium term (2-3 releases from now) and putting it into a new place
> > (alongside Ballista)?
> >
> > Andrew
> >
> > On Tue, Mar 9, 2021 at 12:30 PM Andy Grove 
> wrote:
> >
> > > As many of you know, the reason that I got involved in Arrow back in
> 2018
> > > was that I wanted to build a distributed compute platform in Rust, with
> > > capabilities similar to Apache Spark. This led to the creation of the
> > > DataFusion query engine, which is an in-memory query engine and is now
> > part
> > > of the Arrow repo.
> > >
> > > Over the past couple of years, I have been working outside of Arrow on
> a
> > > project named “Ballista” [1] to continue the journey of trying to
> build a
> > > distributed version. Due to the pandemic, I have had time over the
> winter
> > > to put more effort into this project and have managed to build a small
> > > community around it over the past few months and the project has now
> > > reached a point where the basic architecture has been proven and it is
> > now
> > > getting a lot of attention (more than 2k stars on GitHub just recently)
> > and
> > > I think that it would now make sense to donate some or all of the
> project
> > > to Apache Arrow and continue its growth here.
> > >
> > > For an overview of the project, please see the talk I recently gave at
> > the
> > > New York Open Statistical Programming Meetup [2].
> > >
> > > Some of the benefits that I see in donating the project to Arrow are:
> > >
> > >
> > >-
> > >
> > >DataFusion also needs a scheduler and it would probably make sense
> to
> > >push some parts of the Ballista scheduler down a level in the stack
> so
> > > that
> > >the same approach is used to scale across cores in DataFusion and to
> > > scale
> > >across nodes in Ballista.
> > >-
> > >
> > >Ballista provides preliminary support for spill-to-disk
> functionality
> > >(in Arrow IPC format) which could also benefit DataFusion and
> provide
> > >better scalability through out-of-core processing.
> > >-
> > >
> > >Although the Ballista scheduler is implemented in Rust, it is
> possible
> > >to implement executors in other languages due to the use of Flight,
> > > gRPC,
> > >and protobuf, so this may be of interest to other language
> > > implementations
> > >of Arrow as well.
> > >-
> > >
> > >There is already some overlap between Arrow and Ballista
> contributors.
> > >-
> > >
> > >Ballista unit tests will be part of Arrow CI which means that any
> > >changes to Arrow or DataFusion APIs that Ballista depends on will
> also
> > >require that the corresponding

Re: [DISCUSS] [Rust] Donate Ballista to Apache Arrow

2021-03-10 Thread Andy Grove
Thanks, Andrew.

I agree with your points and I do see the argument for DataFusion/Ballista
being in their own repo. When I first donated DataFusion there was a
discussion about the fact that it could be moved back out later on once it
was more mature. I will go see if I can find that conversation.

Another option here would be to propose creating a new top-level Apache
project but I don't know if these components would qualify or what the
process would be. I imagine they would need to be much more mature before
this would be an option.

Thanks,

Andy.





On Wed, Mar 10, 2021 at 4:13 AM Andrew Lamb  wrote:

> My thoughts are:
>
> 1. The scheduler and spill-to-disk/out of core operations sound very good
> to bring into DataFusion and many people would benefit
>
> 2. I think the arrow github project and the unified workflow process in
> particular is reaching its limits. Adding another cool, but non trivial
> project like Ballista will likely exacerbate the challenges even more.
>
> 3. My sense is that the Rust arrow implementation is nearing feature
> completion (though we may still have one last big revamp, depending on
> Jorge's plans) and so I expect breaking API changes there to slow down,
> lessening the value of keeping everything in the same rep.
>
> 4.  What would you think about pulling DataFusion out of the arrow crate in
> the medium term (2-3 releases from now) and putting it into a new place
> (alongside Ballista)?
>
> Andrew
>
> On Tue, Mar 9, 2021 at 12:30 PM Andy Grove  wrote:
>
> > As many of you know, the reason that I got involved in Arrow back in 2018
> > was that I wanted to build a distributed compute platform in Rust, with
> > capabilities similar to Apache Spark. This led to the creation of the
> > DataFusion query engine, which is an in-memory query engine and is now
> part
> > of the Arrow repo.
> >
> > Over the past couple of years, I have been working outside of Arrow on a
> > project named “Ballista” [1] to continue the journey of trying to build a
> > distributed version. Due to the pandemic, I have had time over the winter
> > to put more effort into this project and have managed to build a small
> > community around it over the past few months and the project has now
> > reached a point where the basic architecture has been proven and it is
> now
> > getting a lot of attention (more than 2k stars on GitHub just recently)
> and
> > I think that it would now make sense to donate some or all of the project
> > to Apache Arrow and continue its growth here.
> >
> > For an overview of the project, please see the talk I recently gave at
> the
> > New York Open Statistical Programming Meetup [2].
> >
> > Some of the benefits that I see in donating the project to Arrow are:
> >
> >
> >-
> >
> >DataFusion also needs a scheduler and it would probably make sense to
> >push some parts of the Ballista scheduler down a level in the stack so
> > that
> >the same approach is used to scale across cores in DataFusion and to
> > scale
> >across nodes in Ballista.
> >-
> >
> >Ballista provides preliminary support for spill-to-disk functionality
> >(in Arrow IPC format) which could also benefit DataFusion and provide
> >better scalability through out-of-core processing.
> >-
> >
> >Although the Ballista scheduler is implemented in Rust, it is possible
> >to implement executors in other languages due to the use of Flight,
> > gRPC,
> >and protobuf, so this may be of interest to other language
> > implementations
> >of Arrow as well.
> >-
> >
> >There is already some overlap between Arrow and Ballista contributors.
> >-
> >
> >Ballista unit tests will be part of Arrow CI which means that any
> >changes to Arrow or DataFusion APIs that Ballista depends on will also
> >require that the corresponding Ballista code is updated as part of the
> > same
> >PR.
> >
> >
> > My main goal with this email thread is to gauge interest in donating this
> > code. If there is interest in doing so then we can have a more detailed
> > follow-up conversation on exactly what would be donated and where it
> would
> > go.
> >
> >
> > I have also filed a GitHub issue in Ballista to get feedback from current
> > contributors [3].
> >
> >
> > I'm looking forward to hearing opinions on this!
> >
> >
> > Thanks,
> >
> > Andy.
> >
> > [1] https://github.com/ballista-compute/ballista
> >
> > [2] https://www.youtube.com/watch?v=ZZHQaOap9pQ
> >
> > [3] https://github.com/ballista-compute/ballista/issues/646
> >
>


Re: [Rust] [DataFusion] Topic for next Rust Sync Call

2021-03-10 Thread Jorge Cardoso Leitão
Hi,

If there is time available, I would like to present the status of the
experimental arrow2  repo, and
gather feedback on what would be the best way to proceed. 10-15m?

Best,
Jorge


On Wed, Mar 10, 2021 at 1:57 PM Andrew Lamb  wrote:

> Also:
> *  semantics for CAST and what to do on failure (return NULL or error)
> [Mike S]
>
> On Wed, Mar 10, 2021 at 7:38 AM Andrew Lamb  wrote:
>
> > Reminder that today is the next Rust sync call
> >
> > Potential topics for discussion:
> > * Ballista / DataFusion / etc
> > * I remember that someone else was going to demo the use of Arrow but I
> > can't remember exactly what that was now
> >
> > On Tue, Feb 16, 2021 at 10:59 AM Dominik Moritz 
> wrote:
> >
> >>  Somewhat related, I tried to compile DataFusion to WASM and it didn’t
> >> work
> >> because of some dependencies:
> >> https://issues.apache.org/jira/projects/ARROW/issues/ARROW-11615. I
> >> wonder
> >> whether DataFusion could have a feature flag for only shipping what is
> >> WASM
> >> compatible?
> >>
> >> On Feb 15, 2021 at 12:13:04, Andrew Lamb  wrote:
> >>
> >> > Also, unrelated, is there a schedule for the sync calls? Will try and
> >> >
> >> > carve out some free time for the next one :)
> >> >
> >> > It is every other Wednesday at noon EST. Here is the original
> >> announcement
> >> > with more details:
> >> >
> >> >
> >>
> https://lists.apache.org/thread.html/raa72e1a8a3ad5dbb8366e9609a041eccca87f85545c3bc3d85170cfc%40%3Cdev.arrow.apache.org%3E
> >> >
> >> >
> >> > On Sun, Feb 14, 2021 at 8:29 AM Ruan Pearce-Authers <
> >> r...@reservoirdb.com>
> >> > wrote:
> >> >
> >> > I'd be interested in helping spec this out, it's especially tricky atm
> >> to
> >> >
> >> > track down issues when integrating DataFusion into the same binary as
> >> other
> >> >
> >> > medium/large dependencies.
> >> >
> >> >
> >> > Recently hit a really specific issue where DataFusion depends on
> >> Parquet,
> >> >
> >> > which supports various compression algs, including Brotli, and
> actix-web
> >> >
> >> > also depends on a slightly different Rust implementation of Brotli.
> >> Both of
> >> >
> >> > these Brotli libs package the same underlying C lib separately,
> >> resulting
> >> >
> >> > in multiply-defined symbols compiling using msvc (and maybe on other
> >> >
> >> > platforms? didn't test in CI in the end).
> >> >
> >> >
> >> > Got a quick interim hack [1] in place for my use case which doesn't
> >> really
> >> >
> >> > use Parquet, so it's not pressing, but would be awesome to sort this
> >> >
> >> > properly upstream.
> >> >
> >> >
> >> > I guess the only major tradeoff of having a comprehensive feature
> setup
> >> is
> >> >
> >> > that it could make testing slightly harder, in terms of making sure
> >> no-one
> >> >
> >> > breaks the build for specific feature combinations; this can always be
> >> >
> >> > mitigated with more CI though (yay, unlimited Actions minutes for
> public
> >> >
> >> > repos).
> >> >
> >> >
> >> > Also, unrelated, is there a schedule for the sync calls? Will try and
> >> >
> >> > carve out some free time for the next one :)
> >> >
> >> >
> >> > [1]
> >> >
> >> >
> >> >
> >>
> https://github.com/reservoirdb/arrow/commit/e63e157927a552ecf1a6f63ec401f0b6157b5468
> >> >
> >> >
> >> > -Original Message-
> >> >
> >> > From: Andrew Lamb 
> >> >
> >> > Sent: 14 February 2021 11:14
> >> >
> >> > To: dev 
> >> >
> >> > Subject: [Rust] [DataFusion] Topic for next Rust Sync Call
> >> >
> >> >
> >> > I would like to add the following item to the agenda call for the next
> >> >
> >> > Rust sync call:
> >> >
> >> >
> >> > Dependencies
> >> >
> >> >
> >> > Background: As the dependency stack gets larger, it will be harder to
> >> use
> >> >
> >> > DataFusion as an embedded query engine and the compile / dev times
> will
> >> get
> >> >
> >> > higher.
> >> >
> >> >
> >> > As we expand the supported functions of DataFusion this problem is
> >> likely
> >> >
> >> > to get worse. For example
> >> >
> >> > https://github.com/apache/arrow/pull/9243#discussion_r575716759 and
> >> >
> >> > https://github.com/apache/arrow/pull/9139
> >> >
> >> >
> >> > Proposal: Add Rust "features" to the datafusion crate and make many of
> >> the
> >> >
> >> > new dependencies optional (so that we had features like regex and
> >> unicode
> >> >
> >> > and hash which would only pull in the dependencies / have those
> >> functions
> >> >
> >> > if the features were enabled.) This approach has worked well for Arrow
> >> >
> >> > (which has only chrono and num as required dependencies)
> >> >
> >> >
> >> >
> >>
> >
>


Re: [Rust] [DataFusion] Topic for next Rust Sync Call

2021-03-10 Thread Andrew Lamb
Also:
*  semantics for CAST and what to do on failure (return NULL or error)
[Mike S]

On Wed, Mar 10, 2021 at 7:38 AM Andrew Lamb  wrote:

> Reminder that today is the next Rust sync call
>
> Potential topics for discussion:
> * Ballista / DataFusion / etc
> * I remember that someone else was going to demo the use of Arrow but I
> can't remember exactly what that was now
>
> On Tue, Feb 16, 2021 at 10:59 AM Dominik Moritz  wrote:
>
>>  Somewhat related, I tried to compile DataFusion to WASM and it didn’t
>> work
>> because of some dependencies:
>> https://issues.apache.org/jira/projects/ARROW/issues/ARROW-11615. I
>> wonder
>> whether DataFusion could have a feature flag for only shipping what is
>> WASM
>> compatible?
>>
>> On Feb 15, 2021 at 12:13:04, Andrew Lamb  wrote:
>>
>> > Also, unrelated, is there a schedule for the sync calls? Will try and
>> >
>> > carve out some free time for the next one :)
>> >
>> > It is every other Wednesday at noon EST. Here is the original
>> announcement
>> > with more details:
>> >
>> >
>> https://lists.apache.org/thread.html/raa72e1a8a3ad5dbb8366e9609a041eccca87f85545c3bc3d85170cfc%40%3Cdev.arrow.apache.org%3E
>> >
>> >
>> > On Sun, Feb 14, 2021 at 8:29 AM Ruan Pearce-Authers <
>> r...@reservoirdb.com>
>> > wrote:
>> >
>> > I'd be interested in helping spec this out, it's especially tricky atm
>> to
>> >
>> > track down issues when integrating DataFusion into the same binary as
>> other
>> >
>> > medium/large dependencies.
>> >
>> >
>> > Recently hit a really specific issue where DataFusion depends on
>> Parquet,
>> >
>> > which supports various compression algs, including Brotli, and actix-web
>> >
>> > also depends on a slightly different Rust implementation of Brotli.
>> Both of
>> >
>> > these Brotli libs package the same underlying C lib separately,
>> resulting
>> >
>> > in multiply-defined symbols compiling using msvc (and maybe on other
>> >
>> > platforms? didn't test in CI in the end).
>> >
>> >
>> > Got a quick interim hack [1] in place for my use case which doesn't
>> really
>> >
>> > use Parquet, so it's not pressing, but would be awesome to sort this
>> >
>> > properly upstream.
>> >
>> >
>> > I guess the only major tradeoff of having a comprehensive feature setup
>> is
>> >
>> > that it could make testing slightly harder, in terms of making sure
>> no-one
>> >
>> > breaks the build for specific feature combinations; this can always be
>> >
>> > mitigated with more CI though (yay, unlimited Actions minutes for public
>> >
>> > repos).
>> >
>> >
>> > Also, unrelated, is there a schedule for the sync calls? Will try and
>> >
>> > carve out some free time for the next one :)
>> >
>> >
>> > [1]
>> >
>> >
>> >
>> https://github.com/reservoirdb/arrow/commit/e63e157927a552ecf1a6f63ec401f0b6157b5468
>> >
>> >
>> > -Original Message-
>> >
>> > From: Andrew Lamb 
>> >
>> > Sent: 14 February 2021 11:14
>> >
>> > To: dev 
>> >
>> > Subject: [Rust] [DataFusion] Topic for next Rust Sync Call
>> >
>> >
>> > I would like to add the following item to the agenda call for the next
>> >
>> > Rust sync call:
>> >
>> >
>> > Dependencies
>> >
>> >
>> > Background: As the dependency stack gets larger, it will be harder to
>> use
>> >
>> > DataFusion as an embedded query engine and the compile / dev times will
>> get
>> >
>> > higher.
>> >
>> >
>> > As we expand the supported functions of DataFusion this problem is
>> likely
>> >
>> > to get worse. For example
>> >
>> > https://github.com/apache/arrow/pull/9243#discussion_r575716759 and
>> >
>> > https://github.com/apache/arrow/pull/9139
>> >
>> >
>> > Proposal: Add Rust "features" to the datafusion crate and make many of
>> the
>> >
>> > new dependencies optional (so that we had features like regex and
>> unicode
>> >
>> > and hash which would only pull in the dependencies / have those
>> functions
>> >
>> > if the features were enabled.) This approach has worked well for Arrow
>> >
>> > (which has only chrono and num as required dependencies)
>> >
>> >
>> >
>>
>


Re: [Rust] [DataFusion] Topic for next Rust Sync Call

2021-03-10 Thread Andrew Lamb
Reminder that today is the next Rust sync call

Potential topics for discussion:
* Ballista / DataFusion / etc
* I remember that someone else was going to demo the use of Arrow but I
can't remember exactly what that was now

On Tue, Feb 16, 2021 at 10:59 AM Dominik Moritz  wrote:

>  Somewhat related, I tried to compile DataFusion to WASM and it didn’t work
> because of some dependencies:
> https://issues.apache.org/jira/projects/ARROW/issues/ARROW-11615. I wonder
> whether DataFusion could have a feature flag for only shipping what is WASM
> compatible?
>
> On Feb 15, 2021 at 12:13:04, Andrew Lamb  wrote:
>
> > Also, unrelated, is there a schedule for the sync calls? Will try and
> >
> > carve out some free time for the next one :)
> >
> > It is every other Wednesday at noon EST. Here is the original
> announcement
> > with more details:
> >
> >
> https://lists.apache.org/thread.html/raa72e1a8a3ad5dbb8366e9609a041eccca87f85545c3bc3d85170cfc%40%3Cdev.arrow.apache.org%3E
> >
> >
> > On Sun, Feb 14, 2021 at 8:29 AM Ruan Pearce-Authers <
> r...@reservoirdb.com>
> > wrote:
> >
> > I'd be interested in helping spec this out, it's especially tricky atm to
> >
> > track down issues when integrating DataFusion into the same binary as
> other
> >
> > medium/large dependencies.
> >
> >
> > Recently hit a really specific issue where DataFusion depends on Parquet,
> >
> > which supports various compression algs, including Brotli, and actix-web
> >
> > also depends on a slightly different Rust implementation of Brotli. Both
> of
> >
> > these Brotli libs package the same underlying C lib separately, resulting
> >
> > in multiply-defined symbols compiling using msvc (and maybe on other
> >
> > platforms? didn't test in CI in the end).
> >
> >
> > Got a quick interim hack [1] in place for my use case which doesn't
> really
> >
> > use Parquet, so it's not pressing, but would be awesome to sort this
> >
> > properly upstream.
> >
> >
> > I guess the only major tradeoff of having a comprehensive feature setup
> is
> >
> > that it could make testing slightly harder, in terms of making sure
> no-one
> >
> > breaks the build for specific feature combinations; this can always be
> >
> > mitigated with more CI though (yay, unlimited Actions minutes for public
> >
> > repos).
> >
> >
> > Also, unrelated, is there a schedule for the sync calls? Will try and
> >
> > carve out some free time for the next one :)
> >
> >
> > [1]
> >
> >
> >
> https://github.com/reservoirdb/arrow/commit/e63e157927a552ecf1a6f63ec401f0b6157b5468
> >
> >
> > -Original Message-
> >
> > From: Andrew Lamb 
> >
> > Sent: 14 February 2021 11:14
> >
> > To: dev 
> >
> > Subject: [Rust] [DataFusion] Topic for next Rust Sync Call
> >
> >
> > I would like to add the following item to the agenda call for the next
> >
> > Rust sync call:
> >
> >
> > Dependencies
> >
> >
> > Background: As the dependency stack gets larger, it will be harder to use
> >
> > DataFusion as an embedded query engine and the compile / dev times will
> get
> >
> > higher.
> >
> >
> > As we expand the supported functions of DataFusion this problem is likely
> >
> > to get worse. For example
> >
> > https://github.com/apache/arrow/pull/9243#discussion_r575716759 and
> >
> > https://github.com/apache/arrow/pull/9139
> >
> >
> > Proposal: Add Rust "features" to the datafusion crate and make many of
> the
> >
> > new dependencies optional (so that we had features like regex and unicode
> >
> > and hash which would only pull in the dependencies / have those functions
> >
> > if the features were enabled.) This approach has worked well for Arrow
> >
> > (which has only chrono and num as required dependencies)
> >
> >
> >
>


Re: [DISCUSS] [Rust] Donate Ballista to Apache Arrow

2021-03-10 Thread Andrew Lamb
My thoughts are:

1. The scheduler and spill-to-disk/out of core operations sound very good
to bring into DataFusion and many people would benefit

2. I think the arrow github project and the unified workflow process in
particular is reaching its limits. Adding another cool, but non trivial
project like Ballista will likely exacerbate the challenges even more.

3. My sense is that the Rust arrow implementation is nearing feature
completion (though we may still have one last big revamp, depending on
Jorge's plans) and so I expect breaking API changes there to slow down,
lessening the value of keeping everything in the same rep.

4.  What would you think about pulling DataFusion out of the arrow crate in
the medium term (2-3 releases from now) and putting it into a new place
(alongside Ballista)?

Andrew

On Tue, Mar 9, 2021 at 12:30 PM Andy Grove  wrote:

> As many of you know, the reason that I got involved in Arrow back in 2018
> was that I wanted to build a distributed compute platform in Rust, with
> capabilities similar to Apache Spark. This led to the creation of the
> DataFusion query engine, which is an in-memory query engine and is now part
> of the Arrow repo.
>
> Over the past couple of years, I have been working outside of Arrow on a
> project named “Ballista” [1] to continue the journey of trying to build a
> distributed version. Due to the pandemic, I have had time over the winter
> to put more effort into this project and have managed to build a small
> community around it over the past few months and the project has now
> reached a point where the basic architecture has been proven and it is now
> getting a lot of attention (more than 2k stars on GitHub just recently) and
> I think that it would now make sense to donate some or all of the project
> to Apache Arrow and continue its growth here.
>
> For an overview of the project, please see the talk I recently gave at the
> New York Open Statistical Programming Meetup [2].
>
> Some of the benefits that I see in donating the project to Arrow are:
>
>
>-
>
>DataFusion also needs a scheduler and it would probably make sense to
>push some parts of the Ballista scheduler down a level in the stack so
> that
>the same approach is used to scale across cores in DataFusion and to
> scale
>across nodes in Ballista.
>-
>
>Ballista provides preliminary support for spill-to-disk functionality
>(in Arrow IPC format) which could also benefit DataFusion and provide
>better scalability through out-of-core processing.
>-
>
>Although the Ballista scheduler is implemented in Rust, it is possible
>to implement executors in other languages due to the use of Flight,
> gRPC,
>and protobuf, so this may be of interest to other language
> implementations
>of Arrow as well.
>-
>
>There is already some overlap between Arrow and Ballista contributors.
>-
>
>Ballista unit tests will be part of Arrow CI which means that any
>changes to Arrow or DataFusion APIs that Ballista depends on will also
>require that the corresponding Ballista code is updated as part of the
> same
>PR.
>
>
> My main goal with this email thread is to gauge interest in donating this
> code. If there is interest in doing so then we can have a more detailed
> follow-up conversation on exactly what would be donated and where it would
> go.
>
>
> I have also filed a GitHub issue in Ballista to get feedback from current
> contributors [3].
>
>
> I'm looking forward to hearing opinions on this!
>
>
> Thanks,
>
> Andy.
>
> [1] https://github.com/ballista-compute/ballista
>
> [2] https://www.youtube.com/watch?v=ZZHQaOap9pQ
>
> [3] https://github.com/ballista-compute/ballista/issues/646
>


[NIGHTLY] Arrow Build Report for Job nightly-2021-03-10-0

2021-03-10 Thread Crossbow


Arrow Build Report for Job nightly-2021-03-10-0

All tasks: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-10-0

Failed Tasks:
- test-conda-cpp-valgrind:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-10-0-github-test-conda-cpp-valgrind
- test-conda-python-3.7-turbodbc-latest:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-10-0-github-test-conda-python-3.7-turbodbc-latest
- test-conda-python-3.7-turbodbc-master:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-10-0-github-test-conda-python-3.7-turbodbc-master
- test-conda-python-3.8-dask-master:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-10-0-github-test-conda-python-3.8-dask-master
- test-conda-python-3.8-jpype:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-10-0-github-test-conda-python-3.8-jpype
- test-r-versions:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-10-0-github-test-r-versions
- test-ubuntu-18.04-docs:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-10-0-azure-test-ubuntu-18.04-docs
- test-ubuntu-18.04-r-sanitizer:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-10-0-azure-test-ubuntu-18.04-r-sanitizer
- wheel-osx-high-sierra-cp36m:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-10-0-github-wheel-osx-high-sierra-cp36m
- wheel-osx-high-sierra-cp37m:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-10-0-github-wheel-osx-high-sierra-cp37m
- wheel-osx-high-sierra-cp38:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-10-0-github-wheel-osx-high-sierra-cp38
- wheel-osx-high-sierra-cp39:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-10-0-github-wheel-osx-high-sierra-cp39
- wheel-osx-mavericks-cp36m:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-10-0-github-wheel-osx-mavericks-cp36m
- wheel-osx-mavericks-cp37m:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-10-0-github-wheel-osx-mavericks-cp37m
- wheel-osx-mavericks-cp38:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-10-0-github-wheel-osx-mavericks-cp38
- wheel-osx-mavericks-cp39:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-10-0-github-wheel-osx-mavericks-cp39

Succeeded Tasks:
- centos-7-amd64:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-10-0-github-centos-7-amd64
- centos-8-amd64:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-10-0-github-centos-8-amd64
- conda-clean:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-10-0-azure-conda-clean
- conda-linux-gcc-py36-aarch64:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-10-0-drone-conda-linux-gcc-py36-aarch64
- conda-linux-gcc-py36-cpu-r36:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-10-0-azure-conda-linux-gcc-py36-cpu-r36
- conda-linux-gcc-py36-cuda:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-10-0-azure-conda-linux-gcc-py36-cuda
- conda-linux-gcc-py37-aarch64:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-10-0-drone-conda-linux-gcc-py37-aarch64
- conda-linux-gcc-py37-cpu-r40:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-10-0-azure-conda-linux-gcc-py37-cpu-r40
- conda-linux-gcc-py37-cuda:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-10-0-azure-conda-linux-gcc-py37-cuda
- conda-linux-gcc-py38-aarch64:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-10-0-drone-conda-linux-gcc-py38-aarch64
- conda-linux-gcc-py38-cpu:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-10-0-azure-conda-linux-gcc-py38-cpu
- conda-linux-gcc-py38-cuda:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-10-0-azure-conda-linux-gcc-py38-cuda
- conda-linux-gcc-py39-aarch64:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-10-0-drone-conda-linux-gcc-py39-aarch64
- conda-linux-gcc-py39-cpu:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-10-0-azure-conda-linux-gcc-py39-cpu
- conda-linux-gcc-py39-cuda:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-10-0-azure-conda-linux-gcc-py39-cuda
- conda-osx-clang-py36-r36:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-

Re: [C++] libarrow isolation

2021-03-10 Thread Antoine Pitrou



There's perhaps a simpler way for MATLAB to solve this:

1. link Arrow statically inside your own libmatlab_arrow.dll
2. have libmatlab_arrow.dll expose its own API corresponding to MATLAB needs
3. arrange for Arrow symbols to not be exposed by libmatlab_arrow.dll

I suppose it depends whether the API needed by MATLAB is small or large.

Regards

Antoine.


Le 10/03/2021 à 04:36, Tahsin Hassan a écrit :

Hi Wes,

Thanks for providing the feedback.

   1.  Regarding the symbol versioning change
As a next step, should I open up a Jira issue and assign myself to see whether 
I can automate adding Macros
BEGIN_ARROW_NS/ END_ARROW_NS ? using LLVM clang AST matchers.
We can submit this as a one-time mechanical change like you mentioned.

However, “setting up a coding guideline of ensuring `arrow/config.h` by all 
headers” - - should this also be part of the same issue?
Is there some other buy-in process for such guidelines?



   1.  There is another issue, of library versioning, which comes even before 
symbol version issue.
Since, right now, both pyarrow and MATLAB ships a dll that is aptly named 
arrow.dll on windows, if pyarrow/python brings in the arrow.dll, the MATLAB 
process will not load our shipped arrow.dll to begin with due to RTLD_GLOBAL.

We can handle the library versioning issue, by changing some CMake 
Infrastructure, by giving user ability for specifying their own
SOVERSION- in linux/mac.
VERSION/OUTPUT_NAME   - in windows (NOT sure yet, just a hunch for windows)

Currently, these cmake variables are located here :
https://github.com/apache/arrow/blob/0f72bcfc67bb4db72c27f9c3282fe5020490f214/cpp/cmake_modules/BuildUtils.cmake#L369
Once again, I would be happy to create another Jira issue and see whether I 
could work through the change.

Regards,
Tahsin


From: Wes McKinney 
Date: Sunday, March 7, 2021 at 4:12 PM
To: dev 
Subject: Re: [C++] libarrow isolation
I took a look at the document. Basically you want to have two
different versions of the Arrow shared library loaded into the same
process, with some code linked to one library and some code linked to
another. This is very similar to the problem that Boost addresses with
the `bcp --namespace=$MY_PRIVATE_BOOST_NAMESPACE ...` operation.

To obtain strict symbol isolation you have to change the "arrow::"
namespace in the C++ libraries you are shipping in your application.
AFAIK, this is a bit of a nuisance to do. One way to achieve it is to
replace every use of

namespace arrow {
...
}

with

BEGIN_ARROW_NS
...
END_ARROW_NS

or similar. Then you would do something like `cmake
-DARROW_NAMESPACE=mwarrow ...` when configuring your build.

This change could certainly be implemented in the Arrow library, which
is a one-time mechanical operation but will create some ongoing
non-intuitiveness anytime new files are created. All headers must also
therefore depend on a central `arrow/config.h` which contains the
needed namespace macros.

On Fri, Mar 5, 2021 at 9:19 AM Antoine Pitrou  wrote:



Hi Tashin,

Sorry for the lack of response. Unfortunatly I feel a bit out of my
depth on linker issues. I hope someone else can give advice.

Regards

Antoine.


Le 05/03/2021 à 16:09, Tahsin Hassan a écrit :

Hi,

I was wondering whether folks had a chance to look over the material and had 
any pointers for the proposed approach.
If I should post in some other format or clarify something, please let me know.
In the meantime, I will try out the steps we propose.

Regards,
Tahsin


From: Tahsin Hassan 
Date: Thursday, February 25, 2021 at 11:43 AM
To: dev@arrow.apache.org 
Subject: Re: [C++] libarrow isolation
Hi Antoine,

I struggled a bit to put all my thoughts in an email format, that will be 
easily consumable.
So, I wrote up a github markdown to add some more detail to the issue, we are 
facing.

Could you take a look, and let us know your thoughts?
https://github.com/mathworks/matlab-arrow-support-files/blob/main/libarrowclash.md

Regards,
Tahsin

From: Antoine Pitrou 
Date: Tuesday, February 23, 2021 at 1:21 PM
To: dev@arrow.apache.org 
Subject: Re: [C++] libarrow isolation

Hi Tahsin,

I see. So the error happens when loading PyArrow into MATLAB, I
suppose? What kind of error do you get?

Regards

Antoine.


Le 23/02/2021 à 18:12, Tahsin Hassan a écrit :

Hi Antoine,

MATLAB is using RTLD_GLOBAL. Hope that helps in clarifying the workflow.

Regrards,
Tahsin


From: Antoine Pitrou 
Sent: Monday, February 22, 2021 9:41 AM
To: dev@arrow.apache.org 
Subject: Re: [C++] libarrow isolation


Le 22/02/2021 à 15:29, Tahsin Hassan a écrit :

Hi all,

MATLAB uses the Arrow C++ libraries (i.e. libarrow.so) to read and write 
Parquet files (https://www.mathworks.com/help/matlab/ref/parquetread.html) 
While exploring ways to integrate more tightly with Arrow, we've run into a 
symbol/library naming clash issue.

When running p