Re: [VOTE][Format][Flight] Result set expiration support

2023-06-27 Thread Sutou Kouhei
+1

In <20230628.103017.2111667987485891680@clear-code.com>
  "[VOTE][Format][Flight] Result set expiration support" on Wed, 28 Jun 2023 
10:30:17 +0900 (JST),
  Sutou Kouhei  wrote:

> Hi,
> 
> I would like to propose result set expiration support for
> Flight RPC.
> 
> See the following pull request and discussion for details:
> 
> * GH-35500: [C++][Go][Java][FlightRPC] Add support for result set expiration
>   https://github.com/apache/arrow/pull/36009
> 
> * [DISCUSS][Format][Flight] Result set expiration support
>   https://lists.apache.org/thread/48fqd554gkqrrld8k13l3b8trz5gk7ow
> 
> This is based on one of the following proposals:
> 
>   [DISCUSS] Flight RPC/Flight SQL/ADBC enhancements
>   https://lists.apache.org/thread/247z3t06mf132nocngc1jkp3oqglz7jp
> 
>   Google Docs: (Arrow ML) Arrow Flight RPC/Flight SQL Proposals
>   
> https://docs.google.com/document/d/1jhPyPZSOo2iy0LqIJVUs9KWPyFULVFJXTILDfkadx2g/edit#heading=h.h2ein4otvhtq
> 
> Summary:
> 
> * Background: Currently, it is undefined whether a client
>   can call DoGet more than once. Clients may want to retry
>   requests, and servers may not want to persist a query
>   result forever.
> 
> * Proposal: Add an expiration time to FlightEndpoint. If
>   present, clients may assume they can retry DoGet
>   requests. Otherwise, clients should avoid retrying DoGet
>   requests.
> 
>   NOTE: This proposal is "not" a full retry protocol.
> 
> * Changes:
>   * Add FlightEndpoint.expiration_time field
> 
>   * Add the following pre-defined actions:
> * CancelFlightInfo: Asynchronously cancel the execution
>   of a distributed query. (Replaces the equivalent
>   Flight SQL action.)
> * RenewFlightEndpoint: Request an extension of the
>   expiration of a FlightEndpoint.
> 
> * This proposal does NOT break a backward
>   compatibility:
> 
>   * Flight RPC: Because clients can ignore
> FlightEndpoint.expiration_time.
> 
>   * Flight SQL: Because we deprecate existing CancelQuery
> action but it still available.
> 
> * The pull request includes reference implementations for
>   C++, Go and Java.
> 
> 
> The vote will be open for at least 72 hours.
> 
> [ ] +1 Accept this proposal
> [ ] +0
> [ ] -1 Do not accept this proposal because...
> 
> 
> Thanks,
> -- 
> kou


[VOTE][Format][Flight] Result set expiration support

2023-06-27 Thread Sutou Kouhei
Hi,

I would like to propose result set expiration support for
Flight RPC.

See the following pull request and discussion for details:

* GH-35500: [C++][Go][Java][FlightRPC] Add support for result set expiration
  https://github.com/apache/arrow/pull/36009

* [DISCUSS][Format][Flight] Result set expiration support
  https://lists.apache.org/thread/48fqd554gkqrrld8k13l3b8trz5gk7ow

This is based on one of the following proposals:

  [DISCUSS] Flight RPC/Flight SQL/ADBC enhancements
  https://lists.apache.org/thread/247z3t06mf132nocngc1jkp3oqglz7jp

  Google Docs: (Arrow ML) Arrow Flight RPC/Flight SQL Proposals
  
https://docs.google.com/document/d/1jhPyPZSOo2iy0LqIJVUs9KWPyFULVFJXTILDfkadx2g/edit#heading=h.h2ein4otvhtq

Summary:

* Background: Currently, it is undefined whether a client
  can call DoGet more than once. Clients may want to retry
  requests, and servers may not want to persist a query
  result forever.

* Proposal: Add an expiration time to FlightEndpoint. If
  present, clients may assume they can retry DoGet
  requests. Otherwise, clients should avoid retrying DoGet
  requests.

  NOTE: This proposal is "not" a full retry protocol.

* Changes:
  * Add FlightEndpoint.expiration_time field

  * Add the following pre-defined actions:
* CancelFlightInfo: Asynchronously cancel the execution
  of a distributed query. (Replaces the equivalent
  Flight SQL action.)
* RenewFlightEndpoint: Request an extension of the
  expiration of a FlightEndpoint.

* This proposal does NOT break a backward
  compatibility:

  * Flight RPC: Because clients can ignore
FlightEndpoint.expiration_time.

  * Flight SQL: Because we deprecate existing CancelQuery
action but it still available.

* The pull request includes reference implementations for
  C++, Go and Java.


The vote will be open for at least 72 hours.

[ ] +1 Accept this proposal
[ ] +0
[ ] -1 Do not accept this proposal because...


Thanks,
-- 
kou


Enabling apache/arrow GitHub dependency graph with vcpkg

2023-06-27 Thread Michael Price
Hello Apache Arrow project,



The Microsoft C++ team has been working with our partners at GitHub to improve 
the C and C++ user experience on their platform. As a part of that effort, we 
have added vcpkg support for the GitHub dependency graph feature. We are 
looking for feedback from GitHub repositories, like apache/arrow, that are 
using vcpkg so we can identify improvements to this new feature.



Enabling this feature for your repositories brings a number of benefits, now 
and in the future:



  *   Visibility - Users can easily see which packages you depend on and their 
versions. This includes transitive dependencies not listed in your vcpkg.json 
manifest file.
  *   Compliance - Generate an SBOM from GitHub that includes C and C++ 
dependencies as well as other supported ecosystems.
  *   Networking - A fully functional dependency graph allows you to not only 
see your dependencies, but also other GitHub projects that depend on you, 
letting you get an idea of how many people depend on your efforts. We want to 
hear from you if we should prioritize enabling this.
  *   Security - The intention is to enable GitHub's secure supply chain 
features.
 Those features are not available yet, but when they are, you'll already be 
ready to use them on day one.



What's Involved?



If you decide to help us out, here's how that would look:

  *   Enable the integration following our documentation. See GitHub 
integrations - The GitHub dependency 
graph more information.
  *   Send us a follow-up email letting us know if the documentation worked and 
was clear, and what missing functionality is most important to you.
  *   If you have problem enabling the integration, we'll work directly with 
you to resolve your issue.
  *   We will schedule a brief follow-up call (15-20) with you after the 
feature is enabled to discuss your feedback.
  *   When we make improvements, we'd like you to try them out to let us know 
if we are solving the important problems.
  *   Eventually, we'd like to get a "thumbs up" or "thumbs down" on whether or 
not you think the feature is complete enough to no longer be an experiment.
  *   We'll credit you for your help when we make the move out of experimental 
and blog about the transition to fully supported.



If you are interested in collaborating with us, let us know by replying to this 
email.



Thanks,



Michael Price
Product Manager, Microsoft C++ Team




Re: [Python][Discuss] PyArrow Dataset as a Python protocol

2023-06-27 Thread Ian Cook
> I think there's three routes we can go here:
>
> 1. We keep PyArrow expressions in the API initially, but once we have
> Substrait-based alternatives we deprecate the PyArrow expression support.
> This is what I intended with the current design, and I think it provides
> the most obvious migration paths for existing producers and consumers.
> 2. We keep the overall dataset API, but don't introduce the filter and
> projection arguments until we have Substrait support. I'm not sure what the
> migration path looks like for producers and consumers, but I think this
> just implicitly becomes the same as (1), but with worse documentation.
> 3. We write a protocol completely from scratch, that doesn't try to
> describe the existing dataset API. Producers and consumers would then
> migrate to use the new protocol and deprecate their existing dataset
> integrations. We could introduce a dunder method in that API (sort of like
> __arrow_array__) that would make the migration seamless from the end-user
> perspective.
>
> *Which do you all think is the best path forward?*

I favor option 2 out of concern that option 1 could create a
temptation for users of this protocol to depend on a feature that we
intend to deprecate. I think option 2 also creates a stronger
motivation to complete the Substrait expression integration work,
which is underway in https://github.com/apache/arrow/pull/34834.

Ian


On Fri, Jun 23, 2023 at 1:25 PM Weston Pace  wrote:
>
> > The trouble is that Dataset was not designed to serve as a
> > general-purpose unmaterialized dataframe. For example, the PyArrow
> > Dataset constructor [5] exposes options for specifying a list of
> > source files and a partitioning scheme, which are irrelevant for many
> > of the applications that Will anticipates. And some work is needed to
> > reconcile the methods of the PyArrow Dataset object [6] with the
> > methods of the Table object. Some methods like filter() are exposed by
> > both and behave lazily on Datasets and eagerly on Tables, as a user
> > might expect. But many other Table methods are not implemented for
> > Dataset though they potentially could be, and it is unclear where we
> > should draw the line between adding methods to Dataset vs. encouraging
> > new scanner implementations to expose options controlling what lazy
> > operations should be performed as they see fit.
>
> In my mind there is a distinction between the "compute domain" (e.g. a
> pandas dataframe or something like ibis or SQL) and the "data domain" (e.g.
> pyarrow datasets).  I think, in a perfect world, you could push any and all
> compute up and down the chain as far as possible.  However, in practice, I
> think there is a healthy set of tools and libraries that say "simple column
> projection and filtering is good enough".  I would argue that there is room
> for both APIs and while the temptation is always present to "shove as much
> compute as you can" I think pyarrow datasets seem to have found a balance
> between the two that users like.
>
> So I would argue that this protocol may never become a general-purpose
> unmaterialized dataframe and that isn't necessarily a bad thing.
>
> > they are splittable and serializable, so that fragments can be distributed
> > amongst processes / workers.
>
> Just to clarify, the proposal currently only requires the fragments to be
> serializable correct?
>
> On Fri, Jun 23, 2023 at 11:48 AM Will Jones  wrote:
>
> > Thanks Ian for your extensive feedback.
> >
> > I strongly agree with the comments made by David,
> > > Weston, and Dewey arguing that we should avoid any use of PyArrow
> > > expressions in this API. Expressions are an implementation detail of
> > > PyArrow, not a part of the Arrow standard. It would be much safer for
> > > the initial version of this protocol to not define *any*
> > > methods/arguments that take expressions.
> > >
> >
> > I would agree with this point, if we were starting from scratch. But one of
> > my goals is for this protocol to be descriptive of the existing dataset
> > integrations in the ecosystem, which all currently rely on PyArrow
> > expressions. For example, you'll notice in the PR that there are unit tests
> > to verify the current PyArrow Dataset classes conform to this protocol,
> > without changes.
> >
> > I think there's three routes we can go here:
> >
> > 1. We keep PyArrow expressions in the API initially, but once we have
> > Substrait-based alternatives we deprecate the PyArrow expression support.
> > This is what I intended with the current design, and I think it provides
> > the most obvious migration paths for existing producers and consumers.
> > 2. We keep the overall dataset API, but don't introduce the filter and
> > projection arguments until we have Substrait support. I'm not sure what the
> > migration path looks like for producers and consumers, but I think this
> > just implicitly becomes the same as (1), but with worse documentation.
> > 3. We write a protocol completely fro

[RUST][Ballista] UDF/UDAF in Ballista

2023-06-27 Thread Jaroslaw Nowosad
Hi,
Quick question: is UDF/UDAF working in Ballista?
I saw "TODO" in the executor part :

```rust
// TODO add logic to dynamically load UDF/UDAFs libs from files
scalar_functions: HashMap::new(),
aggregate_functions: HashMap::new(),
```

To create an example library and add reading functionality here looks
"simple enough", however I don't know how this would possibly work
with client/ scheduler - resolving logical plan and then logical to
physical. Not to mention how that will be passed through grpc?
I feel confused.

For " hack version" - should I register the udf library in all
client/scheduler/executors?

Will appreciate any help / pointers - learning is fun ;-)


Cheers,
Jaro