Re: [Rust] Arrow SQL Adapters/Connectors

2020-09-27 Thread Andy Grove
I didn't get a chance yet to really read this thread in detail but I am
definitely very interested in this conversation and will make time this
week to add my thoughts.

Thanks,

Andy.

On Sun, Sep 27, 2020, 4:01 PM Adam Lippai  wrote:

> Hi Neville,
>
> yes, my concerns against common row based DB APIs is that I use
> Arrow/Parquet for OLAP too.
> What https://turbodbc.readthedocs.io/en/latest/ (python) or
> https://github.com/pacman82/odbc-api#state (rust) does is that they read
> large blocks of data instead of processing rows one-by-one, but indeed, the
> ODBC and the Postgresql wire protocol is still row based.
>
> Clickhouse is an interesting example, as it directly supports arrow and
> parquet *server-side* (I didn't try it yet, just read it in the docs).
>
> Best regards,
> Adam Lippai
>
> On Sun, Sep 27, 2020 at 11:24 PM Neville Dipale 
> wrote:
>
> > Thanks for the feedback
> >
> > My interest is mainly in the narrow usecase of reading and writing batch
> > data,
> > so I wouldn't want to deal with producing and consuming rows per se.
> > Andy has worked on RDBC (https://github.com/tokio-rs/rdbc) for the
> > row-based or OLTP case,
> > and I'm considering something more suitable for the OLAP case.
> >
> > @Wes I'll have a read through the Python DB API, I've also been looking
> at
> > JDBC
> > as well as how Apache Spark manages to get such good performance from
> JDBC.
> >
> > I haven't been an ODBC fan, but mainly because of historic struggles with
> > getting it to work
> > on Linux envs where I don't have system control. WIth that said, we could
> > still support ODBC.
> >
> > @Jorge, I have an implementation at rust-dataframe (
> >
> https://github.com/nevi-me/rust-dataframe/tree/master/src/io/sql/postgres)
> > which uses rust-postgres. I however don't use the row-based API as that
> > comes at
> > a serialization cost (going from bytes > Rust types > Arrow).
> > I instead use the
> > Postgres binary format (
> >
> >
> https://github.com/nevi-me/rust-dataframe/blob/master/src/io/sql/postgres/reader.rs#L204
> > ).
> > That postgres module would be the starting point of such separate crate.
> >
> > For Postgres <> Arrow type conversions, I leverage 2 methods:
> >
> > 1. When reading a table, we I get schema from the *information_schema*
> > system
> > table
> > 2. When reading a query, I issue the query with a 1-row limit, and
> convert
> > the row's schema to an Arrow schema
> >
> > @Adam I think async and pooling would be attainable yes, if an underlying
> > SQL crate
> > uses R2D2 for pooling, an API that supports that could be provided.
> >
> > In summary, I'm thinking along the lines of:
> >
> > * A reader that takes connection parameters & a query or table
> > * The reader can handle partitioning if need be (similar to how Spark
> does
> > it)
> > * The reader returns a Schema, and can be iterated on to return data in
> > batches
> >
> > * A writer that takes connection parameters and a table
> > * The writer writes batches to a table, and is able to write batches in
> > parallel
> >
> > In the case of a hypothetical interfacing with column databases like
> > Clickhouse,
> > we would be able to levarage materialising arrows from columns, instead
> of
> > the
> > potential column-wise conversions that can be performed from row-based
> > APIs.
> >
> > Neville
> >
> >
> > On Sun, 27 Sep 2020 at 22:08, Adam Lippai  wrote:
> >
> > > One more universal approach is to use ODBC, this is a recent Rust
> > > conversation (with example) on the topic:
> > > https://github.com/Koka/odbc-rs/issues/140
> > >
> > > Honestly I find the Python DB API too simple, all it provides is a
> > > row-by-row API. I miss four things:
> > >
> > >- Batched or bulk processing both for data loading and dumping.
> > >- Async support (python has asyncio and async web frameworks, but no
> > >async DB spec). SQLAlchemy async support is coming soon and there is
> > >https://github.com/encode/databases
> > >- Connection pooling (it's common to use TLS, connection reuse would
> > be
> > >nice as TLS 1.3 is not here yet)
> > >- Failover / load balancing support (this is connected to the
> > previous)
> > >
> > > Best regards,
> > > Adam Lippai
> > >
> > > On Sun, Sep 27, 2020 at 9:57 PM Jorge Cardoso Leitão <
> > > jorgecarlei...@gmail.com> wrote:
> > >
> > > > That would be awesome! I agree with this, and would be really useful,
> > as
> > > it
> > > > would leverage all the goodies that RDMS have wrt to transitions,
> etc.
> > > >
> > > > I would probably go for having database-specifics outside of the
> arrow
> > > > project, so that they can be used by other folks beyond arrow, and
> keep
> > > the
> > > > arrow-specifics (i.e. conversion from the format from the specific
> > > > databases to arrow) as part of the arrow crate. Ideally as Wes wrote,
> > > with
> > > > some standard to be easier to handle different DBs.
> > > >
> > > > I think that there are two layers: one is how to connect to a

Re: [Rust] Arrow SQL Adapters/Connectors

2020-09-27 Thread Adam Lippai
Hi Neville,

yes, my concerns against common row based DB APIs is that I use
Arrow/Parquet for OLAP too.
What https://turbodbc.readthedocs.io/en/latest/ (python) or
https://github.com/pacman82/odbc-api#state (rust) does is that they read
large blocks of data instead of processing rows one-by-one, but indeed, the
ODBC and the Postgresql wire protocol is still row based.

Clickhouse is an interesting example, as it directly supports arrow and
parquet *server-side* (I didn't try it yet, just read it in the docs).

Best regards,
Adam Lippai

On Sun, Sep 27, 2020 at 11:24 PM Neville Dipale 
wrote:

> Thanks for the feedback
>
> My interest is mainly in the narrow usecase of reading and writing batch
> data,
> so I wouldn't want to deal with producing and consuming rows per se.
> Andy has worked on RDBC (https://github.com/tokio-rs/rdbc) for the
> row-based or OLTP case,
> and I'm considering something more suitable for the OLAP case.
>
> @Wes I'll have a read through the Python DB API, I've also been looking at
> JDBC
> as well as how Apache Spark manages to get such good performance from JDBC.
>
> I haven't been an ODBC fan, but mainly because of historic struggles with
> getting it to work
> on Linux envs where I don't have system control. WIth that said, we could
> still support ODBC.
>
> @Jorge, I have an implementation at rust-dataframe (
> https://github.com/nevi-me/rust-dataframe/tree/master/src/io/sql/postgres)
> which uses rust-postgres. I however don't use the row-based API as that
> comes at
> a serialization cost (going from bytes > Rust types > Arrow).
> I instead use the
> Postgres binary format (
>
> https://github.com/nevi-me/rust-dataframe/blob/master/src/io/sql/postgres/reader.rs#L204
> ).
> That postgres module would be the starting point of such separate crate.
>
> For Postgres <> Arrow type conversions, I leverage 2 methods:
>
> 1. When reading a table, we I get schema from the *information_schema*
> system
> table
> 2. When reading a query, I issue the query with a 1-row limit, and convert
> the row's schema to an Arrow schema
>
> @Adam I think async and pooling would be attainable yes, if an underlying
> SQL crate
> uses R2D2 for pooling, an API that supports that could be provided.
>
> In summary, I'm thinking along the lines of:
>
> * A reader that takes connection parameters & a query or table
> * The reader can handle partitioning if need be (similar to how Spark does
> it)
> * The reader returns a Schema, and can be iterated on to return data in
> batches
>
> * A writer that takes connection parameters and a table
> * The writer writes batches to a table, and is able to write batches in
> parallel
>
> In the case of a hypothetical interfacing with column databases like
> Clickhouse,
> we would be able to levarage materialising arrows from columns, instead of
> the
> potential column-wise conversions that can be performed from row-based
> APIs.
>
> Neville
>
>
> On Sun, 27 Sep 2020 at 22:08, Adam Lippai  wrote:
>
> > One more universal approach is to use ODBC, this is a recent Rust
> > conversation (with example) on the topic:
> > https://github.com/Koka/odbc-rs/issues/140
> >
> > Honestly I find the Python DB API too simple, all it provides is a
> > row-by-row API. I miss four things:
> >
> >- Batched or bulk processing both for data loading and dumping.
> >- Async support (python has asyncio and async web frameworks, but no
> >async DB spec). SQLAlchemy async support is coming soon and there is
> >https://github.com/encode/databases
> >- Connection pooling (it's common to use TLS, connection reuse would
> be
> >nice as TLS 1.3 is not here yet)
> >- Failover / load balancing support (this is connected to the
> previous)
> >
> > Best regards,
> > Adam Lippai
> >
> > On Sun, Sep 27, 2020 at 9:57 PM Jorge Cardoso Leitão <
> > jorgecarlei...@gmail.com> wrote:
> >
> > > That would be awesome! I agree with this, and would be really useful,
> as
> > it
> > > would leverage all the goodies that RDMS have wrt to transitions, etc.
> > >
> > > I would probably go for having database-specifics outside of the arrow
> > > project, so that they can be used by other folks beyond arrow, and keep
> > the
> > > arrow-specifics (i.e. conversion from the format from the specific
> > > databases to arrow) as part of the arrow crate. Ideally as Wes wrote,
> > with
> > > some standard to be easier to handle different DBs.
> > >
> > > I think that there are two layers: one is how to connect to a database,
> > the
> > > other is how to serialize/deserialize. AFAIK PEP 249 covers both
> layers,
> > as
> > > it standardizes things like `connect` and `tpc_begin`, as well as how
> > > things should be serialized to Python objects (e.g. dates should be
> > > datetime.date). This split is done by postgres for Rust
> > > , as it offers 5 crates:
> > > * postges-async
> > > * postges-sync (a blocking wrapper of postgres-async)
> > > * postges-type

Re: [Rust] Arrow SQL Adapters/Connectors

2020-09-27 Thread Neville Dipale
Thanks for the feedback

My interest is mainly in the narrow usecase of reading and writing batch
data,
so I wouldn't want to deal with producing and consuming rows per se.
Andy has worked on RDBC (https://github.com/tokio-rs/rdbc) for the
row-based or OLTP case,
and I'm considering something more suitable for the OLAP case.

@Wes I'll have a read through the Python DB API, I've also been looking at
JDBC
as well as how Apache Spark manages to get such good performance from JDBC.

I haven't been an ODBC fan, but mainly because of historic struggles with
getting it to work
on Linux envs where I don't have system control. WIth that said, we could
still support ODBC.

@Jorge, I have an implementation at rust-dataframe (
https://github.com/nevi-me/rust-dataframe/tree/master/src/io/sql/postgres)
which uses rust-postgres. I however don't use the row-based API as that
comes at
a serialization cost (going from bytes > Rust types > Arrow).
I instead use the
Postgres binary format (
https://github.com/nevi-me/rust-dataframe/blob/master/src/io/sql/postgres/reader.rs#L204
).
That postgres module would be the starting point of such separate crate.

For Postgres <> Arrow type conversions, I leverage 2 methods:

1. When reading a table, we I get schema from the *information_schema* system
table
2. When reading a query, I issue the query with a 1-row limit, and convert
the row's schema to an Arrow schema

@Adam I think async and pooling would be attainable yes, if an underlying
SQL crate
uses R2D2 for pooling, an API that supports that could be provided.

In summary, I'm thinking along the lines of:

* A reader that takes connection parameters & a query or table
* The reader can handle partitioning if need be (similar to how Spark does
it)
* The reader returns a Schema, and can be iterated on to return data in
batches

* A writer that takes connection parameters and a table
* The writer writes batches to a table, and is able to write batches in
parallel

In the case of a hypothetical interfacing with column databases like
Clickhouse,
we would be able to levarage materialising arrows from columns, instead of
the
potential column-wise conversions that can be performed from row-based APIs.

Neville


On Sun, 27 Sep 2020 at 22:08, Adam Lippai  wrote:

> One more universal approach is to use ODBC, this is a recent Rust
> conversation (with example) on the topic:
> https://github.com/Koka/odbc-rs/issues/140
>
> Honestly I find the Python DB API too simple, all it provides is a
> row-by-row API. I miss four things:
>
>- Batched or bulk processing both for data loading and dumping.
>- Async support (python has asyncio and async web frameworks, but no
>async DB spec). SQLAlchemy async support is coming soon and there is
>https://github.com/encode/databases
>- Connection pooling (it's common to use TLS, connection reuse would be
>nice as TLS 1.3 is not here yet)
>- Failover / load balancing support (this is connected to the previous)
>
> Best regards,
> Adam Lippai
>
> On Sun, Sep 27, 2020 at 9:57 PM Jorge Cardoso Leitão <
> jorgecarlei...@gmail.com> wrote:
>
> > That would be awesome! I agree with this, and would be really useful, as
> it
> > would leverage all the goodies that RDMS have wrt to transitions, etc.
> >
> > I would probably go for having database-specifics outside of the arrow
> > project, so that they can be used by other folks beyond arrow, and keep
> the
> > arrow-specifics (i.e. conversion from the format from the specific
> > databases to arrow) as part of the arrow crate. Ideally as Wes wrote,
> with
> > some standard to be easier to handle different DBs.
> >
> > I think that there are two layers: one is how to connect to a database,
> the
> > other is how to serialize/deserialize. AFAIK PEP 249 covers both layers,
> as
> > it standardizes things like `connect` and `tpc_begin`, as well as how
> > things should be serialized to Python objects (e.g. dates should be
> > datetime.date). This split is done by postgres for Rust
> > , as it offers 5 crates:
> > * postges-async
> > * postges-sync (a blocking wrapper of postgres-async)
> > * postges-types (to convert to native rust  < IMO this one is what we
> > want to offer in Arrow)
> > * postges-TLS
> > * postges-openssl
> >
> > `postges-sync` implements Iterator (`client.query`), and
> postges-async
> > implements Stream.
> >
> > One idea is to have a generic iterator/stream adapter, that yields
> > RecordBatches. The implementation of this trait by different providers
> > would give support to be used in Arrow and DataFusion.
> >
> > Besides postgres, one idea is to pick the top from this list
> > :
> >
> > * Oracle
> > * MySQL
> > * MsSQL
> >
> > Another idea is to start by by supporting SQLite, which is a good
> > development env to work with relational databases.
> >
> > Best,
> > Jorge
> >
> >
> >
> >
> >
> > On Sun, Sep 27, 2020 at 4:22 AM Neville Dipal

Re: [Python/C-Glib] writing IPC file format column-by-column

2020-09-27 Thread Ishan Anand
Hi

Updating the thread for people with a similar use case. A new project called 
[duckdb](https://github.com/cwida/duckdb) allows usage of Arrow memory mapped 
files as virtual tables, so a lot of pandas functionality can be covered using 
their sql equivalents. Duckdb works equally well with chunked tables, so that 
alleviates the need for contiguous columns in the Arrow file.

Thank you,
Ishan

From: Sutou Kouhei 
Sent: Friday, September 11, 2020 3:23 AM
To: u...@arrow.apache.org ; dev@arrow.apache.org 

Subject: Re: [Python/C-Glib] writing IPC file format column-by-column

Hi,

I add dev@ because this may need to improve Apache Arrow C++.

It seems that we need the following new feature for this use
case (combining chunks with small memory to process large
data with pandas, mmap and small memory):

  * Writing chunks in arrow::Table as one large
arrow::RecordTable without creating intermediate
combined chunks

The current arrow::ipc::RecordBatchWriter::WriteTable()
always splits the given arrow::Table to one or more
arrow::RecordBatch. We may be able to add the feature that
writes the given arrow::Table as one combined
arrow::RecordBatch without creating intermediate combined
chunks.


Do C++ developers have any opinion on this?


Thanks,
--
kou

In
 

  "[Python/C-Glib] writing IPC file format column-by-column " on Wed, 9 Sep 
2020 10:11:54 +,
  Ishan Anand  wrote:

> Hi
>
> I'm looking at using Arrow primarily on low-resource instances with out of 
> memory datasets. This is the workflow I'm trying to implement.
>
>
>   *   Write record batches in IPC streaming format to a file from a C runtime.
>   *   Consume it one row at a time from python/C by loading the file in 
> chunks.
>   *   If the schema is simple enough to support zero copy operations, make 
> the table readable from pandas. This needs me to,
>  *   convert it into a Table with a single chunk per column (since pandas 
> can't use mmap with chunked arrays).
>  *   write the table in IPC random access format to a file.
>
> PyArrow provides a method `combine_chunks` to combine chunks into a single 
> chunk. However, it needs to create the entire table in memory (I suspect it 
> is 2x, since it loads both versions of the table in memory but that can be 
> avoided).
>
> Since the Arrow layout is columnar, I'm curious if it is possible to write 
> the table one column at a time. And if the existing glib/python APIs support 
> it? The C++ file writer objects seem to go down to serializing a single 
> record batch at a time and not per column.
>
>
> Thank you,
> Ishan


Re: [NIGHTLY] Arrow Build Report for Job nightly-2020-09-27-0

2020-09-27 Thread Uwe L. Korn
I'm working on a fix for the conda failures in 
https://github.com/apache/arrow/pull/8282

On Sun, Sep 27, 2020, at 12:20 PM, Crossbow wrote:
> 
> Arrow Build Report for Job nightly-2020-09-27-0
> 
> All tasks: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0
> 
> Failed Tasks:
> - conda-linux-gcc-py36-aarch64:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-drone-conda-linux-gcc-py36-aarch64
> - conda-linux-gcc-py36-cpu:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-azure-conda-linux-gcc-py36-cpu
> - conda-linux-gcc-py36-cuda:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-azure-conda-linux-gcc-py36-cuda
> - conda-linux-gcc-py37-aarch64:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-drone-conda-linux-gcc-py37-aarch64
> - conda-linux-gcc-py37-cpu:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-azure-conda-linux-gcc-py37-cpu
> - conda-linux-gcc-py37-cuda:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-azure-conda-linux-gcc-py37-cuda
> - conda-linux-gcc-py38-aarch64:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-drone-conda-linux-gcc-py38-aarch64
> - conda-linux-gcc-py38-cpu:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-azure-conda-linux-gcc-py38-cpu
> - conda-linux-gcc-py38-cuda:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-azure-conda-linux-gcc-py38-cuda
> - conda-osx-clang-py36:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-azure-conda-osx-clang-py36
> - conda-osx-clang-py37:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-azure-conda-osx-clang-py37
> - conda-osx-clang-py38:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-azure-conda-osx-clang-py38
> - gandiva-jar-osx:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-travis-gandiva-jar-osx
> - gandiva-jar-xenial:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-travis-gandiva-jar-xenial
> - homebrew-cpp:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-travis-homebrew-cpp
> - test-conda-cpp-valgrind:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-github-test-conda-cpp-valgrind
> - test-conda-python-3.7-hdfs-2.9.2:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-github-test-conda-python-3.7-hdfs-2.9.2
> - test-conda-python-3.7-spark-branch-3.0:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-github-test-conda-python-3.7-spark-branch-3.0
> - test-conda-python-3.8-spark-master:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-github-test-conda-python-3.8-spark-master
> - wheel-osx-cp35m:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-travis-wheel-osx-cp35m
> - wheel-osx-cp36m:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-travis-wheel-osx-cp36m
> - wheel-osx-cp37m:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-travis-wheel-osx-cp37m
> - wheel-osx-cp38:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-travis-wheel-osx-cp38
> - wheel-win-cp36m:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-appveyor-wheel-win-cp36m
> 
> Succeeded Tasks:
> - centos-6-amd64:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-github-centos-6-amd64
> - centos-7-aarch64:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-travis-centos-7-aarch64
> - centos-7-amd64:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-github-centos-7-amd64
> - centos-8-aarch64:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-travis-centos-8-aarch64
> - centos-8-amd64:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-github-centos-8-amd64
> - conda-clean:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-azure-conda-clean
> - conda-win-vs2017-py36:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-azure-conda-win-vs2017-py36
> - conda-win-vs2017-py37:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-azure-conda-win-vs2017-py37
> - conda-win-vs2017-py

Re: [Rust] Arrow SQL Adapters/Connectors

2020-09-27 Thread Adam Lippai
One more universal approach is to use ODBC, this is a recent Rust
conversation (with example) on the topic:
https://github.com/Koka/odbc-rs/issues/140

Honestly I find the Python DB API too simple, all it provides is a
row-by-row API. I miss four things:

   - Batched or bulk processing both for data loading and dumping.
   - Async support (python has asyncio and async web frameworks, but no
   async DB spec). SQLAlchemy async support is coming soon and there is
   https://github.com/encode/databases
   - Connection pooling (it's common to use TLS, connection reuse would be
   nice as TLS 1.3 is not here yet)
   - Failover / load balancing support (this is connected to the previous)

Best regards,
Adam Lippai

On Sun, Sep 27, 2020 at 9:57 PM Jorge Cardoso Leitão <
jorgecarlei...@gmail.com> wrote:

> That would be awesome! I agree with this, and would be really useful, as it
> would leverage all the goodies that RDMS have wrt to transitions, etc.
>
> I would probably go for having database-specifics outside of the arrow
> project, so that they can be used by other folks beyond arrow, and keep the
> arrow-specifics (i.e. conversion from the format from the specific
> databases to arrow) as part of the arrow crate. Ideally as Wes wrote, with
> some standard to be easier to handle different DBs.
>
> I think that there are two layers: one is how to connect to a database, the
> other is how to serialize/deserialize. AFAIK PEP 249 covers both layers, as
> it standardizes things like `connect` and `tpc_begin`, as well as how
> things should be serialized to Python objects (e.g. dates should be
> datetime.date). This split is done by postgres for Rust
> , as it offers 5 crates:
> * postges-async
> * postges-sync (a blocking wrapper of postgres-async)
> * postges-types (to convert to native rust  < IMO this one is what we
> want to offer in Arrow)
> * postges-TLS
> * postges-openssl
>
> `postges-sync` implements Iterator (`client.query`), and postges-async
> implements Stream.
>
> One idea is to have a generic iterator/stream adapter, that yields
> RecordBatches. The implementation of this trait by different providers
> would give support to be used in Arrow and DataFusion.
>
> Besides postgres, one idea is to pick the top from this list
> :
>
> * Oracle
> * MySQL
> * MsSQL
>
> Another idea is to start by by supporting SQLite, which is a good
> development env to work with relational databases.
>
> Best,
> Jorge
>
>
>
>
>
> On Sun, Sep 27, 2020 at 4:22 AM Neville Dipale 
> wrote:
>
> > Hi Arrow developers
> >
> > I would like to gauge the appetite for an Arrow SQL connector that:
> >
> > * Reads and writes Arrow data to and from SQL databases
> > * Reads tables and queries into record batches, and writes batches to
> > tables (either append or overwrite)
> > * Leverages binary SQL formats where available (e.g. PostgreSQL format is
> > relatively easy and well-documented)
> > * Provides a batch interface that abstracts away the different database
> > semantics, and exposes a RecordBatchReader (
> >
> https://docs.rs/arrow/1.0.1/arrow/record_batch/trait.RecordBatchReader.html
> > ),
> > and perhaps a RecordBatchWriter
> > * Resides in the Rust repo as either an arrow::sql module (like
> arrow::csv,
> > arrow::json, arrow::ipc) or alternatively is a separate crate in the
> > workspace  (*arrow-sql*?)
> >
> > I would be able to contribute a Postgres reader/writer as a start.
> > I could make this a separate crate, but to drive adoption I would prefer
> > this living in Arrow, also it can remain updated (sometimes we reorganise
> > modules and end up breaking dependencies).
> >
> > Also, being developed next to DataFusion could allow DF to support SQL
> > databases, as this would be yet another datasource.
> >
> > Some questions:
> > * Should such library support async, sync or both IO methods?
> > * Other than postgres, what other databases would be interesting? Here
> I'm
> > hoping that once we've established a suitable API, it could be easier to
> > natively support more database types.
> >
> > Potential concerns:
> >
> > * Sparse database support
> > It's a lot of effort to write database connectors, especially if starting
> > from scratch (unlike with say JDBC). What if we end up supporting 1 or 2
> > database servers?
> > Perhaps in that case we could keep the module without publishing it to
> > crates.io until we're happy with database support, or even its usage.
> >
> > * Dependency bloat
> > We could feature-gate database types to reduce the number of dependencies
> > if one only wants certain DB connectors
> >
> > * Why not use Java's JDBC adapter?
> > I already do this, but sometimes if working on a Rust project, creating a
> > separate JVM service solely to extract Arrow data is a lot of effort.
> > I also don't think it's currently possible to use the adapter to save
> Arrow
> > data in a database.
> >
> > * What about Flight 

Re: [DISCUSS] Rethinking our approach to scheduling CPU and IO work in C++?

2020-09-27 Thread Wes McKinney
Hi Weston -- this is a really interesting analysis.

1. I have been under the assumption that the current libraries work
poorly on high latency file systems, and your analysis provides the
proof, so thank you.

2. This shows that we have a lot of work to do to retool many of our
IO libraries (Parquet, CSV, etc.) to be implemented (at least
internally) with asynchronous concepts, and provide synchronous
wrappers that preserve the existing APIs. Given the importance of fast
data access this is time well spent

3. The problem as I see it with the way that Executor::Then is being
implemented here is that the follow-on spawned tasks go to "the back
of the line" in the pending task queue (which is a single deque for
the whole thread pool). An obvious thing to do would be to evolve the
built-in ThreadPool to have a per-CPU-core task queue and implement
work stealing [1].

4. It seems like if we are diligent about creating future-chains and
move the IO-related futures to a separate thread pool (or at least
different threads from the CPU thread pool) then this solves the
blocking IO and idle CPU problem that we have now.

5. It seems to me that the thread pool should have some awareness of
how many disjoint workloads (where "workload" is some function that
may spawn subtasks) are running in process. For example, suppose that
we are on an 8-core machine and there are 16 workloads submitted
around the same time. Even if we have CPU-core-specific task queues,
there are still some bad scenarios where a workload could end up
queuing for a long time waiting for other workloads to complete. So,
up to some point it may make sense to make sure there is a at least
one thread + task queue for each workload to reduce the risk of this
even if it results temporarily in thread oversubscription (empirical
research could help determine the heuristics around this). We could
add some kind of ThreadPoolConsumer API where the ThreadPool ensures
that each consumer (up to some limit on the number of threads spawned)
is spawning tasks into different queues. If there are more threads
than CPU cores, when a consumer finishes the pool and shut down its
thread.

Thanks,
Wes

[1]: https://en.wikipedia.org/wiki/Work_stealing

On Fri, Sep 25, 2020 at 4:22 PM Weston Pace  wrote:
>
> So this may be a return to the details, I think the larger discussion
> is a good discussion to have but I don't know enough of the code base
> to comment further.
>
> I finished playing around with the CSV reader.  The code for this
> experiment can be found here
> (https://github.com/westonpace/arrow/tree/feature/composable-futures),
> it is pretty rough as I was just trying to get things working well
> enough to run some experiments.  In particular the futures stuff is
> not fleshed out well and does not handle invalid statuses correctly.
> Most of the experiment is in nested-read-table-deadlock-example.cc.
>
> # Key Observations
>
> * To keep I/O waits off the thread pool you can use a dedicated thread
> for I/O instead of moving to non-blocking I/O (Matthias did mention
> this and the CSV reader was already doing this somewhat).  However,
> without futures / asynchronous, you can still have a thread pool task
> wait on the I/O thread to populate the channel with another block of
> data.  So this doesn't prevent I/O on the thread pool all by itself.
> * Since arrow already has futures and a thread pool it isn't too much
> additional work to add continuations / promises.  Although this may be
> something of a sunk cost fallacy.
> * The current CSV reader implementation does not do well with a high
> latency filesystem, an asynchronous solution can work around this
> * The current thread pool implementation deadlocks when used in a
> "nested" case, an asynchronous solution can work around this
> * Work still needs to be done so that the asynchronous solution works
> as well as the synchronous multi threaded solution in low latency
> environments
>
> # Description
>
> I created 20 CSV files, each which has 1,000,000 rows and 4 columns (1
> integral and 3 decimal).  The files are each around 63MB.  Rather than
> use the dataset stuff directly I simulated a dataset scan by reading
> the files in a loop.  So this experiment only directly involved the
> csv reader.  The CSV reader supports parallelism in a few places.
> First, both the serial and threaded CSV readers have a dedicated
> thread for I/O.  So in the serial case, while one thread is computing
> the results of a chunk, the I/O thread is fetching the next block.
> The threaded CSV reader will process up to X blocks at once where X is
> the capacity of the thread pool.  Also, when processing a block, the
> threaded CSV reader will launch a task for converting each column.
> These conversion tasks may fail and need to be rerun.  So it is
> possible for there to be more than one task per column per block.
>
> My tests ran on an m5.large EC2 instance connected to EBS storage
> (where the CSV files were stored) so d

Re: [Rust] Arrow SQL Adapters/Connectors

2020-09-27 Thread Jorge Cardoso Leitão
That would be awesome! I agree with this, and would be really useful, as it
would leverage all the goodies that RDMS have wrt to transitions, etc.

I would probably go for having database-specifics outside of the arrow
project, so that they can be used by other folks beyond arrow, and keep the
arrow-specifics (i.e. conversion from the format from the specific
databases to arrow) as part of the arrow crate. Ideally as Wes wrote, with
some standard to be easier to handle different DBs.

I think that there are two layers: one is how to connect to a database, the
other is how to serialize/deserialize. AFAIK PEP 249 covers both layers, as
it standardizes things like `connect` and `tpc_begin`, as well as how
things should be serialized to Python objects (e.g. dates should be
datetime.date). This split is done by postgres for Rust
, as it offers 5 crates:
* postges-async
* postges-sync (a blocking wrapper of postgres-async)
* postges-types (to convert to native rust  < IMO this one is what we
want to offer in Arrow)
* postges-TLS
* postges-openssl

`postges-sync` implements Iterator (`client.query`), and postges-async
implements Stream.

One idea is to have a generic iterator/stream adapter, that yields
RecordBatches. The implementation of this trait by different providers
would give support to be used in Arrow and DataFusion.

Besides postgres, one idea is to pick the top from this list
:

* Oracle
* MySQL
* MsSQL

Another idea is to start by by supporting SQLite, which is a good
development env to work with relational databases.

Best,
Jorge





On Sun, Sep 27, 2020 at 4:22 AM Neville Dipale 
wrote:

> Hi Arrow developers
>
> I would like to gauge the appetite for an Arrow SQL connector that:
>
> * Reads and writes Arrow data to and from SQL databases
> * Reads tables and queries into record batches, and writes batches to
> tables (either append or overwrite)
> * Leverages binary SQL formats where available (e.g. PostgreSQL format is
> relatively easy and well-documented)
> * Provides a batch interface that abstracts away the different database
> semantics, and exposes a RecordBatchReader (
> https://docs.rs/arrow/1.0.1/arrow/record_batch/trait.RecordBatchReader.html
> ),
> and perhaps a RecordBatchWriter
> * Resides in the Rust repo as either an arrow::sql module (like arrow::csv,
> arrow::json, arrow::ipc) or alternatively is a separate crate in the
> workspace  (*arrow-sql*?)
>
> I would be able to contribute a Postgres reader/writer as a start.
> I could make this a separate crate, but to drive adoption I would prefer
> this living in Arrow, also it can remain updated (sometimes we reorganise
> modules and end up breaking dependencies).
>
> Also, being developed next to DataFusion could allow DF to support SQL
> databases, as this would be yet another datasource.
>
> Some questions:
> * Should such library support async, sync or both IO methods?
> * Other than postgres, what other databases would be interesting? Here I'm
> hoping that once we've established a suitable API, it could be easier to
> natively support more database types.
>
> Potential concerns:
>
> * Sparse database support
> It's a lot of effort to write database connectors, especially if starting
> from scratch (unlike with say JDBC). What if we end up supporting 1 or 2
> database servers?
> Perhaps in that case we could keep the module without publishing it to
> crates.io until we're happy with database support, or even its usage.
>
> * Dependency bloat
> We could feature-gate database types to reduce the number of dependencies
> if one only wants certain DB connectors
>
> * Why not use Java's JDBC adapter?
> I already do this, but sometimes if working on a Rust project, creating a
> separate JVM service solely to extract Arrow data is a lot of effort.
> I also don't think it's currently possible to use the adapter to save Arrow
> data in a database.
>
> * What about Flight SQL extensions?
> There have been discussions around creating Flight SQL extensions, and the
> Rust SQL adapter could implement that and co-exist well.
> From a crate dependency, *arrow-flight* depends on *arrow*, so it could
> also depend on this *arrow-sql* crate.
>
> Please let me know what you think
>
> Regards
> Neville
>


Re: [Rust] Arrow SQL Adapters/Connectors

2020-09-27 Thread Wes McKinney
hi Neville,

In Python we have something called the DB API 2.0 (PEP 249) that
defines an API standard for SQL databases in Python, including an
expectation around the data format of result sets. It sounds like you
need to create the equivalent of that in Rust with Arrow as the API /
format returned by fetch/fetchall operations.

Once you define a standard API for SQL database interactions, then you
can start creating multiple implementations of that API that can be
passed interchangeably into applications. Applications of course are
responsible for knowing which version of SQL is supported by a driver,
but this layer is agnostic to the SQL strings that get passed in for
queries. In Python it's a little easier to manage because of duck
typing (an implementation of the DB API 2.0 does not need to depend on
any libraries) and there's a standard test suite you can use to verify
compliance with the API.

FWIW, I would like to do the same thing for Arrow C++, to create a
DBAPI 2.0-like API that can be implemented by database driver
interfaces maintained by the community as well as the third party
projects. It might make sense for us to create a version of this API
that can be used from C/CFFI with the Arrow C data interface. I
suspect we'll get to this eventually -- PEP 249 came 10 years into the
Python programming language, and I often liken Arrow's growth pattern
to that of a programming language.

- Wes

On Sat, Sep 26, 2020 at 9:22 PM Neville Dipale  wrote:
>
> Hi Arrow developers
>
> I would like to gauge the appetite for an Arrow SQL connector that:
>
> * Reads and writes Arrow data to and from SQL databases
> * Reads tables and queries into record batches, and writes batches to
> tables (either append or overwrite)
> * Leverages binary SQL formats where available (e.g. PostgreSQL format is
> relatively easy and well-documented)
> * Provides a batch interface that abstracts away the different database
> semantics, and exposes a RecordBatchReader (
> https://docs.rs/arrow/1.0.1/arrow/record_batch/trait.RecordBatchReader.html),
> and perhaps a RecordBatchWriter
> * Resides in the Rust repo as either an arrow::sql module (like arrow::csv,
> arrow::json, arrow::ipc) or alternatively is a separate crate in the
> workspace  (*arrow-sql*?)
>
> I would be able to contribute a Postgres reader/writer as a start.
> I could make this a separate crate, but to drive adoption I would prefer
> this living in Arrow, also it can remain updated (sometimes we reorganise
> modules and end up breaking dependencies).
>
> Also, being developed next to DataFusion could allow DF to support SQL
> databases, as this would be yet another datasource.
>
> Some questions:
> * Should such library support async, sync or both IO methods?
> * Other than postgres, what other databases would be interesting? Here I'm
> hoping that once we've established a suitable API, it could be easier to
> natively support more database types.
>
> Potential concerns:
>
> * Sparse database support
> It's a lot of effort to write database connectors, especially if starting
> from scratch (unlike with say JDBC). What if we end up supporting 1 or 2
> database servers?
> Perhaps in that case we could keep the module without publishing it to
> crates.io until we're happy with database support, or even its usage.
>
> * Dependency bloat
> We could feature-gate database types to reduce the number of dependencies
> if one only wants certain DB connectors
>
> * Why not use Java's JDBC adapter?
> I already do this, but sometimes if working on a Rust project, creating a
> separate JVM service solely to extract Arrow data is a lot of effort.
> I also don't think it's currently possible to use the adapter to save Arrow
> data in a database.
>
> * What about Flight SQL extensions?
> There have been discussions around creating Flight SQL extensions, and the
> Rust SQL adapter could implement that and co-exist well.
> From a crate dependency, *arrow-flight* depends on *arrow*, so it could
> also depend on this *arrow-sql* crate.
>
> Please let me know what you think
>
> Regards
> Neville


Re: [DISCUSS] Plasma appears to have been forked, consider deprecating pyarrow.serialization

2020-09-27 Thread Wes McKinney
To be clear, if someone wants to step up as the Plasma maintainer in
Apache Arrow, that's completely fine -- that would be a good outcome.
Many of us had already been concerned for a while about Plasma's
maintenance status -- lots of stale PRs and low engagement on JIRA
issues and mailing list discussions, so now that the hard fork has
happened we want to make sure that we aren't creating the wrong
expectations by shipping a piece of software that has lost its
erstwhile maintainers.

On Sun, Sep 27, 2020 at 6:35 AM Niklas B  wrote:
>
> We to rely heavily on Plasma (we use Ray as well, but also Plasma independent 
> of Ray). I’ve started a thread on ray dev list to see if Rays plasma can be 
> used standalone outside of ray as well. That would allow us who use Plasma to 
> move to a standalone “ray plasma” when/if it’s removed from Arrow.
>
> > On 26 Sep 2020, at 00:30, Wes McKinney  wrote:
> >
> > I'd suggest as a preliminary that we stop packaging Plasma for 1-2
> > releases to see who is affected by the component's removal. Usage may
> > be more widespread than we realize, and we don't have much telemetry
> > to know for certain.
> >
> > On Tue, Aug 18, 2020 at 1:26 PM Antoine Pitrou  wrote:
> >>
> >>
> >> Also, the fact that Ray has forked Plasma means their implementation
> >> becomes potentially incompatible with Arrow's.  So even if we keep
> >> Plasma in our codebase, we can't guarantee interoperability with Ray.
> >>
> >> Regards
> >>
> >> Antoine.
> >>
> >>
> >> Le 18/08/2020 à 19:51, Wes McKinney a écrit :
> >>> I do not think there is an urgency to remove Plasma from the Arrow
> >>> codebase (as it currently does not cause much maintenance burden), but
> >>> the reality is that Ray has already hard-forked and so new maintainers
> >>> will need to come out of the woodwork to help support the project if
> >>> it is to continue having a life of its own. I started this thread to
> >>> create more awareness of the issue so that existing Plasma
> >>> stakeholders can make themselves known and possibly volunteer their
> >>> time to develop and maintain the codebase.
> >>>
> >>> On Tue, Aug 18, 2020 at 12:02 PM Matthias Vallentin
> >>>  wrote:
> 
>  We are very interested in Plasma as a stand-alone project. The fork would
>  hit us doubly hard, because it reduces both the appeal of an 
>  Arrow-specific
>  use case as well as our planned Ray integration.
> 
>  We are developing effectively a database for network activity data that
>  runs with Arrow as data plane. See https://github.com/tenzir/vast for
>  details. One of our upcoming features is supporting a 1:N output channel
>  using Plasma, where multiple downstream tools (Python/Pandas, R, Spark) 
>  can
>  process the same data set that's exactly materialized in memory once. We
>  currently don't have the developer bandwidth to prioritize this effort, 
>  but
>  the concurrent, multi-tool processing capability was one of the main
>  strategic reasons to go with Arrow as data plane. If Plasma has no 
>  future,
>  Arrow has a reduced appeal for us in the medium term.
> 
>  We also have Ray as a data consumer on our roadmap, but the dependency
>  chain seems now inverted. If we have to do costly custom plumbing for 
>  Ray,
>  with a custom version of Plasma, the Ray integration will lose quite a 
>  bit
>  of appeal because it doesn't fit into the existing 1:N model. That is, 
>  even
>  though the fork may make sense from a Ray-internal point of view, it
>  decreases the appeal of Ray from the outside. (Again, only speaking 
>  shared
>  data plane here.)
> 
>  In the future, we're happy to contribute cycles when it comes to keeping
>  Plasma as a useful standalone project. We recently made sure that static
>  builds work as expected . As 
>  of
>  now, we unfortunately cannot commit to anything specific though, but our
>  interest extends to Gandiva, Flight, and lots of other parts of the Arrow
>  ecosystem.
> 
>  On Tue, Aug 18, 2020 at 4:02 AM Robert Nishihara 
>  
>  wrote:
> 
> > To answer Wes's question, the Plasma inside of Ray is not currently 
> > usable
> >
> >
> > in a C++ library context, though it wouldn't be impossible to make that
> >
> >
> > happen.
> >
> >
> >
> >
> >
> > I (or someone) could conduct a simple poll via Google Forms on the user
> >
> >
> > mailing list to gauge demand if we are concerned about breaking a lot of
> >
> >
> > people's workflow.
> >
> >
> >
> >
> >
> > On Mon, Aug 17, 2020 at 3:21 AM Antoine Pitrou  
> > wrote:
> >
> >
> >
> >
> >
> >>
> >
> >
> >> Le 15/08/2020 à 17:56, Wes McKinney a écrit :
> >
> >
> >>>
> >
> >
> >>> What is

Re: [DISCUSS] Plasma appears to have been forked, consider deprecating pyarrow.serialization

2020-09-27 Thread Niklas B
We to rely heavily on Plasma (we use Ray as well, but also Plasma independent 
of Ray). I’ve started a thread on ray dev list to see if Rays plasma can be 
used standalone outside of ray as well. That would allow us who use Plasma to 
move to a standalone “ray plasma” when/if it’s removed from Arrow.

> On 26 Sep 2020, at 00:30, Wes McKinney  wrote:
> 
> I'd suggest as a preliminary that we stop packaging Plasma for 1-2
> releases to see who is affected by the component's removal. Usage may
> be more widespread than we realize, and we don't have much telemetry
> to know for certain.
> 
> On Tue, Aug 18, 2020 at 1:26 PM Antoine Pitrou  wrote:
>> 
>> 
>> Also, the fact that Ray has forked Plasma means their implementation
>> becomes potentially incompatible with Arrow's.  So even if we keep
>> Plasma in our codebase, we can't guarantee interoperability with Ray.
>> 
>> Regards
>> 
>> Antoine.
>> 
>> 
>> Le 18/08/2020 à 19:51, Wes McKinney a écrit :
>>> I do not think there is an urgency to remove Plasma from the Arrow
>>> codebase (as it currently does not cause much maintenance burden), but
>>> the reality is that Ray has already hard-forked and so new maintainers
>>> will need to come out of the woodwork to help support the project if
>>> it is to continue having a life of its own. I started this thread to
>>> create more awareness of the issue so that existing Plasma
>>> stakeholders can make themselves known and possibly volunteer their
>>> time to develop and maintain the codebase.
>>> 
>>> On Tue, Aug 18, 2020 at 12:02 PM Matthias Vallentin
>>>  wrote:
 
 We are very interested in Plasma as a stand-alone project. The fork would
 hit us doubly hard, because it reduces both the appeal of an Arrow-specific
 use case as well as our planned Ray integration.
 
 We are developing effectively a database for network activity data that
 runs with Arrow as data plane. See https://github.com/tenzir/vast for
 details. One of our upcoming features is supporting a 1:N output channel
 using Plasma, where multiple downstream tools (Python/Pandas, R, Spark) can
 process the same data set that's exactly materialized in memory once. We
 currently don't have the developer bandwidth to prioritize this effort, but
 the concurrent, multi-tool processing capability was one of the main
 strategic reasons to go with Arrow as data plane. If Plasma has no future,
 Arrow has a reduced appeal for us in the medium term.
 
 We also have Ray as a data consumer on our roadmap, but the dependency
 chain seems now inverted. If we have to do costly custom plumbing for Ray,
 with a custom version of Plasma, the Ray integration will lose quite a bit
 of appeal because it doesn't fit into the existing 1:N model. That is, even
 though the fork may make sense from a Ray-internal point of view, it
 decreases the appeal of Ray from the outside. (Again, only speaking shared
 data plane here.)
 
 In the future, we're happy to contribute cycles when it comes to keeping
 Plasma as a useful standalone project. We recently made sure that static
 builds work as expected . As of
 now, we unfortunately cannot commit to anything specific though, but our
 interest extends to Gandiva, Flight, and lots of other parts of the Arrow
 ecosystem.
 
 On Tue, Aug 18, 2020 at 4:02 AM Robert Nishihara 
 
 wrote:
 
> To answer Wes's question, the Plasma inside of Ray is not currently usable
> 
> 
> in a C++ library context, though it wouldn't be impossible to make that
> 
> 
> happen.
> 
> 
> 
> 
> 
> I (or someone) could conduct a simple poll via Google Forms on the user
> 
> 
> mailing list to gauge demand if we are concerned about breaking a lot of
> 
> 
> people's workflow.
> 
> 
> 
> 
> 
> On Mon, Aug 17, 2020 at 3:21 AM Antoine Pitrou  wrote:
> 
> 
> 
> 
> 
>> 
> 
> 
>> Le 15/08/2020 à 17:56, Wes McKinney a écrit :
> 
> 
>>> 
> 
> 
>>> What isn't clear is whether the Plasma that's in Ray is usable in a
> 
> 
>>> C++ library context (e.g. what we currently ship as libplasma-dev e.g.
> 
> 
>>> on Ubuntu/Debian). That seems still useful, but if the project isn't
> 
> 
>>> being actively maintained / developed (which, given the series of
> 
> 
>>> stale PRs over the last year or two, it doesn't seem to be) it's
> 
> 
>>> unclear whether we want to keep shipping it.
> 
> 
>> 
> 
> 
>> At least on GitHub, the C++ API seems to be getting little use.  Most
> 
> 
>> search results below are forks/copies of the Arrow or Ray codebases.
> 
> 
>> There are also a couple stale experiments:
> 
> 
>> https://github.com/search?l=

[NIGHTLY] Arrow Build Report for Job nightly-2020-09-27-0

2020-09-27 Thread Crossbow


Arrow Build Report for Job nightly-2020-09-27-0

All tasks: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0

Failed Tasks:
- conda-linux-gcc-py36-aarch64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-drone-conda-linux-gcc-py36-aarch64
- conda-linux-gcc-py36-cpu:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-azure-conda-linux-gcc-py36-cpu
- conda-linux-gcc-py36-cuda:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-azure-conda-linux-gcc-py36-cuda
- conda-linux-gcc-py37-aarch64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-drone-conda-linux-gcc-py37-aarch64
- conda-linux-gcc-py37-cpu:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-azure-conda-linux-gcc-py37-cpu
- conda-linux-gcc-py37-cuda:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-azure-conda-linux-gcc-py37-cuda
- conda-linux-gcc-py38-aarch64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-drone-conda-linux-gcc-py38-aarch64
- conda-linux-gcc-py38-cpu:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-azure-conda-linux-gcc-py38-cpu
- conda-linux-gcc-py38-cuda:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-azure-conda-linux-gcc-py38-cuda
- conda-osx-clang-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-azure-conda-osx-clang-py36
- conda-osx-clang-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-azure-conda-osx-clang-py37
- conda-osx-clang-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-azure-conda-osx-clang-py38
- gandiva-jar-osx:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-travis-gandiva-jar-osx
- gandiva-jar-xenial:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-travis-gandiva-jar-xenial
- homebrew-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-travis-homebrew-cpp
- test-conda-cpp-valgrind:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-github-test-conda-cpp-valgrind
- test-conda-python-3.7-hdfs-2.9.2:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-github-test-conda-python-3.7-hdfs-2.9.2
- test-conda-python-3.7-spark-branch-3.0:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-github-test-conda-python-3.7-spark-branch-3.0
- test-conda-python-3.8-spark-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-github-test-conda-python-3.8-spark-master
- wheel-osx-cp35m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-travis-wheel-osx-cp35m
- wheel-osx-cp36m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-travis-wheel-osx-cp36m
- wheel-osx-cp37m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-travis-wheel-osx-cp37m
- wheel-osx-cp38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-travis-wheel-osx-cp38
- wheel-win-cp36m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-appveyor-wheel-win-cp36m

Succeeded Tasks:
- centos-6-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-github-centos-6-amd64
- centos-7-aarch64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-travis-centos-7-aarch64
- centos-7-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-github-centos-7-amd64
- centos-8-aarch64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-travis-centos-8-aarch64
- centos-8-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-github-centos-8-amd64
- conda-clean:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-azure-conda-clean
- conda-win-vs2017-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-azure-conda-win-vs2017-py36
- conda-win-vs2017-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-azure-conda-win-vs2017-py37
- conda-win-vs2017-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-azure-conda-win-vs2017-py38
- debian-buster-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-github-debian-buster-amd64
- debian-buster-arm64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-202