Re: [DISCUSS] Support of higher bit-width Decimal type

2020-09-25 Thread Micah Kornfield
The decimal256 branch now contains sufficient implementations in Java and
C++ to pass round trip integration tests.  Some of python interop is
missing (but actively being worked on).

I'll plan on creating PR to update the specification with a corresponding
vote over the next couple of days.

One thing Antoine brought up was if it makes sense to try to merge the
contents of Decimal256 sooner rather than later to master to avoid
accumulating a much larger PR.

Thoughts?

Thanks,
Micah

On Sat, Aug 15, 2020 at 8:48 AM Wes McKinney  wrote:

> On Fri, Aug 14, 2020 at 11:17 PM Micah Kornfield 
> wrote:
> >
> > Hi Jacques,
> >
> > Do we have a good definition of what is necessary to add a new data type?
> > > Adding a type but not pulling it through most of the code seems less
> than
> > > ideal since it means one part of Arrow doesn't work with another
> (providing
> > > a less optimal end-user experience).
> >
> > I think what I proposed below is a minimum viable integration plan (and
> > covers previously discussed requirements for new types). It demonstrates
> > interop between two reference implementations and allows conversion
> to/from
> > idiomatic language analogues.  So it covers the basic goal of "arrow
> > interop".
> >
> >
> > For example, would this work include making Gandiva and all the kernels
> > > support this new data type where appropriate?
> >
> > Not initially.  There needs to be a stepping stone to start supporting
> new
> > types. I don't think it is feasible to try to land all of this
> > functionality in one PR.  I'll lend a hand at trying get support for
> > built-in compute after we get the first part done.
>
> Since (I think?) there are other data types that Gandiva already does
> not support, trying to use decimal256 data with Gandiva would raise
> the same exception that it would raise with an unsupported type.
> Another option would be to insert an implicit cast to decimal128 as a
> stopgap.
>
> > Thanks,
> > Micah
> >
> >
> >
> > On Fri, Aug 14, 2020 at 5:08 PM Jacques Nadeau 
> wrote:
> >
> > > Do we have a good definition of what is necessary to add a new data
> type?
> > > Adding a type but not pulling it through most of the code seems less
> than
> > > ideal since it means one part of Arrow doesn't work with another
> (providing
> > > a less optimal end-user experience).
> > >
> > > For example, would this work include making Gandiva and all the kernels
> > > support this new data type where appropriate?
> > >
> > > On Wed, Aug 5, 2020 at 12:01 PM Wes McKinney 
> wrote:
> > >
> > > > Sounds fine to me. I guess one question is what needs to be
> formalized
> > > > in the Schema.fbs files or elsewhere in the columnar format
> > > > documentation (and we will need to hold an associated vote for that I
> > > > think)
> > > >
> > > > On Mon, Aug 3, 2020 at 11:30 PM Micah Kornfield <
> emkornfi...@gmail.com>
> > > > wrote:
> > > > >
> > > > > Given no objections, we'll go ahead and start implementing support
> for
> > > > 256-bit decimals.
> > > > >
> > > > > I'm considering setting up another branch to develop all the
> components
> > > > so they can be merged to master atomically.
> > > > >
> > > > > Thanks,
> > > > > Micah
> > > > >
> > > > > On Tue, Jul 28, 2020 at 6:39 AM Wes McKinney 
> > > > wrote:
> > > > >>
> > > > >> Generally this sounds fine to me. At some point it would be good
> to
> > > > >> add 32-bit and 64-bit decimal support but this can be done in the
> > > > >> future.
> > > > >>
> > > > >> On Tue, Jul 28, 2020 at 7:28 AM Fan Liya 
> > > wrote:
> > > > >> >
> > > > >> > Hi Micah,
> > > > >> >
> > > > >> > Thanks for opening the discussion.
> > > > >> > I am aware of some scenarios where decimal requires more than 16
> > > > bytes, so
> > > > >> > I think it would be beneficial to support this in Arrow.
> > > > >> >
> > > > >> > Best,
> > > > >> > Liya Fan
> > > > >> >
> > > > >> >
> > > > >> > On Tue, Jul 28, 2020 at 11:12 AM Micah Kornfield <
> > > > emkornfi...@gmail.com>
> > > > >> > wrote:
> > > > >> >
> > > > >> > > Hi Arrow Dev,
> > > > >> > > ZetaSQL (Google's open source standard SQL library) recently
> > > > introduced a
> > > > >> > > BigNumeric [1] type which requires a 256 bit width to properly
> > > > support it.
> > > > >> > > I'd like to add support (possibly in collaboration with some
> of my
> > > > >> > > colleagues) to add support for 256 bit width Decimals in
> Arrow to
> > > > support a
> > > > >> > > type corresponding to BigNumeric.
> > > > >> > >
> > > > >> > > In past discussions on this, I don't think we established a
> > > minimum
> > > > bar for
> > > > >> > > supporting additional bit-widths within Arrow.
> > > > >> > >
> > > > >> > > I'd like to propose the following requirements:
> > > > >> > > 1.  A vote agreeing on adding support for a new bitwidth (we
> can
> > > > discuss
> > > > >> > > any objections here).
> > > > >> > > 2.  Support in Java and C++ for integration tests verifying
> the
> > > > ability to
> > > > >> > > roun

Re: [DISCUSS] Plasma appears to have been forked, consider deprecating pyarrow.serialization

2020-09-25 Thread Wes McKinney
I'd suggest as a preliminary that we stop packaging Plasma for 1-2
releases to see who is affected by the component's removal. Usage may
be more widespread than we realize, and we don't have much telemetry
to know for certain.

On Tue, Aug 18, 2020 at 1:26 PM Antoine Pitrou  wrote:
>
>
> Also, the fact that Ray has forked Plasma means their implementation
> becomes potentially incompatible with Arrow's.  So even if we keep
> Plasma in our codebase, we can't guarantee interoperability with Ray.
>
> Regards
>
> Antoine.
>
>
> Le 18/08/2020 à 19:51, Wes McKinney a écrit :
> > I do not think there is an urgency to remove Plasma from the Arrow
> > codebase (as it currently does not cause much maintenance burden), but
> > the reality is that Ray has already hard-forked and so new maintainers
> > will need to come out of the woodwork to help support the project if
> > it is to continue having a life of its own. I started this thread to
> > create more awareness of the issue so that existing Plasma
> > stakeholders can make themselves known and possibly volunteer their
> > time to develop and maintain the codebase.
> >
> > On Tue, Aug 18, 2020 at 12:02 PM Matthias Vallentin
> >  wrote:
> >>
> >> We are very interested in Plasma as a stand-alone project. The fork would
> >> hit us doubly hard, because it reduces both the appeal of an Arrow-specific
> >> use case as well as our planned Ray integration.
> >>
> >> We are developing effectively a database for network activity data that
> >> runs with Arrow as data plane. See https://github.com/tenzir/vast for
> >> details. One of our upcoming features is supporting a 1:N output channel
> >> using Plasma, where multiple downstream tools (Python/Pandas, R, Spark) can
> >> process the same data set that's exactly materialized in memory once. We
> >> currently don't have the developer bandwidth to prioritize this effort, but
> >> the concurrent, multi-tool processing capability was one of the main
> >> strategic reasons to go with Arrow as data plane. If Plasma has no future,
> >> Arrow has a reduced appeal for us in the medium term.
> >>
> >> We also have Ray as a data consumer on our roadmap, but the dependency
> >> chain seems now inverted. If we have to do costly custom plumbing for Ray,
> >> with a custom version of Plasma, the Ray integration will lose quite a bit
> >> of appeal because it doesn't fit into the existing 1:N model. That is, even
> >> though the fork may make sense from a Ray-internal point of view, it
> >> decreases the appeal of Ray from the outside. (Again, only speaking shared
> >> data plane here.)
> >>
> >> In the future, we're happy to contribute cycles when it comes to keeping
> >> Plasma as a useful standalone project. We recently made sure that static
> >> builds work as expected . As of
> >> now, we unfortunately cannot commit to anything specific though, but our
> >> interest extends to Gandiva, Flight, and lots of other parts of the Arrow
> >> ecosystem.
> >>
> >> On Tue, Aug 18, 2020 at 4:02 AM Robert Nishihara 
> >> 
> >> wrote:
> >>
> >>> To answer Wes's question, the Plasma inside of Ray is not currently usable
> >>>
> >>>
> >>> in a C++ library context, though it wouldn't be impossible to make that
> >>>
> >>>
> >>> happen.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> I (or someone) could conduct a simple poll via Google Forms on the user
> >>>
> >>>
> >>> mailing list to gauge demand if we are concerned about breaking a lot of
> >>>
> >>>
> >>> people's workflow.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On Mon, Aug 17, 2020 at 3:21 AM Antoine Pitrou  wrote:
> >>>
> >>>
> >>>
> >>>
> >>>
> 
> >>>
> >>>
>  Le 15/08/2020 à 17:56, Wes McKinney a écrit :
> >>>
> >>>
> >
> >>>
> >>>
> > What isn't clear is whether the Plasma that's in Ray is usable in a
> >>>
> >>>
> > C++ library context (e.g. what we currently ship as libplasma-dev e.g.
> >>>
> >>>
> > on Ubuntu/Debian). That seems still useful, but if the project isn't
> >>>
> >>>
> > being actively maintained / developed (which, given the series of
> >>>
> >>>
> > stale PRs over the last year or two, it doesn't seem to be) it's
> >>>
> >>>
> > unclear whether we want to keep shipping it.
> >>>
> >>>
> 
> >>>
> >>>
>  At least on GitHub, the C++ API seems to be getting little use.  Most
> >>>
> >>>
>  search results below are forks/copies of the Arrow or Ray codebases.
> >>>
> >>>
>  There are also a couple stale experiments:
> >>>
> >>>
>  https://github.com/search?l=C%2B%2B&p=1&q=PlasmaClient&type=Code
> >>>
> >>>
> 
> >>>
> >>>
>  Regards
> >>>
> >>>
> 
> >>>
> >>>
>  Antoine.
> >>>
> >>>
> 
> >>>
> >>>
> >>>


Re: [DISCUSS] Rethinking our approach to scheduling CPU and IO work in C++?

2020-09-25 Thread Weston Pace
So this may be a return to the details, I think the larger discussion
is a good discussion to have but I don't know enough of the code base
to comment further.

I finished playing around with the CSV reader.  The code for this
experiment can be found here
(https://github.com/westonpace/arrow/tree/feature/composable-futures),
it is pretty rough as I was just trying to get things working well
enough to run some experiments.  In particular the futures stuff is
not fleshed out well and does not handle invalid statuses correctly.
Most of the experiment is in nested-read-table-deadlock-example.cc.

# Key Observations

* To keep I/O waits off the thread pool you can use a dedicated thread
for I/O instead of moving to non-blocking I/O (Matthias did mention
this and the CSV reader was already doing this somewhat).  However,
without futures / asynchronous, you can still have a thread pool task
wait on the I/O thread to populate the channel with another block of
data.  So this doesn't prevent I/O on the thread pool all by itself.
* Since arrow already has futures and a thread pool it isn't too much
additional work to add continuations / promises.  Although this may be
something of a sunk cost fallacy.
* The current CSV reader implementation does not do well with a high
latency filesystem, an asynchronous solution can work around this
* The current thread pool implementation deadlocks when used in a
"nested" case, an asynchronous solution can work around this
* Work still needs to be done so that the asynchronous solution works
as well as the synchronous multi threaded solution in low latency
environments

# Description

I created 20 CSV files, each which has 1,000,000 rows and 4 columns (1
integral and 3 decimal).  The files are each around 63MB.  Rather than
use the dataset stuff directly I simulated a dataset scan by reading
the files in a loop.  So this experiment only directly involved the
csv reader.  The CSV reader supports parallelism in a few places.
First, both the serial and threaded CSV readers have a dedicated
thread for I/O.  So in the serial case, while one thread is computing
the results of a chunk, the I/O thread is fetching the next block.
The threaded CSV reader will process up to X blocks at once where X is
the capacity of the thread pool.  Also, when processing a block, the
threaded CSV reader will launch a task for converting each column.
These conversion tasks may fail and need to be rerun.  So it is
possible for there to be more than one task per column per block.

My tests ran on an m5.large EC2 instance connected to EBS storage
(where the CSV files were stored) so disk latency was pretty low (e.g.
compared to something like S3) and there were two dedicated cores
available.  So my thread pool was of size 2.

Converting the CSV columns was fairly processor intensive.  Each file
took about ~0.5 seconds to process and the experiment ran in about 10
seconds for all 20 files.  The experiment was CPU bound and very
little time was spent waiting on I/O.  The serial implementation
performed about 20% worse than the threaded implementation.  Top
reported the CPU at pretty much full capacity when the threaded
implementation was running although this isn't exactly an accurate
measurement.

The threaded implementation added ~6600 tasks to the thread pool for a
rate of about 660 tasks/second which doesn't seem like it would be
especially taxing to a thread pool.

I then added composability to arrow futures (a very rough
implementation that would need considerable polish before being used
for anything).  Using this I created an asynchronous CSV reader.  This
CSV reader acted the same as the multithreaded CSV reader but it broke
tasks into smaller pieces and used callbacks so that it would not
block on I/O and, as an added bonus, could be nested with an
asynchronous scan.  The asynchronous reader performed about 10% less
than the multithreaded reader and about 10% better than the serial
reader.  I wouldn't expect it to be faster than the threaded reader
(since the task was not I/O bound) but I didn't expect it to be
slower.  Right now my leading theories are:

* The asynchronous approach read from all 20 files at the same time
instead of one file at a time.  It's possible this slowed down the
I/O, some sort of throttled filesystem could help if this were the
issue.
* My implementation of continuations was pretty sloppy and probably
had too many copies of various data fields (my currently leading
hypothesis)
* The asynchronous implementation created nearly twice as many tasks
as the synchronous implementation so maybe it was putting stress on
the thread pool
* There is a fair amount of locking in the threaded CSV reader
algorithm, it's possible the asynchronous implementation interacted
with this locking less favorably.

When I added latency to the I/O the serial and threaded
implementations quickly fell behind the asynchronous implementation in
terms of performance.  At about 6-7ms of latency the task starte

Re: discuss about plasma

2020-09-25 Thread Micah Kornfield
>
> Will plasma be open source as an independent project or be eliminated and
> unmaintained directly?


To my knowledge there are no plans by current maintainers to maintain it as
a separate project, so the source code would still be available from old
git branches and if someone in theory wanted to could fork it.

In addition, I know that plasma is born out of the Object store in Ray. Do
> you have any plans for the Object store?


One of the reasons for dropping it from Arrow's is that the original
authors forked a version that lives directly in Ray. So some version of it
lives directly in Ray, and as far as I know Ray is being actively developed.



On Fri, Sep 25, 2020 at 7:41 AM zhaolong163  wrote:

> Hello, we are using plasma and are very surprised that it will be removed
> from arrow.
> If plasma is removed from arrow, what will happen to plasma? Will plasma
> be open source as an independent project or be eliminated and unmaintained
> directly?
> In addition, I know that plasma is born out of the Object store in Ray. Do
> you have any plans for the Object store?
>


Re: Thank you

2020-09-25 Thread Krisztián Szűcs
As one of the "orchestrators" thank you very much! :)

On Thu, Aug 27, 2020 at 10:28 PM Lucas Pickup
 wrote:
>
> This is an awesome sentimate, thank you Release orchestrstors and
> contributors!
>
> Cheers,
> Lucas
>
> On Thu, Aug 27, 2020 at 1:26 PM Jorge Cardoso Leitão <
> jorgecarlei...@gmail.com> wrote:
>
> > Hi,
> >
> >
> >
> > I am writing to just thank all those involved in the release process.
> >
> > Sometimes the work of releases is not fully appreciated within development
> >
> > (where are the PRs ^_^?), but I find it impressive that the release is so
> >
> > smooth for such a complex project, and IMO that is to a large extent due to
> >
> > the team orchestrating the release.
> >
> >
> >
> > Best,
> >
> > Jorge
> >
> >


[NIGHTLY] Arrow Build Report for Job nightly-2020-09-25-0

2020-09-25 Thread Crossbow


Arrow Build Report for Job nightly-2020-09-25-0

All tasks: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-25-0

Failed Tasks:
- conda-linux-gcc-py36-aarch64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-25-0-drone-conda-linux-gcc-py36-aarch64
- conda-linux-gcc-py37-aarch64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-25-0-drone-conda-linux-gcc-py37-aarch64
- conda-linux-gcc-py38-aarch64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-25-0-drone-conda-linux-gcc-py38-aarch64
- gandiva-jar-osx:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-25-0-travis-gandiva-jar-osx
- gandiva-jar-xenial:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-25-0-travis-gandiva-jar-xenial
- homebrew-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-25-0-travis-homebrew-cpp
- test-conda-cpp-valgrind:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-25-0-github-test-conda-cpp-valgrind
- test-conda-python-3.7-hdfs-2.9.2:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-25-0-github-test-conda-python-3.7-hdfs-2.9.2
- test-conda-python-3.7-spark-branch-3.0:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-25-0-github-test-conda-python-3.7-spark-branch-3.0
- test-conda-python-3.8-spark-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-25-0-github-test-conda-python-3.8-spark-master
- test-ubuntu-18.04-docs:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-25-0-azure-test-ubuntu-18.04-docs
- wheel-osx-cp35m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-25-0-travis-wheel-osx-cp35m
- wheel-osx-cp36m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-25-0-travis-wheel-osx-cp36m
- wheel-osx-cp37m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-25-0-travis-wheel-osx-cp37m
- wheel-osx-cp38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-25-0-travis-wheel-osx-cp38

Succeeded Tasks:
- centos-6-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-25-0-github-centos-6-amd64
- centos-7-aarch64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-25-0-travis-centos-7-aarch64
- centos-7-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-25-0-github-centos-7-amd64
- centos-8-aarch64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-25-0-travis-centos-8-aarch64
- centos-8-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-25-0-github-centos-8-amd64
- conda-clean:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-25-0-azure-conda-clean
- conda-linux-gcc-py36-cpu:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-25-0-azure-conda-linux-gcc-py36-cpu
- conda-linux-gcc-py36-cuda:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-25-0-azure-conda-linux-gcc-py36-cuda
- conda-linux-gcc-py37-cpu:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-25-0-azure-conda-linux-gcc-py37-cpu
- conda-linux-gcc-py37-cuda:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-25-0-azure-conda-linux-gcc-py37-cuda
- conda-linux-gcc-py38-cpu:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-25-0-azure-conda-linux-gcc-py38-cpu
- conda-linux-gcc-py38-cuda:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-25-0-azure-conda-linux-gcc-py38-cuda
- conda-osx-clang-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-25-0-azure-conda-osx-clang-py36
- conda-osx-clang-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-25-0-azure-conda-osx-clang-py37
- conda-osx-clang-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-25-0-azure-conda-osx-clang-py38
- conda-win-vs2017-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-25-0-azure-conda-win-vs2017-py36
- conda-win-vs2017-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-25-0-azure-conda-win-vs2017-py37
- conda-win-vs2017-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-25-0-azure-conda-win-vs2017-py38
- debian-buster-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-25-0-github-debian-buster-amd64
- debian-buster-arm64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=