Re: Computational Kernels: the project overview

2021-02-05 Thread Wes McKinney
Sure, feel free to open a Jira issue and / or submit a PR.

On Fri, Feb 5, 2021 at 12:48 PM Ying Zhou  wrote:
>
> Hi,
>
> Speaking of the computational kernels I found that Cast needs significant 
> improvement. Right now it can not cast a FixedSizeBinary array to a Binary 
> one which caused my ORC tests to be unusually long. I plan to significantly 
> expand it within 2 months to include nested types and make ORC (and maybe 
> Parquet, Feather, CSV etc) testing much simpler. (In case you wonder why this 
> is needed..since Arrow generally have a lot more formats than other hence 
> to_arrow(from_arrow(table)) and table are usually not equal and casting is 
> necessary.) Is this something we want to work on?
>
> Ying
>
> > On Nov 21, 2020, at 6:08 AM, Kirill Lykov  wrote:
> >
> > Hi,
> >
> > There are some computations kernels in arrow and it looks that this part is
> > in active development right now. I wonder if there is a document / some
> > emails describing what is the goal and uses cases for this part of the code
> > base. Would be very interesting to know a bit more and I would like to
> > contribute at some point.
> > I'm interested because I develop a Proof-of-concept for a declarative
> > language to perform statistical computations on top of gandiva.
> >
> > --
> > Best regards,
> > Kirill Lykov
>


Re: JIRA grooming

2021-02-05 Thread Joris Van den Bossche
Personally, I watch the "iss...@arrow.apache.org" mailing list, which only
sends one email for a new JIRA creation to have an overview of only new
JIRAs (and based on that I subscribe to JIRAs that interest me to get more
email notificaitons).
So I have a slight preference that people keep using the [C++]/[Rust]/..
manually in their JIRA title instead of relying on a bot for that part, as
it would make the issues mailing list less useful.

Joris

On Fri, 5 Feb 2021 at 17:19, Neal Richardson 
wrote:

> I am all for automation. I'll see if I can carve out some time to work on
> that next week.
>
> Neal
>
> On Fri, Feb 5, 2021 at 8:13 AM Wes McKinney  wrote:
>
> > It occurs to me we could (relatively) easily program a bot to apply
> > these "title tags" automatically based on what's in the Component
> > field. What do you think?
> >
> > On Fri, Feb 5, 2021 at 10:09 AM Neal Richardson
> >  wrote:
> > >
> > >  Hi folks,
> > > Just a reminder to please make sure your JIRA issue titles start with
> the
> > > subproject/language in square brackets ([Rust], [Python], etc.). If
> > you're
> > > reviewing or triaging issues that others have reported, please do clean
> > > them up. While it is technically redundant to the component field, it
> is
> > > greatly helpful for being able to scan the JIRA activity feed, and it's
> > > important for making our changelogs readable.
> > >
> > > Thanks,
> > > Neal
> >
>


Re: Computational Kernels: the project overview

2021-02-05 Thread Ying Zhou
Hi,

Speaking of the computational kernels I found that Cast needs significant 
improvement. Right now it can not cast a FixedSizeBinary array to a Binary one 
which caused my ORC tests to be unusually long. I plan to significantly expand 
it within 2 months to include nested types and make ORC (and maybe Parquet, 
Feather, CSV etc) testing much simpler. (In case you wonder why this is 
needed..since Arrow generally have a lot more formats than other hence 
to_arrow(from_arrow(table)) and table are usually not equal and casting is 
necessary.) Is this something we want to work on?

Ying

> On Nov 21, 2020, at 6:08 AM, Kirill Lykov  wrote:
> 
> Hi,
> 
> There are some computations kernels in arrow and it looks that this part is
> in active development right now. I wonder if there is a document / some
> emails describing what is the goal and uses cases for this part of the code
> base. Would be very interesting to know a bit more and I would like to
> contribute at some point.
> I'm interested because I develop a Proof-of-concept for a declarative
> language to perform statistical computations on top of gandiva.
> 
> -- 
> Best regards,
> Kirill Lykov



Re: JIRA grooming

2021-02-05 Thread Neal Richardson
I personally filter all JIRA emails to the trash--agree that it's too noisy
to pay attention to. The zulip chat app that Ursa Labs hosts (happy to
invite anyone to it, just email me) has a reasonable threaded view of JIRA
activity that I rely on. It's not perfect either but it's much more
manageable.

We have some code somewhere that we used to run when the ASF's JIRA
integration was broken and unmaintained, I'll dig that up. Should be a good
starting point.

Neal

On Fri, Feb 5, 2021 at 8:54 AM Jorge Cardoso Leitão <
jorgecarlei...@gmail.com> wrote:

> I would love that also!
>
> Atm we need to:
> * add the tag to the jira issue
> * add the component in the dropdown for components
> * add the component to the PR
>
> In case of multiple components, the above is per component, per PR.
>
> IMO we should only have to select the component from one place, which IMO
> should be the component dropdown in JIRA, which has the strongest
> validation and is thus less prone to mistakes.
> From that component field, we can derive the squashed commit name (e.g. in
> dev/merge_pr.py, read from it and use it to create the commit message), and
> also use it to populate the changelog accordingly.
>
> My concern with a jira bot is that people are already heavily spammed by
> JIRA. On a new PR, this is roughly my email activity:
>
> * email from github with the bot adding the link to jira
> * email from JIRA with an update that the bot added a link to jira
> [if I forgot to place the jira issue]
> * email from github with the bot adding the message about a missing jira
> issue
> * email from JIRA with an update that the bot added a message about a
> missing jira issue
> [if forgot to place the jira issue]
> * email from github with coverage
> * email from JIRA with an update that the bot added coverage report
> * email from github that someone commented/reviewed etc
> * email from JIRA with an update that someone commented/reviewed etc
> * [repeat for every activity on the PR]
>
> imo this is way too much verbosity, specially the github+JIRA with copies
> of each other. If we start changing titles on JIRA, there will be yet
> another email from JIRA with that update. Note that emails from JIRA are
> administered globally on the project, the email from github for updates on
> the PR is imo way more relevant (because it is a one-click from email to
> the exact comment). I suspect that many people either ignore JIRA, or have
> some filter to ignore it, which imo is bad because important discussions do
> happen in JIRA - they are just a needle in the haystack (I am curious as to
> whether folks have a different setup here!)
>
> Regardless, I am up to pair with you Neal to work on this front to
> alleviate this if others also feel some pain with this.
>
> Best,
> Jorge
>
>
>
> On Fri, Feb 5, 2021 at 5:19 PM Neal Richardson <
> neal.p.richard...@gmail.com>
> wrote:
>
> > I am all for automation. I'll see if I can carve out some time to work on
> > that next week.
> >
> > Neal
> >
> > On Fri, Feb 5, 2021 at 8:13 AM Wes McKinney  wrote:
> >
> > > It occurs to me we could (relatively) easily program a bot to apply
> > > these "title tags" automatically based on what's in the Component
> > > field. What do you think?
> > >
> > > On Fri, Feb 5, 2021 at 10:09 AM Neal Richardson
> > >  wrote:
> > > >
> > > >  Hi folks,
> > > > Just a reminder to please make sure your JIRA issue titles start with
> > the
> > > > subproject/language in square brackets ([Rust], [Python], etc.). If
> > > you're
> > > > reviewing or triaging issues that others have reported, please do
> clean
> > > > them up. While it is technically redundant to the component field, it
> > is
> > > > greatly helpful for being able to scan the JIRA activity feed, and
> it's
> > > > important for making our changelogs readable.
> > > >
> > > > Thanks,
> > > > Neal
> > >
> >
>


Re: Computational Kernels: the project overview

2021-02-05 Thread Micah Kornfield
Welcome Aldrin,
This sounds like a very reasonable way to start contributing.

-Micah

On Fri, Jan 29, 2021 at 1:53 PM Aldrin  wrote:

> Hello!
>
> I am trying to use the expression and compute APIs for query processing,
> and in my searches so far, this thread seems to be the most relevant.
>
> A lot of the operators and functions that I need in the short-term appear
> to be implemented, but the documentation seems sparse or at least not all
> in the same place. The document that Micah linked has been useful, and I've
> been perusing the source, but I was wondering if some initial contributions
> I can make would be to document the designed model and then propose further
> changes or designs afterwards.
>
> Is anyone already putting effort in (or completed) consolidating or
> expanding documentation on the compute and dataset/expression APIs and how
> they interact, etc.?
>
> Thanks!
>
> Aldrin Montana
> Computer Science PhD Student
> UC Santa Cruz
>
>
> On Mon, Nov 30, 2020 at 7:40 AM Wes McKinney  wrote:
>
> > One objective of the precompiled kernels project is to have meaningful
> > computational functionality in a package that does not need to include
> > the LLVM runtime -- to require the LLVM dependency even for simple
> > functions would more than double the size of our Python packages, for
> > example.
> >
> > There is currently little code sharing between functions that do
> > identical work in arrow::compute:: versus gandiva:: -- this has been
> > discussed, but it needs a champion to do something about it. When I
> > was working on the new function framework earlier this year, I spent a
> > day or so perusing src/gandiva/precompiled and reasoned it would be a
> > prohibitive amount of refactoring for me to undertake at that time. In
> > principle many of these functions (e.g. string functions) can be
> > incrementally refactored into reusable inline functions / templates
> > for improved code reuse. We could also explore common infrastructure
> > for unit testing and benchmarking. Anything is possible if enough
> > engineering time is invested.
> >
> > I would hope in the future to see a generalized expression API as part
> > of a logical query plan-type system (for query processing) that has
> > the ability to use Gandiva (if it's available) to compile
> > subexpressions for better performance. I had hoped to spend some time
> > on this myself earlier this year, but I've gotten busy with some other
> > things and won't be able to devote much development time to this
> > myself.
> >
> > - Wes
> >
> > On Sun, Nov 29, 2020 at 11:18 PM Micah Kornfield 
> > wrote:
> > >
> > > >
> > > > There are some computations kernels in arrow and it looks that this
> > part is
> > > > in active development right now. I wonder if there is a document /
> some
> > > > emails describing what is the goal and uses cases for this part of
> the
> > code
> > > > base. Would be very interesting to know a bit more and I would like
> to
> > > > contribute at some point.
> > >
> > >
> > >
> >
> https://docs.google.com/document/d/1LFk3WRfWGQbJ9uitWwucjiJsZMqLh8lC1vAUOscLtj8/edit
> > > talks about some of the goals of the compute module.
> > >
> > > I'm interested because I develop a Proof-of-concept for a declarative
> > > > language to perform statistical computations on top of gandiva.
> > >
> > >
> > > I think upon cursory examination someone (maybe Wes) thought Gandiva
> and
> > > the compute kernels might not play nicely together, but I can't find a
> > > reference to that at the moment.
> > >
> > >
> > > On Sat, Nov 21, 2020 at 3:09 AM Kirill Lykov 
> > wrote:
> > >
> > > > Hi,
> > > >
> > > > There are some computations kernels in arrow and it looks that this
> > part is
> > > > in active development right now. I wonder if there is a document /
> some
> > > > emails describing what is the goal and uses cases for this part of
> the
> > code
> > > > base. Would be very interesting to know a bit more and I would like
> to
> > > > contribute at some point.
> > > > I'm interested because I develop a Proof-of-concept for a declarative
> > > > language to perform statistical computations on top of gandiva.
> > > >
> > > > --
> > > > Best regards,
> > > > Kirill Lykov
> > > >
> >
>


Re: JIRA grooming

2021-02-05 Thread Jorge Cardoso Leitão
I would love that also!

Atm we need to:
* add the tag to the jira issue
* add the component in the dropdown for components
* add the component to the PR

In case of multiple components, the above is per component, per PR.

IMO we should only have to select the component from one place, which IMO
should be the component dropdown in JIRA, which has the strongest
validation and is thus less prone to mistakes.
>From that component field, we can derive the squashed commit name (e.g. in
dev/merge_pr.py, read from it and use it to create the commit message), and
also use it to populate the changelog accordingly.

My concern with a jira bot is that people are already heavily spammed by
JIRA. On a new PR, this is roughly my email activity:

* email from github with the bot adding the link to jira
* email from JIRA with an update that the bot added a link to jira
[if I forgot to place the jira issue]
* email from github with the bot adding the message about a missing jira
issue
* email from JIRA with an update that the bot added a message about a
missing jira issue
[if forgot to place the jira issue]
* email from github with coverage
* email from JIRA with an update that the bot added coverage report
* email from github that someone commented/reviewed etc
* email from JIRA with an update that someone commented/reviewed etc
* [repeat for every activity on the PR]

imo this is way too much verbosity, specially the github+JIRA with copies
of each other. If we start changing titles on JIRA, there will be yet
another email from JIRA with that update. Note that emails from JIRA are
administered globally on the project, the email from github for updates on
the PR is imo way more relevant (because it is a one-click from email to
the exact comment). I suspect that many people either ignore JIRA, or have
some filter to ignore it, which imo is bad because important discussions do
happen in JIRA - they are just a needle in the haystack (I am curious as to
whether folks have a different setup here!)

Regardless, I am up to pair with you Neal to work on this front to
alleviate this if others also feel some pain with this.

Best,
Jorge



On Fri, Feb 5, 2021 at 5:19 PM Neal Richardson 
wrote:

> I am all for automation. I'll see if I can carve out some time to work on
> that next week.
>
> Neal
>
> On Fri, Feb 5, 2021 at 8:13 AM Wes McKinney  wrote:
>
> > It occurs to me we could (relatively) easily program a bot to apply
> > these "title tags" automatically based on what's in the Component
> > field. What do you think?
> >
> > On Fri, Feb 5, 2021 at 10:09 AM Neal Richardson
> >  wrote:
> > >
> > >  Hi folks,
> > > Just a reminder to please make sure your JIRA issue titles start with
> the
> > > subproject/language in square brackets ([Rust], [Python], etc.). If
> > you're
> > > reviewing or triaging issues that others have reported, please do clean
> > > them up. While it is technically redundant to the component field, it
> is
> > > greatly helpful for being able to scan the JIRA activity feed, and it's
> > > important for making our changelogs readable.
> > >
> > > Thanks,
> > > Neal
> >
>


Re: JIRA grooming

2021-02-05 Thread Neal Richardson
I am all for automation. I'll see if I can carve out some time to work on
that next week.

Neal

On Fri, Feb 5, 2021 at 8:13 AM Wes McKinney  wrote:

> It occurs to me we could (relatively) easily program a bot to apply
> these "title tags" automatically based on what's in the Component
> field. What do you think?
>
> On Fri, Feb 5, 2021 at 10:09 AM Neal Richardson
>  wrote:
> >
> >  Hi folks,
> > Just a reminder to please make sure your JIRA issue titles start with the
> > subproject/language in square brackets ([Rust], [Python], etc.). If
> you're
> > reviewing or triaging issues that others have reported, please do clean
> > them up. While it is technically redundant to the component field, it is
> > greatly helpful for being able to scan the JIRA activity feed, and it's
> > important for making our changelogs readable.
> >
> > Thanks,
> > Neal
>


Re: JIRA grooming

2021-02-05 Thread Wes McKinney
It occurs to me we could (relatively) easily program a bot to apply
these "title tags" automatically based on what's in the Component
field. What do you think?

On Fri, Feb 5, 2021 at 10:09 AM Neal Richardson
 wrote:
>
>  Hi folks,
> Just a reminder to please make sure your JIRA issue titles start with the
> subproject/language in square brackets ([Rust], [Python], etc.). If you're
> reviewing or triaging issues that others have reported, please do clean
> them up. While it is technically redundant to the component field, it is
> greatly helpful for being able to scan the JIRA activity feed, and it's
> important for making our changelogs readable.
>
> Thanks,
> Neal


JIRA grooming

2021-02-05 Thread Neal Richardson
 Hi folks,
Just a reminder to please make sure your JIRA issue titles start with the
subproject/language in square brackets ([Rust], [Python], etc.). If you're
reviewing or triaging issues that others have reported, please do clean
them up. While it is technically redundant to the component field, it is
greatly helpful for being able to scan the JIRA activity feed, and it's
important for making our changelogs readable.

Thanks,
Neal


Re: [Rust][DataFusion] DataFusion Overview / Architecture

2021-02-05 Thread Andrew Lamb
Thanks -- I plan to start on my slides next week

On Thu, Feb 4, 2021 at 2:32 PM Fernando Herrera <
fernando.j.herr...@gmail.com> wrote:

> Hi Andy. I would like to take you offer and get a copy of your book. It
> would help me to understand better datafusion and help Andrew with the
> project documentation.
>
> Fernando
>
> On Thu, 4 Feb 2021, 18:01 Andy Grove,  wrote:
>
> > That's correct, Remi. I built the Kotlin query engine from scratch as I
> was
> > writing the book, and it does follow the same basic design as
> DataFusion. I
> > think it would be a useful reference for anyone writing up some
> > DataFusion-specific documentation and I am happy to send a free copy to
> > anyone who is working on that.
> >
> > On Thu, Feb 4, 2021 at 10:30 AM Rémi Dettai  wrote:
> >
> > > Hi Andrew!
> > >
> > > The book "How query engines work" (
> > > https://leanpub.com/how-query-engines-work) that Andy wrote is pretty
> > > great! It documents query engine APIs in Kotlin and not Rust, as it was
> > > written during earlier Ballista experimentations, but almost all items
> > > still apply to DataFusion (feel free to correct me if I'm wrong Andy).
> > >
> > > Remi
> > >
> > >
> > >
> > > Le jeu. 4 févr. 2021 à 12:33, Andrew Lamb  a
> > écrit :
> > >
> > > > Does anyone have any high level architectural / overview material
> about
> > > > DataFusion that they can share or point me at?
> > > >
> > > > I am planning on creating a high level / architectural overview of
> > > > DataFusion (as it exists today) as a set of slides for a Tech Talk
> > (will
> > > be
> > > > open to the public) sometime in March.
> > > >
> > > > I was hoping to take a friendly look at any material others may have
> > been
> > > > put together, if it exists.
> > > >
> > > > Thanks,
> > > > Andrew
> > > >
> > >
> >
>


[NIGHTLY] Arrow Build Report for Job nightly-2021-02-05-0

2021-02-05 Thread Crossbow


Arrow Build Report for Job nightly-2021-02-05-0

All tasks: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0

Failed Tasks:
- conda-linux-gcc-py36-aarch64:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-drone-conda-linux-gcc-py36-aarch64
- conda-linux-gcc-py37-aarch64:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-drone-conda-linux-gcc-py37-aarch64
- conda-linux-gcc-py38-aarch64:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-drone-conda-linux-gcc-py38-aarch64
- conda-linux-gcc-py39-aarch64:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-drone-conda-linux-gcc-py39-aarch64
- test-conda-python-3.8-jpype:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-github-test-conda-python-3.8-jpype
- test-ubuntu-18.04-docs:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-azure-test-ubuntu-18.04-docs

Succeeded Tasks:
- centos-7-amd64:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-github-centos-7-amd64
- centos-8-amd64:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-github-centos-8-amd64
- conda-clean:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-azure-conda-clean
- conda-linux-gcc-py36-cpu-r36:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-azure-conda-linux-gcc-py36-cpu-r36
- conda-linux-gcc-py36-cuda:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-azure-conda-linux-gcc-py36-cuda
- conda-linux-gcc-py37-cpu-r40:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-azure-conda-linux-gcc-py37-cpu-r40
- conda-linux-gcc-py37-cuda:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-azure-conda-linux-gcc-py37-cuda
- conda-linux-gcc-py38-cpu:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-azure-conda-linux-gcc-py38-cpu
- conda-linux-gcc-py38-cuda:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-azure-conda-linux-gcc-py38-cuda
- conda-linux-gcc-py39-cpu:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-azure-conda-linux-gcc-py39-cpu
- conda-linux-gcc-py39-cuda:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-azure-conda-linux-gcc-py39-cuda
- conda-osx-clang-py36-r36:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-azure-conda-osx-clang-py36-r36
- conda-osx-clang-py37-r40:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-azure-conda-osx-clang-py37-r40
- conda-osx-clang-py38:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-azure-conda-osx-clang-py38
- conda-osx-clang-py39:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-azure-conda-osx-clang-py39
- conda-win-vs2017-py36-r36:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-azure-conda-win-vs2017-py36-r36
- conda-win-vs2017-py37-r40:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-azure-conda-win-vs2017-py37-r40
- conda-win-vs2017-py38:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-azure-conda-win-vs2017-py38
- debian-buster-amd64:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-github-debian-buster-amd64
- example-cpp-minimal-build-static-system-dependency:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-github-example-cpp-minimal-build-static-system-dependency
- example-cpp-minimal-build-static:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-github-example-cpp-minimal-build-static
- gandiva-jar-osx:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-github-gandiva-jar-osx
- gandiva-jar-ubuntu:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-github-gandiva-jar-ubuntu
- homebrew-cpp:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-github-homebrew-cpp
- homebrew-r-autobrew:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-github-homebrew-r-autobrew
- nuget:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-github-nuget
- python-sdist:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-github