Re: Computational Kernels: the project overview
Sure, feel free to open a Jira issue and / or submit a PR. On Fri, Feb 5, 2021 at 12:48 PM Ying Zhou wrote: > > Hi, > > Speaking of the computational kernels I found that Cast needs significant > improvement. Right now it can not cast a FixedSizeBinary array to a Binary > one which caused my ORC tests to be unusually long. I plan to significantly > expand it within 2 months to include nested types and make ORC (and maybe > Parquet, Feather, CSV etc) testing much simpler. (In case you wonder why this > is needed..since Arrow generally have a lot more formats than other hence > to_arrow(from_arrow(table)) and table are usually not equal and casting is > necessary.) Is this something we want to work on? > > Ying > > > On Nov 21, 2020, at 6:08 AM, Kirill Lykov wrote: > > > > Hi, > > > > There are some computations kernels in arrow and it looks that this part is > > in active development right now. I wonder if there is a document / some > > emails describing what is the goal and uses cases for this part of the code > > base. Would be very interesting to know a bit more and I would like to > > contribute at some point. > > I'm interested because I develop a Proof-of-concept for a declarative > > language to perform statistical computations on top of gandiva. > > > > -- > > Best regards, > > Kirill Lykov >
Re: JIRA grooming
Personally, I watch the "iss...@arrow.apache.org" mailing list, which only sends one email for a new JIRA creation to have an overview of only new JIRAs (and based on that I subscribe to JIRAs that interest me to get more email notificaitons). So I have a slight preference that people keep using the [C++]/[Rust]/.. manually in their JIRA title instead of relying on a bot for that part, as it would make the issues mailing list less useful. Joris On Fri, 5 Feb 2021 at 17:19, Neal Richardson wrote: > I am all for automation. I'll see if I can carve out some time to work on > that next week. > > Neal > > On Fri, Feb 5, 2021 at 8:13 AM Wes McKinney wrote: > > > It occurs to me we could (relatively) easily program a bot to apply > > these "title tags" automatically based on what's in the Component > > field. What do you think? > > > > On Fri, Feb 5, 2021 at 10:09 AM Neal Richardson > > wrote: > > > > > > Hi folks, > > > Just a reminder to please make sure your JIRA issue titles start with > the > > > subproject/language in square brackets ([Rust], [Python], etc.). If > > you're > > > reviewing or triaging issues that others have reported, please do clean > > > them up. While it is technically redundant to the component field, it > is > > > greatly helpful for being able to scan the JIRA activity feed, and it's > > > important for making our changelogs readable. > > > > > > Thanks, > > > Neal > > >
Re: Computational Kernels: the project overview
Hi, Speaking of the computational kernels I found that Cast needs significant improvement. Right now it can not cast a FixedSizeBinary array to a Binary one which caused my ORC tests to be unusually long. I plan to significantly expand it within 2 months to include nested types and make ORC (and maybe Parquet, Feather, CSV etc) testing much simpler. (In case you wonder why this is needed..since Arrow generally have a lot more formats than other hence to_arrow(from_arrow(table)) and table are usually not equal and casting is necessary.) Is this something we want to work on? Ying > On Nov 21, 2020, at 6:08 AM, Kirill Lykov wrote: > > Hi, > > There are some computations kernels in arrow and it looks that this part is > in active development right now. I wonder if there is a document / some > emails describing what is the goal and uses cases for this part of the code > base. Would be very interesting to know a bit more and I would like to > contribute at some point. > I'm interested because I develop a Proof-of-concept for a declarative > language to perform statistical computations on top of gandiva. > > -- > Best regards, > Kirill Lykov
Re: JIRA grooming
I personally filter all JIRA emails to the trash--agree that it's too noisy to pay attention to. The zulip chat app that Ursa Labs hosts (happy to invite anyone to it, just email me) has a reasonable threaded view of JIRA activity that I rely on. It's not perfect either but it's much more manageable. We have some code somewhere that we used to run when the ASF's JIRA integration was broken and unmaintained, I'll dig that up. Should be a good starting point. Neal On Fri, Feb 5, 2021 at 8:54 AM Jorge Cardoso Leitão < jorgecarlei...@gmail.com> wrote: > I would love that also! > > Atm we need to: > * add the tag to the jira issue > * add the component in the dropdown for components > * add the component to the PR > > In case of multiple components, the above is per component, per PR. > > IMO we should only have to select the component from one place, which IMO > should be the component dropdown in JIRA, which has the strongest > validation and is thus less prone to mistakes. > From that component field, we can derive the squashed commit name (e.g. in > dev/merge_pr.py, read from it and use it to create the commit message), and > also use it to populate the changelog accordingly. > > My concern with a jira bot is that people are already heavily spammed by > JIRA. On a new PR, this is roughly my email activity: > > * email from github with the bot adding the link to jira > * email from JIRA with an update that the bot added a link to jira > [if I forgot to place the jira issue] > * email from github with the bot adding the message about a missing jira > issue > * email from JIRA with an update that the bot added a message about a > missing jira issue > [if forgot to place the jira issue] > * email from github with coverage > * email from JIRA with an update that the bot added coverage report > * email from github that someone commented/reviewed etc > * email from JIRA with an update that someone commented/reviewed etc > * [repeat for every activity on the PR] > > imo this is way too much verbosity, specially the github+JIRA with copies > of each other. If we start changing titles on JIRA, there will be yet > another email from JIRA with that update. Note that emails from JIRA are > administered globally on the project, the email from github for updates on > the PR is imo way more relevant (because it is a one-click from email to > the exact comment). I suspect that many people either ignore JIRA, or have > some filter to ignore it, which imo is bad because important discussions do > happen in JIRA - they are just a needle in the haystack (I am curious as to > whether folks have a different setup here!) > > Regardless, I am up to pair with you Neal to work on this front to > alleviate this if others also feel some pain with this. > > Best, > Jorge > > > > On Fri, Feb 5, 2021 at 5:19 PM Neal Richardson < > neal.p.richard...@gmail.com> > wrote: > > > I am all for automation. I'll see if I can carve out some time to work on > > that next week. > > > > Neal > > > > On Fri, Feb 5, 2021 at 8:13 AM Wes McKinney wrote: > > > > > It occurs to me we could (relatively) easily program a bot to apply > > > these "title tags" automatically based on what's in the Component > > > field. What do you think? > > > > > > On Fri, Feb 5, 2021 at 10:09 AM Neal Richardson > > > wrote: > > > > > > > > Hi folks, > > > > Just a reminder to please make sure your JIRA issue titles start with > > the > > > > subproject/language in square brackets ([Rust], [Python], etc.). If > > > you're > > > > reviewing or triaging issues that others have reported, please do > clean > > > > them up. While it is technically redundant to the component field, it > > is > > > > greatly helpful for being able to scan the JIRA activity feed, and > it's > > > > important for making our changelogs readable. > > > > > > > > Thanks, > > > > Neal > > > > > >
Re: Computational Kernels: the project overview
Welcome Aldrin, This sounds like a very reasonable way to start contributing. -Micah On Fri, Jan 29, 2021 at 1:53 PM Aldrin wrote: > Hello! > > I am trying to use the expression and compute APIs for query processing, > and in my searches so far, this thread seems to be the most relevant. > > A lot of the operators and functions that I need in the short-term appear > to be implemented, but the documentation seems sparse or at least not all > in the same place. The document that Micah linked has been useful, and I've > been perusing the source, but I was wondering if some initial contributions > I can make would be to document the designed model and then propose further > changes or designs afterwards. > > Is anyone already putting effort in (or completed) consolidating or > expanding documentation on the compute and dataset/expression APIs and how > they interact, etc.? > > Thanks! > > Aldrin Montana > Computer Science PhD Student > UC Santa Cruz > > > On Mon, Nov 30, 2020 at 7:40 AM Wes McKinney wrote: > > > One objective of the precompiled kernels project is to have meaningful > > computational functionality in a package that does not need to include > > the LLVM runtime -- to require the LLVM dependency even for simple > > functions would more than double the size of our Python packages, for > > example. > > > > There is currently little code sharing between functions that do > > identical work in arrow::compute:: versus gandiva:: -- this has been > > discussed, but it needs a champion to do something about it. When I > > was working on the new function framework earlier this year, I spent a > > day or so perusing src/gandiva/precompiled and reasoned it would be a > > prohibitive amount of refactoring for me to undertake at that time. In > > principle many of these functions (e.g. string functions) can be > > incrementally refactored into reusable inline functions / templates > > for improved code reuse. We could also explore common infrastructure > > for unit testing and benchmarking. Anything is possible if enough > > engineering time is invested. > > > > I would hope in the future to see a generalized expression API as part > > of a logical query plan-type system (for query processing) that has > > the ability to use Gandiva (if it's available) to compile > > subexpressions for better performance. I had hoped to spend some time > > on this myself earlier this year, but I've gotten busy with some other > > things and won't be able to devote much development time to this > > myself. > > > > - Wes > > > > On Sun, Nov 29, 2020 at 11:18 PM Micah Kornfield > > wrote: > > > > > > > > > > > There are some computations kernels in arrow and it looks that this > > part is > > > > in active development right now. I wonder if there is a document / > some > > > > emails describing what is the goal and uses cases for this part of > the > > code > > > > base. Would be very interesting to know a bit more and I would like > to > > > > contribute at some point. > > > > > > > > > > > > https://docs.google.com/document/d/1LFk3WRfWGQbJ9uitWwucjiJsZMqLh8lC1vAUOscLtj8/edit > > > talks about some of the goals of the compute module. > > > > > > I'm interested because I develop a Proof-of-concept for a declarative > > > > language to perform statistical computations on top of gandiva. > > > > > > > > > I think upon cursory examination someone (maybe Wes) thought Gandiva > and > > > the compute kernels might not play nicely together, but I can't find a > > > reference to that at the moment. > > > > > > > > > On Sat, Nov 21, 2020 at 3:09 AM Kirill Lykov > > wrote: > > > > > > > Hi, > > > > > > > > There are some computations kernels in arrow and it looks that this > > part is > > > > in active development right now. I wonder if there is a document / > some > > > > emails describing what is the goal and uses cases for this part of > the > > code > > > > base. Would be very interesting to know a bit more and I would like > to > > > > contribute at some point. > > > > I'm interested because I develop a Proof-of-concept for a declarative > > > > language to perform statistical computations on top of gandiva. > > > > > > > > -- > > > > Best regards, > > > > Kirill Lykov > > > > > > >
Re: JIRA grooming
I would love that also! Atm we need to: * add the tag to the jira issue * add the component in the dropdown for components * add the component to the PR In case of multiple components, the above is per component, per PR. IMO we should only have to select the component from one place, which IMO should be the component dropdown in JIRA, which has the strongest validation and is thus less prone to mistakes. >From that component field, we can derive the squashed commit name (e.g. in dev/merge_pr.py, read from it and use it to create the commit message), and also use it to populate the changelog accordingly. My concern with a jira bot is that people are already heavily spammed by JIRA. On a new PR, this is roughly my email activity: * email from github with the bot adding the link to jira * email from JIRA with an update that the bot added a link to jira [if I forgot to place the jira issue] * email from github with the bot adding the message about a missing jira issue * email from JIRA with an update that the bot added a message about a missing jira issue [if forgot to place the jira issue] * email from github with coverage * email from JIRA with an update that the bot added coverage report * email from github that someone commented/reviewed etc * email from JIRA with an update that someone commented/reviewed etc * [repeat for every activity on the PR] imo this is way too much verbosity, specially the github+JIRA with copies of each other. If we start changing titles on JIRA, there will be yet another email from JIRA with that update. Note that emails from JIRA are administered globally on the project, the email from github for updates on the PR is imo way more relevant (because it is a one-click from email to the exact comment). I suspect that many people either ignore JIRA, or have some filter to ignore it, which imo is bad because important discussions do happen in JIRA - they are just a needle in the haystack (I am curious as to whether folks have a different setup here!) Regardless, I am up to pair with you Neal to work on this front to alleviate this if others also feel some pain with this. Best, Jorge On Fri, Feb 5, 2021 at 5:19 PM Neal Richardson wrote: > I am all for automation. I'll see if I can carve out some time to work on > that next week. > > Neal > > On Fri, Feb 5, 2021 at 8:13 AM Wes McKinney wrote: > > > It occurs to me we could (relatively) easily program a bot to apply > > these "title tags" automatically based on what's in the Component > > field. What do you think? > > > > On Fri, Feb 5, 2021 at 10:09 AM Neal Richardson > > wrote: > > > > > > Hi folks, > > > Just a reminder to please make sure your JIRA issue titles start with > the > > > subproject/language in square brackets ([Rust], [Python], etc.). If > > you're > > > reviewing or triaging issues that others have reported, please do clean > > > them up. While it is technically redundant to the component field, it > is > > > greatly helpful for being able to scan the JIRA activity feed, and it's > > > important for making our changelogs readable. > > > > > > Thanks, > > > Neal > > >
Re: JIRA grooming
I am all for automation. I'll see if I can carve out some time to work on that next week. Neal On Fri, Feb 5, 2021 at 8:13 AM Wes McKinney wrote: > It occurs to me we could (relatively) easily program a bot to apply > these "title tags" automatically based on what's in the Component > field. What do you think? > > On Fri, Feb 5, 2021 at 10:09 AM Neal Richardson > wrote: > > > > Hi folks, > > Just a reminder to please make sure your JIRA issue titles start with the > > subproject/language in square brackets ([Rust], [Python], etc.). If > you're > > reviewing or triaging issues that others have reported, please do clean > > them up. While it is technically redundant to the component field, it is > > greatly helpful for being able to scan the JIRA activity feed, and it's > > important for making our changelogs readable. > > > > Thanks, > > Neal >
Re: JIRA grooming
It occurs to me we could (relatively) easily program a bot to apply these "title tags" automatically based on what's in the Component field. What do you think? On Fri, Feb 5, 2021 at 10:09 AM Neal Richardson wrote: > > Hi folks, > Just a reminder to please make sure your JIRA issue titles start with the > subproject/language in square brackets ([Rust], [Python], etc.). If you're > reviewing or triaging issues that others have reported, please do clean > them up. While it is technically redundant to the component field, it is > greatly helpful for being able to scan the JIRA activity feed, and it's > important for making our changelogs readable. > > Thanks, > Neal
JIRA grooming
Hi folks, Just a reminder to please make sure your JIRA issue titles start with the subproject/language in square brackets ([Rust], [Python], etc.). If you're reviewing or triaging issues that others have reported, please do clean them up. While it is technically redundant to the component field, it is greatly helpful for being able to scan the JIRA activity feed, and it's important for making our changelogs readable. Thanks, Neal
Re: [Rust][DataFusion] DataFusion Overview / Architecture
Thanks -- I plan to start on my slides next week On Thu, Feb 4, 2021 at 2:32 PM Fernando Herrera < fernando.j.herr...@gmail.com> wrote: > Hi Andy. I would like to take you offer and get a copy of your book. It > would help me to understand better datafusion and help Andrew with the > project documentation. > > Fernando > > On Thu, 4 Feb 2021, 18:01 Andy Grove, wrote: > > > That's correct, Remi. I built the Kotlin query engine from scratch as I > was > > writing the book, and it does follow the same basic design as > DataFusion. I > > think it would be a useful reference for anyone writing up some > > DataFusion-specific documentation and I am happy to send a free copy to > > anyone who is working on that. > > > > On Thu, Feb 4, 2021 at 10:30 AM Rémi Dettai wrote: > > > > > Hi Andrew! > > > > > > The book "How query engines work" ( > > > https://leanpub.com/how-query-engines-work) that Andy wrote is pretty > > > great! It documents query engine APIs in Kotlin and not Rust, as it was > > > written during earlier Ballista experimentations, but almost all items > > > still apply to DataFusion (feel free to correct me if I'm wrong Andy). > > > > > > Remi > > > > > > > > > > > > Le jeu. 4 févr. 2021 à 12:33, Andrew Lamb a > > écrit : > > > > > > > Does anyone have any high level architectural / overview material > about > > > > DataFusion that they can share or point me at? > > > > > > > > I am planning on creating a high level / architectural overview of > > > > DataFusion (as it exists today) as a set of slides for a Tech Talk > > (will > > > be > > > > open to the public) sometime in March. > > > > > > > > I was hoping to take a friendly look at any material others may have > > been > > > > put together, if it exists. > > > > > > > > Thanks, > > > > Andrew > > > > > > > > > >
[NIGHTLY] Arrow Build Report for Job nightly-2021-02-05-0
Arrow Build Report for Job nightly-2021-02-05-0 All tasks: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0 Failed Tasks: - conda-linux-gcc-py36-aarch64: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-drone-conda-linux-gcc-py36-aarch64 - conda-linux-gcc-py37-aarch64: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-drone-conda-linux-gcc-py37-aarch64 - conda-linux-gcc-py38-aarch64: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-drone-conda-linux-gcc-py38-aarch64 - conda-linux-gcc-py39-aarch64: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-drone-conda-linux-gcc-py39-aarch64 - test-conda-python-3.8-jpype: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-github-test-conda-python-3.8-jpype - test-ubuntu-18.04-docs: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-azure-test-ubuntu-18.04-docs Succeeded Tasks: - centos-7-amd64: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-github-centos-7-amd64 - centos-8-amd64: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-github-centos-8-amd64 - conda-clean: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-azure-conda-clean - conda-linux-gcc-py36-cpu-r36: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-azure-conda-linux-gcc-py36-cpu-r36 - conda-linux-gcc-py36-cuda: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-azure-conda-linux-gcc-py36-cuda - conda-linux-gcc-py37-cpu-r40: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-azure-conda-linux-gcc-py37-cpu-r40 - conda-linux-gcc-py37-cuda: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-azure-conda-linux-gcc-py37-cuda - conda-linux-gcc-py38-cpu: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-azure-conda-linux-gcc-py38-cpu - conda-linux-gcc-py38-cuda: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-azure-conda-linux-gcc-py38-cuda - conda-linux-gcc-py39-cpu: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-azure-conda-linux-gcc-py39-cpu - conda-linux-gcc-py39-cuda: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-azure-conda-linux-gcc-py39-cuda - conda-osx-clang-py36-r36: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-azure-conda-osx-clang-py36-r36 - conda-osx-clang-py37-r40: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-azure-conda-osx-clang-py37-r40 - conda-osx-clang-py38: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-azure-conda-osx-clang-py38 - conda-osx-clang-py39: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-azure-conda-osx-clang-py39 - conda-win-vs2017-py36-r36: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-azure-conda-win-vs2017-py36-r36 - conda-win-vs2017-py37-r40: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-azure-conda-win-vs2017-py37-r40 - conda-win-vs2017-py38: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-azure-conda-win-vs2017-py38 - debian-buster-amd64: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-github-debian-buster-amd64 - example-cpp-minimal-build-static-system-dependency: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-github-example-cpp-minimal-build-static-system-dependency - example-cpp-minimal-build-static: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-github-example-cpp-minimal-build-static - gandiva-jar-osx: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-github-gandiva-jar-osx - gandiva-jar-ubuntu: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-github-gandiva-jar-ubuntu - homebrew-cpp: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-github-homebrew-cpp - homebrew-r-autobrew: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-github-homebrew-r-autobrew - nuget: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-github-nuget - python-sdist: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-05-0-github