[jira] [Created] (ARROW-8325) [R][CI] Stop including boost in R windows bundle
Neal Richardson created ARROW-8325: -- Summary: [R][CI] Stop including boost in R windows bundle Key: ARROW-8325 URL: https://issues.apache.org/jira/browse/ARROW-8325 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Neal Richardson Assignee: Neal Richardson -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8324) [R] Add read/write_ipc_file separate from _feather
Neal Richardson created ARROW-8324: -- Summary: [R] Add read/write_ipc_file separate from _feather Key: ARROW-8324 URL: https://issues.apache.org/jira/browse/ARROW-8324 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Neal Richardson See [https://github.com/apache/arrow/pull/6771#issuecomment-608133760] {quote}Let's add read/write_ipc_file also? I'm wary of the "version" option in "write_feather" and the Feather version inference capability in "read_feather". It's potentially confusing and we may choose to add options to write_ipc_file/read_ipc_file that are more developer centric, having to do with particulars in the IPC format, that are not relevant or appropriate for the Feather APIs. IMHO it's best for "Feather format" to remain an abstracted higher-level concept with its use of the "IPC file format" as an implementation detail, and segregated from the other things. {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: Proposal to use Black for automatic formatting of Python code
On Thu, Apr 2, 2020 at 2:19 PM Antoine Pitrou wrote: > > > Le 02/04/2020 à 20:58, Joris Van den Bossche a écrit : > > > > Yes, both autopep8 and black can fix up linting issues to ensure your code > > passes the PEP8 checks (although autopep8 can not fix all issues > > automatically). > > But with autopep8 you *still* need to think about how to format your code, > > as there > > are many different ways you can write code that all satisfy PEP8 / autopep8. > > I don't understand why you need to think. > With black: write your code as it comes and reformat it afterwards. > With autopep8: write your code as it comes and reformat it afterwards. With either solution, you don't have to "think" about PEP8 compliance while programming, because autopep8 will handle it for you. There might be some stylistic issues around where to put line breaks, but I think what we are essentially agreeing to in this discussion is to not make stylistic comments in code reviews so long as the code is PEP8-compliant (which sounds good to me). > So you can pretty much avoid thinking if you don't want to... (which > IMHO is a weird thing to ask for, but hey :-)) > > Regards > > Antoine.
Re: Proposal to use Black for automatic formatting of Python code
Le 02/04/2020 à 20:58, Joris Van den Bossche a écrit : > > Yes, both autopep8 and black can fix up linting issues to ensure your code > passes the PEP8 checks (although autopep8 can not fix all issues > automatically). > But with autopep8 you *still* need to think about how to format your code, > as there > are many different ways you can write code that all satisfy PEP8 / autopep8. I don't understand why you need to think. With black: write your code as it comes and reformat it afterwards. With autopep8: write your code as it comes and reformat it afterwards. So you can pretty much avoid thinking if you don't want to... (which IMHO is a weird thing to ask for, but hey :-)) Regards Antoine.
Re: Proposal to use Black for automatic formatting of Python code
Personally, I don't think autopep8 being less aggressive / more conservative is that relevant. This is only for the single PR that does the reformatting where black gives a much bigger number of changed lines. But once that one-time cost is paid, using black will not give larger diffs or make more invasive changes. Yes, both autopep8 and black can fix up linting issues to ensure your code passes the PEP8 checks (although autopep8 can not fix all issues automatically). But with autopep8 you *still* need to think about how to format your code, as there are many different ways you can write code that all satisfy PEP8 / autopep8. That's IMO an advantage of black over autopep8. Joris On Thu, 2 Apr 2020 at 17:40, Wes McKinney wrote: > I admit that the status quo does not bother me that much, so > `autopep8` as the more conservative / less aggressive option seems > fine to me, and also makes it simple for people to fix up common > linting issues in their PRs. > > On Thu, Apr 2, 2020 at 5:16 AM Antoine Pitrou wrote: > > > > > > I have looked at the kind of reformatting used by black and I've become > > -1 on this. `black` is much too aggressive and actually makes the code > > less readable. > > > > `autopep8` seems much better and less aggressive. Let's use that > > instead. > > > > Regards > > > > Antoine. > > > > > > On Thu, 26 Mar 2020 20:37:01 +0100 > > Joris Van den Bossche wrote: > > > Hi all, > > > > > > I would like to propose adopting Black as code formatter within the > python > > > project. There is an older JIRA issue about this ( > > > https://issues.apache.org/jira/browse/ARROW-5176), but bringing it to > the > > > mailing list for wider attention. > > > > > > Black (https://github.com/ambv/black) is a tool for automatically > > > formatting python code in ways which flake8 and our other linters > approve > > > of (and fill a similar role to clang-format for C++ and cmake-format > for > > > cmake). It can also be added to the linting checks on CI and to the > > > pre-commit hooks like we now run flake8. > > > Using it ensures python code will be formatted consistently, and more > > > importantly automates this formatting, letting you focus on more > important > > > matters. > > > > > > Black makes some specific formatting choices, and not everybody (me > > > included) will always like those choices (that's how it goes with > something > > > subjective like formatting). But my experience with using it in some > other > > > big python projects (pandas, dask) has been very positive. You very > quickly > > > get used to how it looks, while it is much nicer to not have to worry > about > > > formatting anymore. > > > > > > Best, > > > Joris > > > > > > > > > >
[jira] [Created] (ARROW-8323) [C++] Pin gRPC at v1.27 to avoid compilation error in its headers
Ben Kietzman created ARROW-8323: --- Summary: [C++] Pin gRPC at v1.27 to avoid compilation error in its headers Key: ARROW-8323 URL: https://issues.apache.org/jira/browse/ARROW-8323 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 0.16.0 Reporter: Ben Kietzman Assignee: Ben Kietzman Fix For: 0.17.0 [gRPC 1.28|https://github.com/grpc/grpc/releases/tag/v1.28.0] includes a change which introduces an implicit size_t->int conversion in proto_utils.h: https://github.com/grpc/grpc/commit/2748755a4ff9ed940356e78c105f55f839fdf38b Conversion warnings are treated as errors for example here: https://ci.appveyor.com/project/BenjaminKietzman/arrow/build/job/9cl0vqa8e495knn3#L1126 So IIUC we need to pin gRPC to 1.27 for now. Upstream PR: https://github.com/grpc/grpc/pull/22557 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8322) [CI] Fix C# workflow file syntax
Krisztian Szucs created ARROW-8322: -- Summary: [CI] Fix C# workflow file syntax Key: ARROW-8322 URL: https://issues.apache.org/jira/browse/ARROW-8322 Project: Apache Arrow Issue Type: Task Components: Continuous Integration Reporter: Krisztian Szucs The github actions expression requires the enclosing "${{ }}" -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8321) [CI] Use bundled thrift in Fedora 30 build
Krisztian Szucs created ARROW-8321: -- Summary: [CI] Use bundled thrift in Fedora 30 build Key: ARROW-8321 URL: https://issues.apache.org/jira/browse/ARROW-8321 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration Affects Versions: 0.17.0 Reporter: Krisztian Szucs Assignee: Krisztian Szucs After unsetting Thrift_SOURCE from AUTO it surfaced that the thrift available on Fedora 30 is older 0.10 than the minimal required version 0.11. Build thrift_ep instead. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8320) [Documentation][Format] Clarify (lack of) alignment requirements in C data interface
Wes McKinney created ARROW-8320: --- Summary: [Documentation][Format] Clarify (lack of) alignment requirements in C data interface Key: ARROW-8320 URL: https://issues.apache.org/jira/browse/ARROW-8320 Project: Apache Arrow Issue Type: Improvement Components: Documentation, Format Reporter: Wes McKinney Fix For: 0.17.0 This document should clarify that memory buffers need not start on aligned pointer offsets. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8319) [CI] Install thrift compiler in the debian build
Krisztian Szucs created ARROW-8319: -- Summary: [CI] Install thrift compiler in the debian build Key: ARROW-8319 URL: https://issues.apache.org/jira/browse/ARROW-8319 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration Reporter: Krisztian Szucs Assignee: Krisztian Szucs Fix For: 0.17.0 CMake is missing thrift compiler after setting Thrift_SOURCE to empty from AUTO, see build: https://github.com/apache/arrow/runs/555631125?check_suite_focus=true#step:6:143 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8318) [C++][Dataset] Dataset should instantiate Fragment
Francois Saint-Jacques created ARROW-8318: - Summary: [C++][Dataset] Dataset should instantiate Fragment Key: ARROW-8318 URL: https://issues.apache.org/jira/browse/ARROW-8318 Project: Apache Arrow Issue Type: Improvement Components: C++ - Dataset Reporter: Francois Saint-Jacques Fragments are created on the fly when invoking a Scan. This means that a lot of the auxilliary/ancilliary data must be stored by the specialised Dataset, e.g. the FileSystemDataset must hold the path and partition expression. With the venue of more complex Fragment, e.g. ParquetFileFragment, more data must be stored. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8317) [C++] grpc-cpp 1.28.0 from conda-forge causing Appveyor build to fail
Wes McKinney created ARROW-8317: --- Summary: [C++] grpc-cpp 1.28.0 from conda-forge causing Appveyor build to fail Key: ARROW-8317 URL: https://issues.apache.org/jira/browse/ARROW-8317 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Wes McKinney Fix For: 0.17.0 This started occurring in the last few hours since grpc-cpp 1.28.0 update was just merged on conda-forge https://ci.appveyor.com/project/wesm/arrow/build/job/8oe0n4epkxegr21x -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: CPP : arrow symbols.map issue
On Thu, Apr 2, 2020 at 12:06 PM Antoine Pitrou wrote: > > > Hi, > > On Thu, 2 Apr 2020 16:56:06 + > Brian Bowman wrote: > > A new high-performance file system we are working with returns an error > > while writing a .parquet file. The following arrow symbol does not > > resolve properly and the error is masked. > > > > libparquet.so: undefined symbol: _ZNK5arrow6Status8ToStringB5cxx11Ev > > > > > nm libarrow.so* | grep -i ZNK5arrow6Status8ToStringB5cxx11Ev > > 002b7760 t _ZNK5arrow6Status8ToStringB5cxx11Ev > > 002b7760 t _ZNK5arrow6Status8ToStringB5cxx11Ev > > For clarity, you should use `nm --demangle`. This will give you the > actual C++ symbol, i.e. "arrow::Status::ToString[abi:cxx11]() const". > > > One of our Linux dev/build experts tracked this down to an issue in arrow > > open source. He says the lowercase ‘t’ (text) code (… 7760 t _ZNK …) in > > the nm command output is incorrect and it should instead be an uppercase > > ‘T’. > > I have the right output here: > > $ nm --demangle --defined-only --dynamic .../libarrow.so | \ > grep Status::ToString > 012f1ff0 T arrow::Status::ToString[abi:cxx11]() const > > Which toolchain (linker etc.) are you using? My guess is also that you have a mixed-gcc-toolchain problem. What compiler/linker (and gcc toolchain, if you built with Clang) was used to produce libparquet.so (or where did you obtain the package), and which toolchain are you using to build and link your application? > Regards > > Antoine. > >
Re: CPP : arrow symbols.map issue
Hi, On Thu, 2 Apr 2020 16:56:06 + Brian Bowman wrote: > A new high-performance file system we are working with returns an error while > writing a .parquet file. The following arrow symbol does not resolve > properly and the error is masked. > > libparquet.so: undefined symbol: _ZNK5arrow6Status8ToStringB5cxx11Ev > > > nm libarrow.so* | grep -i ZNK5arrow6Status8ToStringB5cxx11Ev > 002b7760 t _ZNK5arrow6Status8ToStringB5cxx11Ev > 002b7760 t _ZNK5arrow6Status8ToStringB5cxx11Ev For clarity, you should use `nm --demangle`. This will give you the actual C++ symbol, i.e. "arrow::Status::ToString[abi:cxx11]() const". > One of our Linux dev/build experts tracked this down to an issue in arrow > open source. He says the lowercase ‘t’ (text) code (… 7760 t _ZNK …) in the > nm command output is incorrect and it should instead be an uppercase ‘T’. I have the right output here: $ nm --demangle --defined-only --dynamic .../libarrow.so | \ grep Status::ToString 012f1ff0 T arrow::Status::ToString[abi:cxx11]() const Which toolchain (linker etc.) are you using? Regards Antoine.
CPP : arrow symbols.map issue
A new high-performance file system we are working with returns an error while writing a .parquet file. The following arrow symbol does not resolve properly and the error is masked. libparquet.so: undefined symbol: _ZNK5arrow6Status8ToStringB5cxx11Ev > nm libarrow.so* | grep -i ZNK5arrow6Status8ToStringB5cxx11Ev 002b7760 t _ZNK5arrow6Status8ToStringB5cxx11Ev 002b7760 t _ZNK5arrow6Status8ToStringB5cxx11Ev One of our Linux dev/build experts tracked this down to an issue in arrow open source. He says the lowercase ‘t’ (text) code (… 7760 t _ZNK …) in the nm command output is incorrect and it should instead be an uppercase ‘T’. He traced the problem to this file: ../cpp/src/arrow/symbols.map Here’s an update with his fix. Lines 27-30 are new. Nothing else changes. 1 # Licensed to the Apache Software Foundation (ASF) under one 2 # or more contributor license agreements. See the NOTICE file 3 # distributed with this work for additional information 4 # regarding copyright ownership. The ASF licenses this file 5 # to you under the Apache License, Version 2.0 (the 6 # "License"); you may not use this file except in compliance 7 # with the License. You may obtain a copy of the License at 8 # 9 # http://www.apache.org/licenses/LICENSE-2.0 10 # 11 # Unless required by applicable law or agreed to in writing, 12 # software distributed under the License is distributed on an 13 # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY 14 # KIND, either express or implied. See the License for the 15 # specific language governing permissions and limitations 16 # under the License. 17 18 { 19 global: 20 extern "C++" { 21 # The leading asterisk is required for symbols such as 22 # "typeinfo for arrow::SomeClass". 23 # Unfortunately this will also catch template specializations 24 # (from e.g. STL or Flatbuffers) involving Arrow types. 25 *arrow::*; 26 *arrow_vendored::*; 27 *ToString*; 28 *key*; 29 *str*; 30 *value*; 31 }; 32 # Also export C-level helpers 33 arrow_*; 34 pyarrow_*; 35 36 # Symbols marked as 'local' are not exported by the DSO and thus may not 37 # be used by client applications. Everything except the above falls here. 38 # This ensures we hide symbols of static dependencies. 39 local: 40 *; 41 42 }; We have made these changes in our local clones the arrow open source repositories. I’m passing this along for the community’s review. Reply with a link and I’ll enter a jira ticket if needed. -Brian
[jira] [Created] (ARROW-8316) [CI] Set docker-compose to use docker-cli instead of docker-py for building images
Krisztian Szucs created ARROW-8316: -- Summary: [CI] Set docker-compose to use docker-cli instead of docker-py for building images Key: ARROW-8316 URL: https://issues.apache.org/jira/browse/ARROW-8316 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration Reporter: Krisztian Szucs The images pushed from the master branch were sometimes producing reusable layers, sometimes not. So the caching was working non-deterministically. The underlying issue is https://github.com/docker/compose/issues/883 -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: Support of more manipulation for Record Batch
hi Chengxin, Yes, if you look at the JIRA tracker and look for past discussions on the mailing list, there are plans to develop comprehensive data manipulation and query processing capabilities in this project for use in Python, R, and any other language that binds to C++, including C/GLib and Ruby. The way that this functionality is exposed in the pyarrow API will almost certainly be different than pandas, though. Rather than have objects with long lists of instance methods, we would opt instead for computational functions that "act" on the data structures, producing one or more data structures as output, more similar to tools like dplyr (an R library). Developers are welcome to create pandas-like convenience layers, of course, should they so choose. References: * C++ datasets API project https://docs.google.com/document/d/1bVhzifD38qDypnSjtf8exvpP3sSB5x_Kw9m-n66FB2c/edit?usp=sharing * C++ query engine project https://docs.google.com/document/d/10RoUZmiMQRi_J1FcPeVAUAMJ6d_ZuiEbaM2Y33sNPu4/edit?usp=sharing * C++ data frame API project https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit?usp=sharing Building these things take time, especially considering the scope of maintenance involved with keeping this project running. If anyone reading is interested in contributing time or money to this effort I'd be happy to speak with you offline about it. If you would like to contribute we would be glad to have you aboard. Thanks Wes On Thu, Apr 2, 2020 at 6:50 AM Chengxin Ma wrote: > > Hi all, > > I am working on a distributed sorting program which runs on multiple > computation nodes. > > In this sorting program, data is represented as pandas DataFrames and key > operations are groupby, concat, and sort_values. For shuffling data among the > computation nodes, the DataFrames are converted to Arrow Record Batches and > communicated via Arrow Flight. > > What I’ve noticed is that much time was spent on the conversion between > DataFrame and Record Batch. > > The [zero-copy > feature](https://arrow.apache.org/docs/python/pandas.html#memory-usage-and-zero-copy) > unfortunately cannot be applied to my case, since the DataFrames contain > strings as well. > > I wanted to try replacing DataFrames with Record Batches, so there would be > no need of conversion. However, there seems to be no direct way to do groupby > and sort_values on Record Batches, according to [the > documentation](https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html) > > Is there a plan to add such methods to the API of Record Batch in the future? > > Kind Regards > > Chengxin > > Sent with [ProtonMail](https://protonmail.com) Secure Email.
Re: Proposal to use Black for automatic formatting of Python code
I admit that the status quo does not bother me that much, so `autopep8` as the more conservative / less aggressive option seems fine to me, and also makes it simple for people to fix up common linting issues in their PRs. On Thu, Apr 2, 2020 at 5:16 AM Antoine Pitrou wrote: > > > I have looked at the kind of reformatting used by black and I've become > -1 on this. `black` is much too aggressive and actually makes the code > less readable. > > `autopep8` seems much better and less aggressive. Let's use that > instead. > > Regards > > Antoine. > > > On Thu, 26 Mar 2020 20:37:01 +0100 > Joris Van den Bossche wrote: > > Hi all, > > > > I would like to propose adopting Black as code formatter within the python > > project. There is an older JIRA issue about this ( > > https://issues.apache.org/jira/browse/ARROW-5176), but bringing it to the > > mailing list for wider attention. > > > > Black (https://github.com/ambv/black) is a tool for automatically > > formatting python code in ways which flake8 and our other linters approve > > of (and fill a similar role to clang-format for C++ and cmake-format for > > cmake). It can also be added to the linting checks on CI and to the > > pre-commit hooks like we now run flake8. > > Using it ensures python code will be formatted consistently, and more > > importantly automates this formatting, letting you focus on more important > > matters. > > > > Black makes some specific formatting choices, and not everybody (me > > included) will always like those choices (that's how it goes with something > > subjective like formatting). But my experience with using it in some other > > big python projects (pandas, dask) has been very positive. You very quickly > > get used to how it looks, while it is much nicer to not have to worry about > > formatting anymore. > > > > Best, > > Joris > > > > >
Re: [Python] black vs. autopep8
I'm personally fine with the Black changes. After the one-time cost of reformatting the codebase, it will take any personal preferences out of code formatting (I admit that I have several myself, but I don't mind the normalization provided by Black). I hope that Cython support comes soon since a great deal of our code is Cython On Thu, Apr 2, 2020 at 9:00 AM Jacek Pliszka wrote: > > Hi! > > I believe amount of changes is not that important. > > In my opinion, what matters is which format will allow reviewers to be > more efficient. > > The committer can always reformat as they like. It is harder for the reviewer. > > BR, > > Jacek > > czw., 2 kwi 2020 o 15:32 Antoine Pitrou napisał(a): > > > > > > PS: in both cases, Cython files are not processed. autopep8 is actually > > able to process them, but the comparison wouldn't be apples-to-apples. > > > > (that said, autopep8 gives suboptimal results on Cython files, for > > example it changes "_variable" to "& c_variable" and > > "void* ptr" to "void * ptr") > > > > Regards > > > > Antoine. > > > > Le 02/04/2020 à 15:30, Antoine Pitrou a écrit : > > > > > > Hello, > > > > > > I've put up two PRs to compare the effect of running black vs. autopep8 > > > on the Python codebase. > > > > > > * black: https://github.com/apache/arrow/pull/6810 > > > 65 files changed, 7855 insertions(+), 5215 deletions(-) > > > > > > * autopep8: https://github.com/apache/arrow/pull/6811 > > > 20 files changed, 137 insertions(+), 118 deletions(-) > > > > > > I've configured black to try and minimize changes (for example, avoid > > > normalizing string quoting style). Still, the number of changes is > > > humongous and they add 2600 lines to the codebase (which is a tangible > > > amount of vertical space). > > > > > > Regards > > > > > > Antoine. > > >
[jira] [Created] (ARROW-8315) [Python]
Ben Kietzman created ARROW-8315: --- Summary: [Python] Key: ARROW-8315 URL: https://issues.apache.org/jira/browse/ARROW-8315 Project: Apache Arrow Issue Type: Bug Reporter: Ben Kietzman -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: [Python] black vs. autopep8
Hi! I believe amount of changes is not that important. In my opinion, what matters is which format will allow reviewers to be more efficient. The committer can always reformat as they like. It is harder for the reviewer. BR, Jacek czw., 2 kwi 2020 o 15:32 Antoine Pitrou napisał(a): > > > PS: in both cases, Cython files are not processed. autopep8 is actually > able to process them, but the comparison wouldn't be apples-to-apples. > > (that said, autopep8 gives suboptimal results on Cython files, for > example it changes "_variable" to "& c_variable" and > "void* ptr" to "void * ptr") > > Regards > > Antoine. > > Le 02/04/2020 à 15:30, Antoine Pitrou a écrit : > > > > Hello, > > > > I've put up two PRs to compare the effect of running black vs. autopep8 > > on the Python codebase. > > > > * black: https://github.com/apache/arrow/pull/6810 > > 65 files changed, 7855 insertions(+), 5215 deletions(-) > > > > * autopep8: https://github.com/apache/arrow/pull/6811 > > 20 files changed, 137 insertions(+), 118 deletions(-) > > > > I've configured black to try and minimize changes (for example, avoid > > normalizing string quoting style). Still, the number of changes is > > humongous and they add 2600 lines to the codebase (which is a tangible > > amount of vertical space). > > > > Regards > > > > Antoine. > >
Re: [Python] black vs. autopep8
PS: in both cases, Cython files are not processed. autopep8 is actually able to process them, but the comparison wouldn't be apples-to-apples. (that said, autopep8 gives suboptimal results on Cython files, for example it changes "_variable" to "& c_variable" and "void* ptr" to "void * ptr") Regards Antoine. Le 02/04/2020 à 15:30, Antoine Pitrou a écrit : > > Hello, > > I've put up two PRs to compare the effect of running black vs. autopep8 > on the Python codebase. > > * black: https://github.com/apache/arrow/pull/6810 > 65 files changed, 7855 insertions(+), 5215 deletions(-) > > * autopep8: https://github.com/apache/arrow/pull/6811 > 20 files changed, 137 insertions(+), 118 deletions(-) > > I've configured black to try and minimize changes (for example, avoid > normalizing string quoting style). Still, the number of changes is > humongous and they add 2600 lines to the codebase (which is a tangible > amount of vertical space). > > Regards > > Antoine. >
[Python] black vs. autopep8
Hello, I've put up two PRs to compare the effect of running black vs. autopep8 on the Python codebase. * black: https://github.com/apache/arrow/pull/6810 65 files changed, 7855 insertions(+), 5215 deletions(-) * autopep8: https://github.com/apache/arrow/pull/6811 20 files changed, 137 insertions(+), 118 deletions(-) I've configured black to try and minimize changes (for example, avoid normalizing string quoting style). Still, the number of changes is humongous and they add 2600 lines to the codebase (which is a tangible amount of vertical space). Regards Antoine.
[jira] [Created] (ARROW-8314) [Python] Provide a method to select a subset of columns of a Table
Joris Van den Bossche created ARROW-8314: Summary: [Python] Provide a method to select a subset of columns of a Table Key: ARROW-8314 URL: https://issues.apache.org/jira/browse/ARROW-8314 Project: Apache Arrow Issue Type: New Feature Components: Python Reporter: Joris Van den Bossche I looked through the open issues and in our API, but didn't directly find something about selecting a subset of columns of a table. Assume you have a table like: {code} table = pa.table({'a': [1, 2], 'b': [.1, .2], 'c': ['a', 'b']}) {code} You can select a single column with {{table.column('a')}} or {{table['a']}} to get a chunked array. You can add, append, remove and replace columns (with {{add_column}}, {{append_column}}, {{remove_column}}, {{set_column}}). But an easy way to get a subset of the columns (without the manuall removing the ones you don't want one by one) doesn't seem possible. I would propose something like: {code} table.select(['a', 'c']) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: Join operation on attributes from arrow structs
They're mapped with the StructType/StructArray, which is also columnar representation, e.g. one buffer per field in the sub-object. If you have varying/incompatible types, a field will be promoted to a UnionType. François On Thu, Apr 2, 2020 at 12:54 AM Micah Kornfield wrote: > > Hi Hasara, > There isn't current functionality in C++/Python to do this ( > https://issues.apache.org/jira/browse/ARROW-4630 is the issue tracking > this). > > Also how nested attributes in json format are mapped into buffers once > > converted in arrow format? > > I'm not sure I understand this question? > > Thanks, > Micah > > On Sun, Mar 22, 2020 at 10:09 PM Hasara Maithree < > hasaramaithreedesi...@gmail.com> wrote: > > > Hi all, > > > > Assume I have a json file named 'my_data.json' as below. > > > > *{"a": [1, 2], "b": {"c": true, "d": "1991-02-03"}} > > {"a": [3, 4, 5], "b": {"c": false, "d": "2019-04-01"**}}* > > > > If I need to do a join operation based on attribute d, can I do it > > directly from arrow structs? ( or are there any efficient alternatives?) > > Also how nested attributes in json format are mapped into buffers once > > converted in arrow format? (example taken from documentation) > > > > >>> table = json.read_json("my_data.json")>>> table > > pyarrow.Table > > a: list > > child 0, item: int64 > > b: struct > > child 0, c: bool > > child 1, d: timestamp[s]>>> table.to_pandas() > >a b0 [1, 2] > > {'c': True, 'd': 1991-02-03 00:00:00}1 [3, 4, 5] {'c': False, 'd': > > 2019-04-01 00:00:00} > > > > > > Thank You > >
Support of more manipulation for Record Batch
Hi all, I am working on a distributed sorting program which runs on multiple computation nodes. In this sorting program, data is represented as pandas DataFrames and key operations are groupby, concat, and sort_values. For shuffling data among the computation nodes, the DataFrames are converted to Arrow Record Batches and communicated via Arrow Flight. What I’ve noticed is that much time was spent on the conversion between DataFrame and Record Batch. The [zero-copy feature](https://arrow.apache.org/docs/python/pandas.html#memory-usage-and-zero-copy) unfortunately cannot be applied to my case, since the DataFrames contain strings as well. I wanted to try replacing DataFrames with Record Batches, so there would be no need of conversion. However, there seems to be no direct way to do groupby and sort_values on Record Batches, according to [the documentation](https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html) Is there a plan to add such methods to the API of Record Batch in the future? Kind Regards Chengxin Sent with [ProtonMail](https://protonmail.com) Secure Email.
Re: Proposal to use Black for automatic formatting of Python code
I have looked at the kind of reformatting used by black and I've become -1 on this. `black` is much too aggressive and actually makes the code less readable. `autopep8` seems much better and less aggressive. Let's use that instead. Regards Antoine. On Thu, 26 Mar 2020 20:37:01 +0100 Joris Van den Bossche wrote: > Hi all, > > I would like to propose adopting Black as code formatter within the python > project. There is an older JIRA issue about this ( > https://issues.apache.org/jira/browse/ARROW-5176), but bringing it to the > mailing list for wider attention. > > Black (https://github.com/ambv/black) is a tool for automatically > formatting python code in ways which flake8 and our other linters approve > of (and fill a similar role to clang-format for C++ and cmake-format for > cmake). It can also be added to the linting checks on CI and to the > pre-commit hooks like we now run flake8. > Using it ensures python code will be formatted consistently, and more > importantly automates this formatting, letting you focus on more important > matters. > > Black makes some specific formatting choices, and not everybody (me > included) will always like those choices (that's how it goes with something > subjective like formatting). But my experience with using it in some other > big python projects (pandas, dask) has been very positive. You very quickly > get used to how it looks, while it is much nicer to not have to worry about > formatting anymore. > > Best, > Joris >
Re: Clarification regarding the `CDataInterface.rst`
Upgrading the pip installer worked perfectly. Thanks! Regards, Anish Biswas On 2020/04/02 09:35:50, Antoine Pitrou wrote: > > Hi Anish, > > It looks like a bug with old pip versions. You can first upgrade pip using: > > $ pip install -U pip > > Then redo the "pip install" command for pyarrow. > > If you can't upgrade pip, you can install Numpy separately first (using > "pip install numpy"). > > Regards > > Antoine. > > > Le 02/04/2020 à 06:07, Anish Biswas a écrit : > > Hey Antoine, > > > > I am getting a few complications by using what you said. It's attempting to > > collect numpy>=1.14.0(from pyarrow) and I cross-checked it and isn't any > > .whl file for numpy hosted there. The same case persists for six. Can you > > please look into it? > > > > Thanks, > > Anish Biswas > > > > On 2020/03/30 16:15:53, Antoine Pitrou wrote: > >> On Mon, 30 Mar 2020 15:17:02 - > >> Anish Biswas wrote: > >>> Thanks! I'll probably build the Arrow Library from source. Thanks again! > >> > >> You should be able to get a nightly build using: > >> > >> $ pip install -U --extra-index-url \ > >> https://pypi.fury.io/arrow-nightlies/ --pre pyarrow > >> > >> Regards > >> > >> Antoine. > >> > >> > >> >
Re: Clarification regarding the `CDataInterface.rst`
Hi Anish, It looks like a bug with old pip versions. You can first upgrade pip using: $ pip install -U pip Then redo the "pip install" command for pyarrow. If you can't upgrade pip, you can install Numpy separately first (using "pip install numpy"). Regards Antoine. Le 02/04/2020 à 06:07, Anish Biswas a écrit : > Hey Antoine, > > I am getting a few complications by using what you said. It's attempting to > collect numpy>=1.14.0(from pyarrow) and I cross-checked it and isn't any .whl > file for numpy hosted there. The same case persists for six. Can you please > look into it? > > Thanks, > Anish Biswas > > On 2020/03/30 16:15:53, Antoine Pitrou wrote: >> On Mon, 30 Mar 2020 15:17:02 - >> Anish Biswas wrote: >>> Thanks! I'll probably build the Arrow Library from source. Thanks again! >> >> You should be able to get a nightly build using: >> >> $ pip install -U --extra-index-url \ >> https://pypi.fury.io/arrow-nightlies/ --pre pyarrow >> >> Regards >> >> Antoine. >> >> >>
[jira] [Created] (ARROW-8313) [Gandiva][UDF] Solutions to register new UDFs dynamically without checking it into arrow repo.
ZMZ91 created ARROW-8313: Summary: [Gandiva][UDF] Solutions to register new UDFs dynamically without checking it into arrow repo. Key: ARROW-8313 URL: https://issues.apache.org/jira/browse/ARROW-8313 Project: Apache Arrow Issue Type: New Feature Reporter: ZMZ91 Hi there, Recently I'm studying on gandiva and trying to add some UDF. I noted that it's needed to check the UDF implementation into the arrow repo, register the UDF and then build the UDF into precompiled_bitcode lib, right? I'm just wandering that is it possible to register new UDFs dynamically? Say I have the UDF implementation code locally which is not built into the gandiva lib yet, am I able to call some function or other solutions provided by gandiva officially to register and implement it. Thanks in advance. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8312) improve IN expression support
Yuan Zhou created ARROW-8312: Summary: improve IN expression support Key: ARROW-8312 URL: https://issues.apache.org/jira/browse/ARROW-8312 Project: Apache Arrow Issue Type: Improvement Components: C++ - Gandiva, Java Reporter: Yuan Zhou Assignee: Yuan Zhou Gandiva C++ provided IN API[1] is able to accept TreeNode as param, which allows IN expression to operate on output of some function. However in Java API[2], IN expression only accept Field as param, which limits the API usage. [1] https://github.com/apache/arrow/blob/master/cpp/src/gandiva/tree_expr_builder.h#L94-L125 [2] https://github.com/apache/arrow/blob/master/java/gandiva/src/main/java/org/apache/arrow/gandiva/expression/InNode.java#L50-L63 -- This message was sent by Atlassian Jira (v8.3.4#803005)