[jira] [Created] (ARROW-8325) [R][CI] Stop including boost in R windows bundle

2020-04-02 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-8325:
--

 Summary: [R][CI] Stop including boost in R windows bundle
 Key: ARROW-8325
 URL: https://issues.apache.org/jira/browse/ARROW-8325
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Neal Richardson
Assignee: Neal Richardson






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8324) [R] Add read/write_ipc_file separate from _feather

2020-04-02 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-8324:
--

 Summary: [R] Add read/write_ipc_file separate from _feather
 Key: ARROW-8324
 URL: https://issues.apache.org/jira/browse/ARROW-8324
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Neal Richardson


See [https://github.com/apache/arrow/pull/6771#issuecomment-608133760]
{quote}Let's add read/write_ipc_file also? I'm wary of the "version" option in 
"write_feather" and the Feather version inference capability in "read_feather". 
It's potentially confusing and we may choose to add options to 
write_ipc_file/read_ipc_file that are more developer centric, having to do with 
particulars in the IPC format, that are not relevant or appropriate for the 
Feather APIs.

IMHO it's best for "Feather format" to remain an abstracted higher-level 
concept with its use of the "IPC file format" as an implementation detail, and 
segregated from the other things.
{quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Proposal to use Black for automatic formatting of Python code

2020-04-02 Thread Wes McKinney
On Thu, Apr 2, 2020 at 2:19 PM Antoine Pitrou  wrote:
>
>
> Le 02/04/2020 à 20:58, Joris Van den Bossche a écrit :
> >
> > Yes, both autopep8 and black can fix up linting issues to ensure your code
> > passes the PEP8 checks (although autopep8 can not fix all issues
> > automatically).
> > But with autopep8 you *still* need to think about how to format your code,
> > as there
> > are many different ways you can write code that all satisfy PEP8 / autopep8.
>
> I don't understand why you need to think.
> With black: write your code as it comes and reformat it afterwards.
> With autopep8: write your code as it comes and reformat it afterwards.

With either solution, you don't have to "think" about PEP8 compliance
while programming, because autopep8 will handle it for you. There
might be some stylistic issues around where to put line breaks, but I
think what we are essentially agreeing to in this discussion is to not
make stylistic comments in code reviews so long as the code is
PEP8-compliant (which sounds good to me).

> So you can pretty much avoid thinking if you don't want to... (which
> IMHO is a weird thing to ask for, but hey :-))
>
> Regards
>
> Antoine.


Re: Proposal to use Black for automatic formatting of Python code

2020-04-02 Thread Antoine Pitrou


Le 02/04/2020 à 20:58, Joris Van den Bossche a écrit :
> 
> Yes, both autopep8 and black can fix up linting issues to ensure your code
> passes the PEP8 checks (although autopep8 can not fix all issues
> automatically).
> But with autopep8 you *still* need to think about how to format your code,
> as there
> are many different ways you can write code that all satisfy PEP8 / autopep8.

I don't understand why you need to think.
With black: write your code as it comes and reformat it afterwards.
With autopep8: write your code as it comes and reformat it afterwards.

So you can pretty much avoid thinking if you don't want to... (which
IMHO is a weird thing to ask for, but hey :-))

Regards

Antoine.


Re: Proposal to use Black for automatic formatting of Python code

2020-04-02 Thread Joris Van den Bossche
Personally, I don't think autopep8 being less aggressive / more conservative
is that relevant. This is only for the single PR that does the reformatting
where black gives a much bigger number of changed lines. But once that
one-time cost is paid, using black will not give larger diffs or make more
invasive
changes.

Yes, both autopep8 and black can fix up linting issues to ensure your code
passes the PEP8 checks (although autopep8 can not fix all issues
automatically).
But with autopep8 you *still* need to think about how to format your code,
as there
are many different ways you can write code that all satisfy PEP8 / autopep8.
That's IMO an advantage of black over autopep8.

Joris

On Thu, 2 Apr 2020 at 17:40, Wes McKinney  wrote:

> I admit that the status quo does not bother me that much, so
> `autopep8` as the more conservative / less aggressive option seems
> fine to me, and also makes it simple for people to fix up common
> linting issues in their PRs.
>
> On Thu, Apr 2, 2020 at 5:16 AM Antoine Pitrou  wrote:
> >
> >
> > I have looked at the kind of reformatting used by black and I've become
> > -1 on this.  `black` is much too aggressive and actually makes the code
> > less readable.
> >
> > `autopep8` seems much better and less aggressive. Let's use that
> > instead.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > On Thu, 26 Mar 2020 20:37:01 +0100
> > Joris Van den Bossche  wrote:
> > > Hi all,
> > >
> > > I would like to propose adopting Black as code formatter within the
> python
> > > project. There is an older JIRA issue about this (
> > > https://issues.apache.org/jira/browse/ARROW-5176), but bringing it to
> the
> > > mailing list for wider attention.
> > >
> > > Black (https://github.com/ambv/black) is a tool for automatically
> > > formatting python code in ways which flake8 and our other linters
> approve
> > > of (and fill a similar role to clang-format for C++ and cmake-format
> for
> > > cmake). It can also be added to the linting checks on CI and to the
> > > pre-commit hooks like we now run flake8.
> > > Using it ensures python code will be formatted consistently, and more
> > > importantly automates this formatting, letting you focus on more
> important
> > > matters.
> > >
> > > Black makes some specific formatting choices, and not everybody (me
> > > included) will always like those choices (that's how it goes with
> something
> > > subjective like formatting). But my experience with using it in some
> other
> > > big python projects (pandas, dask) has been very positive. You very
> quickly
> > > get used to how it looks, while it is much nicer to not have to worry
> about
> > > formatting anymore.
> > >
> > > Best,
> > > Joris
> > >
> >
> >
> >
>


[jira] [Created] (ARROW-8323) [C++] Pin gRPC at v1.27 to avoid compilation error in its headers

2020-04-02 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-8323:
---

 Summary: [C++] Pin gRPC at v1.27 to avoid compilation error in its 
headers
 Key: ARROW-8323
 URL: https://issues.apache.org/jira/browse/ARROW-8323
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 0.16.0
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 0.17.0


[gRPC 1.28|https://github.com/grpc/grpc/releases/tag/v1.28.0] includes a change 
which introduces an implicit size_t->int conversion in proto_utils.h: 
https://github.com/grpc/grpc/commit/2748755a4ff9ed940356e78c105f55f839fdf38b

Conversion warnings are treated as errors for example here: 
https://ci.appveyor.com/project/BenjaminKietzman/arrow/build/job/9cl0vqa8e495knn3#L1126
So IIUC we need to pin gRPC to 1.27 for now.

Upstream PR: https://github.com/grpc/grpc/pull/22557



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8322) [CI] Fix C# workflow file syntax

2020-04-02 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-8322:
--

 Summary: [CI] Fix C# workflow file syntax
 Key: ARROW-8322
 URL: https://issues.apache.org/jira/browse/ARROW-8322
 Project: Apache Arrow
  Issue Type: Task
  Components: Continuous Integration
Reporter: Krisztian Szucs


The github actions expression requires the enclosing "${{ }}"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8321) [CI] Use bundled thrift in Fedora 30 build

2020-04-02 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-8321:
--

 Summary: [CI] Use bundled thrift in Fedora 30 build
 Key: ARROW-8321
 URL: https://issues.apache.org/jira/browse/ARROW-8321
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration
Affects Versions: 0.17.0
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs


After unsetting Thrift_SOURCE from AUTO it surfaced that the thrift available 
on Fedora 30 is older 0.10 than the minimal required version 0.11.

Build thrift_ep instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8320) [Documentation][Format] Clarify (lack of) alignment requirements in C data interface

2020-04-02 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8320:
---

 Summary: [Documentation][Format] Clarify (lack of) alignment 
requirements in C data interface
 Key: ARROW-8320
 URL: https://issues.apache.org/jira/browse/ARROW-8320
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation, Format
Reporter: Wes McKinney
 Fix For: 0.17.0


This document should clarify that memory buffers need not start on aligned 
pointer offsets. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8319) [CI] Install thrift compiler in the debian build

2020-04-02 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-8319:
--

 Summary: [CI] Install thrift compiler in the debian build
 Key: ARROW-8319
 URL: https://issues.apache.org/jira/browse/ARROW-8319
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs
 Fix For: 0.17.0


CMake is missing thrift compiler after setting Thrift_SOURCE to empty from 
AUTO, 
see build: 
https://github.com/apache/arrow/runs/555631125?check_suite_focus=true#step:6:143



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8318) [C++][Dataset] Dataset should instantiate Fragment

2020-04-02 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-8318:
-

 Summary: [C++][Dataset] Dataset should instantiate Fragment
 Key: ARROW-8318
 URL: https://issues.apache.org/jira/browse/ARROW-8318
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Dataset
Reporter: Francois Saint-Jacques


Fragments are created on the fly when invoking a Scan. This means that a lot of 
the auxilliary/ancilliary data must be stored by the specialised Dataset, e.g. 
the FileSystemDataset must hold the path and partition expression. With the 
venue of more complex Fragment, e.g. ParquetFileFragment, more data must be 
stored. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8317) [C++] grpc-cpp 1.28.0 from conda-forge causing Appveyor build to fail

2020-04-02 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8317:
---

 Summary: [C++] grpc-cpp 1.28.0 from conda-forge causing Appveyor 
build to fail
 Key: ARROW-8317
 URL: https://issues.apache.org/jira/browse/ARROW-8317
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.17.0


This started occurring in the last few hours since grpc-cpp 1.28.0 update was 
just merged on conda-forge

https://ci.appveyor.com/project/wesm/arrow/build/job/8oe0n4epkxegr21x



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: CPP : arrow symbols.map issue

2020-04-02 Thread Wes McKinney
On Thu, Apr 2, 2020 at 12:06 PM Antoine Pitrou  wrote:
>
>
> Hi,
>
> On Thu, 2 Apr 2020 16:56:06 +
> Brian Bowman  wrote:
> > A new high-performance file system we are working with returns an error 
> > while writing a .parquet file.   The following arrow symbol does not 
> > resolve properly and the error is masked.
> >
> > libparquet.so: undefined symbol: _ZNK5arrow6Status8ToStringB5cxx11Ev
> >
> >  > nm libarrow.so* | grep -i ZNK5arrow6Status8ToStringB5cxx11Ev
> >  002b7760 t _ZNK5arrow6Status8ToStringB5cxx11Ev
> >  002b7760 t _ZNK5arrow6Status8ToStringB5cxx11Ev
>
> For clarity, you should use `nm --demangle`.  This will give you the
> actual C++ symbol, i.e. "arrow::Status::ToString[abi:cxx11]() const".
>
> > One of our Linux dev/build experts tracked this down to an issue in arrow 
> > open source.  He says the lowercase ‘t’ (text) code (… 7760 t _ZNK …) in 
> > the nm command output is incorrect and it should instead be an uppercase 
> > ‘T’.
>
> I have the right output here:
>
> $ nm --demangle --defined-only --dynamic .../libarrow.so | \
> grep Status::ToString
> 012f1ff0 T arrow::Status::ToString[abi:cxx11]() const
>
> Which toolchain (linker etc.) are you using?

My guess is also that you have a mixed-gcc-toolchain problem. What
compiler/linker (and gcc toolchain, if you built with Clang) was used
to produce libparquet.so (or where did you obtain the package), and
which toolchain are you using to build and link your application?

> Regards
>
> Antoine.
>
>


Re: CPP : arrow symbols.map issue

2020-04-02 Thread Antoine Pitrou


Hi,

On Thu, 2 Apr 2020 16:56:06 +
Brian Bowman  wrote:
> A new high-performance file system we are working with returns an error while 
> writing a .parquet file.   The following arrow symbol does not resolve 
> properly and the error is masked.
> 
> libparquet.so: undefined symbol: _ZNK5arrow6Status8ToStringB5cxx11Ev
> 
>  > nm libarrow.so* | grep -i ZNK5arrow6Status8ToStringB5cxx11Ev  
>  002b7760 t _ZNK5arrow6Status8ToStringB5cxx11Ev
>  002b7760 t _ZNK5arrow6Status8ToStringB5cxx11Ev

For clarity, you should use `nm --demangle`.  This will give you the
actual C++ symbol, i.e. "arrow::Status::ToString[abi:cxx11]() const".

> One of our Linux dev/build experts tracked this down to an issue in arrow 
> open source.  He says the lowercase ‘t’ (text) code (… 7760 t _ZNK …) in the 
> nm command output is incorrect and it should instead be an uppercase ‘T’.

I have the right output here:

$ nm --demangle --defined-only --dynamic .../libarrow.so | \
grep Status::ToString
012f1ff0 T arrow::Status::ToString[abi:cxx11]() const

Which toolchain (linker etc.) are you using?

Regards

Antoine.




CPP : arrow symbols.map issue

2020-04-02 Thread Brian Bowman
A new high-performance file system we are working with returns an error while 
writing a .parquet file.   The following arrow symbol does not resolve properly 
and the error is masked.

libparquet.so: undefined symbol: _ZNK5arrow6Status8ToStringB5cxx11Ev

 > nm libarrow.so* | grep -i ZNK5arrow6Status8ToStringB5cxx11Ev
 002b7760 t _ZNK5arrow6Status8ToStringB5cxx11Ev
 002b7760 t _ZNK5arrow6Status8ToStringB5cxx11Ev

One of our Linux dev/build experts tracked this down to an issue in arrow open 
source.  He says the lowercase ‘t’ (text) code (… 7760 t _ZNK …) in the nm 
command output is incorrect and it should instead be an uppercase ‘T’.

He traced the problem to this file:

../cpp/src/arrow/symbols.map

Here’s an update with his fix.  Lines 27-30 are new.  Nothing else changes.

  1 # Licensed to the Apache Software Foundation (ASF) under one
  2 # or more contributor license agreements.  See the NOTICE file
  3 # distributed with this work for additional information
  4 # regarding copyright ownership.  The ASF licenses this file
  5 # to you under the Apache License, Version 2.0 (the
  6 # "License"); you may not use this file except in compliance
  7 # with the License.  You may obtain a copy of the License at
  8 #
  9 #   http://www.apache.org/licenses/LICENSE-2.0
10 #
11 # Unless required by applicable law or agreed to in writing,
12 # software distributed under the License is distributed on an
13 # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
14 # KIND, either express or implied.  See the License for the
15 # specific language governing permissions and limitations
16 # under the License.
17
18 {
19   global:
20 extern "C++" {
21   # The leading asterisk is required for symbols such as
22   # "typeinfo for arrow::SomeClass".
23   # Unfortunately this will also catch template specializations
24   # (from e.g. STL or Flatbuffers) involving Arrow types.
25   *arrow::*;
26   *arrow_vendored::*;
27   *ToString*;
28   *key*;
29   *str*;
30   *value*;
31 };
32 # Also export C-level helpers
33 arrow_*;
34 pyarrow_*;
35
36   # Symbols marked as 'local' are not exported by the DSO and thus may not
37   # be used by client applications.  Everything except the above falls here.
38   # This ensures we hide symbols of static dependencies.
39   local:
40 *;
41
42 };

We have made these changes in our local clones the arrow open source 
repositories.   I’m passing this along for the community’s review.  Reply with 
a link and I’ll enter a jira ticket if needed.

-Brian






[jira] [Created] (ARROW-8316) [CI] Set docker-compose to use docker-cli instead of docker-py for building images

2020-04-02 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-8316:
--

 Summary: [CI] Set docker-compose to use docker-cli instead of 
docker-py for building images
 Key: ARROW-8316
 URL: https://issues.apache.org/jira/browse/ARROW-8316
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration
Reporter: Krisztian Szucs


The images pushed from the master branch were sometimes producing reusable 
layers, sometimes not. So the caching was working non-deterministically. 
The underlying issue is https://github.com/docker/compose/issues/883







--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Support of more manipulation for Record Batch

2020-04-02 Thread Wes McKinney
hi Chengxin,

Yes, if you look at the JIRA tracker and look for past discussions on
the mailing list, there are plans to develop comprehensive data
manipulation and query processing capabilities in this project for use
in Python, R, and any other language that binds to C++, including
C/GLib and Ruby.

The way that this functionality is exposed in the pyarrow API will
almost certainly be different than pandas, though. Rather than have
objects with long lists of instance methods, we would opt instead for
computational functions that "act" on the data structures, producing
one or more data structures as output, more similar to tools like
dplyr (an R library). Developers are welcome to create pandas-like
convenience layers, of course, should they so choose.

References:

* C++ datasets API project
https://docs.google.com/document/d/1bVhzifD38qDypnSjtf8exvpP3sSB5x_Kw9m-n66FB2c/edit?usp=sharing
* C++ query engine project
https://docs.google.com/document/d/10RoUZmiMQRi_J1FcPeVAUAMJ6d_ZuiEbaM2Y33sNPu4/edit?usp=sharing
* C++ data frame API project
https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit?usp=sharing

Building these things take time, especially considering the scope of
maintenance involved with keeping this project running. If anyone
reading is interested in contributing time or money to this effort I'd
be happy to speak with you offline about it. If you would like to
contribute we would be glad to have you aboard.

Thanks
Wes

On Thu, Apr 2, 2020 at 6:50 AM Chengxin Ma  wrote:
>
> Hi all,
>
> I am working on a distributed sorting program which runs on multiple 
> computation nodes.
>
> In this sorting program, data is represented as pandas DataFrames and key 
> operations are groupby, concat, and sort_values. For shuffling data among the 
> computation nodes, the DataFrames are converted to Arrow Record Batches and 
> communicated via Arrow Flight.
>
> What I’ve noticed is that much time was spent on the conversion between 
> DataFrame and Record Batch.
>
> The [zero-copy 
> feature](https://arrow.apache.org/docs/python/pandas.html#memory-usage-and-zero-copy)
>  unfortunately cannot be applied to my case, since the DataFrames contain 
> strings as well.
>
> I wanted to try replacing DataFrames with Record Batches, so there would be 
> no need of conversion. However, there seems to be no direct way to do groupby 
> and sort_values on Record Batches, according to [the 
> documentation](https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html)
>
> Is there a plan to add such methods to the API of Record Batch in the future?
>
> Kind Regards
>
> Chengxin
>
> Sent with [ProtonMail](https://protonmail.com) Secure Email.


Re: Proposal to use Black for automatic formatting of Python code

2020-04-02 Thread Wes McKinney
I admit that the status quo does not bother me that much, so
`autopep8` as the more conservative / less aggressive option seems
fine to me, and also makes it simple for people to fix up common
linting issues in their PRs.

On Thu, Apr 2, 2020 at 5:16 AM Antoine Pitrou  wrote:
>
>
> I have looked at the kind of reformatting used by black and I've become
> -1 on this.  `black` is much too aggressive and actually makes the code
> less readable.
>
> `autopep8` seems much better and less aggressive. Let's use that
> instead.
>
> Regards
>
> Antoine.
>
>
> On Thu, 26 Mar 2020 20:37:01 +0100
> Joris Van den Bossche  wrote:
> > Hi all,
> >
> > I would like to propose adopting Black as code formatter within the python
> > project. There is an older JIRA issue about this (
> > https://issues.apache.org/jira/browse/ARROW-5176), but bringing it to the
> > mailing list for wider attention.
> >
> > Black (https://github.com/ambv/black) is a tool for automatically
> > formatting python code in ways which flake8 and our other linters approve
> > of (and fill a similar role to clang-format for C++ and cmake-format for
> > cmake). It can also be added to the linting checks on CI and to the
> > pre-commit hooks like we now run flake8.
> > Using it ensures python code will be formatted consistently, and more
> > importantly automates this formatting, letting you focus on more important
> > matters.
> >
> > Black makes some specific formatting choices, and not everybody (me
> > included) will always like those choices (that's how it goes with something
> > subjective like formatting). But my experience with using it in some other
> > big python projects (pandas, dask) has been very positive. You very quickly
> > get used to how it looks, while it is much nicer to not have to worry about
> > formatting anymore.
> >
> > Best,
> > Joris
> >
>
>
>


Re: [Python] black vs. autopep8

2020-04-02 Thread Wes McKinney
I'm personally fine with the Black changes. After the one-time cost of
reformatting the codebase, it will take any personal preferences out
of code formatting (I admit that I have several myself, but I don't
mind the normalization provided by Black). I hope that Cython support
comes soon since a great deal of our code is Cython

On Thu, Apr 2, 2020 at 9:00 AM Jacek Pliszka  wrote:
>
> Hi!
>
> I believe amount of changes is not that important.
>
> In my opinion, what matters is which format will allow reviewers to be
> more efficient.
>
> The committer can always reformat as they like. It is harder for the reviewer.
>
> BR,
>
> Jacek
>
> czw., 2 kwi 2020 o 15:32 Antoine Pitrou  napisał(a):
> >
> >
> > PS: in both cases, Cython files are not processed.  autopep8 is actually
> > able to process them, but the comparison wouldn't be apples-to-apples.
> >
> > (that said, autopep8 gives suboptimal results on Cython files, for
> > example it changes "_variable" to "& c_variable" and
> > "void* ptr" to "void * ptr")
> >
> > Regards
> >
> > Antoine.
> >
> > Le 02/04/2020 à 15:30, Antoine Pitrou a écrit :
> > >
> > > Hello,
> > >
> > > I've put up two PRs to compare the effect of running black vs. autopep8
> > > on the Python codebase.
> > >
> > > * black: https://github.com/apache/arrow/pull/6810
> > >  65 files changed, 7855 insertions(+), 5215 deletions(-)
> > >
> > > * autopep8: https://github.com/apache/arrow/pull/6811
> > >  20 files changed, 137 insertions(+), 118 deletions(-)
> > >
> > > I've configured black to try and minimize changes (for example, avoid
> > > normalizing string quoting style).  Still, the number of changes is
> > > humongous and they add 2600 lines to the codebase (which is a tangible
> > > amount of vertical space).
> > >
> > > Regards
> > >
> > > Antoine.
> > >


[jira] [Created] (ARROW-8315) [Python]

2020-04-02 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-8315:
---

 Summary: [Python]
 Key: ARROW-8315
 URL: https://issues.apache.org/jira/browse/ARROW-8315
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Ben Kietzman






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [Python] black vs. autopep8

2020-04-02 Thread Jacek Pliszka
Hi!

I believe amount of changes is not that important.

In my opinion, what matters is which format will allow reviewers to be
more efficient.

The committer can always reformat as they like. It is harder for the reviewer.

BR,

Jacek

czw., 2 kwi 2020 o 15:32 Antoine Pitrou  napisał(a):
>
>
> PS: in both cases, Cython files are not processed.  autopep8 is actually
> able to process them, but the comparison wouldn't be apples-to-apples.
>
> (that said, autopep8 gives suboptimal results on Cython files, for
> example it changes "_variable" to "& c_variable" and
> "void* ptr" to "void * ptr")
>
> Regards
>
> Antoine.
>
> Le 02/04/2020 à 15:30, Antoine Pitrou a écrit :
> >
> > Hello,
> >
> > I've put up two PRs to compare the effect of running black vs. autopep8
> > on the Python codebase.
> >
> > * black: https://github.com/apache/arrow/pull/6810
> >  65 files changed, 7855 insertions(+), 5215 deletions(-)
> >
> > * autopep8: https://github.com/apache/arrow/pull/6811
> >  20 files changed, 137 insertions(+), 118 deletions(-)
> >
> > I've configured black to try and minimize changes (for example, avoid
> > normalizing string quoting style).  Still, the number of changes is
> > humongous and they add 2600 lines to the codebase (which is a tangible
> > amount of vertical space).
> >
> > Regards
> >
> > Antoine.
> >


Re: [Python] black vs. autopep8

2020-04-02 Thread Antoine Pitrou


PS: in both cases, Cython files are not processed.  autopep8 is actually
able to process them, but the comparison wouldn't be apples-to-apples.

(that said, autopep8 gives suboptimal results on Cython files, for
example it changes "_variable" to "& c_variable" and
"void* ptr" to "void * ptr")

Regards

Antoine.

Le 02/04/2020 à 15:30, Antoine Pitrou a écrit :
> 
> Hello,
> 
> I've put up two PRs to compare the effect of running black vs. autopep8
> on the Python codebase.
> 
> * black: https://github.com/apache/arrow/pull/6810
>  65 files changed, 7855 insertions(+), 5215 deletions(-)
> 
> * autopep8: https://github.com/apache/arrow/pull/6811
>  20 files changed, 137 insertions(+), 118 deletions(-)
> 
> I've configured black to try and minimize changes (for example, avoid
> normalizing string quoting style).  Still, the number of changes is
> humongous and they add 2600 lines to the codebase (which is a tangible
> amount of vertical space).
> 
> Regards
> 
> Antoine.
> 


[Python] black vs. autopep8

2020-04-02 Thread Antoine Pitrou


Hello,

I've put up two PRs to compare the effect of running black vs. autopep8
on the Python codebase.

* black: https://github.com/apache/arrow/pull/6810
 65 files changed, 7855 insertions(+), 5215 deletions(-)

* autopep8: https://github.com/apache/arrow/pull/6811
 20 files changed, 137 insertions(+), 118 deletions(-)

I've configured black to try and minimize changes (for example, avoid
normalizing string quoting style).  Still, the number of changes is
humongous and they add 2600 lines to the codebase (which is a tangible
amount of vertical space).

Regards

Antoine.


[jira] [Created] (ARROW-8314) [Python] Provide a method to select a subset of columns of a Table

2020-04-02 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8314:


 Summary: [Python] Provide a method to select a subset of columns 
of a Table
 Key: ARROW-8314
 URL: https://issues.apache.org/jira/browse/ARROW-8314
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Python
Reporter: Joris Van den Bossche


I looked through the open issues and in our API, but didn't directly find 
something about selecting a subset of columns of a table.

Assume you have a table like:

{code}
table = pa.table({'a': [1, 2], 'b': [.1, .2], 'c': ['a', 'b']})
{code}

You can select a single column with {{table.column('a')}} or {{table['a']}} to 
get a chunked array. You can add, append, remove and replace columns (with 
{{add_column}}, {{append_column}}, {{remove_column}}, {{set_column}}). 
But an easy way to get a subset of the columns (without the manuall removing 
the ones you don't want one by one) doesn't seem possible. 

I would propose something like:

{code}
table.select(['a', 'c'])
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Join operation on attributes from arrow structs

2020-04-02 Thread Francois Saint-Jacques
They're mapped with the StructType/StructArray, which is also columnar
representation, e.g. one buffer per field in the sub-object. If you
have varying/incompatible types, a field will be promoted to a
UnionType.

François

On Thu, Apr 2, 2020 at 12:54 AM Micah Kornfield  wrote:
>
> Hi Hasara,
> There isn't current functionality in C++/Python to do this (
> https://issues.apache.org/jira/browse/ARROW-4630 is the issue tracking
> this).
>
> Also how nested attributes in json format are mapped into buffers once
> > converted in arrow format?
>
> I'm not sure I understand this question?
>
> Thanks,
> Micah
>
> On Sun, Mar 22, 2020 at 10:09 PM Hasara Maithree <
> hasaramaithreedesi...@gmail.com> wrote:
>
> > Hi all,
> >
> > Assume I have a json file named 'my_data.json' as below.
> >
> > *{"a": [1, 2], "b": {"c": true, "d": "1991-02-03"}}
> > {"a": [3, 4, 5], "b": {"c": false, "d": "2019-04-01"**}}*
> >
> > If I need to do a join operation based on attribute d, can I do it
> > directly from arrow structs? ( or are there any efficient alternatives?)
> > Also how nested attributes in json format are mapped into buffers once
> > converted in arrow format? (example taken from documentation)
> >
> > >>> table = json.read_json("my_data.json")>>> table
> > pyarrow.Table
> > a: list
> >   child 0, item: int64
> > b: struct
> >   child 0, c: bool
> >   child 1, d: timestamp[s]>>> table.to_pandas()
> >a   b0 [1, 2]
> > {'c': True, 'd': 1991-02-03 00:00:00}1  [3, 4, 5]  {'c': False, 'd':
> > 2019-04-01 00:00:00}
> >
> >
> > Thank You
> >


Support of more manipulation for Record Batch

2020-04-02 Thread Chengxin Ma
Hi all,

I am working on a distributed sorting program which runs on multiple 
computation nodes.

In this sorting program, data is represented as pandas DataFrames and key 
operations are groupby, concat, and sort_values. For shuffling data among the 
computation nodes, the DataFrames are converted to Arrow Record Batches and 
communicated via Arrow Flight.

What I’ve noticed is that much time was spent on the conversion between 
DataFrame and Record Batch.

The [zero-copy 
feature](https://arrow.apache.org/docs/python/pandas.html#memory-usage-and-zero-copy)
 unfortunately cannot be applied to my case, since the DataFrames contain 
strings as well.

I wanted to try replacing DataFrames with Record Batches, so there would be no 
need of conversion. However, there seems to be no direct way to do groupby and 
sort_values on Record Batches, according to [the 
documentation](https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html)

Is there a plan to add such methods to the API of Record Batch in the future?

Kind Regards

Chengxin

Sent with [ProtonMail](https://protonmail.com) Secure Email.

Re: Proposal to use Black for automatic formatting of Python code

2020-04-02 Thread Antoine Pitrou


I have looked at the kind of reformatting used by black and I've become
-1 on this.  `black` is much too aggressive and actually makes the code
less readable.

`autopep8` seems much better and less aggressive. Let's use that
instead.

Regards

Antoine.


On Thu, 26 Mar 2020 20:37:01 +0100
Joris Van den Bossche  wrote:
> Hi all,
> 
> I would like to propose adopting Black as code formatter within the python
> project. There is an older JIRA issue about this (
> https://issues.apache.org/jira/browse/ARROW-5176), but bringing it to the
> mailing list for wider attention.
> 
> Black (https://github.com/ambv/black) is a tool for automatically
> formatting python code in ways which flake8 and our other linters approve
> of (and fill a similar role to clang-format for C++ and cmake-format for
> cmake). It can also be added to the linting checks on CI and to the
> pre-commit hooks like we now run flake8.
> Using it ensures python code will be formatted consistently, and more
> importantly automates this formatting, letting you focus on more important
> matters.
> 
> Black makes some specific formatting choices, and not everybody (me
> included) will always like those choices (that's how it goes with something
> subjective like formatting). But my experience with using it in some other
> big python projects (pandas, dask) has been very positive. You very quickly
> get used to how it looks, while it is much nicer to not have to worry about
> formatting anymore.
> 
> Best,
> Joris
> 





Re: Clarification regarding the `CDataInterface.rst`

2020-04-02 Thread Anish Biswas
Upgrading the pip installer worked perfectly. Thanks!

Regards,
Anish Biswas

On 2020/04/02 09:35:50, Antoine Pitrou  wrote: 
> 
> Hi Anish,
> 
> It looks like a bug with old pip versions.  You can first upgrade pip using:
> 
> $ pip install -U pip
> 
> Then redo the "pip install" command for pyarrow.
> 
> If you can't upgrade pip, you can install Numpy separately first (using
> "pip install numpy").
> 
> Regards
> 
> Antoine.
> 
> 
> Le 02/04/2020 à 06:07, Anish Biswas a écrit :
> > Hey Antoine,
> > 
> > I am getting a few complications by using what you said. It's attempting to 
> > collect numpy>=1.14.0(from pyarrow) and I cross-checked it and isn't any 
> > .whl file for numpy hosted there. The same case persists for six. Can you 
> > please look into it?
> > 
> > Thanks,
> > Anish Biswas
> > 
> > On 2020/03/30 16:15:53, Antoine Pitrou  wrote: 
> >> On Mon, 30 Mar 2020 15:17:02 -
> >> Anish Biswas  wrote:
> >>> Thanks! I'll probably build the Arrow Library from source. Thanks again!
> >>
> >> You should be able to get a nightly build using:
> >>
> >> $ pip install -U --extra-index-url \
> >> https://pypi.fury.io/arrow-nightlies/ --pre pyarrow
> >>
> >> Regards
> >>
> >> Antoine.
> >>
> >>
> >>
> 


Re: Clarification regarding the `CDataInterface.rst`

2020-04-02 Thread Antoine Pitrou


Hi Anish,

It looks like a bug with old pip versions.  You can first upgrade pip using:

$ pip install -U pip

Then redo the "pip install" command for pyarrow.

If you can't upgrade pip, you can install Numpy separately first (using
"pip install numpy").

Regards

Antoine.


Le 02/04/2020 à 06:07, Anish Biswas a écrit :
> Hey Antoine,
> 
> I am getting a few complications by using what you said. It's attempting to 
> collect numpy>=1.14.0(from pyarrow) and I cross-checked it and isn't any .whl 
> file for numpy hosted there. The same case persists for six. Can you please 
> look into it?
> 
> Thanks,
> Anish Biswas
> 
> On 2020/03/30 16:15:53, Antoine Pitrou  wrote: 
>> On Mon, 30 Mar 2020 15:17:02 -
>> Anish Biswas  wrote:
>>> Thanks! I'll probably build the Arrow Library from source. Thanks again!
>>
>> You should be able to get a nightly build using:
>>
>> $ pip install -U --extra-index-url \
>> https://pypi.fury.io/arrow-nightlies/ --pre pyarrow
>>
>> Regards
>>
>> Antoine.
>>
>>
>>


[jira] [Created] (ARROW-8313) [Gandiva][UDF] Solutions to register new UDFs dynamically without checking it into arrow repo.

2020-04-02 Thread ZMZ91 (Jira)
ZMZ91 created ARROW-8313:


 Summary: [Gandiva][UDF] Solutions to register new UDFs dynamically 
without checking it into arrow repo.
 Key: ARROW-8313
 URL: https://issues.apache.org/jira/browse/ARROW-8313
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: ZMZ91


Hi there,

Recently I'm studying on gandiva and trying to add some UDF. I noted that it's 
needed to check the UDF implementation into the arrow repo, register the UDF 
and then build the UDF into precompiled_bitcode lib, right? I'm just wandering 
that is it possible to register new UDFs dynamically? Say I have the UDF 
implementation code locally which is not built into the gandiva lib yet, am I 
able to call some function or other solutions provided by gandiva officially to 
register and implement it. Thanks in advance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8312) improve IN expression support

2020-04-02 Thread Yuan Zhou (Jira)
Yuan Zhou created ARROW-8312:


 Summary: improve IN expression support
 Key: ARROW-8312
 URL: https://issues.apache.org/jira/browse/ARROW-8312
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Gandiva, Java
Reporter: Yuan Zhou
Assignee: Yuan Zhou


Gandiva C++ provided IN API[1] is able to accept TreeNode as param, which 
allows IN expression to operate on output of some function. However in Java 
API[2], IN expression only accept Field as param, which limits the API usage. 

[1] 
https://github.com/apache/arrow/blob/master/cpp/src/gandiva/tree_expr_builder.h#L94-L125
[2] 
https://github.com/apache/arrow/blob/master/java/gandiva/src/main/java/org/apache/arrow/gandiva/expression/InNode.java#L50-L63



--
This message was sent by Atlassian Jira
(v8.3.4#803005)