[DISCUSS] [Rust] Deleting the legacy query execution code

2019-10-04 Thread Andy Grove
I have been working on ARROW-5227 [1] for quite a while now (with help from
some other contributors) and the new trait-based physical query plan has
now reached feature parity with the previous implementation of query
execution which directly executed the logical plan enum and therefore could
not be extended by other projects and also didn’t support parallel
execution.

I would now like to delete the original query execution code which would
reduce the code base by approximately 2,500 lines of code and remove a lot
of code duplication. I have a WIP PR for this [2].

I am raising this for discussion here because there is an argument for just
deprecating this code for 1.0.0 and then removing in a later release, but I
would prefer to delete it prior to the 1.0.0 release to reduce confusion
and maintenance overhead.

Does anybody have any objections to this?

[1] https://issues.apache.org/jira/browse/ARROW-5227

[2] https://github.com/apache/arrow/pull/5583


Re: [DISCUSS] Result vs Status

2019-10-04 Thread Micah Kornfield
>
> >
> >  It was my impression that we had workable solutions for using Result in
> at
> > least Python and Glib/Ruby (I'm don't know about R).
>
> In Python we do (though it needed a C++-side helper).
>

OK, so what  could more context be provided on:


> From the discussion in the sync call, it seems reasonable to require that:
> Public APIs which are likely to be directly wrapped in a binding should not
> use Result<> to the exclusion of Status. An equivalent Status API should
> always be provided for ease of binding.


Thanks,
Micah

On Thu, Oct 3, 2019 at 3:42 AM Antoine Pitrou  wrote:

>
> Le 03/10/2019 à 06:13, Micah Kornfield a écrit :
> >
> >  It was my impression that we had workable solutions for using Result in
> at
> > least Python and Glib/Ruby (I'm don't know about R).
>
> In Python we do (though it needed a C++-side helper).
>
> Regards
>
> Antoine.
>


Re: [jira] [Created] (ARROW-6793) [R] Arrow C++ binary packaging for Linux

2019-10-04 Thread Thomas S
Very recently i had the pleasure to install arrow on Linux. At this stage
let me first remark that without the help of @xhochy and @kou I certainly
would have failed. I have now managed to install(? still quite a lot of
warning messages) in a rocker container. I have published the docker-image
here:

https://hub.docker.com/r/tschm/rocker-arrow

Maybe one of the experts could fix and/or improve it? Many thanks

Thomas



On Fri, 4 Oct 2019 at 20:07, Neal Richardson (Jira)  wrote:

> Neal Richardson created ARROW-6793:
> --
>
>  Summary: [R] Arrow C++ binary packaging for Linux
>  Key: ARROW-6793
>  URL: https://issues.apache.org/jira/browse/ARROW-6793
>  Project: Apache Arrow
>   Issue Type: Improvement
>   Components: R
> Reporter: Neal Richardson
> Assignee: Neal Richardson
>  Fix For: 1.0.0
>
>
> Our current installation experience on Linux isn't ideal. Unless you've
> already installed the Arrow C++ library, when you install the R package,
> you get a shell that tells you to install the C++ library. That was a
> useful approach to allow us to get the package on CRAN, which makes it easy
> for macOS and Windows users to install, but it doesn't improve the
> installation experience for Linux users. This is an impediment to adoption
> of arrow not only by users but also by package maintainers who might want
> to depend on arrow.
>
> macOS and Windows have a better experience because at installation time,
> the configure scripts download and statically link a prebuilt C++ library.
> CRAN bundles the whole thing up and delivers that as a binary R package.
>
> Python wheels do a similar thing: they're binaries that contain all
> external dependencies. And there are pyarrow wheels for Linux. This
> suggests that we could do something similar for R: build a generic Linux
> binary of the C++ library and download it in the R package configure script
> at install time.
>
> I experimented with using the Arrow C++ binaries included in the Python
> wheels in R. See discussion at the end of ARROW-5956. This worked on macOS
> (not useful for R, but it proved the concept) and almost worked on Linux,
> but it turned out that the "manylinux2010" standard is too archaic to work
> with contemporary Rcpp.
>
> Proposal: do a similar workflow to what the manylinux2010 pyarrow build
> does, just with slightly more modern compiler/settings. Publish that C++
> binary package to bintray. Then download it in the R configure script if a
> local/system package isn't found.
>
> Once we have a basic version working, test against various distros on
> [R-hub|https://builder.r-hub.io/advanced] to make sure we're solid
> everywhere and/or ensure the current fallback behavior when we encounter a
> distro that this doesn't work for. If necessary, we can make multiple
> flavors of this C++ binary for debian, centos, etc.
>
>
>
> --
> This message was sent by Atlassian Jira
> (v8.3.4#803005)
>


-- 
Dr. Thomas Schmelzer
*post: *Rue Louis-de-Savoie 60, 1110 Morges, Switzerland
*mobile:* +41 786 928 942
*skype: *thomas.schmelzer


[jira] [Created] (ARROW-6795) [C#] Reading large Arrow files in C# results in an exception

2019-10-04 Thread Eric Erhardt (Jira)
Eric Erhardt created ARROW-6795:
---

 Summary: [C#] Reading large Arrow files in C# results in an 
exception
 Key: ARROW-6795
 URL: https://issues.apache.org/jira/browse/ARROW-6795
 Project: Apache Arrow
  Issue Type: Bug
  Components: C#
Reporter: Eric Erhardt


If you try to read a large Arrow file (2GB+) using the C# reader, you get an 
exception because it is casting the file position (a 64-bit long) to a 32-bit 
integer. When the file size is large

 

See [https://github.com/apache/arrow/pull/5412]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Should Arrow adopt C++14 / 17?

2019-10-04 Thread Wes McKinney
On Fri, Oct 4, 2019 at 3:18 PM Zhuo Peng  wrote:
>
>
>
> On 2019/10/04 19:43:04, Wes McKinney  wrote:
> > On Fri, Oct 4, 2019 at 12:45 PM Zhuo Peng  wrote:
> > >
> > >
> > >
> > > On 2019/10/04 17:05:00, Antoine Pitrou  wrote:
> > > >
> > > > Le 04/10/2019 à 19:01, Zhuo Peng a écrit :
> > > > >
> > > > > backports are cool for internal use, but probably not so if a public 
> > > > > API accepts it? (because you vendor the headers in (i.e. namespace, 
> > > > > symbol names unchanged), they might clash with headers that a client 
> > > > > uses).
> > > >
> > > > This is true unfortunately.
> > > >
> > > > >>> And btw, was -std=gnu++11 an intentional choice? what gnu 
> > > > >>> extensions does the library rely on?
> > > > >>
> > > > >> None, AFAIK.  Arrow compiles on MSVC fine.  Where is -std=gnu++11 
> > > > >> added?
> > > > > https://github.com/apache/arrow/blob/3129e3ed90219ecfffe2a25ce5820eec8cc947d0/cpp/cmake_modules/SetupCxxFlags.cmake#L33
> > > > >
> > > > > https://cmake.org/cmake/help/v3.1/prop_tgt/CXX_STANDARD.html
> > > >
> > > > Right, so this is a CMake decision.  I think we require only plain C++11
> > > > (but we may enable additional features on some compilers, provided
> > > > there's a fallback).
> > > Extensions can be disabled through:
> > > set(CMAKE_CXX_EXTENSIONS OFF)
> > >
> > > https://cmake.org/cmake/help/v3.1/prop_tgt/CXX_EXTENSIONS.html
> > >
> > > Is that something more desirable than the current state?
> >
> > Yes, I think so, I don't think we need to be relying on GNU gcc
> > extensions, but we should open a JIRA issue about disabling it in case
> > some tests break because of something we didn't realize we were
> > depending on.
> sg. I'll create one then.
> >
> > As far as C++14/17 upgrading, it seems like it will be at least 2
> > years before we could upgrade to C++17 given the state of compiler
> > support across the spectrum. Using C++17 would mean requiring at least
> > VS 2017 on Windows, since at least in the Python world I think
> > everything is on VS 2015.
> >
> > Are there ways we could create defines to switch between backports and
> > STL things (like string_view, optional, etc.) so that developers using
> > the Arrow library in a C++17 application can use the built-in types?
> This is dangerous unless they build the Arrow library from source with C++17. 
> if libarrow takes arrow::string_view but the user gives it a 
> std::string_view, it's UB.
>
> If we are talking about allowing users to build Arrow with C++17 and support 
> transparently the new STL types in the public APIs, the ABSL[1] library could 
> be something to consider.. absl::{string_view,optional,variant} becomes their 
> std:: counterparts when compiled under C++17, e.g. [2].
>

Yes, the presumption would be a monotoolchain environment, and Arrow
would need to have some CMake options to build in C++17 mode

> And inline namespaces are used [3] to make sure different libraries can 
> depend on different version of absl.
>
> [1] https://abseil.io/
> [2] 
> https://github.com/abseil/abseil-cpp/blob/25597bdfc148e91e27678ec30efa52f4fc8c164f/absl/strings/string_view.h#L38
> [3] 
> https://github.com/abseil/abseil-cpp/blob/aa844899c937bde5d2b24f276b59997e5b668bde/absl/strings/string_view.h#L38
> >
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> >


[jira] [Created] (ARROW-6794) [Release] dev/release/post-03-website.sh is out of date in a couple of ways

2019-10-04 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-6794:
---

 Summary: [Release] dev/release/post-03-website.sh is out of date 
in a couple of ways
 Key: ARROW-6794
 URL: https://issues.apache.org/jira/browse/ARROW-6794
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Developer Tools
Reporter: Wes McKinney
 Fix For: 1.0.0


* Need to add APACHE_ prefix to environment variables
* arrow-site repository is now separate



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Should Arrow adopt C++14 / 17?

2019-10-04 Thread Zhuo Peng



On 2019/10/04 19:43:04, Wes McKinney  wrote: 
> On Fri, Oct 4, 2019 at 12:45 PM Zhuo Peng  wrote:
> >
> >
> >
> > On 2019/10/04 17:05:00, Antoine Pitrou  wrote:
> > >
> > > Le 04/10/2019 à 19:01, Zhuo Peng a écrit :
> > > >
> > > > backports are cool for internal use, but probably not so if a public 
> > > > API accepts it? (because you vendor the headers in (i.e. namespace, 
> > > > symbol names unchanged), they might clash with headers that a client 
> > > > uses).
> > >
> > > This is true unfortunately.
> > >
> > > >>> And btw, was -std=gnu++11 an intentional choice? what gnu extensions 
> > > >>> does the library rely on?
> > > >>
> > > >> None, AFAIK.  Arrow compiles on MSVC fine.  Where is -std=gnu++11 
> > > >> added?
> > > > https://github.com/apache/arrow/blob/3129e3ed90219ecfffe2a25ce5820eec8cc947d0/cpp/cmake_modules/SetupCxxFlags.cmake#L33
> > > >
> > > > https://cmake.org/cmake/help/v3.1/prop_tgt/CXX_STANDARD.html
> > >
> > > Right, so this is a CMake decision.  I think we require only plain C++11
> > > (but we may enable additional features on some compilers, provided
> > > there's a fallback).
> > Extensions can be disabled through:
> > set(CMAKE_CXX_EXTENSIONS OFF)
> >
> > https://cmake.org/cmake/help/v3.1/prop_tgt/CXX_EXTENSIONS.html
> >
> > Is that something more desirable than the current state?
> 
> Yes, I think so, I don't think we need to be relying on GNU gcc
> extensions, but we should open a JIRA issue about disabling it in case
> some tests break because of something we didn't realize we were
> depending on.
sg. I'll create one then.
> 
> As far as C++14/17 upgrading, it seems like it will be at least 2
> years before we could upgrade to C++17 given the state of compiler
> support across the spectrum. Using C++17 would mean requiring at least
> VS 2017 on Windows, since at least in the Python world I think
> everything is on VS 2015.
> 
> Are there ways we could create defines to switch between backports and
> STL things (like string_view, optional, etc.) so that developers using
> the Arrow library in a C++17 application can use the built-in types?
This is dangerous unless they build the Arrow library from source with C++17. 
if libarrow takes arrow::string_view but the user gives it a std::string_view, 
it's UB.

If we are talking about allowing users to build Arrow with C++17 and support 
transparently the new STL types in the public APIs, the ABSL[1] library could 
be something to consider.. absl::{string_view,optional,variant} becomes their 
std:: counterparts when compiled under C++17, e.g. [2].

And inline namespaces are used [3] to make sure different libraries can depend 
on different version of absl.

[1] https://abseil.io/ 
[2] 
https://github.com/abseil/abseil-cpp/blob/25597bdfc148e91e27678ec30efa52f4fc8c164f/absl/strings/string_view.h#L38
[3] 
https://github.com/abseil/abseil-cpp/blob/aa844899c937bde5d2b24f276b59997e5b668bde/absl/strings/string_view.h#L38
> 
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> 


Re: [Proposal]: Expose Flight gRPC for Dremio use case (Java)

2019-10-04 Thread Rohit Gupta
I think for now we just want to expose the gRPC impl under a different
namespace

- FlightGrpcServer would expose FlightBindingService
- FlightGrpcClient would expose FlightClient

On Fri, Oct 4, 2019 at 11:48 AM David Li  wrote:

> Hi Rohit,
>
> This sounds interesting, and I think we've voiced support for
> something similar before :)
>
> Given that Flight does want to abstract over the exact backends,
> though, how should we approach this? Is the proposal to also refactor
> Flight/Java such that the core classes are just interfaces (or
> delegate to interfaces) that anyone can implement, and have the gRPC
> implementation as the reference one? Or is this just proposing to
> expose the gRPC implementation under a separate namespace, and leave
> that question for later?
>
> Best,
> David
>
> On 10/4/19, Rohit Gupta  wrote:
> > Hi,
> >
> > At dremio we are using gRPC for JobsService. One of the api's relies on
> > Arrow Flight. We want access to the Flight service so we can bind it to
> the
> > same managed channel  as the rest of JobsService (& not have a completely
> > separate server).
> >
> > The approach would be to create a new module within the same package
> > (org.apache.arrow.flight) and have 2 classes FlightGrpcServer &
> > FlightGrpcClient that expose the client & server, and also make
> > FlightClient ctor package-private.
> >
> > Please let us know if you have questions or concerns.
> >
> > Best,
> > Rohit
> >
>


Re: [VOTE] Release Apache Arrow 0.15.0 - RC2

2019-10-04 Thread Wes McKinney
The commits from your local RC branch aren't available so I cannot
rebase master yet, I'll just wait for you to be available again. If
anyone has some spare time we should try to complete as many
post-release tasks this weekend so we can announce the release on
Monday or Tuesday next week.

Thanks all for your help getting this release ready!

On Fri, Oct 4, 2019 at 6:40 AM Krisztián Szűcs
 wrote:
>
> We have 5 binding +1 votes and 2 non-binding +1 votes so far.
> The 72 hours has passed, so we can close the release vote.
>
> Sadly I won't be available for the rest of the day, so I will be able
> to close the vote and start to work on the the post release tasks
> from tomorrow.
> @Wes if you have bandwidth feel free to close the vote sooner.
>
>
> On Thu, Oct 3, 2019 at 1:14 AM Bryan Cutler  wrote:
>
> > Accidentally sent too soon. The ORC build error I got was probably just an
> > env issue for me, but here it is in case anyone else had the same issue:
> >
> > In file included from
> >
> > ESC[01mESC[K/tmp/arrow-0.15.0.KxYbA/apache-arrow-0.15.0/cpp/build/orc_ep-prefix/src/orc_ep/c++/src/wrap/orc-proto-wrapper.cc:44:0ESC[mESC[K:
> >
> > ESC[01mESC[K/tmp/arrow-0.15.0.KxYbA/apache-arrow-0.15.0/cpp/build/orc_ep-prefix/src/orc_ep-build/c++/src/orc_proto.pb.cc:970
> > :13:ESC[mESC[K
> > ESC[01;31mESC[Kerror:
> > ESC[mESC[K‘ESC[01mESC[Kdynamic_init_dummy_orc_5fproto_2eprotoESC[mESC[K’
> > defined but not used [-Werror=unused-variable]
> >  static bool dynamic_init_dummy_orc_5fproto_2eproto = []() {
> > AddDescriptors_orc_5fproto_2eproto(); return true; }();
> > ESC[01;32mESC[K ^ESC[mESC[K
> > cc1plus: all warnings being treated as errors
> > make[5]: *** [c++/src/CMakeFiles/orc.dir/wrap/orc-proto-wrapper.cc.o] Error
> > 1
> > make[5]: *** Waiting for unfinished jobs
> > make[4]: *** [c++/src/CMakeFiles/orc.dir/all] Error 2
> > make[3]: *** [all] Error 2
> >
> > [ 29%] Performing build step for 'orc_ep'
> > CMake Error at
> >
> > /tmp/arrow-0.15.0.KxYbA/apache-arrow-0.15.0/cpp/build/orc_ep-prefix/src/orc_ep-stamp/orc_ep-build-RELEASE.cmake:16
> > (message):
> >   Command failed: 2
> >
> >'make'
> >
> >   See also
> >
> >
> >
> > /tmp/arrow-0.15.0.KxYbA/apache-arrow-0.15.0/cpp/build/orc_ep-prefix/src/orc_ep-stamp/orc_ep-build-*.log
> >
> > CMakeFiles/orc_ep.dir/build.make:111: recipe for target
> > 'orc_ep-prefix/src/orc_ep-stamp/orc_ep-build' failed
> > make[2]: *** [orc_ep-prefix/src/orc_ep-stamp/orc_ep-build] Error 1
> > CMakeFiles/Makefile2:1248: recipe for target 'CMakeFiles/orc_ep.dir/all'
> > failed
> > make[1]: *** [CMakeFiles/orc_ep.dir/all] Error 2
> >
> > On Wed, Oct 2, 2019 at 4:12 PM Bryan Cutler  wrote:
> >
> > > +1 (non-binding)
> > >
> > > I ran the following on Ubuntu 16.04 4.15.0-64-generic:
> > > > dev/release/verify-release-candidate.sh binaries 0.15.0 2
> > > > ARROW_CUDA=OFF \
> > > TEST_DEFAULT=0 \
> > > TEST_SOURCE=1 \
> > > TEST_CPP=1 \
> > > TEST_PYTHON=1 \
> > > TEST_JAVA=1 \
> > > TEST_INTEGRATION=1 \
> > > dev/release/verify-release-candidate.sh source 0.15.0 2
> > >
> > > For source verification I set INTEGRATION_TEST_ARGS="--enable-js=0
> > > --enable-go=0"
> > >
> > > When attempting source verification with defaults, I got the below error
> > > when building the ORC adapter. It looks like just a warning that is being
> > > treated as error and seems to be only in
> > >
> > > On Wed, Oct 2, 2019 at 7:53 AM Andy Grove  wrote:
> > >
> > >> +1 (binding)
> > >>
> > >> On Mon, Sep 30, 2019 at 11:57 PM Krisztián Szűcs <
> > >> szucs.kriszt...@gmail.com>
> > >> wrote:
> > >>
> > >> > Hi,
> > >> >
> > >> > I would like to propose the following release candidate (RC2) of
> > Apache
> > >> > Arrow version 0.15.0. This is a release consiting of 697
> > >> > resolved JIRA issues[1].
> > >> >
> > >> > This release candidate is based on commit:
> > >> > 40d468e162e88e1761b1e80b3ead060f0be927ee [2]
> > >> >
> > >> > The source release rc2 is hosted at [3].
> > >> > The binary artifacts are hosted at [4][5][6][7].
> > >> > The changelog is located at [8].
> > >> >
> > >> > Please download, verify checksums and signatures, run the unit tests,
> > >> > and vote on the release. See [9] for how to validate a release
> > >> candidate.
> > >> >
> > >> > The vote will be open for at least 72 hours.
> > >> >
> > >> > [ ] +1 Release this as Apache Arrow 0.15.0
> > >> > [ ] +0
> > >> > [ ] -1 Do not release this as Apache Arrow 0.15.0 because...
> > >> >
> > >> > [1]:
> > >> >
> > >> >
> > >>
> > https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%200.15.0
> > >> > [2]:
> > >> >
> > >> >
> > >>
> > https://github.com/apache/arrow/tree/40d468e162e88e1761b1e80b3ead060f0be927ee
> > >> > [3]:
> > >> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-0.15.0-rc2
> > >> > [4]: https://bintray.com/apache/arrow/centos-rc/0.15.0-rc2
> > >> > [5]: https://bintray.com/apache/arrow/debian-rc/0.15.0-rc2
> > >>

Re: Should Arrow adopt C++14 / 17?

2019-10-04 Thread Wes McKinney
On Fri, Oct 4, 2019 at 12:45 PM Zhuo Peng  wrote:
>
>
>
> On 2019/10/04 17:05:00, Antoine Pitrou  wrote:
> >
> > Le 04/10/2019 à 19:01, Zhuo Peng a écrit :
> > >
> > > backports are cool for internal use, but probably not so if a public API 
> > > accepts it? (because you vendor the headers in (i.e. namespace, symbol 
> > > names unchanged), they might clash with headers that a client uses).
> >
> > This is true unfortunately.
> >
> > >>> And btw, was -std=gnu++11 an intentional choice? what gnu extensions 
> > >>> does the library rely on?
> > >>
> > >> None, AFAIK.  Arrow compiles on MSVC fine.  Where is -std=gnu++11 added?
> > > https://github.com/apache/arrow/blob/3129e3ed90219ecfffe2a25ce5820eec8cc947d0/cpp/cmake_modules/SetupCxxFlags.cmake#L33
> > >
> > > https://cmake.org/cmake/help/v3.1/prop_tgt/CXX_STANDARD.html
> >
> > Right, so this is a CMake decision.  I think we require only plain C++11
> > (but we may enable additional features on some compilers, provided
> > there's a fallback).
> Extensions can be disabled through:
> set(CMAKE_CXX_EXTENSIONS OFF)
>
> https://cmake.org/cmake/help/v3.1/prop_tgt/CXX_EXTENSIONS.html
>
> Is that something more desirable than the current state?

Yes, I think so, I don't think we need to be relying on GNU gcc
extensions, but we should open a JIRA issue about disabling it in case
some tests break because of something we didn't realize we were
depending on.

As far as C++14/17 upgrading, it seems like it will be at least 2
years before we could upgrade to C++17 given the state of compiler
support across the spectrum. Using C++17 would mean requiring at least
VS 2017 on Windows, since at least in the Python world I think
everything is on VS 2015.

Are there ways we could create defines to switch between backports and
STL things (like string_view, optional, etc.) so that developers using
the Arrow library in a C++17 application can use the built-in types?

> >
> > Regards
> >
> > Antoine.
> >


Re: [Proposal]: Expose Flight gRPC for Dremio use case (Java)

2019-10-04 Thread Wes McKinney
Is it possible for a single gRPC server to expose multiple services
through the same port (it sounds like it is)? It would be a good idea
to do similar refactoring in C++ so that Flight RPC endpoints can be
provided alongside some other non-Flight endpoints in the same gRPC
server

On Fri, Oct 4, 2019 at 1:49 PM David Li  wrote:
>
> Hi Rohit,
>
> This sounds interesting, and I think we've voiced support for
> something similar before :)
>
> Given that Flight does want to abstract over the exact backends,
> though, how should we approach this? Is the proposal to also refactor
> Flight/Java such that the core classes are just interfaces (or
> delegate to interfaces) that anyone can implement, and have the gRPC
> implementation as the reference one? Or is this just proposing to
> expose the gRPC implementation under a separate namespace, and leave
> that question for later?
>
> Best,
> David
>
> On 10/4/19, Rohit Gupta  wrote:
> > Hi,
> >
> > At dremio we are using gRPC for JobsService. One of the api's relies on
> > Arrow Flight. We want access to the Flight service so we can bind it to the
> > same managed channel  as the rest of JobsService (& not have a completely
> > separate server).
> >
> > The approach would be to create a new module within the same package
> > (org.apache.arrow.flight) and have 2 classes FlightGrpcServer &
> > FlightGrpcClient that expose the client & server, and also make
> > FlightClient ctor package-private.
> >
> > Please let us know if you have questions or concerns.
> >
> > Best,
> > Rohit
> >


Re: [Proposal]: Expose Flight gRPC for Dremio use case (Java)

2019-10-04 Thread David Li
Hi Rohit,

This sounds interesting, and I think we've voiced support for
something similar before :)

Given that Flight does want to abstract over the exact backends,
though, how should we approach this? Is the proposal to also refactor
Flight/Java such that the core classes are just interfaces (or
delegate to interfaces) that anyone can implement, and have the gRPC
implementation as the reference one? Or is this just proposing to
expose the gRPC implementation under a separate namespace, and leave
that question for later?

Best,
David

On 10/4/19, Rohit Gupta  wrote:
> Hi,
>
> At dremio we are using gRPC for JobsService. One of the api's relies on
> Arrow Flight. We want access to the Flight service so we can bind it to the
> same managed channel  as the rest of JobsService (& not have a completely
> separate server).
>
> The approach would be to create a new module within the same package
> (org.apache.arrow.flight) and have 2 classes FlightGrpcServer &
> FlightGrpcClient that expose the client & server, and also make
> FlightClient ctor package-private.
>
> Please let us know if you have questions or concerns.
>
> Best,
> Rohit
>


[jira] [Created] (ARROW-6793) [R] Arrow C++ binary packaging for Linux

2019-10-04 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-6793:
--

 Summary: [R] Arrow C++ binary packaging for Linux
 Key: ARROW-6793
 URL: https://issues.apache.org/jira/browse/ARROW-6793
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Neal Richardson
Assignee: Neal Richardson
 Fix For: 1.0.0


Our current installation experience on Linux isn't ideal. Unless you've already 
installed the Arrow C++ library, when you install the R package, you get a 
shell that tells you to install the C++ library. That was a useful approach to 
allow us to get the package on CRAN, which makes it easy for macOS and Windows 
users to install, but it doesn't improve the installation experience for Linux 
users. This is an impediment to adoption of arrow not only by users but also by 
package maintainers who might want to depend on arrow. 

macOS and Windows have a better experience because at installation time, the 
configure scripts download and statically link a prebuilt C++ library. CRAN 
bundles the whole thing up and delivers that as a binary R package. 

Python wheels do a similar thing: they're binaries that contain all external 
dependencies. And there are pyarrow wheels for Linux. This suggests that we 
could do something similar for R: build a generic Linux binary of the C++ 
library and download it in the R package configure script at install time.

I experimented with using the Arrow C++ binaries included in the Python wheels 
in R. See discussion at the end of ARROW-5956. This worked on macOS (not useful 
for R, but it proved the concept) and almost worked on Linux, but it turned out 
that the "manylinux2010" standard is too archaic to work with contemporary 
Rcpp. 

Proposal: do a similar workflow to what the manylinux2010 pyarrow build does, 
just with slightly more modern compiler/settings. Publish that C++ binary 
package to bintray. Then download it in the R configure script if a 
local/system package isn't found.

Once we have a basic version working, test against various distros on 
[R-hub|https://builder.r-hub.io/advanced] to make sure we're solid everywhere 
and/or ensure the current fallback behavior when we encounter a distro that 
this doesn't work for. If necessary, we can make multiple flavors of this C++ 
binary for debian, centos, etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[Proposal]: Expose Flight gRPC for Dremio use case (Java)

2019-10-04 Thread Rohit Gupta
Hi,

At dremio we are using gRPC for JobsService. One of the api's relies on
Arrow Flight. We want access to the Flight service so we can bind it to the
same managed channel  as the rest of JobsService (& not have a completely
separate server).

The approach would be to create a new module within the same package
(org.apache.arrow.flight) and have 2 classes FlightGrpcServer &
FlightGrpcClient that expose the client & server, and also make
FlightClient ctor package-private.

Please let us know if you have questions or concerns.

Best,
Rohit


Re: Should Arrow adopt C++14 / 17?

2019-10-04 Thread Zhuo Peng



On 2019/10/04 17:05:00, Antoine Pitrou  wrote: 
> 
> Le 04/10/2019 à 19:01, Zhuo Peng a écrit :
> > 
> > backports are cool for internal use, but probably not so if a public API 
> > accepts it? (because you vendor the headers in (i.e. namespace, symbol 
> > names unchanged), they might clash with headers that a client uses).
> 
> This is true unfortunately.
> 
> >>> And btw, was -std=gnu++11 an intentional choice? what gnu extensions does 
> >>> the library rely on?
> >>
> >> None, AFAIK.  Arrow compiles on MSVC fine.  Where is -std=gnu++11 added?
> > https://github.com/apache/arrow/blob/3129e3ed90219ecfffe2a25ce5820eec8cc947d0/cpp/cmake_modules/SetupCxxFlags.cmake#L33
> > 
> > https://cmake.org/cmake/help/v3.1/prop_tgt/CXX_STANDARD.html
> 
> Right, so this is a CMake decision.  I think we require only plain C++11
> (but we may enable additional features on some compilers, provided
> there's a fallback).
Extensions can be disabled through:
set(CMAKE_CXX_EXTENSIONS OFF)

https://cmake.org/cmake/help/v3.1/prop_tgt/CXX_EXTENSIONS.html

Is that something more desirable than the current state? 
> 
> Regards
> 
> Antoine.
> 


Re: Should Arrow adopt C++14 / 17?

2019-10-04 Thread Antoine Pitrou


Le 04/10/2019 à 19:01, Zhuo Peng a écrit :
> 
> backports are cool for internal use, but probably not so if a public API 
> accepts it? (because you vendor the headers in (i.e. namespace, symbol names 
> unchanged), they might clash with headers that a client uses).

This is true unfortunately.

>>> And btw, was -std=gnu++11 an intentional choice? what gnu extensions does 
>>> the library rely on?
>>
>> None, AFAIK.  Arrow compiles on MSVC fine.  Where is -std=gnu++11 added?
> https://github.com/apache/arrow/blob/3129e3ed90219ecfffe2a25ce5820eec8cc947d0/cpp/cmake_modules/SetupCxxFlags.cmake#L33
> 
> https://cmake.org/cmake/help/v3.1/prop_tgt/CXX_STANDARD.html

Right, so this is a CMake decision.  I think we require only plain C++11
(but we may enable additional features on some compilers, provided
there's a fallback).

Regards

Antoine.


Re: Should Arrow adopt C++14 / 17?

2019-10-04 Thread Antoine Pitrou


C++14 isn't very interesting.  C++17 is, but it's probably too young
given the diversity of platform and toolchain requirements that constrai us.

Regards

Antoine.


Le 04/10/2019 à 18:13, Neal Richardson a écrit :
> We do have to care about more than just conda and Python. For R, for
> example, C++14 support is limited:
> https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Using-C_002b_002b14-code
> 
> That's not to say that we can't try (if C++ devs wanted to, and I
> can't speak for them), but it might be time-consuming (or worse) to
> figure out what features are supported on all of the platforms and
> languages that Arrow C++ intends to work for. That said, hopefully our
> CI coverage is sufficient to let someone try it out and see what
> breaks.
> 
> Neal
> 
> On Fri, Oct 4, 2019 at 9:05 AM Zhuo Peng  wrote:
>>
>> Dear Arrow maintainers,
>>
>> Sorry if this was raised before. I did search the mailing list but "C++" 
>> matched too many results..
>>
>> With manylinux1 (GCC4.8) being sunset, both Conda and Pypa are providing a 
>> modern enough toolchain (Conda Forge - GCC7; Pypa manylinux2010 docker - 
>> devtoolset-8(GCC8)). And full C++17 support has been included in GCC7 [1]. I 
>> wonder what are the concerns of adopting a newer standard?
>>
>> C++14 might not bring a whole lot of interesting features, but C++17 brings:
>>
>> std::string_view
>> std::optional
>> std::variant (the newly added Result class is based on some form of variant 
>> implementation I suppose?)
>>
>> and many syntax sugar.. (like emplace_back() returning back(), so you can do 
>> RETURN_NOT_OK(CreateArray(my_array_sp_vector.emplace_back(
>>
>> And btw, was -std=gnu++11 an intentional choice? what gnu extensions does 
>> the library rely on?
>>
>> [1] https://gcc.gnu.org/projects/cxx-status.html
>>


Re: Should Arrow adopt C++14 / 17?

2019-10-04 Thread Zhuo Peng



On 2019/10/04 16:53:59, Antoine Pitrou  wrote: 
> 
> Le 04/10/2019 à 18:05, Zhuo Peng a écrit :
> > Dear Arrow maintainers,
> > 
> > Sorry if this was raised before. I did search the mailing list but "C++" 
> > matched too many results..
> > 
> > With manylinux1 (GCC4.8) being sunset, both Conda and Pypa are providing a 
> > modern enough toolchain (Conda Forge - GCC7; Pypa manylinux2010 docker - 
> > devtoolset-8(GCC8)). And full C++17 support has been included in GCC7 [1]. 
> > I wonder what are the concerns of adopting a newer standard?
> >  
> > C++14 might not bring a whole lot of interesting features, but C++17 brings:
> > 
> > std::string_view
> > std::optional
> > std::variant (the newly added Result class is based on some form of variant 
> > implementation I suppose?)
> 
> We already have `string_view` and `variant` backports.  We could
> reasonably add a `optional` backport.
> 
backports are cool for internal use, but probably not so if a public API 
accepts it? (because you vendor the headers in (i.e. namespace, symbol names 
unchanged), they might clash with headers that a client uses).

> > And btw, was -std=gnu++11 an intentional choice? what gnu extensions does 
> > the library rely on?
> 
> None, AFAIK.  Arrow compiles on MSVC fine.  Where is -std=gnu++11 added?
https://github.com/apache/arrow/blob/3129e3ed90219ecfffe2a25ce5820eec8cc947d0/cpp/cmake_modules/SetupCxxFlags.cmake#L33

https://cmake.org/cmake/help/v3.1/prop_tgt/CXX_STANDARD.html
> 
> Regards
> 
> Antoine.
> 


Re: Should Arrow adopt C++14 / 17?

2019-10-04 Thread Antoine Pitrou


Le 04/10/2019 à 18:05, Zhuo Peng a écrit :
> Dear Arrow maintainers,
> 
> Sorry if this was raised before. I did search the mailing list but "C++" 
> matched too many results..
> 
> With manylinux1 (GCC4.8) being sunset, both Conda and Pypa are providing a 
> modern enough toolchain (Conda Forge - GCC7; Pypa manylinux2010 docker - 
> devtoolset-8(GCC8)). And full C++17 support has been included in GCC7 [1]. I 
> wonder what are the concerns of adopting a newer standard?
>  
> C++14 might not bring a whole lot of interesting features, but C++17 brings:
> 
> std::string_view
> std::optional
> std::variant (the newly added Result class is based on some form of variant 
> implementation I suppose?)

We already have `string_view` and `variant` backports.  We could
reasonably add a `optional` backport.

> And btw, was -std=gnu++11 an intentional choice? what gnu extensions does the 
> library rely on?

None, AFAIK.  Arrow compiles on MSVC fine.  Where is -std=gnu++11 added?

Regards

Antoine.


Re: Should Arrow adopt C++14 / 17?

2019-10-04 Thread Neal Richardson
We do have to care about more than just conda and Python. For R, for
example, C++14 support is limited:
https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Using-C_002b_002b14-code

That's not to say that we can't try (if C++ devs wanted to, and I
can't speak for them), but it might be time-consuming (or worse) to
figure out what features are supported on all of the platforms and
languages that Arrow C++ intends to work for. That said, hopefully our
CI coverage is sufficient to let someone try it out and see what
breaks.

Neal

On Fri, Oct 4, 2019 at 9:05 AM Zhuo Peng  wrote:
>
> Dear Arrow maintainers,
>
> Sorry if this was raised before. I did search the mailing list but "C++" 
> matched too many results..
>
> With manylinux1 (GCC4.8) being sunset, both Conda and Pypa are providing a 
> modern enough toolchain (Conda Forge - GCC7; Pypa manylinux2010 docker - 
> devtoolset-8(GCC8)). And full C++17 support has been included in GCC7 [1]. I 
> wonder what are the concerns of adopting a newer standard?
>
> C++14 might not bring a whole lot of interesting features, but C++17 brings:
>
> std::string_view
> std::optional
> std::variant (the newly added Result class is based on some form of variant 
> implementation I suppose?)
>
> and many syntax sugar.. (like emplace_back() returning back(), so you can do 
> RETURN_NOT_OK(CreateArray(my_array_sp_vector.emplace_back(
>
> And btw, was -std=gnu++11 an intentional choice? what gnu extensions does the 
> library rely on?
>
> [1] https://gcc.gnu.org/projects/cxx-status.html
>


[jira] [Created] (ARROW-6792) [R] Explore roxygen2 R6 class documentation

2019-10-04 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-6792:
--

 Summary: [R] Explore roxygen2 R6 class documentation
 Key: ARROW-6792
 URL: https://issues.apache.org/jira/browse/ARROW-6792
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Neal Richardson
 Fix For: 1.0.0


roxygen2 version 7.0 adds support for documenting R6 classes, rather than the 
ad hoc approach we've had to take without it: 
[https://github.com/r-lib/roxygen2/blob/master/vignettes/rd.Rmd#L203]

Try it out and see how we like it, and consider refactoring the docs to use it 
everywhere.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Should Arrow adopt C++14 / 17?

2019-10-04 Thread Zhuo Peng
Dear Arrow maintainers,

Sorry if this was raised before. I did search the mailing list but "C++" 
matched too many results..

With manylinux1 (GCC4.8) being sunset, both Conda and Pypa are providing a 
modern enough toolchain (Conda Forge - GCC7; Pypa manylinux2010 docker - 
devtoolset-8(GCC8)). And full C++17 support has been included in GCC7 [1]. I 
wonder what are the concerns of adopting a newer standard?
 
C++14 might not bring a whole lot of interesting features, but C++17 brings:

std::string_view
std::optional
std::variant (the newly added Result class is based on some form of variant 
implementation I suppose?)

and many syntax sugar.. (like emplace_back() returning back(), so you can do 
RETURN_NOT_OK(CreateArray(my_array_sp_vector.emplace_back(

And btw, was -std=gnu++11 an intentional choice? what gnu extensions does the 
library rely on?

[1] https://gcc.gnu.org/projects/cxx-status.html



Re: [VOTE] Release Apache Arrow 0.15.0 - RC2

2019-10-04 Thread Krisztián Szűcs
We have 5 binding +1 votes and 2 non-binding +1 votes so far.
The 72 hours has passed, so we can close the release vote.

Sadly I won't be available for the rest of the day, so I will be able
to close the vote and start to work on the the post release tasks
from tomorrow.
@Wes if you have bandwidth feel free to close the vote sooner.


On Thu, Oct 3, 2019 at 1:14 AM Bryan Cutler  wrote:

> Accidentally sent too soon. The ORC build error I got was probably just an
> env issue for me, but here it is in case anyone else had the same issue:
>
> In file included from
>
> ESC[01mESC[K/tmp/arrow-0.15.0.KxYbA/apache-arrow-0.15.0/cpp/build/orc_ep-prefix/src/orc_ep/c++/src/wrap/orc-proto-wrapper.cc:44:0ESC[mESC[K:
>
> ESC[01mESC[K/tmp/arrow-0.15.0.KxYbA/apache-arrow-0.15.0/cpp/build/orc_ep-prefix/src/orc_ep-build/c++/src/orc_proto.pb.cc:970
> :13:ESC[mESC[K
> ESC[01;31mESC[Kerror:
> ESC[mESC[K‘ESC[01mESC[Kdynamic_init_dummy_orc_5fproto_2eprotoESC[mESC[K’
> defined but not used [-Werror=unused-variable]
>  static bool dynamic_init_dummy_orc_5fproto_2eproto = []() {
> AddDescriptors_orc_5fproto_2eproto(); return true; }();
> ESC[01;32mESC[K ^ESC[mESC[K
> cc1plus: all warnings being treated as errors
> make[5]: *** [c++/src/CMakeFiles/orc.dir/wrap/orc-proto-wrapper.cc.o] Error
> 1
> make[5]: *** Waiting for unfinished jobs
> make[4]: *** [c++/src/CMakeFiles/orc.dir/all] Error 2
> make[3]: *** [all] Error 2
>
> [ 29%] Performing build step for 'orc_ep'
> CMake Error at
>
> /tmp/arrow-0.15.0.KxYbA/apache-arrow-0.15.0/cpp/build/orc_ep-prefix/src/orc_ep-stamp/orc_ep-build-RELEASE.cmake:16
> (message):
>   Command failed: 2
>
>'make'
>
>   See also
>
>
>
> /tmp/arrow-0.15.0.KxYbA/apache-arrow-0.15.0/cpp/build/orc_ep-prefix/src/orc_ep-stamp/orc_ep-build-*.log
>
> CMakeFiles/orc_ep.dir/build.make:111: recipe for target
> 'orc_ep-prefix/src/orc_ep-stamp/orc_ep-build' failed
> make[2]: *** [orc_ep-prefix/src/orc_ep-stamp/orc_ep-build] Error 1
> CMakeFiles/Makefile2:1248: recipe for target 'CMakeFiles/orc_ep.dir/all'
> failed
> make[1]: *** [CMakeFiles/orc_ep.dir/all] Error 2
>
> On Wed, Oct 2, 2019 at 4:12 PM Bryan Cutler  wrote:
>
> > +1 (non-binding)
> >
> > I ran the following on Ubuntu 16.04 4.15.0-64-generic:
> > > dev/release/verify-release-candidate.sh binaries 0.15.0 2
> > > ARROW_CUDA=OFF \
> > TEST_DEFAULT=0 \
> > TEST_SOURCE=1 \
> > TEST_CPP=1 \
> > TEST_PYTHON=1 \
> > TEST_JAVA=1 \
> > TEST_INTEGRATION=1 \
> > dev/release/verify-release-candidate.sh source 0.15.0 2
> >
> > For source verification I set INTEGRATION_TEST_ARGS="--enable-js=0
> > --enable-go=0"
> >
> > When attempting source verification with defaults, I got the below error
> > when building the ORC adapter. It looks like just a warning that is being
> > treated as error and seems to be only in
> >
> > On Wed, Oct 2, 2019 at 7:53 AM Andy Grove  wrote:
> >
> >> +1 (binding)
> >>
> >> On Mon, Sep 30, 2019 at 11:57 PM Krisztián Szűcs <
> >> szucs.kriszt...@gmail.com>
> >> wrote:
> >>
> >> > Hi,
> >> >
> >> > I would like to propose the following release candidate (RC2) of
> Apache
> >> > Arrow version 0.15.0. This is a release consiting of 697
> >> > resolved JIRA issues[1].
> >> >
> >> > This release candidate is based on commit:
> >> > 40d468e162e88e1761b1e80b3ead060f0be927ee [2]
> >> >
> >> > The source release rc2 is hosted at [3].
> >> > The binary artifacts are hosted at [4][5][6][7].
> >> > The changelog is located at [8].
> >> >
> >> > Please download, verify checksums and signatures, run the unit tests,
> >> > and vote on the release. See [9] for how to validate a release
> >> candidate.
> >> >
> >> > The vote will be open for at least 72 hours.
> >> >
> >> > [ ] +1 Release this as Apache Arrow 0.15.0
> >> > [ ] +0
> >> > [ ] -1 Do not release this as Apache Arrow 0.15.0 because...
> >> >
> >> > [1]:
> >> >
> >> >
> >>
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%200.15.0
> >> > [2]:
> >> >
> >> >
> >>
> https://github.com/apache/arrow/tree/40d468e162e88e1761b1e80b3ead060f0be927ee
> >> > [3]:
> >> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-0.15.0-rc2
> >> > [4]: https://bintray.com/apache/arrow/centos-rc/0.15.0-rc2
> >> > [5]: https://bintray.com/apache/arrow/debian-rc/0.15.0-rc2
> >> > [6]: https://bintray.com/apache/arrow/python-rc/0.15.0-rc2
> >> > [7]: https://bintray.com/apache/arrow/ubuntu-rc/0.15.0-rc2
> >> > [8]:
> >> >
> >> >
> >>
> https://github.com/apache/arrow/blob/40d468e162e88e1761b1e80b3ead060f0be927ee/CHANGELOG.md
> >> > [9]:
> >> >
> >> >
> >>
> https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates
> >> >
> >>
> >
>


Re: [DISCUSS][Java] Design of the algorithm module

2019-10-04 Thread Praveen Kumar
Hi Micah,

I agree with 1., i think as an end user, what they would really want is a
query/data processing engine. I am not sure how easy/relevant the
algorithms will be in the absence of the engine. For e.g. most of these
operators would need to pipelined, handle memory, distribution etc. So
bundling this along with engine makes a lot more sense, the interfaces
required might be a bit different too for that.

Thx.



On Thu, Oct 3, 2019 at 10:27 AM Micah Kornfield 
wrote:

> Hi Liya Fan,
> Thanks again for writing this up.  I think it provides a road-map for
> intended features.  I commented on the document but I wanted to raise a few
> high-level concerns here as well to get more feedback from the community.
>
> 1.  It isn't clear to me who the users will of this will be.  My perception
> is that in the Java ecosystem there aren't use-cases for the algorithms
> outside of specific compute engines.  I'm not super involved in open-source
> Java these days so I would love to hear others opinions. For instance, I'm
> not sure if Dremio would switch to using these algorithms instead of the
> ones they've already open-sourced  [1] and Apache Spark I believe is only
> using Arrow for interfacing with Python (they similarly have there own
> compute pipeline).  I think you mentioned in the past that these are being
> used internally on an engine that your company is working on, but if that
> is the only consumer it makes me wonder if the algorithm development might
> be better served as part of that engine.
>
> 2.  If we do move forward with this, we also need a plan for how to
> optimize the algorithms to avoid virtual calls.  There are two high-level
> approaches template-based and (byte)code generation based.  Both aren't
> applicable in all situations but it would be good to come consensus on when
> (and when not to) use each.
>
> Thanks,
> Micah
>
> [1]
>
> https://github.com/dremio/dremio-oss/tree/master/sabot/kernel/src/main/java/com/dremio/sabot/op/sort/external
>
> On Tue, Sep 24, 2019 at 6:48 AM Fan Liya  wrote:
>
> > Hi Micah,
> >
> > Thanks for your effort and precious time.
> > Looking forward to receiving more valuable feedback from you.
> >
> > Best,
> > Liya Fan
> >
> > On Tue, Sep 24, 2019 at 2:12 PM Micah Kornfield 
> > wrote:
> >
> >> Hi Liya Fan,
> >> I started reviewing but haven't gotten all the way through it. I will
> try
> >> to leave more comments over the next few days.
> >>
> >> Thanks again for the write-up I think it will help frame a productive
> >> conversation.
> >>
> >> -Micah
> >>
> >> On Tue, Sep 17, 2019 at 1:47 AM Fan Liya  wrote:
> >>
> >>> Hi Micah,
> >>>
> >>> Thanks for your kind reminder. Comments are enabled now.
> >>>
> >>> Best,
> >>> Liya Fan
> >>>
> >>> On Tue, Sep 17, 2019 at 12:45 PM Micah Kornfield <
> emkornfi...@gmail.com>
> >>> wrote:
> >>>
>  Hi Liya Fan,
>  Thank you for this writeup, it doesn't look like comments are enabled
> on
>  the document.  Could you allow for them?
> 
>  Thanks,
>  Micah
> 
>  On Sat, Sep 14, 2019 at 6:57 AM Fan Liya 
> wrote:
> 
>  > Dear all,
>  >
>  > We have prepared a document for discussing the requirements, design
>  and
>  > implementation issues for the algorithm module of Java:
>  >
>  >
>  >
> 
> https://docs.google.com/document/d/17nqHWS7gs0vARfeDAcUEbhKMOYHnCtA46TOY_Nls69s/edit?usp=sharing
>  >
>  > So far, we have finished the initial draft for sort, search and
>  dictionary
>  > encoding algorithms. Discussions for more algorithms may be added in
>  the
>  > future. This document will keep evolving to reflect the latest
>  discussion
>  > results in the community and the latest code changes.
>  >
>  > Please give your valuable feedback.
>  >
>  > Best,
>  > Liya Fan
>  >
> 
> >>>
>


[jira] [Created] (ARROW-6791) Memory Leak

2019-10-04 Thread George Prichard (Jira)
George Prichard created ARROW-6791:
--

 Summary: Memory Leak 
 Key: ARROW-6791
 URL: https://issues.apache.org/jira/browse/ARROW-6791
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.14.1, 0.14.0
 Environment: Ubuntu 18.04, 32GB ram, conda-forge installation
Reporter: George Prichard


Memory leak with large string columns crashes the program. This only seems to 
affect 0.14.x  - it works fine for me in 0.13.0. It might be related to earlier 
similar issues? e.g. [https://github.com/apache/arrow/issues/2624]

Below is a reprex which works in earlier versions, but crashes on read (writing 
is fine) in this one. The real-life version of the data is full of URLs as the 
strings. 

Weirdly it crashes my 32GB Ubuntu 18.04, but runs (if very slowly for the read) 
on my 16GB Macbook. 

Thanks so much for the excellent tools! 

 

 
{code:java}
import pandas as pd
n_rows = int(1e6)
n_cols = 10
col_length = 100
df = pd.DataFrame()
for i in range(n_cols):
 df[f'col_{i}'] = pd.util.testing.rands_array(col_length, n_rows)
print('Generated df', df.shape)
filename = 'tmp.parquet'
print('Writing parquet')
df.to_parquet(filename)
print('Reading parquet')
pd.read_parquet(filename)
{code}
 

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)