[jira] [Created] (ARROW-2786) [JS] Read Parquet files in JavaScript

2018-07-02 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2786:
---

 Summary: [JS] Read Parquet files in JavaScript
 Key: ARROW-2786
 URL: https://issues.apache.org/jira/browse/ARROW-2786
 Project: Apache Arrow
  Issue Type: New Feature
  Components: JavaScript
Reporter: Wes McKinney


See question in https://github.com/apache/arrow/issues/2209



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2785) [C++] Crash in json-integration-test

2018-07-02 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-2785:
-

 Summary: [C++] Crash in json-integration-test
 Key: ARROW-2785
 URL: https://issues.apache.org/jira/browse/ARROW-2785
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Antoine Pitrou


This is probably something I keep getting wrong when creating a new 
environment, but after creating a Python 3.7 conda environment and installing 
the tool chain, I get the following crash (apparently boost-related):

{code}
$ ./build-test/debug/json-integration-test 
[==] Running 2 tests from 1 test case.
[--] Global test environment set-up.
[--] 2 tests from TestJSONIntegration
[ RUN  ] TestJSONIntegration.ConvertAndValidate
*** Error in `./build-test/debug/json-integration-test': munmap_chunk(): 
invalid pointer: 0x7ffc22542578 ***
=== Backtrace: =
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7f4762f257e5]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x1a8)[0x7f4762f32698]
/home/antoine/miniconda3/envs/pyarrow37/lib/libstdc++.so.6(_ZNSsD1Ev+0x15)[0x7f476384cca5]
./build-test/debug/json-integration-test(_ZN5boost10filesystem4pathD1Ev+0x18)[0x694f4a]
./build-test/debug/json-integration-test[0x69205a]
./build-test/debug/json-integration-test(_ZN5arrow3ipc19TestJSONIntegration7mkstempEv+0x2c)[0x69599e]
./build-test/debug/json-integration-test(_ZN5arrow3ipc43TestJSONIntegration_ConvertAndValidate_Test8TestBodyEv+0x3b)[0x69210f]
./build-test/debug/json-integration-test(_ZN7testing8internal38HandleSehExceptionsInMethodIfSupportedINS_4TestEvEET0_PT_MS4_FS3_vEPKc+0x65)[0x8759da]
./build-test/debug/json-integration-test(_ZN7testing8internal35HandleExceptionsInMethodIfSupportedINS_4TestEvEET0_PT_MS4_FS3_vEPKc+0x5a)[0x86f65d]
./build-test/debug/json-integration-test(_ZN7testing4Test3RunEv+0xd5)[0x853697]
./build-test/debug/json-integration-test(_ZN7testing8TestInfo3RunEv+0x105)[0x853fef]
./build-test/debug/json-integration-test(_ZN7testing8TestCase3RunEv+0xf4)[0x8546f8]
./build-test/debug/json-integration-test(_ZN7testing8internal12UnitTestImpl11RunAllTestsEv+0x2ac)[0x85b666]
./build-test/debug/json-integration-test(_ZN7testing8internal38HandleSehExceptionsInMethodIfSupportedINS0_12UnitTestImplEbEET0_PT_MS4_FS3_vEPKc+0x65)[0x876eb7]
./build-test/debug/json-integration-test(_ZN7testing8internal35HandleExceptionsInMethodIfSupportedINS0_12UnitTestImplEbEET0_PT_MS4_FS3_vEPKc+0x5a)[0x870327]
./build-test/debug/json-integration-test(_ZN7testing8UnitTest3RunEv+0xc6)[0x85a128]
./build-test/debug/json-integration-test(_Z13RUN_ALL_TESTSv+0x11)[0x6945e6]
./build-test/debug/json-integration-test(main+0xfb)[0x693a2b]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f4762ece830]
./build-test/debug/json-integration-test(_start+0x29)[0x68b4a9]
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2784) [C++] MemoryMappedFile::WriteAt allow writing past the end

2018-07-02 Thread Dimitri Vorona (JIRA)
Dimitri Vorona created ARROW-2784:
-

 Summary: [C++] MemoryMappedFile::WriteAt allow writing past the end
 Key: ARROW-2784
 URL: https://issues.apache.org/jira/browse/ARROW-2784
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Affects Versions: 0.9.0
Reporter: Dimitri Vorona


There is a missing check in WriteAt, this PR adds it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2783) Importing conda-forge pyarrow fails

2018-07-02 Thread Phillip Cloud (JIRA)
Phillip Cloud created ARROW-2783:


 Summary: Importing conda-forge pyarrow fails
 Key: ARROW-2783
 URL: https://issues.apache.org/jira/browse/ARROW-2783
 Project: Apache Arrow
  Issue Type: Task
  Components: Python
Affects Versions: 0.9.0
Reporter: Phillip Cloud


Possibly related to: 
https://issues.apache.org/jira/projects/ARROW/issues/ARROW-2770

Steps to reproduce:

{code}
$ conda create -n test python=3 pyarrow -c conda-forge -y
$ conda activate test
$ python -c 'import pyarrow'
{code}

This gives:

{code}
Traceback (most recent call last):
  File "", line 1, in 
  File 
"/home/phillip/miniconda3/envs/py36/lib/python3.6/site-packages/pyarrow/__init__.py",
 line 47
, in 
from pyarrow.lib import cpu_count, set_cpu_count
ImportError: libboost_system.so.1.65.1: cannot open shared object file: No such 
file or directory
{code}

Downgrading boost to {{1.65.1}} gives a symbol lookup error:

{code}
$ conda install boost-cpp=1.65.1 -y -c conda-forge
$ python -c 'import pyarrow'
Traceback (most recent call last):
  File "", line 1, in 
  File 
"/home/phillip/miniconda3/envs/py36/lib/python3.6/site-packages/pyarrow/__init__.py",
 line 47
, in 
from pyarrow.lib import cpu_count, set_cpu_count
ImportError: 
/home/phillip/miniconda3/envs/py36/lib/python3.6/site-packages/pyarrow/../../../libarrow.so.0:
 undefined symbol: 
_ZN5boost13match_resultsIN9__gnu_cxx17__normal_iteratorIPKcNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcESaINS_9sub_matchISB_12maybe_assignERKSF_
{code}

Installing {{pyarrow}} from {{defaults}} and importing it works fine.

cc [~kszucs] [~xhochy]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Recruiting more maintainers for Apache Arrow

2018-07-02 Thread Antoine Pitrou


Hi,

Le 02/07/2018 à 15:58, Wes McKinney a écrit :
> * http://ivory.idyll.org/blog/2018-how-open-is-too-open.html
> * http://ivory.idyll.org/blog/2018-oss-framework-cpr.html

Very good articles, but I would stress that some of the mechanisms
proposed lack metrics in their favour.  Two particular examples that I
know about:

1)

""" I seem to recall Martin van Loewis offering to review one externally
contributed patch for every ten other patches reviewed by the submitter.
(I can’t find the link, sorry!) This imposes work requirements on
would-be contributors that obligate them to contribute substantively to
the project maintenance, before their pet feature gets implemented. """

Martin's offer was almost never taken up, although he expressed it many
times during many years.  I think there are two factors to it:

a) Cost.  As an occasional contributor, I could understand having to do
a review before contributing a patch of mine, but not having to do 5 or
more reviews for each patch I contribute.  The effort asked is much too
high, and you're probably discouraging people who are discovering the
project, even before they could get hooked on it.

b) Difficult.  It's much more difficult and intimidating to review
someone else's PR, than to propose your own changes knowing that it will
be reviewed by (you are assuming) competent people.  So this mechanism
is excluding first-time contributors, which is probably *not* what you want.

2)

""" Some projects have excellent incubators, like the Python Core
Mentorship Program, where people who are interested in applying their
effort to recruiting new contributors can do so. """

Actually, it doesn't seem to me that a significant proportion of
frequent Python contributors have gone through the core mentorship
process.  It probably got us a handful of one-time contributions.
Pointing to the Python core mentorship program as an "excellent
incubator" sounds rather far-fetched to me.

Generally speaking, there's a limit to the usefulness of hand-holding
contributors, especially if your project is rather complex (as Python
is), because the blocking point for contributors is *not* that the
development mailing-list is a bit intimidating (as was claimed by the
people who founded the Python core mentorship program).


PS : as a matter of fact, the general rate of contributions to Python
has been *decreasing* for years.

Regards

Antoine.


Re: Recruiting more maintainers for Apache Arrow

2018-07-02 Thread Wes McKinney
Hi folks,

I would like to highlight that the challenges we are having are
endemic to many parts of the open source world right now. A colleague
of mine in the Python world wrote some pieces about this recently:

* http://ivory.idyll.org/blog/2018-how-open-is-too-open.html
* http://ivory.idyll.org/blog/2018-oss-framework-cpr.html

Here are some quotes from those pieces:

"This need for constant attention to projects, the sprawling ecosystem
of amazing scientific software packages, and the relatively small
community of actual maintainers, when combined, lead to the open
source sustainability problem in science: we do not have the person
power to keep it all running without heroic efforts. And when you
couple this with the lack of clear career paths for software
maintenance in science, it is clear that we cannot ethically and
sustainably recruit more people into open source maintainership."

I would say that "heroics" does describe some of the occasional
behavior of Arrow maintainers. The trouble with "heroics" (which
translates practically speaking to "overwork") is that if sustained
for a long period of time, it surely leads to burnout and depression.
I can speak from personal experience.

On a later point in this quote about "lack of clear career paths for
software maintenance", rather than griping about the problem, I
decided to do something about it. I have recently created a new
organization so that I can

a) enable organizations to directly fund Arrow maintenance and
b) provide secure full-time employment to Arrow maintainers

"Second, the cost of the constant maintenance needs (code,
documentation, installation, etc.) on the pool of available effort
needs to be taken into account. Contributions of new features that do
not come with effort applied to maintenance should be carefully
considered - is this new contributor likely to stick around? Can they
and will they devote some effort to maintenance? If not, maybe those
contributions should be deferred in favor of contributions that add
maintenance effort to the project, e.g. via partnerships."

I see both sides of this argument. I think we need to be more
proactive about requesting maintenance help from "extractive"
contributors who are mostly "taking" from the project and giving
relatively little to support the overall health of the project.

"Fourth, there are some interesting governance implications around
allowing all or most of the resource appropriators to participate in
decision making. I need to dig more into this, but, briefly, I think
projects should formally lay out what level of investment and
contribution is rewarded with what kind of operational, policy making,
and constitutional decision making authority."

Apache governance already provides a framework for obtaining decision
making authority in a project. Suffice to say, I would be hesistant to
support a new PMC member who has not engaged on project maintenance.

- Wes

On Mon, Jul 2, 2018 at 7:03 AM, Antoine Pitrou  wrote:
>
> Hi Dimitri,
>
> Le 02/07/2018 à 12:46, Dimitri Vorona a écrit :
>> Hi Wes,
>>
>> to contribute an outsiders POW: while it is clear, what's expected if you'd
>> like to make a PR, it's not at all clear to me, where would I start if I
>> wanted to help with PR reviews without being heavily involved with the
>> community/being a full maintainer. Should I just grab a PR, test it,
>> comment on changes? I wouldn't be sure if I were stepping on someone's
>> feet, tbh.
>
> You don't have to manually test a PR, unless you want to be sure about
> semantics that are not part of the tests added in the PR (but then it
> would be a good idea to mention that the tests don't exercise the
> semantics enough :-)).
>
> From my point of view (generally as an open source developer and
> maintainer, this isn't specific to Arrow), reviewing is:
>
> * checking for soundness of concepts (if the PR adds any of them)
> * checking for maintainability and readability of code
> * checking for smelly coding patterns, possible sources of bugs etc.
> * depending on the context, checking for possible performance issues
> * any potential problem that your personal expertise may help you detect
>
> If you're not sure about a comment and hesitate posting it, a good
> solution is to phrase it as a question.
>
> Regards
>
> Antoine.


[jira] [Created] (ARROW-2782) [Python] Ongoing Travis CI failures in Plasma unit tests

2018-07-02 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2782:
---

 Summary: [Python] Ongoing Travis CI failures in Plasma unit tests
 Key: ARROW-2782
 URL: https://issues.apache.org/jira/browse/ARROW-2782
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Wes McKinney
 Fix For: 0.10.0


e.g.

{code}
_ test_use_huge_pages 
__

@pytest.mark.skipif(not os.path.exists("/mnt/hugepages"),
reason="requires hugepage support")
def test_use_huge_pages():
import pyarrow.plasma as plasma
with plasma.start_plasma_store(
plasma_store_memory=2*10**9,
plasma_directory="/mnt/hugepages",
use_hugepages=True) as (plasma_store_name, p):
plasma_client = plasma.connect(plasma_store_name, "", 64)
>   create_object(plasma_client, 10**8)

pyarrow/tests/test_plasma.py:773: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pyarrow/tests/test_plasma.py:79: in create_object
seal=seal)
pyarrow/tests/test_plasma.py:68: in create_object_with_id
memory_buffer = client.create(object_id, data_size, metadata)
pyarrow/_plasma.pyx:300: in pyarrow._plasma.PlasmaClient.create
check_status(self.client.get().Create(object_id.data, data_size,
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   raise PlasmaStoreFull(message)
E   PlasmaStoreFull: 
/home/travis/build/apache/arrow/cpp/src/plasma/client.cc:375 code: 
ReadCreateReply(buffer.data(), buffer.size(), &id, &object, &store_fd, 
&mmap_size)
E   object does not fit in the plasma store

pyarrow/error.pxi:99: PlasmaStoreFull
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Recruiting more maintainers for Apache Arrow

2018-07-02 Thread Antoine Pitrou


Hi Dimitri,

Le 02/07/2018 à 12:46, Dimitri Vorona a écrit :
> Hi Wes,
> 
> to contribute an outsiders POW: while it is clear, what's expected if you'd
> like to make a PR, it's not at all clear to me, where would I start if I
> wanted to help with PR reviews without being heavily involved with the
> community/being a full maintainer. Should I just grab a PR, test it,
> comment on changes? I wouldn't be sure if I were stepping on someone's
> feet, tbh.

You don't have to manually test a PR, unless you want to be sure about
semantics that are not part of the tests added in the PR (but then it
would be a good idea to mention that the tests don't exercise the
semantics enough :-)).

>From my point of view (generally as an open source developer and
maintainer, this isn't specific to Arrow), reviewing is:

* checking for soundness of concepts (if the PR adds any of them)
* checking for maintainability and readability of code
* checking for smelly coding patterns, possible sources of bugs etc.
* depending on the context, checking for possible performance issues
* any potential problem that your personal expertise may help you detect

If you're not sure about a comment and hesitate posting it, a good
solution is to phrase it as a question.

Regards

Antoine.


Re: Recruiting more maintainers for Apache Arrow

2018-07-02 Thread Dimitri Vorona
Hi Wes,

to contribute an outsiders POW: while it is clear, what's expected if you'd
like to make a PR, it's not at all clear to me, where would I start if I
wanted to help with PR reviews without being heavily involved with the
community/being a full maintainer. Should I just grab a PR, test it,
comment on changes? I wouldn't be sure if I were stepping on someone's
feet, tbh. So, in my view it would help if:

* there were some kind of informal reviewer assignment system, i.e. I say
"I'd like to review this PR", Wes/Uwe/Antoine reply: "sure, give it a
shot". This would be mentioned prominently in the contributor guide

* afterwards there were some kind of feedback-to-feedback arrangement,
although it would increase the work load for the existing maintainers in
the short term, of course

Cheers,
Dimitri.

On Sun, Jul 1, 2018 at 1:09 AM Donald E. Foss  wrote:

> For what it's worth, this email thread and your summary writeup, Wes, are
> a significant call to action on their own.
>
> I've been passive, not by choice, but by policy. Given the significance
> and need of this project, I'll see what I can do on my side. It will be at
> least a week given the US holiday.
>
> Donald E. Foss
>
> > On Jun 30, 2018, at 2:15 PM, Marco Neumann 
> wrote:
> >
> > Hey,
> >
> > first of all, thanks a lot for your, Uwes, the mergers and contributors
> > work. Now, to the maintainer problem:
> >
> > # Arrow as "a library"
> > One thing that makes Arrow special is that it is not a single, but many
> > libraries (one for each language) and many of them are not only a
> > binding to a C/C++ lib, but partly a complete re-implementation of the
> > protocol, e.g.:
> >
> > - C++: one core, but also contains Python specialties
> > - Java: another core
> > - Rust: yet another core
> > - Python: a binding to C++ but also a lot more stuff because of Pandas
> > ...
> >
> > And you two are maintaining all of them and I doubt that you have the
> > capacities and knowledge to do this at the desired level of quality
> > (which is natural, not a personal issue or offense). So this I would
> > call "pseudo-maintenance", since you're solely the gatekeeper that does
> > some shallow reviewing and has the burden to do the housekeeping and
> > the merging. So why accepting these language bindings in the first
> > place without bringing a core maintainer in place? For example, let's
> > say someone proposes a binding to Haskell now. That should not be
> > accepted as part of the official Apache implementation without a
> > dedicated maintainer (ideally the PR-author would be that person, but
> > there may others who step up).
> >
> > Right now, it might be too late to remove some of the incomplete / WIP
> > implementations that don't have a core maintainer though.
> >
> > # GitHub
> > Another special thing to consider is that Arrow is (ab)using GitHub as
> > a code hosting platform. Even as a contributor, this has obvious bad
> > uncool consequences:
> >
> > - you have yet another issue hosting system to log in
> > - there is yet another information channel to keep track of (this ML
> >  for example, which has a semi-informative web interface telling you
> >  can only login using Google but does not tell you how to subscribe to
> >  the list)
> > - links to issues don't work in the known magic way
> > - you're merging the PRs by closing them; which is by all means a not
> >  very nice way because it does not reflect the contributors work in
> >  the project overview and personal profiles, but exactly this is a
> >  large part of the GitHub community (btw: merging PRs without using
> >  GitHubs merge button IS possible as bors/bors-ng proof)
> >
> > So as a potential maintainer, this is already a bumper, since I know
> > that there are things less confortable then the system I would get from
> > any normal GitHub or Gitlab project.
> >
> > I'm not really sure how to solve this or if it should be solved (read
> > about the laziness aspect in "Contribution VS Maintenance" below)
> >
> > # Time / Payment
> > Yes, this is indeed a big issue. From what I can tell from the open
> > source projects I was involved in is that for large contributor crowds,
> > you normally have full/half-time positions in place for the core
> > maintainer (look at the Mozilla projects, the Blender Foundation, Gnome
> > / Red Hat). So at one point I think maintaining isn't a part time /
> > hobby thing anymore (w/o downgrading the hard work of Hobby-
> > contributors, in contrast). I don't have a link at hand, but I recall
> > some discussion about GitHub and it's importance for hiring (since it
> > it acts as a CV) after MS bought it, and some of the responses are
> > "doing all this work in your free time is a privilege of wealthy,
> > mostly-white men", which without signing this statement in this really
> > bare form already shows a problem of open source world.
> >
> > # Contribution VS Maintenance
> > The very "nice" thing about patch/PR contribution is that you do your

[jira] [Created] (ARROW-2781) [Python] Download boost using curl in manylinux1 image

2018-07-02 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created ARROW-2781:
--

 Summary: [Python] Download boost using curl in manylinux1 image
 Key: ARROW-2781
 URL: https://issues.apache.org/jira/browse/ARROW-2781
 Project: Apache Arrow
  Issue Type: Bug
  Components: Packaging, Python
Affects Versions: 0.9.0
Reporter: Uwe L. Korn
Assignee: Uwe L. Korn
 Fix For: 0.10.0


This is the only artifact where we use {{wget}} which has not the necessary 
level of TLS support to speak with the bintray servers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2780) [Go] Run code coverage analysis

2018-07-02 Thread Sebastien Binet (JIRA)
Sebastien Binet created ARROW-2780:
--

 Summary: [Go] Run code coverage analysis
 Key: ARROW-2780
 URL: https://issues.apache.org/jira/browse/ARROW-2780
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Go
Reporter: Sebastien Binet






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)