[jira] [Created] (ARROW-6020) [Java] Refactor ByteFunctionHelper#hash with new added ArrowBufHasher

2019-07-23 Thread Ji Liu (JIRA)
Ji Liu created ARROW-6020:
-

 Summary: [Java] Refactor ByteFunctionHelper#hash with new added 
ArrowBufHasher
 Key: ARROW-6020
 URL: https://issues.apache.org/jira/browse/ARROW-6020
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Ji Liu
Assignee: Ji Liu


Some logic in these two classes are similar, should replace 
ByteFunctionHelper#hash logic with ArrowBufHasher since it has murmur hash 
algorithm which could avoid hash collision.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6019) [Java] Port Jdbc and Avro adapter to new directory

2019-07-23 Thread Ji Liu (JIRA)
Ji Liu created ARROW-6019:
-

 Summary: [Java] Port Jdbc and Avro adapter to new directory 
 Key: ARROW-6019
 URL: https://issues.apache.org/jira/browse/ARROW-6019
 Project: Apache Arrow
  Issue Type: Task
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


As discussed in mail list, adapters are different from native reader.

This issue is used to track these issues:

i. create new “contrib” directory and move Jdbc/Avro adapter to it.

ii. provide more description.

iii. change orc readers structure to “converter"

cc [~emkornfi...@gmail.com]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: [Discuss] Do a 0.15.0 release before 1.0.0?

2019-07-23 Thread Micah Kornfield
>
> Could we detect the 4-byte length, incur a penalty copying the memory to
> an aligned buffer, then continue consuming the stream?

I think that is the plan (or at least would be my plan) if we go ahead with
the change



> (It's probably
> fine if we only write the 8-byte length, since consumers on older
> versions of Arrow could slice from the 4th byte before passing a buffer
> to the reader).

I'm not sure I understand this suggestion:
1.  Wouldn't this cause old readers to miss the last 4 bytes of the buffer
(and provide meaningless bytes at the beginning).
2.  The current proposal on the other thread is to have the pattern be
<0x>

Thanks,
Micah

On Tue, Jul 23, 2019 at 11:43 AM Paul Taylor 
wrote:

> +1 for a 0.15.0 before 1.0 if we go ahead with this.
>
> I'm curious to hear other's thoughts about compatibility. I think we
> should avoid breaking backwards compatibility if possible. It's common
> for apps/libs to be pinned on specific Arrow versions, and I worry it'd
> cause a lot of work for downstream devs to audit their tool suite for
> full Arrow binary compatibility (and/or require their customers to do
> the same).
>
> Could we detect the 4-byte length, incur a penalty copying the memory to
> an aligned buffer, then continue consuming the stream? (It's probably
> fine if we only write the 8-byte length, since consumers on older
> versions of Arrow could slice from the 4th byte before passing a buffer
> to the reader).
>
> I've always understood the metadata to be a few dozen/hundred KB, a
> small percentage of the total message size. I could be underestimating
> the ratios though -- is it common to have tables w/ 1000+ columns? I've
> seen a few reports like that in cuDF, but I'm curious to hear
> Jacques'/Dremio's experience too.
>
> If copying is feasible, it doesn't seem so bad a trade-off to maintain
> backwards-compatibility. As libraries and consumers upgrade their Arrow
> dependencies, the 4-byte length will be less and less common, and
> they'll be less likely to pay the cost.
>
>
>
> On 7/23/19 2:22 AM, Uwe L. Korn wrote:
> > It is also a good way to test the change in public. We don't want to
> adjust something like this anymore in a 1.0.0 release. Already doing this
> in 0.15.0 and then maybe doing adjustments due to issues that appear "in
> the wild" is psychologically the easier way. There is a lot of thinking of
> users bound with the magic 1.0, thus I would plan to minimize what is
> changed between 1.0 and pre-1.0. This also should save us maintainers some
> time as I would expect different behaviour in bug reports between 1.0 and
> pre-1.0 issues.
> >
> > Uwe
> >
> > On Tue, Jul 23, 2019, at 7:52 AM, Micah Kornfield wrote:
> >> I think the main reason to do a release before 1.0.0 is if we want to
> make
> >> the change that would give a good error message for forward
> incompatibility
> >> (I think this could be done as 0.14.2 since it would just be clarifying
> an
> >> error message).  Otherwise, I think including it in 1.0.0 would be fine
> >> (its still not clear to me if there is consensus to fix the issue).
> >>
> >> Thanks,
> >> Micah
> >>
> >>
> >> On Monday, July 22, 2019, Wes McKinney  wrote:
> >>
> >>> I'd be satisfied with fixing the Flatbuffer alignment issue either in
> >>> a 0.15.0 or 1.0.0. In the interest of expediency, though, making a
> >>> 0.15.0 with this change sooner rather than later might be prudent.
> >>>
> >>> On Mon, Jul 22, 2019 at 12:35 PM Antoine Pitrou 
> >>> wrote:
> 
>  Hello,
> 
>  Recently we've discussed breaking the IPC format to fix a
> long-standing
>  alignment issue.  See this discussion:
> 
> >>>
> https://lists.apache.org/thread.html/8cea56f2069710ac128ff9129c744f0ef96a3e33a4d79d7e820019af@%3Cdev.arrow.apache.org%3E
>  Should we first do a 0.15.0 in order to get those format fixes right?
>  Once that is fine and settled we can move to the 1.0.0 release?
> 
>  Regards
> 
>  Antoine.
>
>
>


Arrow sync call July 24 at 12:00 US/Eastern, 16:00 UTC

2019-07-23 Thread Neal Richardson
Hi everyone,
Reminder that the biweekly Arrow call is tomorrow (well, already today for
some of you) at https://meet.google.com/vtm-teks-phx. All are welcome to
join. Notes will be sent out to the mailing list afterwards.

Neal


Re: Error building cuDF on new Arrow with std::variant backport

2019-07-23 Thread Keith Kraus
Just following up in case anyone was following that this turned out to be an 
NVCC bug that we've reported to the relevant team internally. We moved the 
`ipc.cu` file to `ipc.cpp` and it works as expected with gcc. Thanks everyone!

-Keith

On 7/22/19, 12:52 PM, "Keith Kraus"  wrote:

We're working on that now, will report back once we have something more 
concrete to act on. Thanks!

-Keith

On 7/22/19, 12:51 PM, "Antoine Pitrou"  wrote:


Hi Keith,

Can you try to further reduce the reduce your reproducer until you find
the offending construct?

Regards

Antoine.


Le 22/07/2019 à 18:46, Keith Kraus a écrit :
> I temporarily removed the csr related code that has the namespace 
clash and confirmed that the same compilation warnings and errors still occur.
> 
> On 7/20/19, 1:03 AM, "Micah Kornfield"  wrote:
> 
> The namespace collision is a definite possibility, especially if 
you are
> using g++ which seems to be less smart about inferring types vs 
methods
> than clang is.
> 
> On Fri, Jul 19, 2019 at 9:28 PM Paul Taylor 

> wrote:
> 
> > Hi Micah,
> >
> > We were able to build Arrow standalone with both c++ 11 and 14, 
but cuDF
> > needs c++ 14.
> >
> > I found this line[1] in one of our cuda files after sending and 
realized
> > we may have a collision/polluted namespace. Does that sound 
like a
> > possibility?
> >
> > Thanks,
> > Paul
> >
> > 1.
> > 
https://github.com/rapidsai/cudf/blob/branch-0.9/cpp/src/io/convert/csr/cudf_to_csr.cu#L30
> >
> > On 7/19/19 8:41 PM, Micah Kornfield wrote:
> >
> > Hi Paul,
> > This actually looks like it might be a problem with arrow-4800. 
  Did the
> > build of arrow use c++14 or c++11?
> >
> > Thanks,
> > Micah
> >
> > On Friday, July 19, 2019, Paul Taylor 
 wrote:
> >
> >> We're updating cuDF to Arrow 0.14 but encountering errors 
building that
> >> look related to PR #4259 
. We
> >> can build Arrow itself, but we can't build cuDF when we 
include Arrow
> >> headers. Using C++ 14 and have tried gcc/g++ 5, 7, and clang.
> >>
> >> Has anyone seen these before or know of a fix?
> >>
> >> Thanks,
> >>
> >> Paul
> >>
> >> 
/cudf/cpp/build/arrow/install/include/arrow/io/interfaces.h(195):
> >>> warning: attribute does not apply to any entity
> >>> 
/cudf/cpp/build/arrow/install/include/arrow/io/interfaces.h(196):
> >>> warning: attribute does not apply to any entity
> >>> 
/cudf/cpp/build/arrow/install/include/arrow/io/interfaces.h(195):
> >>> warning: attribute does not apply to any entity
> >>> 
/cudf/cpp/build/arrow/install/include/arrow/io/interfaces.h(196):
> >>> warning: attribute does not apply to any entity
> >>> 
/cudf/cpp/build/arrow/install/include/arrow/io/interfaces.h(195):
> >>> warning: attribute does not apply to any entity
> >>> 
/cudf/cpp/build/arrow/install/include/arrow/io/interfaces.h(196):
> >>> warning: attribute does not apply to any entity
> >>>
> >>> /cudf/cpp/build/arrow/install/include/arrow/result.h: In 
member function
> >>> 'void arrow::Result::AssignVariant(mpark::variant >>> const char*>&&)':
> >>> /cudf/cpp/build/arrow/install/include/arrow/result.h:292:24: 
error:
> >>> expected primary-expression before ',' token
> >>>  variant_.~variant();
> >>> ^
> >>> /cudf/cpp/build/arrow/install/include/arrow/result.h:292:32: 
error:
> >>> expected primary-expression before ',' token
> >>>  variant_.~variant();
> >>> ^
> >>> /cudf/cpp/build/arrow/install/include/arrow/result.h:292:34: 
error:
> >>> expected primary-expression before 'const'
> >>>  variant_.~variant();
> >>>   ^
> >>> /cudf/cpp/build/arrow/install/include/arrow/result.h:292:34: 
error:
> >>> expected ')' before 'const'
> >>> /cudf/cpp/build/arrow/install/include/arrow/result.h: In 
member function
> >>> 'void arrow::Result::AssignVariant(const mpark::variant >>> arrow::Status, const char*>&)':
   

Re: [DISCUSS][C++][Proposal] Threading engine for Arrow

2019-07-23 Thread Wes McKinney
Useful read on this topic today from the Julia language

https://julialang.org/blog/2019/07/multithreading

On Tue, Jul 23, 2019, 12:22 AM Jacques Nadeau  wrote:

> There are two main things that have been important to us in Dremio around
> threading:
>
> Separate threading model from algorithms. We chose to do parallelization at
> the engine level instead of the operation level. This allows us to
> substantially increase parallelization while still maintaining a strong
> thread prioritization model. This contrasts to some systems like Apache
> Impala which chose to implement threading at the operation level. This has
> ultimately hurt their ability for individual workloads to scale out within
> a node. See the experimental features around MT_DOP when the tried to
> retreat from this model and struggled to do so. It serves as an example of
> the challenges if you don't separate data algorithms from threading early
> on in design [1]. This intention was core to how we designed Gandiva, where
> an external driver makes decisions around threading and the actual
> algorithm only does small amounts of work before yielding to the driver.
> This allows a driver to make parallelization and scheduling decisions
> without having to know the internals of the algorithm. (In Dremio, these
> are all covered under the interfaces described in Operator [2] and it's
> subclasses that together provide a very simple state of operation states
> for the driver to understand.
>
> The second is that the majority of the data we work with these days is
> primarily in high latency cloud storage. While we may stage data locally, a
> huge amount of reads are impacted by the performance of cloud stores. To
> cover these performance behaviors we did two things, the first was
> introduce a very simple to use  async reading interface for data, seen at
> [3] and introduce a collaborative way that individual tasks could declare
> their blocking state to a central coordinator [4]. Happy to cover these in
> more detail if people are interested. In general, using these techniques
> have allowed us to tune many systems to a situation where the (highly)
> variable latency of cloud stores like S3 and ADLS can be mostly cloaked by
> aggressive read ahead and what we call predictive pipelining (where reading
> is guided based on latency performance characteristics along with knowledge
> of columnar formats like Parquet).
>
> [1]
>
> https://www.cloudera.com/documentation/enterprise/latest/topics/impala_mt_dop.html#mt_dop
> [2]
>
> https://github.com/dremio/dremio-oss/blob/master/sabot/kernel/src/main/java/com/dremio/sabot/op/spi/Operator.java
> [3]
>
> https://github.com/dremio/dremio-oss/blob/master/sabot/kernel/src/main/java/com/dremio/exec/store/dfs/async/AsyncByteReader.java
> [4]
>
> https://github.com/dremio/dremio-oss/blob/master/sabot/kernel/src/main/java/com/dremio/sabot/threads/sharedres/SharedResourceManager.java
>
> On Mon, Jul 22, 2019 at 9:56 AM Antoine Pitrou  wrote:
>
> >
> > Le 22/07/2019 à 18:52, Wes McKinney a écrit :
> > >
> > > Probably the way is to introduce async-capable read APIs into the file
> > > interfaces. For example:
> > >
> > > file->ReadAsyncBlock(thread_ctx, ...);
> > >
> > > That way the file implementation can decide whether asynchronous logic
> > > is actually needed.
> > > I doubt very much that a one-size-fits-all
> > > concurrency solution can be developed -- in some applications
> > > coarse-grained IO and CPU task scheduling may be warranted, but we
> > > need to have a solution for finer-grained scenarios where
> > >
> > > * In the memory-mapped case, there is no overhead and
> > > * The programming model is not too burdensome to the library developer
> >
> > Well, the asynchronous I/O programming model *will* be burdensome at
> > least until C++ gets coroutines (which may happen in C++20, and
> > therefore be usable somewhere around 2024 for Arrow?).
> >
> > Regards
> >
> > Antoine.
> >
>


Re: [Discuss] Do a 0.15.0 release before 1.0.0?

2019-07-23 Thread Paul Taylor

+1 for a 0.15.0 before 1.0 if we go ahead with this.

I'm curious to hear other's thoughts about compatibility. I think we 
should avoid breaking backwards compatibility if possible. It's common 
for apps/libs to be pinned on specific Arrow versions, and I worry it'd 
cause a lot of work for downstream devs to audit their tool suite for 
full Arrow binary compatibility (and/or require their customers to do 
the same).


Could we detect the 4-byte length, incur a penalty copying the memory to 
an aligned buffer, then continue consuming the stream? (It's probably 
fine if we only write the 8-byte length, since consumers on older  
versions of Arrow could slice from the 4th byte before passing a buffer 
to the reader).


I've always understood the metadata to be a few dozen/hundred KB, a 
small percentage of the total message size. I could be underestimating 
the ratios though -- is it common to have tables w/ 1000+ columns? I've 
seen a few reports like that in cuDF, but I'm curious to hear 
Jacques'/Dremio's experience too.


If copying is feasible, it doesn't seem so bad a trade-off to maintain 
backwards-compatibility. As libraries and consumers upgrade their Arrow 
dependencies, the 4-byte length will be less and less common, and 
they'll be less likely to pay the cost.




On 7/23/19 2:22 AM, Uwe L. Korn wrote:

It is also a good way to test the change in public. We don't want to adjust something 
like this anymore in a 1.0.0 release. Already doing this in 0.15.0 and then maybe doing 
adjustments due to issues that appear "in the wild" is psychologically the 
easier way. There is a lot of thinking of users bound with the magic 1.0, thus I would 
plan to minimize what is changed between 1.0 and pre-1.0. This also should save us 
maintainers some time as I would expect different behaviour in bug reports between 1.0 
and pre-1.0 issues.

Uwe

On Tue, Jul 23, 2019, at 7:52 AM, Micah Kornfield wrote:

I think the main reason to do a release before 1.0.0 is if we want to make
the change that would give a good error message for forward incompatibility
(I think this could be done as 0.14.2 since it would just be clarifying an
error message).  Otherwise, I think including it in 1.0.0 would be fine
(its still not clear to me if there is consensus to fix the issue).

Thanks,
Micah


On Monday, July 22, 2019, Wes McKinney  wrote:


I'd be satisfied with fixing the Flatbuffer alignment issue either in
a 0.15.0 or 1.0.0. In the interest of expediency, though, making a
0.15.0 with this change sooner rather than later might be prudent.

On Mon, Jul 22, 2019 at 12:35 PM Antoine Pitrou 
wrote:


Hello,

Recently we've discussed breaking the IPC format to fix a long-standing
alignment issue.  See this discussion:


https://lists.apache.org/thread.html/8cea56f2069710ac128ff9129c744f0ef96a3e33a4d79d7e820019af@%3Cdev.arrow.apache.org%3E

Should we first do a 0.15.0 in order to get those format fixes right?
Once that is fine and settled we can move to the 1.0.0 release?

Regards

Antoine.





[ANNOUNCE] Apache Arrow 0.14.1 released

2019-07-23 Thread Krisztián Szűcs
The Apache Arrow community is pleased to announce the 0.14.1 release.
This is a patch release including 46 resolved issues ([1]) since the 0.14.0
release.

The release is available now from our website, [2] and [3]:
http://arrow.apache.org/install/

Release notes are available at:
https://arrow.apache.org/release/0.14.1.html

What is Apache Arrow?
-

Apache Arrow is a cross-language development platform for in-memory data. It
specifies a standardized language-independent columnar memory format for
flat
and hierarchical data, organized for efficient analytic operations on modern
hardware. It also provides computational libraries and zero-copy streaming
messaging and interprocess communication. Languages currently supported
include
C, C++, C#, Go, Java, JavaScript, MATLAB, Python, R, Ruby, and Rust.

Please report any feedback to the mailing lists ([4])

Regards,
The Apache Arrow community

[1]: https://issues.apache.org/jira/projects/ARROW/versions/12345727
[2]: https://www.apache.org/dyn/closer.cgi/arrow/arrow-0.14.1/
[3]: https://bintray.com/apache/arrow
[4]: https://lists.apache.org/list.html?dev@arrow.apache.org


[jira] [Created] (ARROW-6018) [Release] Gen apidocs step fails with multiple issues

2019-07-23 Thread Krisztian Szucs (JIRA)
Krisztian Szucs created ARROW-6018:
--

 Summary: [Release] Gen apidocs step fails with multiple issues
 Key: ARROW-6018
 URL: https://issues.apache.org/jira/browse/ARROW-6018
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Krisztian Szucs


JAVA_HOME is inproperly set to /opt/conda, this is resolvable by removing maven 
from the conda install step.

Node installed by apt is outdated, use conda to install it.

Generating apidoc for java takes a lot of time.

`npm run doc` still fails with missing modules



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: [VOTE] Release Apache Arrow 0.14.1 - RC0

2019-07-23 Thread Neal Richardson
I'll handle R.

On Mon, Jul 22, 2019 at 5:42 PM Krisztián Szűcs 
wrote:

> On Tue, Jul 23, 2019 at 12:31 AM Krisztián Szűcs <
> szucs.kriszt...@gmail.com>
> wrote:
>
> > The remaining tasks are:
> > - Updating website (after https://github.com/apache/arrow/pull/4922 is
> > merged)
> >
> I'm generating the apidocs and updating the changelog.
> I can send the ANNOUNCEMENT once the site gets updated.
>
> > - Update JavaScript packages
> >
> Paul has published the JS packages.
>
> > - Update R packages
> >
> Anyone would like to help with it?
>
> >
> > On Mon, Jul 22, 2019 at 9:52 PM Krisztián Szűcs <
> szucs.kriszt...@gmail.com>
> > wrote:
> >
> >> Added a warning about that.
> >>
> >> On Mon, Jul 22, 2019 at 9:38 PM Wes McKinney 
> wrote:
> >>
> >>> hi folks -- we had a small snafu with the post-release tasks because
> >>> this patch release did not follow our normal release procedure where
> >>> the release candidate is usually based off of master.
> >>>
> >>> When we prepare a patch release that is based on backported commits
> >>> into a maintenance branch, we DO NOT need to rebase master or any PRs.
> >>> So we need to update the release management instructions to indicate
> >>> that these steps should be skipped for future patch releases (or any
> >>> release that isn't based on master at some point in time).
> >>>
> >>> - Wes
> >>>
> >>> On Mon, Jul 22, 2019 at 10:46 AM Krisztián Szűcs
> >>>  wrote:
> >>> >
> >>> > Hi,
> >>> >
> >>> > The 0.14.1 RC0 vote carries with 4 binding +1 (and 1 non-binding +1)
> >>> votes.
> >>> > Thanks for helping verify the RC!
> >>> > I'm moving on to the post-release tasks [1] once github resolves its
> >>> > partially
> >>> > degraded service issues [2]. Any help is appreciated.
> >>> >
> >>> > - Krisztian
> >>> >
> >>> > [1]:
> >>> >
> >>>
> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-Post-releasetasks
> >>> > [2]: https://www.githubstatus.com/
> >>> >
> >>> > On Mon, Jul 22, 2019 at 4:23 PM Krisztián Szűcs <
> >>> szucs.kriszt...@gmail.com>
> >>> > wrote:
> >>> >
> >>> > > +1 (binding)
> >>> > >
> >>> > > Ran both the source and binary verification scripts on macOS
> Mojave.
> >>> > > Also tested the wheels in python docker containers and on OSX.
> >>> > >
> >>> > > On Thu, Jul 18, 2019 at 11:48 PM Sutou Kouhei 
> >>> wrote:
> >>> > >
> >>> > >> +1 (binding)
> >>> > >>
> >>> > >> I ran the followings on Debian GNU/Linux sid:
> >>> > >>
> >>> > >>   * TEST_CSHARP=0 JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
> >>> > >> CUDA_TOOLKIT_ROOT=/usr dev/release/verify-release-candidate.sh
> >>> source
> >>> > >> 0.14.1 0
> >>> > >>   * dev/release/verify-release-candidate.sh binaries 0.14.1 0
> >>> > >>
> >>> > >> with:
> >>> > >>
> >>> > >>   * gcc (Debian 8.3.0-7) 8.3.0
> >>> > >>   * openjdk version "1.8.0_212"
> >>> > >>   * ruby 2.7.0dev (2019-07-16T13:03:25Z trunk 6ab95fb741)
> >>> [x86_64-linux]
> >>> > >>   * Node.JS v12.1.0
> >>> > >>   * go version go1.11.6 linux/amd64
> >>> > >>   * nvidia-cuda-dev 9.2.148-7
> >>> > >>
> >>> > >> I re-run C# tests by the following command line sometimes:
> >>> > >>
> >>> > >>   TEST_DEFAULT=0 TEST_SOURCE=1 TEST_CSHARP=1
> >>> > >> dev/release/verify-release-candidate.sh source 0.14.1 0
> >>> > >>
> >>> > >> But "sourcelink test" is always failed:
> >>> > >>
> >>> > >>   + sourcelink test
> >>> > >> artifacts/Apache.Arrow/Release/netstandard1.3/Apache.Arrow.pdb
> >>> > >>   The operation was canceled.
> >>> > >>
> >>> > >> I don't think that this is a broker.
> >>> > >>
> >>> > >>
> >>> > >> Thanks,
> >>> > >> --
> >>> > >> kou
> >>> > >>
> >>> > >> In <
> >>> cahm19a5jpetwjj4uj-1zoqjzqdcejj-ky673uv83jtfcoyp...@mail.gmail.com>
> >>> > >>   "[VOTE] Release Apache Arrow 0.14.1 - RC0" on Wed, 17 Jul 2019
> >>> 04:54:33
> >>> > >> +0200,
> >>> > >>   Krisztián Szűcs  wrote:
> >>> > >>
> >>> > >> > Hi,
> >>> > >> >
> >>> > >> > I would like to propose the following release candidate (RC0) of
> >>> Apache
> >>> > >> > Arrow version 0.14.1. This is a patch release consiting of 47
> >>> resolved
> >>> > >> > JIRA issues[1].
> >>> > >> >
> >>> > >> > This release candidate is based on commit:
> >>> > >> > 5f564424c71cef12619522cdde59be5f69b31b68 [2]
> >>> > >> >
> >>> > >> > The source release rc0 is hosted at [3].
> >>> > >> > The binary artifacts are hosted at [4][5][6][7].
> >>> > >> > The changelog is located at [8].
> >>> > >> >
> >>> > >> > Please download, verify checksums and signatures, run the unit
> >>> tests,
> >>> > >> > and vote on the release. See [9] for how to validate a release
> >>> > >> candidate.
> >>> > >> >
> >>> > >> > The vote will be open for at least 72 hours.
> >>> > >> >
> >>> > >> > [ ] +1 Release this as Apache Arrow 0.14.1
> >>> > >> > [ ] +0
> >>> > >> > [ ] -1 Do not release this as Apache Arrow 0.14.1 because...
> >>> > >> >
> >>> > >> > [1]:
> >>> > >> >
> >>> > >>
> >>>
> 

Re: [DISCUSS][JAVA] Designs & goals for readers/writers

2019-07-23 Thread Wes McKinney
Yes I think text files are OK but I want to make sure that committers are
reviewing patches for binary files because there have been a number of
incidents in the past where I had to roll back patches to remove such
files.

On Tue, Jul 23, 2019, 10:37 AM Micah Kornfield 
wrote:

> Hi Wes,
> I haven't checked locally but that file at least for me renders as text
> file in GitHub (with an Apache header).  If we want all test data in the
> testing package I can make sure to move it but I thought text files might
> be ok in the main repo?
>
> Thanks,
> Micah
>
> On Tuesday, July 23, 2019, Wes McKinney  wrote:
>
>> I noticed that test data-related files are beginning to be checked in
>>
>>
>> https://github.com/apache/arrow/blob/master/java/adapter/avro/src/test/resources/schema/test.avsc
>>
>> I wanted to make sure this doesn't turn into a slippery slope where we
>> end up with several megabytes or more of test data files
>>
>> On Mon, Jul 22, 2019 at 11:39 PM Micah Kornfield 
>> wrote:
>> >
>> > Hi Wes,
>> > Are there currently files that need to be moved?
>> >
>> > Thanks,
>> > Micah
>> >
>> > On Monday, July 22, 2019, Wes McKinney  wrote:
>> >>
>> >> Sort of tangentially related, but while we are on the topic:
>> >>
>> >> Please, if you would, avoid checking binary test data files into the
>> >> main repository. Use https://github.com/apache/arrow-testing if you
>> >> truly need to check in binary data -- something to look out for in
>> >> code reviews
>> >>
>> >> On Mon, Jul 22, 2019 at 10:38 AM Micah Kornfield <
>> emkornfi...@gmail.com> wrote:
>> >> >
>> >> > Hi Jacques,
>> >> > Thanks for the clarifications. I think the distinction is useful.
>> >> >
>> >> > If people want to write adapters for Arrow, I see that as useful but
>> very
>> >> > > different than writing native implementations and we should try to
>> create a
>> >> > > clear delineation between the two.
>> >> >
>> >> >
>> >> > What do you think about creating a "contrib" directory and moving
>> the JDBC
>> >> > and AVRO adapters into it? We should also probably provide more
>> description
>> >> > in pom.xml to make it clear for downstream consumers.
>> >> >
>> >> > We should probably come up with a name other than adapters for
>> >> > readers/writer ("converters"?) and use it in the directory structure
>> for
>> >> > the existing Orc implementation?
>> >> >
>> >> > Thanks,
>> >> > Micah
>> >> >
>> >> >
>> >> > On Sun, Jul 21, 2019 at 6:09 PM Jacques Nadeau 
>> wrote:
>> >> >
>> >> > > As I read through your responses, I think it might be useful to
>> talk about
>> >> > > adapters versus native Arrow readers/writers. Adapters are
>> something that
>> >> > > adapt an existing API to produce and/or consume Arrow data. A
>> native
>> >> > > reader/writer is something that understand the format directly and
>> does not
>> >> > > have intermediate representations or APIs the data moves through
>> beyond
>> >> > > those that needs to be used to complete work.
>> >> > >
>> >> > > If people want to write adapters for Arrow, I see that as useful
>> but very
>> >> > > different than writing native implementations and we should try to
>> create a
>> >> > > clear delineation between the two.
>> >> > >
>> >> > > Further comments inline.
>> >> > >
>> >> > >
>> >> > >> Could you expand on what level of detail you would like to see a
>> design
>> >> > >> document?
>> >> > >>
>> >> > >
>> >> > > A couple paragraphs seems sufficient. This is the goals of the
>> >> > > implementation. We target existing functionality X. It is an
>> adapter. Or it
>> >> > > is a native impl. This is the expected memory and processing
>> >> > > characteristics, etc.  I've never been one for huge amount of
>> design but
>> >> > > I've seen a number of recent patches appear where this is no
>> upfront
>> >> > > discussion. Making sure that multiple buy into a design is the
>> best way to
>> >> > > ensure long-term maintenance and use.
>> >> > >
>> >> > >
>> >> > >> I think this should be optional (the same argument below about
>> predicates
>> >> > >> apply so I won't repeat them).
>> >> > >>
>> >> > >
>> >> > > Per my comments above, maybe adapter versus native reader clarifies
>> >> > > things. For example, I've been working on a native avro read
>> >> > > implementation. It is little more than chicken scratch at this
>> point but
>> >> > > its goals, vision and design are very different than the adapter
>> that is
>> >> > > being produced atm.
>> >> > >
>> >> > >
>> >> > >> Can you clarify the intent of this objective.  Is it mainly to
>> tie in with
>> >> > >> the existing Java arrow memory book keeping?  Performance?
>> Something
>> >> > >> else?
>> >> > >>
>> >> > >
>> >> > > Arrow is designed to be off-heap. If you have large variable
>> amounts of
>> >> > > on-heap memory in an application, it starts to make it very hard
>> to make
>> >> > > decisions about off-heap versus on-heap memory since those
>> divisions are by
>> >> > > and large static in nature. It's fine for 

Re: [DISCUSS][JAVA] Designs & goals for readers/writers

2019-07-23 Thread Micah Kornfield
Hi Wes,
I haven't checked locally but that file at least for me renders as text
file in GitHub (with an Apache header).  If we want all test data in the
testing package I can make sure to move it but I thought text files might
be ok in the main repo?

Thanks,
Micah

On Tuesday, July 23, 2019, Wes McKinney  wrote:

> I noticed that test data-related files are beginning to be checked in
>
> https://github.com/apache/arrow/blob/master/java/adapter/avro/src/test/
> resources/schema/test.avsc
>
> I wanted to make sure this doesn't turn into a slippery slope where we
> end up with several megabytes or more of test data files
>
> On Mon, Jul 22, 2019 at 11:39 PM Micah Kornfield 
> wrote:
> >
> > Hi Wes,
> > Are there currently files that need to be moved?
> >
> > Thanks,
> > Micah
> >
> > On Monday, July 22, 2019, Wes McKinney  wrote:
> >>
> >> Sort of tangentially related, but while we are on the topic:
> >>
> >> Please, if you would, avoid checking binary test data files into the
> >> main repository. Use https://github.com/apache/arrow-testing if you
> >> truly need to check in binary data -- something to look out for in
> >> code reviews
> >>
> >> On Mon, Jul 22, 2019 at 10:38 AM Micah Kornfield 
> wrote:
> >> >
> >> > Hi Jacques,
> >> > Thanks for the clarifications. I think the distinction is useful.
> >> >
> >> > If people want to write adapters for Arrow, I see that as useful but
> very
> >> > > different than writing native implementations and we should try to
> create a
> >> > > clear delineation between the two.
> >> >
> >> >
> >> > What do you think about creating a "contrib" directory and moving the
> JDBC
> >> > and AVRO adapters into it? We should also probably provide more
> description
> >> > in pom.xml to make it clear for downstream consumers.
> >> >
> >> > We should probably come up with a name other than adapters for
> >> > readers/writer ("converters"?) and use it in the directory structure
> for
> >> > the existing Orc implementation?
> >> >
> >> > Thanks,
> >> > Micah
> >> >
> >> >
> >> > On Sun, Jul 21, 2019 at 6:09 PM Jacques Nadeau 
> wrote:
> >> >
> >> > > As I read through your responses, I think it might be useful to
> talk about
> >> > > adapters versus native Arrow readers/writers. Adapters are
> something that
> >> > > adapt an existing API to produce and/or consume Arrow data. A native
> >> > > reader/writer is something that understand the format directly and
> does not
> >> > > have intermediate representations or APIs the data moves through
> beyond
> >> > > those that needs to be used to complete work.
> >> > >
> >> > > If people want to write adapters for Arrow, I see that as useful
> but very
> >> > > different than writing native implementations and we should try to
> create a
> >> > > clear delineation between the two.
> >> > >
> >> > > Further comments inline.
> >> > >
> >> > >
> >> > >> Could you expand on what level of detail you would like to see a
> design
> >> > >> document?
> >> > >>
> >> > >
> >> > > A couple paragraphs seems sufficient. This is the goals of the
> >> > > implementation. We target existing functionality X. It is an
> adapter. Or it
> >> > > is a native impl. This is the expected memory and processing
> >> > > characteristics, etc.  I've never been one for huge amount of
> design but
> >> > > I've seen a number of recent patches appear where this is no upfront
> >> > > discussion. Making sure that multiple buy into a design is the best
> way to
> >> > > ensure long-term maintenance and use.
> >> > >
> >> > >
> >> > >> I think this should be optional (the same argument below about
> predicates
> >> > >> apply so I won't repeat them).
> >> > >>
> >> > >
> >> > > Per my comments above, maybe adapter versus native reader clarifies
> >> > > things. For example, I've been working on a native avro read
> >> > > implementation. It is little more than chicken scratch at this
> point but
> >> > > its goals, vision and design are very different than the adapter
> that is
> >> > > being produced atm.
> >> > >
> >> > >
> >> > >> Can you clarify the intent of this objective.  Is it mainly to tie
> in with
> >> > >> the existing Java arrow memory book keeping?  Performance?
> Something
> >> > >> else?
> >> > >>
> >> > >
> >> > > Arrow is designed to be off-heap. If you have large variable
> amounts of
> >> > > on-heap memory in an application, it starts to make it very hard to
> make
> >> > > decisions about off-heap versus on-heap memory since those
> divisions are by
> >> > > and large static in nature. It's fine for short lived applications
> but for
> >> > > long lived applications, if you're working with a large amount of
> data, you
> >> > > want to keep most of your memory in one pool. In the context of
> Arrow, this
> >> > > is going to naturally be off-heap memory.
> >> > >
> >> > >
> >> > >> I'm afraid this might lead to a "perfect is the enemy of the good"
> >> > >> situation.  Starting off with a known good implementation of
> conversion to
> >> > >> 

Re: [DISCUSS] Passing the torch on Python wheel (binary) maintenance

2019-07-23 Thread Krisztián Szűcs
The pyarrow windows wheels for version 0.14.1 are no longer available.

On Tue, Jul 23, 2019 at 4:19 PM Krisztián Szűcs 
wrote:

> Ok, I'm deleting the 0.14.1 windows wheels then.
>
>
> On Tue, Jul 23, 2019 at 3:40 PM Wes McKinney  wrote:
>
>> I agree that we should not let people install broken wheels.
>>
>> On Tue, Jul 23, 2019 at 8:38 AM Krisztián Szűcs
>>  wrote:
>> >
>> > Although we have a quick fix for that [1] and the fixed wheels will be
>> > available soon [2] but sadly pypi doesn't support the update of already
>> > uploaded packages.
>> >
>> > We have three options:
>> > 1. delete the 0.14.1 windows wheels
>> > 2. draft a post release [3] only for the windows wheels, last time we
>> did it
>> > it broke a lot of users' workflows
>> > 3. create a 0.14.2 release
>> >
>> > In my opinion we should stick with option 1.
>> >
>> > [1]:
>> >
>> https://github.com/kszucs/arrow/commit/3b3f12c97be3436bc78374cac199a909b8f5edfe
>> > [2]:
>> >
>> https://issues.apache.org/jira/browse/ARROW-6015?focusedCommentId=16890990=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16890990
>> > [3]: https://www.python.org/dev/peps/pep-0440/#post-releases
>> >
>> > On Tue, Jul 23, 2019 at 3:27 PM Wes McKinney 
>> wrote:
>> >
>> > > As we just found in https://issues.apache.org/jira/browse/ARROW-6015,
>> > > our 0.14.1 wheels have more problems (this time on Windows), so more
>> > > evidence that we don't have the bandwidth to properly support these
>> > > packages.
>> > >
>> > > On Tue, Jul 16, 2019 at 3:08 PM Jacques Nadeau 
>> wrote:
>> > > >
>> > > > I think what you suggest is highly dependent on who does the work.
>> > > >
>> > > > The first question is who is willing to do the work. Given that
>> they are
>> > > > volunteers, they'd probably need to propose something like this
>> (but with
>> > > > there own flavors/choices) and then we'd have to figure out how this
>> > > > communicated to users (especially in the context that the same
>> package
>> > > > would potentially have different capabilities if used pip vs conda).
>> > > >
>> > > > On Mon, Jul 15, 2019 at 8:52 PM Suvayu Ali <
>> fatkasuvayu+li...@gmail.com>
>> > > > wrote:
>> > > >
>> > > > > Hi Wes, others,
>> > > > >
>> > > > > A few thoughts from a user.  Firstly, I completely understand your
>> > > > > frustration.  I myself have delved into a bit of packaging for
>> many
>> > > > > scientific computing packages, like ROOT from CERN, although not
>> at the
>> > > > > scale of users that you face here.
>> > > > >
>> > > > > AIU, wheels are a Python-first spec, whereas Arrow is a C++ first
>> > > library,
>> > > > > with python bindings.  I feel this is what causes the friction in
>> the
>> > > build
>> > > > > chain for wheels.  That said, I would like to propose the
>> following.
>> > > > >
>> > > > > On Mon, Jul 15, 2019 at 10:06:41PM -0500, Wes McKinney wrote:
>> > > > > >
>> > > > > > * Our wheel become much more complex due to Flight (requiring
>> gRPC,
>> > > > > > OpenSSL, and other dependencies) and Gandiva (requiring LLVM and
>> > > more)
>> > > > >
>> > > > > Disable the more advanced features and release reduced feature set
>> > > wheels,
>> > > > > say, only with:
>> > > > > 1. core data structures, Table, etc,
>> > > > > 2. various serialisation support (parquet, orc, etc), and
>> > > > > 3. plasma.
>> > > > >
>> > > > > My justification being, it covers a significant proportion of the
>> > > > > relatively non-expert usecases. (1) covers the interaction with
>> other
>> > > > > Python libraries, particularly pandas, (2) covers most I/O
>> > > requirements,
>> > > > > and plasma along with providing a way to manage Arrow objects
>> > > in-memory for
>> > > > > more advanced architectures, it also serves as a relatively simple
>> > > bridge
>> > > > > to other languages.  Any users requiring Gandiva or Flight on
>> Python
>> > > could
>> > > > > easily "upgrade" to the conda-forge releases.
>> > > > >
>> > > > > What do you think?
>> > > > >
>> > > > > Cheers,
>> > > > >
>> > > > > --
>> > > > > Suvayu
>> > > > >
>> > > > > Open source is the future. It sets us free.
>> > > > >
>> > >
>>
>


Re: [DISCUSS] Passing the torch on Python wheel (binary) maintenance

2019-07-23 Thread Krisztián Szűcs
Ok, I'm deleting the 0.14.1 windows wheels then.


On Tue, Jul 23, 2019 at 3:40 PM Wes McKinney  wrote:

> I agree that we should not let people install broken wheels.
>
> On Tue, Jul 23, 2019 at 8:38 AM Krisztián Szűcs
>  wrote:
> >
> > Although we have a quick fix for that [1] and the fixed wheels will be
> > available soon [2] but sadly pypi doesn't support the update of already
> > uploaded packages.
> >
> > We have three options:
> > 1. delete the 0.14.1 windows wheels
> > 2. draft a post release [3] only for the windows wheels, last time we
> did it
> > it broke a lot of users' workflows
> > 3. create a 0.14.2 release
> >
> > In my opinion we should stick with option 1.
> >
> > [1]:
> >
> https://github.com/kszucs/arrow/commit/3b3f12c97be3436bc78374cac199a909b8f5edfe
> > [2]:
> >
> https://issues.apache.org/jira/browse/ARROW-6015?focusedCommentId=16890990=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16890990
> > [3]: https://www.python.org/dev/peps/pep-0440/#post-releases
> >
> > On Tue, Jul 23, 2019 at 3:27 PM Wes McKinney 
> wrote:
> >
> > > As we just found in https://issues.apache.org/jira/browse/ARROW-6015,
> > > our 0.14.1 wheels have more problems (this time on Windows), so more
> > > evidence that we don't have the bandwidth to properly support these
> > > packages.
> > >
> > > On Tue, Jul 16, 2019 at 3:08 PM Jacques Nadeau 
> wrote:
> > > >
> > > > I think what you suggest is highly dependent on who does the work.
> > > >
> > > > The first question is who is willing to do the work. Given that they
> are
> > > > volunteers, they'd probably need to propose something like this (but
> with
> > > > there own flavors/choices) and then we'd have to figure out how this
> > > > communicated to users (especially in the context that the same
> package
> > > > would potentially have different capabilities if used pip vs conda).
> > > >
> > > > On Mon, Jul 15, 2019 at 8:52 PM Suvayu Ali <
> fatkasuvayu+li...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi Wes, others,
> > > > >
> > > > > A few thoughts from a user.  Firstly, I completely understand your
> > > > > frustration.  I myself have delved into a bit of packaging for many
> > > > > scientific computing packages, like ROOT from CERN, although not
> at the
> > > > > scale of users that you face here.
> > > > >
> > > > > AIU, wheels are a Python-first spec, whereas Arrow is a C++ first
> > > library,
> > > > > with python bindings.  I feel this is what causes the friction in
> the
> > > build
> > > > > chain for wheels.  That said, I would like to propose the
> following.
> > > > >
> > > > > On Mon, Jul 15, 2019 at 10:06:41PM -0500, Wes McKinney wrote:
> > > > > >
> > > > > > * Our wheel become much more complex due to Flight (requiring
> gRPC,
> > > > > > OpenSSL, and other dependencies) and Gandiva (requiring LLVM and
> > > more)
> > > > >
> > > > > Disable the more advanced features and release reduced feature set
> > > wheels,
> > > > > say, only with:
> > > > > 1. core data structures, Table, etc,
> > > > > 2. various serialisation support (parquet, orc, etc), and
> > > > > 3. plasma.
> > > > >
> > > > > My justification being, it covers a significant proportion of the
> > > > > relatively non-expert usecases. (1) covers the interaction with
> other
> > > > > Python libraries, particularly pandas, (2) covers most I/O
> > > requirements,
> > > > > and plasma along with providing a way to manage Arrow objects
> > > in-memory for
> > > > > more advanced architectures, it also serves as a relatively simple
> > > bridge
> > > > > to other languages.  Any users requiring Gandiva or Flight on
> Python
> > > could
> > > > > easily "upgrade" to the conda-forge releases.
> > > > >
> > > > > What do you think?
> > > > >
> > > > > Cheers,
> > > > >
> > > > > --
> > > > > Suvayu
> > > > >
> > > > > Open source is the future. It sets us free.
> > > > >
> > >
>


Re: [DISCUSS][JAVA] Designs & goals for readers/writers

2019-07-23 Thread Wes McKinney
I noticed that test data-related files are beginning to be checked in

https://github.com/apache/arrow/blob/master/java/adapter/avro/src/test/resources/schema/test.avsc

I wanted to make sure this doesn't turn into a slippery slope where we
end up with several megabytes or more of test data files

On Mon, Jul 22, 2019 at 11:39 PM Micah Kornfield  wrote:
>
> Hi Wes,
> Are there currently files that need to be moved?
>
> Thanks,
> Micah
>
> On Monday, July 22, 2019, Wes McKinney  wrote:
>>
>> Sort of tangentially related, but while we are on the topic:
>>
>> Please, if you would, avoid checking binary test data files into the
>> main repository. Use https://github.com/apache/arrow-testing if you
>> truly need to check in binary data -- something to look out for in
>> code reviews
>>
>> On Mon, Jul 22, 2019 at 10:38 AM Micah Kornfield  
>> wrote:
>> >
>> > Hi Jacques,
>> > Thanks for the clarifications. I think the distinction is useful.
>> >
>> > If people want to write adapters for Arrow, I see that as useful but very
>> > > different than writing native implementations and we should try to 
>> > > create a
>> > > clear delineation between the two.
>> >
>> >
>> > What do you think about creating a "contrib" directory and moving the JDBC
>> > and AVRO adapters into it? We should also probably provide more description
>> > in pom.xml to make it clear for downstream consumers.
>> >
>> > We should probably come up with a name other than adapters for
>> > readers/writer ("converters"?) and use it in the directory structure for
>> > the existing Orc implementation?
>> >
>> > Thanks,
>> > Micah
>> >
>> >
>> > On Sun, Jul 21, 2019 at 6:09 PM Jacques Nadeau  wrote:
>> >
>> > > As I read through your responses, I think it might be useful to talk 
>> > > about
>> > > adapters versus native Arrow readers/writers. Adapters are something that
>> > > adapt an existing API to produce and/or consume Arrow data. A native
>> > > reader/writer is something that understand the format directly and does 
>> > > not
>> > > have intermediate representations or APIs the data moves through beyond
>> > > those that needs to be used to complete work.
>> > >
>> > > If people want to write adapters for Arrow, I see that as useful but very
>> > > different than writing native implementations and we should try to 
>> > > create a
>> > > clear delineation between the two.
>> > >
>> > > Further comments inline.
>> > >
>> > >
>> > >> Could you expand on what level of detail you would like to see a design
>> > >> document?
>> > >>
>> > >
>> > > A couple paragraphs seems sufficient. This is the goals of the
>> > > implementation. We target existing functionality X. It is an adapter. Or 
>> > > it
>> > > is a native impl. This is the expected memory and processing
>> > > characteristics, etc.  I've never been one for huge amount of design but
>> > > I've seen a number of recent patches appear where this is no upfront
>> > > discussion. Making sure that multiple buy into a design is the best way 
>> > > to
>> > > ensure long-term maintenance and use.
>> > >
>> > >
>> > >> I think this should be optional (the same argument below about 
>> > >> predicates
>> > >> apply so I won't repeat them).
>> > >>
>> > >
>> > > Per my comments above, maybe adapter versus native reader clarifies
>> > > things. For example, I've been working on a native avro read
>> > > implementation. It is little more than chicken scratch at this point but
>> > > its goals, vision and design are very different than the adapter that is
>> > > being produced atm.
>> > >
>> > >
>> > >> Can you clarify the intent of this objective.  Is it mainly to tie in 
>> > >> with
>> > >> the existing Java arrow memory book keeping?  Performance?  Something
>> > >> else?
>> > >>
>> > >
>> > > Arrow is designed to be off-heap. If you have large variable amounts of
>> > > on-heap memory in an application, it starts to make it very hard to make
>> > > decisions about off-heap versus on-heap memory since those divisions are 
>> > > by
>> > > and large static in nature. It's fine for short lived applications but 
>> > > for
>> > > long lived applications, if you're working with a large amount of data, 
>> > > you
>> > > want to keep most of your memory in one pool. In the context of Arrow, 
>> > > this
>> > > is going to naturally be off-heap memory.
>> > >
>> > >
>> > >> I'm afraid this might lead to a "perfect is the enemy of the good"
>> > >> situation.  Starting off with a known good implementation of conversion 
>> > >> to
>> > >> Arrow can allow us to both to profile hot-spots and provide a comparison
>> > >> of
>> > >> implementations to verify correctness.
>> > >>
>> > >
>> > > I'm not clear what message we're sending as a community if we produce low
>> > > performance components. The whole of Arrow is to increase performance, 
>> > > not
>> > > decrease it. I'm targeting good, not perfect. At the same time, from my
>> > > perspective, Arrow development should not be approached in the same 

Re: [DISCUSS] Passing the torch on Python wheel (binary) maintenance

2019-07-23 Thread Wes McKinney
I agree that we should not let people install broken wheels.

On Tue, Jul 23, 2019 at 8:38 AM Krisztián Szűcs
 wrote:
>
> Although we have a quick fix for that [1] and the fixed wheels will be
> available soon [2] but sadly pypi doesn't support the update of already
> uploaded packages.
>
> We have three options:
> 1. delete the 0.14.1 windows wheels
> 2. draft a post release [3] only for the windows wheels, last time we did it
> it broke a lot of users' workflows
> 3. create a 0.14.2 release
>
> In my opinion we should stick with option 1.
>
> [1]:
> https://github.com/kszucs/arrow/commit/3b3f12c97be3436bc78374cac199a909b8f5edfe
> [2]:
> https://issues.apache.org/jira/browse/ARROW-6015?focusedCommentId=16890990=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16890990
> [3]: https://www.python.org/dev/peps/pep-0440/#post-releases
>
> On Tue, Jul 23, 2019 at 3:27 PM Wes McKinney  wrote:
>
> > As we just found in https://issues.apache.org/jira/browse/ARROW-6015,
> > our 0.14.1 wheels have more problems (this time on Windows), so more
> > evidence that we don't have the bandwidth to properly support these
> > packages.
> >
> > On Tue, Jul 16, 2019 at 3:08 PM Jacques Nadeau  wrote:
> > >
> > > I think what you suggest is highly dependent on who does the work.
> > >
> > > The first question is who is willing to do the work. Given that they are
> > > volunteers, they'd probably need to propose something like this (but with
> > > there own flavors/choices) and then we'd have to figure out how this
> > > communicated to users (especially in the context that the same package
> > > would potentially have different capabilities if used pip vs conda).
> > >
> > > On Mon, Jul 15, 2019 at 8:52 PM Suvayu Ali 
> > > wrote:
> > >
> > > > Hi Wes, others,
> > > >
> > > > A few thoughts from a user.  Firstly, I completely understand your
> > > > frustration.  I myself have delved into a bit of packaging for many
> > > > scientific computing packages, like ROOT from CERN, although not at the
> > > > scale of users that you face here.
> > > >
> > > > AIU, wheels are a Python-first spec, whereas Arrow is a C++ first
> > library,
> > > > with python bindings.  I feel this is what causes the friction in the
> > build
> > > > chain for wheels.  That said, I would like to propose the following.
> > > >
> > > > On Mon, Jul 15, 2019 at 10:06:41PM -0500, Wes McKinney wrote:
> > > > >
> > > > > * Our wheel become much more complex due to Flight (requiring gRPC,
> > > > > OpenSSL, and other dependencies) and Gandiva (requiring LLVM and
> > more)
> > > >
> > > > Disable the more advanced features and release reduced feature set
> > wheels,
> > > > say, only with:
> > > > 1. core data structures, Table, etc,
> > > > 2. various serialisation support (parquet, orc, etc), and
> > > > 3. plasma.
> > > >
> > > > My justification being, it covers a significant proportion of the
> > > > relatively non-expert usecases. (1) covers the interaction with other
> > > > Python libraries, particularly pandas, (2) covers most I/O
> > requirements,
> > > > and plasma along with providing a way to manage Arrow objects
> > in-memory for
> > > > more advanced architectures, it also serves as a relatively simple
> > bridge
> > > > to other languages.  Any users requiring Gandiva or Flight on Python
> > could
> > > > easily "upgrade" to the conda-forge releases.
> > > >
> > > > What do you think?
> > > >
> > > > Cheers,
> > > >
> > > > --
> > > > Suvayu
> > > >
> > > > Open source is the future. It sets us free.
> > > >
> >


Re: [DISCUSS] Passing the torch on Python wheel (binary) maintenance

2019-07-23 Thread Krisztián Szűcs
Although we have a quick fix for that [1] and the fixed wheels will be
available soon [2] but sadly pypi doesn't support the update of already
uploaded packages.

We have three options:
1. delete the 0.14.1 windows wheels
2. draft a post release [3] only for the windows wheels, last time we did it
it broke a lot of users' workflows
3. create a 0.14.2 release

In my opinion we should stick with option 1.

[1]:
https://github.com/kszucs/arrow/commit/3b3f12c97be3436bc78374cac199a909b8f5edfe
[2]:
https://issues.apache.org/jira/browse/ARROW-6015?focusedCommentId=16890990=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16890990
[3]: https://www.python.org/dev/peps/pep-0440/#post-releases

On Tue, Jul 23, 2019 at 3:27 PM Wes McKinney  wrote:

> As we just found in https://issues.apache.org/jira/browse/ARROW-6015,
> our 0.14.1 wheels have more problems (this time on Windows), so more
> evidence that we don't have the bandwidth to properly support these
> packages.
>
> On Tue, Jul 16, 2019 at 3:08 PM Jacques Nadeau  wrote:
> >
> > I think what you suggest is highly dependent on who does the work.
> >
> > The first question is who is willing to do the work. Given that they are
> > volunteers, they'd probably need to propose something like this (but with
> > there own flavors/choices) and then we'd have to figure out how this
> > communicated to users (especially in the context that the same package
> > would potentially have different capabilities if used pip vs conda).
> >
> > On Mon, Jul 15, 2019 at 8:52 PM Suvayu Ali 
> > wrote:
> >
> > > Hi Wes, others,
> > >
> > > A few thoughts from a user.  Firstly, I completely understand your
> > > frustration.  I myself have delved into a bit of packaging for many
> > > scientific computing packages, like ROOT from CERN, although not at the
> > > scale of users that you face here.
> > >
> > > AIU, wheels are a Python-first spec, whereas Arrow is a C++ first
> library,
> > > with python bindings.  I feel this is what causes the friction in the
> build
> > > chain for wheels.  That said, I would like to propose the following.
> > >
> > > On Mon, Jul 15, 2019 at 10:06:41PM -0500, Wes McKinney wrote:
> > > >
> > > > * Our wheel become much more complex due to Flight (requiring gRPC,
> > > > OpenSSL, and other dependencies) and Gandiva (requiring LLVM and
> more)
> > >
> > > Disable the more advanced features and release reduced feature set
> wheels,
> > > say, only with:
> > > 1. core data structures, Table, etc,
> > > 2. various serialisation support (parquet, orc, etc), and
> > > 3. plasma.
> > >
> > > My justification being, it covers a significant proportion of the
> > > relatively non-expert usecases. (1) covers the interaction with other
> > > Python libraries, particularly pandas, (2) covers most I/O
> requirements,
> > > and plasma along with providing a way to manage Arrow objects
> in-memory for
> > > more advanced architectures, it also serves as a relatively simple
> bridge
> > > to other languages.  Any users requiring Gandiva or Flight on Python
> could
> > > easily "upgrade" to the conda-forge releases.
> > >
> > > What do you think?
> > >
> > > Cheers,
> > >
> > > --
> > > Suvayu
> > >
> > > Open source is the future. It sets us free.
> > >
>


Re: [DISCUSS] Passing the torch on Python wheel (binary) maintenance

2019-07-23 Thread Wes McKinney
As we just found in https://issues.apache.org/jira/browse/ARROW-6015,
our 0.14.1 wheels have more problems (this time on Windows), so more
evidence that we don't have the bandwidth to properly support these
packages.

On Tue, Jul 16, 2019 at 3:08 PM Jacques Nadeau  wrote:
>
> I think what you suggest is highly dependent on who does the work.
>
> The first question is who is willing to do the work. Given that they are
> volunteers, they'd probably need to propose something like this (but with
> there own flavors/choices) and then we'd have to figure out how this
> communicated to users (especially in the context that the same package
> would potentially have different capabilities if used pip vs conda).
>
> On Mon, Jul 15, 2019 at 8:52 PM Suvayu Ali 
> wrote:
>
> > Hi Wes, others,
> >
> > A few thoughts from a user.  Firstly, I completely understand your
> > frustration.  I myself have delved into a bit of packaging for many
> > scientific computing packages, like ROOT from CERN, although not at the
> > scale of users that you face here.
> >
> > AIU, wheels are a Python-first spec, whereas Arrow is a C++ first library,
> > with python bindings.  I feel this is what causes the friction in the build
> > chain for wheels.  That said, I would like to propose the following.
> >
> > On Mon, Jul 15, 2019 at 10:06:41PM -0500, Wes McKinney wrote:
> > >
> > > * Our wheel become much more complex due to Flight (requiring gRPC,
> > > OpenSSL, and other dependencies) and Gandiva (requiring LLVM and more)
> >
> > Disable the more advanced features and release reduced feature set wheels,
> > say, only with:
> > 1. core data structures, Table, etc,
> > 2. various serialisation support (parquet, orc, etc), and
> > 3. plasma.
> >
> > My justification being, it covers a significant proportion of the
> > relatively non-expert usecases. (1) covers the interaction with other
> > Python libraries, particularly pandas, (2) covers most I/O requirements,
> > and plasma along with providing a way to manage Arrow objects in-memory for
> > more advanced architectures, it also serves as a relatively simple bridge
> > to other languages.  Any users requiring Gandiva or Flight on Python could
> > easily "upgrade" to the conda-forge releases.
> >
> > What do you think?
> >
> > Cheers,
> >
> > --
> > Suvayu
> >
> > Open source is the future. It sets us free.
> >


[jira] [Created] (ARROW-6016) [Python] pyarrow get_library_dirs assertion error

2019-07-23 Thread Matthijs Brobbel (JIRA)
Matthijs Brobbel created ARROW-6016:
---

 Summary: [Python] pyarrow get_library_dirs assertion error
 Key: ARROW-6016
 URL: https://issues.apache.org/jira/browse/ARROW-6016
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.14.1
Reporter: Matthijs Brobbel


The added code here: 
[https://github.com/apache/arrow/blob/apache-arrow-0.14.1/python/pyarrow/__init__.py#L257-L265]
 causes an `AssertionError` in ubuntu:

{{... line 265, in get_library_dirs}}
 {{assert library_dir.startswith("-L")}}
 {{AssertionError}}

I've installed libarrow-dev from the bintray repositories.

Output from pkg-config:

{{pkg-config --debug --libs-only-L arrow}}

{{Package arrow has -L /usr/lib/x86_64-linux-gnu in Libs}}
 {{Removing -L /usr/lib/x86_64-linux-gnu from libs for arrow}}
 {{  pre-remove: arrow}}
 {{ post-remove: arrow}}
 {{ original: arrow}}
 {{   sorted: arrow}}
 {{adding LIBS_L string ""}}
 {{returning flags string ""}}

 Workaround: set the PKG_CONFIG_ALLOW_SYSTEM_LIBS env var.

{{PKG_CONFIG_ALLOW_SYSTEM_LIBS=1 pkg-config --debug --libs-only-L arrow}}

{{Adding 'arrow' to list of known packages}}
{{Package arrow has -I/usr/include in Cflags}}
{{Removing -I/usr/include from cflags for arrow}}
{{Package arrow has -L /usr/lib/x86_64-linux-gnu in Libs}}
{{  pre-remove: arrow}}
{{ post-remove: arrow}}
{{ original: arrow}}
{{   sorted: arrow}}
{{adding LIBS_L string "-L/usr/lib/x86_64-linux-gnu "}}
{{returning flags string "-L/usr/lib/x86_64-linux-gnu"}}
{{-L/usr/lib/x86_64-linux-gnu}}

 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6015) [Python] pyarrow: `DLL load failed` when importing on windows

2019-07-23 Thread Ruslan Kuprieiev (JIRA)
Ruslan Kuprieiev created ARROW-6015:
---

 Summary: [Python] pyarrow:  `DLL load failed` when importing on 
windows
 Key: ARROW-6015
 URL: https://issues.apache.org/jira/browse/ARROW-6015
 Project: Apache Arrow
  Issue Type: Improvement
Affects Versions: 0.14.1
Reporter: Ruslan Kuprieiev


When installing pyarrow 0.14.1 on windows 10 x64 with python 3.7, you get:
```
>>> import pyarrow Traceback (most recent call last): File "", line 1, 
>>> in  File "C:\Python37\lib\site-packages\pyarrow\__init__.py", line 
>>> 49, in  from pyarrow.lib import cpu_count, set_cpu_count 
>>> ImportError: DLL load failed: The specified module could not be found.
```
On 0.14.0 everything works fine.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6014) [Release] Dockerize post-03-website.sh script

2019-07-23 Thread Krisztian Szucs (JIRA)
Krisztian Szucs created ARROW-6014:
--

 Summary: [Release] Dockerize post-03-website.sh script
 Key: ARROW-6014
 URL: https://issues.apache.org/jira/browse/ARROW-6014
 Project: Apache Arrow
  Issue Type: Task
Reporter: Krisztian Szucs


The script fails on OSX because of the BSD date function, and it also requires 
a recent git command to support -c flag of git shortlog and the installed jira 
python client. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6012) [C++] Fall back on known Apache mirror for Thrift downloads

2019-07-23 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-6012:
-

 Summary: [C++] Fall back on known Apache mirror for Thrift 
downloads
 Key: ARROW-6012
 URL: https://issues.apache.org/jira/browse/ARROW-6012
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Continuous Integration
Reporter: Antoine Pitrou


AppVeyor builds have started failing with SSL certificate errors on 
www.apache.org.
See e.g. 
https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/26166313

Underlying cause is https://github.com/conda-forge/python-feedstock/issues/267




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: [Discuss] Do a 0.15.0 release before 1.0.0?

2019-07-23 Thread Uwe L. Korn
It is also a good way to test the change in public. We don't want to adjust 
something like this anymore in a 1.0.0 release. Already doing this in 0.15.0 
and then maybe doing adjustments due to issues that appear "in the wild" is 
psychologically the easier way. There is a lot of thinking of users bound with 
the magic 1.0, thus I would plan to minimize what is changed between 1.0 and 
pre-1.0. This also should save us maintainers some time as I would expect 
different behaviour in bug reports between 1.0 and pre-1.0 issues.

Uwe

On Tue, Jul 23, 2019, at 7:52 AM, Micah Kornfield wrote:
> I think the main reason to do a release before 1.0.0 is if we want to make
> the change that would give a good error message for forward incompatibility
> (I think this could be done as 0.14.2 since it would just be clarifying an
> error message).  Otherwise, I think including it in 1.0.0 would be fine
> (its still not clear to me if there is consensus to fix the issue).
> 
> Thanks,
> Micah
> 
> 
> On Monday, July 22, 2019, Wes McKinney  wrote:
> 
> > I'd be satisfied with fixing the Flatbuffer alignment issue either in
> > a 0.15.0 or 1.0.0. In the interest of expediency, though, making a
> > 0.15.0 with this change sooner rather than later might be prudent.
> >
> > On Mon, Jul 22, 2019 at 12:35 PM Antoine Pitrou 
> > wrote:
> > >
> > >
> > > Hello,
> > >
> > > Recently we've discussed breaking the IPC format to fix a long-standing
> > > alignment issue.  See this discussion:
> > >
> > https://lists.apache.org/thread.html/8cea56f2069710ac128ff9129c744f0ef96a3e33a4d79d7e820019af@%3Cdev.arrow.apache.org%3E
> > >
> > > Should we first do a 0.15.0 in order to get those format fixes right?
> > > Once that is fine and settled we can move to the 1.0.0 release?
> > >
> > > Regards
> > >
> > > Antoine.
> >
>


[jira] [Created] (ARROW-6011) Data incomplete when using pyarrow in pyspark in python 3.x

2019-07-23 Thread jiangyu (JIRA)
jiangyu created ARROW-6011:
--

 Summary: Data incomplete when using pyarrow in pyspark in python 
3.x
 Key: ARROW-6011
 URL: https://issues.apache.org/jira/browse/ARROW-6011
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java, Python
Affects Versions: 0.14.0, 0.10.0
 Environment: ceonts 7.4  pyarrow 0.10.0  0.14.0   python 2.7  3.5 
3.6
Reporter: jiangyu
 Attachments: image-2019-07-23-16-06-49-889.png

Hi,
 
In spark 2.3, pandas udf add to pyspark and pyarrow as a default serialization 
and deserialization method. It is a great feature, and we use it a lot.
But , when we change the default python version from 2.7 to 3.5 or 3.6 ( conda 
as  python envs manager),  we encounter a fatal problem.
We use pandas udf to process batches of data, but we find the data is 
incompelted. At first , i think the process logical maybe wrong, so i change 
the code to very simple one and it has the same problem.After investigate for a 
week, i find it is related to pyarrow.   
 
Reproduce it:
 
Below is how to reproduce it:

1.generate data
first generate a very simple data, the data have seven column, a、b、c、d、e、f and 
g, every row is the same,data type is Integer
a,b,c,d,e,f,g
1,2,3,4,5,6,7
1,2,3,4,5,6,7
1,2,3,4,5,6,7
1,2,3,4,5,6,7
 we can produce 100,000 rows and name the file test.csv upload to hdfs, then 
load it , and repartition it to 1 partition.
 
df=spark.read.format('csv').option("header","true").load('/test.csv')
df=df.select(*(col(c).cast("int").alias(c) for c in df.columns))
df=df.repartition(1)
spark_context = SparkContext.getOrCreate() 
 
2.register pandas udf
make a very simple pandas udf function and register it.The function is very 
simple , just print “iterator one time” and do nothing then return.
 
def add_func(a,b,c,d,e,f,g):
     print('iterator one time')
     return a
add = pandas_udf(add_func, returnType=IntegerType())
df_result=df.select(add(col("a"),col("b"),col("c"),col("d"),col("e"),col("f"),col("g")))
 
3.trigger spark to action
 
def trigger_func(iterator):
     yield iterator
df_result.rdd.foreachPartition(trigger_func)
 
4.execute it in pyspark (local or yarn)
we set spark.sql.execution.arrow.maxRecordsPerBatch=10, and the rows is 
1,000,000 , so it is should print “iterator one time” for 10 times.
(1)Here is result in python 2.7 envs:
 
PYSPARK_PYTHON=/usr/lib/conda/envs/py2.7/bin/python pyspark --conf 
spark.sql.execution.arrow.maxRecordsPerBatch=10 --conf 
spark.executor.pyspark.memory=2g --conf spark.sql.execution.arrow.enabled=true 
--executor-cores 1
 
!image-2019-07-23-16-06-49-889.png!  
The result is right, 10 times of print.

(2)Then change to python 3.6 envs,with the same code.
PYSPARK_PYTHON=/usr/lib/conda/envs/python3.6/bin/python pyspark --conf 
spark.sql.execution.arrow.maxRecordsPerBatch=10 --conf 
spark.executor.pyspark.memory=2g --conf spark.sql.execution.arrow.enabled=true 
--executor-cores 
1!0pPjMgKKgEJSEACEpCABCQgAQlIQAISkMCcCaismHPrWjcJSEACEpCABCQgAQlIQAISkMAJJKCy4gQ2miJLQAISkIAEJCABCUhAAhKQgATmTEBlxZxb17pJQAISkIAEJCABCUhAAhKQgAROIAGVFSew0RRZAhKQgAQkIAEJSEACEpCABCQwZwIqK
 
bcutZNAhKQgAQkIAEJSEACEpCABCRwAgmorDiBjabIEpCABCQgAQlIQAISkIAEJCCBORP4B5QvwTqM1wfyAElFTkSuQmCC!
 The data is incomplete. 
The exception is print by spark which have been added by us , I will explain it 
later.
 
 
h3. Investigation
So i just add some log to trace it. The “process done” is added in the 
worker.py.
!Ae0YTBna66oMAElFTkSuQmCC!  
In order to get the exception, we also change the spark code, the code is under 
core/src/main/scala/org/apache/spark/util/Utils.scala , and add this code to 
print the exception.
 
@@ -1362,6 +1362,8 @@ private[spark] object Utils extends Logging {
 case t: Throwable =>
 // Purposefully not using NonFatal, because even fatal exceptions
 // we don't want to have our finallyBlock suppress
+ logInfo(t.getLocalizedMessage)
+ t.printStackTrace()
 originalThrowable = t
 throw originalThrowable
 } finally {
 
It seems the pyspark get the data from jvm , but pyarrow get the data 
incomplete. Pyarrow side think the data is finished, then shutdown the socket. 
At the same time, the jvm side still writes to the same socket , but get socket 
close exception.
The pyarrow part is in ipc.pxi:
 
cdef class _RecordBatchReader:
 cdef:
 shared_ptr[CRecordBatchReader] reader
 shared_ptr[InputStream] in_stream

 cdef readonly:
 Schema schema

 def __cinit__(self):
 pass

 def _open(self, source):
 get_input_stream(source, _stream)
 with nogil:
 check_status(CRecordBatchStreamReader.Open(
 self.in_stream.get(), ))

 self.schema = pyarrow_wrap_schema(self.reader.get().schema())

 def __iter__(self):
 while True:
 yield self.read_next_batch()

 def get_next_batch(self):
 import warnings
 warnings.warn('Please use read_next_batch instead of '
 'get_next_batch', FutureWarning)
 return self.read_next_batch()

 def