Re: [Discuss] Array Cast Kernels Support Matrix

2019-03-04 Thread Micah Kornfield
Hi Neville,
In case it helps you do some digging most of the allowed casts in C++ can
be found at [1].

* It does support Uft8 to boolean but I don't believe it does not boolean
to utf8
* It looks like it does support casting List to List.
* It doesn't support Struct to struct

In general, I'm not sure consistency is  important between different
implementations of compute engines (unless they all have the goal of
implementing some standard like one of the SQL-XX).So it might be nice,
but I don't think we should be rigorous about it.  I'd like to hear other
opinions on this though.

In C++ at least,  I think some of the missing casts could likely be handled
by other kernels (if not added directly to casts).

Thanks,
Micah

[1]
https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/generated/cast-codegen-internal.h

On Mon, Mar 4, 2019 at 10:08 AM Neville Dipale 
wrote:

> Hi Arrow devs,
>
> I'm currently adding support for casting arrays in Rust, and I'm wondering
> what casting operations should be supported, and how. Most operations are
> simple, but I have a few questions below.
>
> * Struct to Struct: I am not supporting in Rust as it might not make
> sense/be easy to support. Is that fine?
>
> * List of type A to List of type B: does it make sense to support casting
> as long as the underlying types can be cast to each other? I'm thinking of
> casting a list of u32 to list of i32
>
> * Boolean to Decimal: Apache Impala doesn't support this (
>
> https://impala.apache.org/docs/build/html/topics/impala_boolean.html).Should
> we follow suit here?
>
> * Boolean to Utf8: Impala casts 'true' to '1', what are Arrow
> implementations doing? I'll also have a look at the CPP codebase.
>
> * Utf8 to Boolean: Impala disallows this, but a case could be made for
> supporting this, with the cast operation being limited to (true/false),
> (T/F) and what CSV readers infer to be true or false. This could be useful
> when reading CSV files (in Rust)
>
> * Primitive to List: I was thinking of creating a list with 1 value for the
> primitive (provided the list type is compatible with the from primitive).
> Is this too extreme? We could perhaps leave this out and support it someday
> in array operations
>
> With regards to temporal arrays, should casting date and time to primitive
> types be supported? The inverse makes sense as I might have an Int32Array
> with millisecond values that I want to cast to a Timestamp or a Date32.
>
> If there's interest/benefit in documenting the above for future consistency
> among the various languages, I don't mind documenting something in the
> coming days/weeks.
>
> Thanks and Regards
>
> Neville
>


Depending on non-released Apache projects (C++ Avro)

2019-03-04 Thread Micah Kornfield
I'm looking at incorporating Avro in Arrow C++ [1]. It  seems that the Avro
C++ library APIs  have improved from the last release.  However, it is not
clear when a new release will be available (I asked on the  JIRA Item for
the next release [2] and received no response).

I was wondering if there is a policy governing using other Apache projects
or how people felt about the following options:
1.  Depend on a specific git commit through the third-party library system.
2.  Copy the necessary source code temporarily to our project, and change
to using the next release when it is available.
3.  Fork the code we need (the main benefit I see here is being able to
refactor it to avoid having to deal with exceptions, easier integration
with our IO system and one less 3rd party dependency to deal with).
4.  Wait on the 1.9 release before proceeding.

Thanks,
Micah

[1] https://issues.apache.org/jira/browse/ARROW-1209
[2] https://issues.apache.org/jira/browse/AVRO-2250


[jira] [Created] (ARROW-4772) Provide new ORC adapter interface that allow user to specify row number

2019-03-04 Thread Yurui Zhou (JIRA)
Yurui Zhou created ARROW-4772:
-

 Summary: Provide new ORC adapter interface that allow user to 
specify row number
 Key: ARROW-4772
 URL: https://issues.apache.org/jira/browse/ARROW-4772
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: C++
Reporter: Yurui Zhou






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4773) Enable copy free conversion for dictionary encoded string column

2019-03-04 Thread Yurui Zhou (JIRA)
Yurui Zhou created ARROW-4773:
-

 Summary: Enable copy free conversion for dictionary encoded string 
column
 Key: ARROW-4773
 URL: https://issues.apache.org/jira/browse/ARROW-4773
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: C++
Reporter: Yurui Zhou






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4771) Enable copy free conversion for Composite type

2019-03-04 Thread Yurui Zhou (JIRA)
Yurui Zhou created ARROW-4771:
-

 Summary: Enable copy free conversion for Composite type
 Key: ARROW-4771
 URL: https://issues.apache.org/jira/browse/ARROW-4771
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: C++
Reporter: Yurui Zhou






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4770) Enable copy free conversion for primitive types

2019-03-04 Thread Yurui Zhou (JIRA)
Yurui Zhou created ARROW-4770:
-

 Summary: Enable copy free conversion for primitive types
 Key: ARROW-4770
 URL: https://issues.apache.org/jira/browse/ARROW-4770
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: C++
Reporter: Yurui Zhou






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [Discuss][C++] Hashing floating point numbers

2019-03-04 Thread Micah Kornfield
OK to summarize my understanding of the thoughts expressed:
1.  People really shouldn't be trying to do things like grouping and
joining on double valued columns (but they do).
2.  The consensus (but not 100% agreement) :
   *Canonicalize NaNs and assume NaN == NaN, for group by/unique kernels
   * assume -0.0 == 0.0.

I can update the JIRA with these conclusions unless someone strongly
disagrees.

Thanks,
Micah

On Tue, Feb 26, 2019 at 11:54 AM Wes McKinney  wrote:

> In an analytics setting my prior is that -0/+0 and all types of NaNs
> should respectively be considered semantically to all be "the same
> value". It would be confusing (and likely "wrong" in a practical
> setting) to obtain two kinds of zeros as the output of an algorithm
> involving a hash table, like Unique or ValueCounts. However: hashing
> of floats should not be encouraged in general, but sometimes people
> will hash the results of some operation that happens to yield floats.
>
> On Tue, Feb 26, 2019 at 1:49 PM Antoine Pitrou 
> wrote:
> >
> > On Tue, 26 Feb 2019 09:59:54 -0800
> > Tim Armstrong  wrote:
> > > It's not a database thing, it's a floating point
> > > number thing. If you're doing floating point arithmetic you can end up
> > > with -0/+0 from expressions that should be equivalent.
> >
> > But we are not exactly dealing with arithmetic here...  I'm not sure
> > the IEEE FP standard was designed with database joins in mind.
> >
> > Granted, float hashing and float equality may be of dubious utility.
> > I'm curious about the use cases.
> >
> > > You end up in a world of pain if your equality relation and your hash
> > > function implementation are not aligned.
> >
> > This is not what I am suggesting.
> >
> > > So it's really a question of how you want to define equality (and
> whether
> > > you want to have multiple definitions of equality for different
> purposes).
> >
> > I think this is the goal of this discussion.
> >
> > Regards
> >
> > Antoine.
> >
> >
>


Re: [C++] BUILD_WARNING_LEVEL=EVERYTHING?

2019-03-04 Thread Micah Kornfield
After spending a few hours trying to fix the warnings I think I've come
around to your points of view :)

What got me in the end is DCHECK macros aren't allowed in headers (and I
think if we want to clean these up, we should DCHECK before doing casts).

Cheers,
Micah

On Sun, Mar 3, 2019 at 8:11 PM Wes McKinney  wrote:

> No opposition from me
>
> On Sun, Mar 3, 2019 at 10:02 PM Micah Kornfield 
> wrote:
> >
> > I'm ok with that.  I think some of the conversion warnings might be
> useful
> > (I know I've had bugs in other code that would have been caught with
> > them).  Would people be opposed if I tried to go through and cleanup the
> > EVERYTHING warnings even if more might creep in?
> >
> > Thanks,
> > Micah
> >
> > On Sun, Mar 3, 2019 at 3:27 PM Wes McKinney  wrote:
> >
> > > I'm of the same mind as Antoine on this. I think it's useful to look
> > > at the EVERYTHING warnings periodically, but it is enough effort to
> > > keep things simultaneously building cleanly with gcc, clang, and MSVC,
> > > that I would prefer to maintain the status quo until it can be
> > > demonstrated to be a problem (and even then, it just might be that we
> > > add more specific warnings that we care about to the CHECKIN warning
> > > level). The clang CHECKIN warnings catch some definitely bad things
> > > like missing virtual dtors etc.
> > >
> > > - Wes
> > >
> > > On Sun, Mar 3, 2019 at 3:38 AM Antoine Pitrou 
> wrote:
> > > >
> > > >
> > > > Hmm... There are enough warnings that need pampering in the default
> > > > settings that I don't think we want to go the full length of enabling
> > > > all warnings.  Sometimes it's a PITA to get code to compile cleanly
> on
> > > > all platforms.
> > > >
> > > > If compiler writers had a more reasonable judgement when it comes to
> > > > designing and enabling warnings, I would perhaps revise my position
> ;-)
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > > >
> > > > Le 03/03/2019 à 04:47, Micah Kornfield a écrit :
> > > > > As part of trying to fix the mingw C++ build [1], I tried compiling
> > > with
> > > > > BUILD_WARNING_LEVEL=EVERYTHING and it seems like it highlights a
> lot of
> > > > > possible warnings that aren't in CHECKIN.   Have we not turned on
> the
> > > > > additional warnings because there was too much to fix at the time
> this
> > > was
> > > > > added?  Or is a conscious decision to ignore some warnings?
> > > > >
> > > > > Thanks,
> > > > > Micah
> > > > >
> > > > > [1] https://github.com/apache/arrow/pull/3793
> > > > >
> > >
>


[jira] [Created] (ARROW-4769) [Rust] Improve array limit function where max records > len

2019-03-04 Thread Neville Dipale (JIRA)
Neville Dipale created ARROW-4769:
-

 Summary: [Rust] Improve array limit function where max records > 
len
 Key: ARROW-4769
 URL: https://issues.apache.org/jira/browse/ARROW-4769
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Affects Versions: 0.12.0
Reporter: Neville Dipale


When we have an array of n records, and we want to take a limit that's higher 
or equat to n, we still iterate through the array values and create a new array.

We could improve this by returning a copy of the array as-is.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4768) [C++][CI] arrow-test-array sometimes gets stuck in MinGW build

2019-03-04 Thread Kouhei Sutou (JIRA)
Kouhei Sutou created ARROW-4768:
---

 Summary: [C++][CI] arrow-test-array sometimes gets stuck in MinGW 
build
 Key: ARROW-4768
 URL: https://issues.apache.org/jira/browse/ARROW-4768
 Project: Apache Arrow
  Issue Type: Test
  Components: C++, Continuous Integration
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4767) [C#] ArrowStreamReader crashes while reading the end of a stream

2019-03-04 Thread Prashanth Govindarajan (JIRA)
Prashanth Govindarajan created ARROW-4767:
-

 Summary: [C#] ArrowStreamReader crashes while reading the end of a 
stream
 Key: ARROW-4767
 URL: https://issues.apache.org/jira/browse/ARROW-4767
 Project: Apache Arrow
  Issue Type: Bug
  Components: C#
Reporter: Prashanth Govindarajan


ReadRecordBatchAsync crashes at the end of a stream when messageLength is 0. 
"0" indicates the end of the stream, so we should just return null. The call 
Flatbug.Message.GetRootAsMessage seems to be crashing. The fix is simple and 
safe. I'll have a PR up soon. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4766) Casting empty boolean array causes segfault

2019-03-04 Thread Keith Kraus (JIRA)
Keith Kraus created ARROW-4766:
--

 Summary: Casting empty boolean array causes segfault
 Key: ARROW-4766
 URL: https://issues.apache.org/jira/browse/ARROW-4766
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: 0.12.0
Reporter: Keith Kraus


Reproducer:

{code:python}
import pyarrow as pa

test = pa.array([], type=pa.bool_())
test2 = test.cast(pa.int8())
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Flaky Travis CI builds on master

2019-03-04 Thread Wes McKinney
hi Francois,

I updated the CI failure page
https://cwiki.apache.org/confluence/display/ARROW/Continuous+Integration+failures
to split the resolved issues out into a separate filter. This makes
things clearer and we can see now there are 7 open issues in the
filter

I also changed the language a bit since describing these issues as
"false positive" could be misleading -- some of these things are
legitimate bugs and may simply present as a flaky test in practice.

- Wes

On Mon, Mar 4, 2019 at 9:11 AM Francois Saint-Jacques
 wrote:
>
> Hello,
>
> I created a new label named `ci-failure`, which was retroactively applied
> to most issues triggering a CI failure in other PRs/master (I searched for
> travis-ci.org/apache/arrow and tagged them). The goal here is to track
> issues which generates false positives failure in PRs and ideally minimize
> this and/or give a very high priority to resolve.
>
> The following dashboard
> https://cwiki.apache.org/confluence/display/ARROW/Continuous+Integration+failures
> tracks all task with the `ci-failure` label.
>
> I also took the opportunity to tidy the top level navigation page tree,
> notably
> - created a Releases page and moved all release pages
> - created a Dashboards page and moved all related pages
> - created a Developers page and moved all related pages
> - Ordered page in importance for new users, e.g. contribute & release at
> top, moved all committer/maintainer stuff in the lower part
>
> François
>
> On Sat, Mar 2, 2019 at 5:53 PM Wes McKinney  wrote:
>
> > I just gave you edit access.
> >
> > If any PMC member would like to be an admin on the Confluence space
> > (and you are not already), please let me know and I'll add you so you
> > can help with the wiki admin requests
> >
> > On Fri, Mar 1, 2019 at 8:09 PM Francois Saint-Jacques
> >  wrote:
> > >
> > > Could someone give me write/edit access to confluence?
> > >
> > > Thank you,
> > > François
> > >
> > > On Fri, Mar 1, 2019 at 3:55 PM Francois Saint-Jacques <
> > > fsaintjacq...@gmail.com> wrote:
> > >
> > > > I'll take this.
> > > >
> > > > On Fri, Mar 1, 2019 at 3:55 PM Wes McKinney 
> > wrote:
> > > >
> > > >> We could create a page on the wiki that shows all open and resolved
> > > >> issues relating to unexpected CI / build failures. Would someone like
> > > >> to give this a go? There are probably many historical issues that can
> > > >> be tagged with the label
> > > >>
> > > >> On Fri, Mar 1, 2019 at 12:45 PM Francois Saint-Jacques
> > > >>  wrote:
> > > >> >
> > > >> > I agree with adding a tag/label for this and even marking the
> > failure as
> > > >> > critical.
> > > >> >
> > > >> >
> > > >> > On Fri, Mar 1, 2019 at 12:18 PM Micah Kornfield <
> > emkornfi...@gmail.com>
> > > >> > wrote:
> > > >> >
> > > >> > > Moving away from the tactical for a minute, I think being able to
> > > >> track
> > > >> > > these over time would be useful.  I can think of a couple of high
> > > >> level
> > > >> > > approaches and I was wondering what others think.
> > > >> > >
> > > >> > > 1.  Use tags appropriately in JIRA and try to generate a report
> > from
> > > >> that.
> > > >> > > 2.  Create a new confluence page to try to log each time these
> > occur
> > > >> (and
> > > >> > > route cause).
> > > >> > > 3.  A separate spreadsheet someplace (e.g. Google Sheet).
> > > >> > >
> > > >> > > Thoughts?
> > > >> > >
> > > >> > > -Micah
> > > >> > >
> > > >> > >
> > > >> > > On Fri, Mar 1, 2019 at 8:55 AM Francois Saint-Jacques <
> > > >> > > fsaintjacq...@gmail.com> wrote:
> > > >> > >
> > > >> > > > Also just created
> > https://issues.apache.org/jira/browse/ARROW-4728
> > > >> > > >
> > > >> > > > On Thu, Feb 28, 2019 at 3:53 AM Ravindra Pindikura <
> > > >> ravin...@dremio.com>
> > > >> > > > wrote:
> > > >> > > >
> > > >> > > > >
> > > >> > > > >
> > > >> > > > > > On Feb 28, 2019, at 2:10 PM, Antoine Pitrou <
> > anto...@python.org
> > > >> >
> > > >> > > > wrote:
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > > Le 28/02/2019 à 07:53, Ravindra Pindikura a écrit :
> > > >> > > > > >>
> > > >> > > > > >>
> > > >> > > > > >>> On Feb 27, 2019, at 1:48 AM, Antoine Pitrou <
> > > >> solip...@pitrou.net>
> > > >> > > > > wrote:
> > > >> > > > > >>>
> > > >> > > > > >>> On Tue, 26 Feb 2019 13:39:08 -0600
> > > >> > > > > >>> Wes McKinney  wrote:
> > > >> > > > >  hi folks,
> > > >> > > > > 
> > > >> > > > >  We haven't had a green build on master for about 5 days
> > now
> > > >> (the
> > > >> > > > last
> > > >> > > > >  one was February 21). Has anyone else been paying
> > attention
> > > >> to
> > > >> > > this?
> > > >> > > > >  It seems we should start cataloging which tests and build
> > > >> > > > environments
> > > >> > > > >  are the most flaky and see if there's anything we can do
> > to
> > > >> reduce
> > > >> > > > the
> > > >> > > > >  flakiness. Since we are dependent on anaconda.org for
> > build
> > > >> > > > toolchain
> > > >> > > > >  packages, it'

[jira] [Created] (ARROW-4765) [JAVA][Flight] Memory leak

2019-03-04 Thread Francois Saint-Jacques (JIRA)
Francois Saint-Jacques created ARROW-4765:
-

 Summary: [JAVA][Flight] Memory leak
 Key: ARROW-4765
 URL: https://issues.apache.org/jira/browse/ARROW-4765
 Project: Apache Arrow
  Issue Type: Improvement
  Components: FlightRPC, Java
Affects Versions: 0.12.1
Reporter: Francois Saint-Jacques


There is a potential race issue when reclaiming the FlightServer.
{code:java}
[ERROR] ensureIndependentSteams(org.apache.arrow.flight.TestBackPressure) Time 
elapsed: 1.394 s <<< ERROR!
java.lang.IllegalStateException: 
Memory was leaked by query. Memory leaked: (131072)
Allocator(perf-server) 0/131072/589824/9223372036854775807 
(res/actual/peak/limit)
at 
org.apache.arrow.flight.TestBackPressure.ensureIndependentSteams(TestBackPressure.java:76)
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4764) [C++/Java] conda-built libplasma_java doesn't work with system Java on Ubuntu Xenial

2019-03-04 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created ARROW-4764:
--

 Summary: [C++/Java] conda-built libplasma_java doesn't work with 
system Java on Ubuntu Xenial
 Key: ARROW-4764
 URL: https://issues.apache.org/jira/browse/ARROW-4764
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, C++ - Plasma, Packaging, Python
Reporter: Uwe L. Korn


System Java will load also the system {{libstdc++}} which is older then the 
conda provided one. Thus a later loading of the plasma library will raise 
missing symbol errors:
{code:java}
TODO{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[Discuss] Array Cast Kernels Support Matrix

2019-03-04 Thread Neville Dipale
Hi Arrow devs,

I'm currently adding support for casting arrays in Rust, and I'm wondering
what casting operations should be supported, and how. Most operations are
simple, but I have a few questions below.

* Struct to Struct: I am not supporting in Rust as it might not make
sense/be easy to support. Is that fine?

* List of type A to List of type B: does it make sense to support casting
as long as the underlying types can be cast to each other? I'm thinking of
casting a list of u32 to list of i32

* Boolean to Decimal: Apache Impala doesn't support this (
https://impala.apache.org/docs/build/html/topics/impala_boolean.html).Should
we follow suit here?

* Boolean to Utf8: Impala casts 'true' to '1', what are Arrow
implementations doing? I'll also have a look at the CPP codebase.

* Utf8 to Boolean: Impala disallows this, but a case could be made for
supporting this, with the cast operation being limited to (true/false),
(T/F) and what CSV readers infer to be true or false. This could be useful
when reading CSV files (in Rust)

* Primitive to List: I was thinking of creating a list with 1 value for the
primitive (provided the list type is compatible with the from primitive).
Is this too extreme? We could perhaps leave this out and support it someday
in array operations

With regards to temporal arrays, should casting date and time to primitive
types be supported? The inverse makes sense as I might have an Int32Array
with millisecond values that I want to cast to a Timestamp or a Date32.

If there's interest/benefit in documenting the above for future consistency
among the various languages, I don't mind documenting something in the
coming days/weeks.

Thanks and Regards

Neville


[jira] [Created] (ARROW-4762) [C++] Support RapidJSON<1.1.0

2019-03-04 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created ARROW-4762:
--

 Summary: [C++] Support RapidJSON<1.1.0
 Key: ARROW-4762
 URL: https://issues.apache.org/jira/browse/ARROW-4762
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Packaging
Reporter: Uwe L. Korn


To support building with as many system packages as possible, we should support 
RapidJSON<1.1.0. For example the distribution provided RapidJSON on Ubuntu 
Xenial is {{0.12~git20141031-3}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4763) [C++/Python] Cannot build Gandiva in conda on OSX due to package conflicts

2019-03-04 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created ARROW-4763:
--

 Summary: [C++/Python] Cannot build Gandiva in conda on OSX due to 
package conflicts
 Key: ARROW-4763
 URL: https://issues.apache.org/jira/browse/ARROW-4763
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, C++ - Gandiva, Python
Reporter: Uwe L. Korn


It is currently not reliably possible to build Gandiva with one of the conda 
toolchains on OSX as the packages {{llvm==4.0.1}} (pulled in by the compilers) 
and {{llvmdev==7.0.1}} conflict in some files: 
https://github.com/conda-forge/llvmdev-feedstock/issues/60



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [Rust] [DataFusion] Preferences on futures / threading crates?

2019-03-04 Thread Neville Dipale
I'm a fan of using Rayon. Perhaps if it's not too much work, we could
compare the two

On Mon, 4 Mar 2019 at 15:04, Krisztián Szűcs 
wrote:

> On Mon, Mar 4, 2019 at 5:55 AM Andy Grove  wrote:
>
> > I have been working on a PoC of parallel query execution and it is
> working
> > well, and I am now starting to create PRs for the various refactors
> > necessary for this in DataFusion.
> >
> > I haven't been following the async/await and futures/tokio developments
> > lately but for the PoC I used tokio-threadpool which seems simple to use.
>
>
> > I just wanted to give everyone a chance to give their thoughts on this
> > before I get too far with my batch of PRs. Is anyone opposed to using
> > tokio-threadpool?
> >
> DataFusion's tasks should be CPU bound and according to tokio-threadpool's
> documentation [1], it is more suitable for event loops:
> "It is optimized for the primary Tokio use case of many independent tasks
>  with limited computation and with most tasks waiting on I/O."
>
> Rayon seems to follow different semantics, but depending on futures-rs is
> considerable, especially because it is maintained by the rust lang nursery.
>
> [1] https://docs.rs/tokio-threadpool/0.1.12/tokio_threadpool/
>
> Cheers, Krisztian
>


[jira] [Created] (ARROW-4760) [C++] protobuf 3.7 defines EXPECT_OK that clashes with Arrow's macro

2019-03-04 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created ARROW-4760:
--

 Summary: [C++] protobuf 3.7 defines EXPECT_OK that clashes with 
Arrow's macro
 Key: ARROW-4760
 URL: https://issues.apache.org/jira/browse/ARROW-4760
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Uwe L. Korn
Assignee: Uwe L. Korn
 Fix For: 0.13.0


This fails for me with the following error:
{code:java}
/home/travis/build/xhochy/arrow/cpp/src/arrow/flight/test-util.cc
In file included from 
/home/travis/build/xhochy/arrow/cpp-toolchain/include/google/protobuf/util/type_resolver.h:39:0,
 from 
/home/travis/build/xhochy/arrow/cpp-toolchain/include/google/protobuf/util/json_util.h:37,
 from 
/home/travis/build/xhochy/arrow/cpp-toolchain/include/grpcpp/impl/codegen/config_protobuf.h:70,
 from 
/home/travis/build/xhochy/arrow/cpp/src/arrow/flight/customize_protobuf.h:23,
 from 
/home/travis/build/xhochy/arrow/cpp/src/arrow/flight/protocol-internal.h:20,
 from 
/home/travis/build/xhochy/arrow/cpp/src/arrow/flight/internal.h:23,
 from 
/home/travis/build/xhochy/arrow/cpp/src/arrow/flight/test-util.cc:35:
/home/travis/build/xhochy/arrow/cpp-toolchain/include/google/protobuf/stubs/status.h:111:0:
 error: "EXPECT_OK" redefined [-Werror]
 #define EXPECT_OK(value) EXPECT_TRUE((value).ok())
 
In file included from 
/home/travis/build/xhochy/arrow/cpp/src/arrow/ipc/test-common.h:35:0,
 from 
/home/travis/build/xhochy/arrow/cpp/src/arrow/flight/test-util.cc:30:
/home/travis/build/xhochy/arrow/cpp/src/arrow/testing/gtest_util.h:80:0: note: 
this is the location of the previous definition
 #define EXPECT_OK(expr) \
 
cc1plus: all warnings being treated as errors{code}
I would workaround this by renaming our {{EXPECT_OK}} to {{ARROW_EXPECT_OK}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4761) [C++] Support zstandard<1

2019-03-04 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created ARROW-4761:
--

 Summary: [C++] Support zstandard<1
 Key: ARROW-4761
 URL: https://issues.apache.org/jira/browse/ARROW-4761
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Packaging
Reporter: Uwe L. Korn
Assignee: Uwe L. Korn
 Fix For: 0.13.0


To support building with as many system packages as possible on Ubuntu, we 
should support building with zstandard 0.5.1 which is the one available on 
Ubuntu Xenial. Given the size of our current code for Zstandard, this seems 
feasible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Flaky Travis CI builds on master

2019-03-04 Thread Francois Saint-Jacques
Hello,

I created a new label named `ci-failure`, which was retroactively applied
to most issues triggering a CI failure in other PRs/master (I searched for
travis-ci.org/apache/arrow and tagged them). The goal here is to track
issues which generates false positives failure in PRs and ideally minimize
this and/or give a very high priority to resolve.

The following dashboard
https://cwiki.apache.org/confluence/display/ARROW/Continuous+Integration+failures
tracks all task with the `ci-failure` label.

I also took the opportunity to tidy the top level navigation page tree,
notably
- created a Releases page and moved all release pages
- created a Dashboards page and moved all related pages
- created a Developers page and moved all related pages
- Ordered page in importance for new users, e.g. contribute & release at
top, moved all committer/maintainer stuff in the lower part

François

On Sat, Mar 2, 2019 at 5:53 PM Wes McKinney  wrote:

> I just gave you edit access.
>
> If any PMC member would like to be an admin on the Confluence space
> (and you are not already), please let me know and I'll add you so you
> can help with the wiki admin requests
>
> On Fri, Mar 1, 2019 at 8:09 PM Francois Saint-Jacques
>  wrote:
> >
> > Could someone give me write/edit access to confluence?
> >
> > Thank you,
> > François
> >
> > On Fri, Mar 1, 2019 at 3:55 PM Francois Saint-Jacques <
> > fsaintjacq...@gmail.com> wrote:
> >
> > > I'll take this.
> > >
> > > On Fri, Mar 1, 2019 at 3:55 PM Wes McKinney 
> wrote:
> > >
> > >> We could create a page on the wiki that shows all open and resolved
> > >> issues relating to unexpected CI / build failures. Would someone like
> > >> to give this a go? There are probably many historical issues that can
> > >> be tagged with the label
> > >>
> > >> On Fri, Mar 1, 2019 at 12:45 PM Francois Saint-Jacques
> > >>  wrote:
> > >> >
> > >> > I agree with adding a tag/label for this and even marking the
> failure as
> > >> > critical.
> > >> >
> > >> >
> > >> > On Fri, Mar 1, 2019 at 12:18 PM Micah Kornfield <
> emkornfi...@gmail.com>
> > >> > wrote:
> > >> >
> > >> > > Moving away from the tactical for a minute, I think being able to
> > >> track
> > >> > > these over time would be useful.  I can think of a couple of high
> > >> level
> > >> > > approaches and I was wondering what others think.
> > >> > >
> > >> > > 1.  Use tags appropriately in JIRA and try to generate a report
> from
> > >> that.
> > >> > > 2.  Create a new confluence page to try to log each time these
> occur
> > >> (and
> > >> > > route cause).
> > >> > > 3.  A separate spreadsheet someplace (e.g. Google Sheet).
> > >> > >
> > >> > > Thoughts?
> > >> > >
> > >> > > -Micah
> > >> > >
> > >> > >
> > >> > > On Fri, Mar 1, 2019 at 8:55 AM Francois Saint-Jacques <
> > >> > > fsaintjacq...@gmail.com> wrote:
> > >> > >
> > >> > > > Also just created
> https://issues.apache.org/jira/browse/ARROW-4728
> > >> > > >
> > >> > > > On Thu, Feb 28, 2019 at 3:53 AM Ravindra Pindikura <
> > >> ravin...@dremio.com>
> > >> > > > wrote:
> > >> > > >
> > >> > > > >
> > >> > > > >
> > >> > > > > > On Feb 28, 2019, at 2:10 PM, Antoine Pitrou <
> anto...@python.org
> > >> >
> > >> > > > wrote:
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > Le 28/02/2019 à 07:53, Ravindra Pindikura a écrit :
> > >> > > > > >>
> > >> > > > > >>
> > >> > > > > >>> On Feb 27, 2019, at 1:48 AM, Antoine Pitrou <
> > >> solip...@pitrou.net>
> > >> > > > > wrote:
> > >> > > > > >>>
> > >> > > > > >>> On Tue, 26 Feb 2019 13:39:08 -0600
> > >> > > > > >>> Wes McKinney  wrote:
> > >> > > > >  hi folks,
> > >> > > > > 
> > >> > > > >  We haven't had a green build on master for about 5 days
> now
> > >> (the
> > >> > > > last
> > >> > > > >  one was February 21). Has anyone else been paying
> attention
> > >> to
> > >> > > this?
> > >> > > > >  It seems we should start cataloging which tests and build
> > >> > > > environments
> > >> > > > >  are the most flaky and see if there's anything we can do
> to
> > >> reduce
> > >> > > > the
> > >> > > > >  flakiness. Since we are dependent on anaconda.org for
> build
> > >> > > > toolchain
> > >> > > > >  packages, it's hard to control for the 500 timeouts that
> > >> occur
> > >> > > > there,
> > >> > > > >  but I'm seeing other kinds of routine flakiness.
> > >> > > > > >>>
> > >> > > > > >>> Isn't it https://issues.apache.org/jira/browse/ARROW-4684
> ?
> > >> > > > > >>
> > >> > > > > >> ARROW-4684 seems to be failing consistently in travis CI.
> > >> > > > > >>
> > >> > > > > >> Can I merge a change if this is the only CI failure ?
> > >> > > > > >
> > >> > > > > > Yes, you can.
> > >> > > > >
> > >> > > > > Thanks !
> > >> > > > >
> > >> > > > > >
> > >> > > > > > Regards
> > >> > > > > >
> > >> > > > > > Antoine.
> > >> > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >>
> > >
>


[jira] [Created] (ARROW-4759) [Rust] [DataFusion] It should be possible to share an execution context between threads

2019-03-04 Thread Andy Grove (JIRA)
Andy Grove created ARROW-4759:
-

 Summary: [Rust] [DataFusion] It should be possible to share an 
execution context between threads
 Key: ARROW-4759
 URL: https://issues.apache.org/jira/browse/ARROW-4759
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust, Rust - DataFusion
Affects Versions: 0.12.0
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 0.13.0


I am working on a PR for this now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [Rust] [DataFusion] Preferences on futures / threading crates?

2019-03-04 Thread Krisztián Szűcs
On Mon, Mar 4, 2019 at 5:55 AM Andy Grove  wrote:

> I have been working on a PoC of parallel query execution and it is working
> well, and I am now starting to create PRs for the various refactors
> necessary for this in DataFusion.
>
> I haven't been following the async/await and futures/tokio developments
> lately but for the PoC I used tokio-threadpool which seems simple to use.


> I just wanted to give everyone a chance to give their thoughts on this
> before I get too far with my batch of PRs. Is anyone opposed to using
> tokio-threadpool?
>
DataFusion's tasks should be CPU bound and according to tokio-threadpool's
documentation [1], it is more suitable for event loops:
"It is optimized for the primary Tokio use case of many independent tasks
 with limited computation and with most tasks waiting on I/O."

Rayon seems to follow different semantics, but depending on futures-rs is
considerable, especially because it is maintained by the rust lang nursery.

[1] https://docs.rs/tokio-threadpool/0.1.12/tokio_threadpool/

Cheers, Krisztian


Re: [Rust] [DataFusion] Preferences on futures / threading crates?

2019-03-04 Thread paddy horan
No opposition here.

P

Get Outlook for iOS


From: Andy Grove 
Sent: Sunday, March 3, 2019 11:55 PM
To: dev@arrow.apache.org
Subject: [Rust] [DataFusion] Preferences on futures / threading crates?

I have been working on a PoC of parallel query execution and it is working
well, and I am now starting to create PRs for the various refactors
necessary for this in DataFusion.

I haven't been following the async/await and futures/tokio developments
lately but for the PoC I used tokio-threadpool which seems simple to use.

I just wanted to give everyone a chance to give their thoughts on this
before I get too far with my batch of PRs. Is anyone opposed to using
tokio-threadpool?

Thanks,

Andy.


Re: Parquet Shared Library Versioning

2019-03-04 Thread Hatem Helal
Thank you Wes, keeping the same version sounds good to me.  


On 2/27/19, 9:05 PM, "Wes McKinney"  wrote:

hi Hatem,

Until the Parquet community begins to make C++ releases out of the new
monorepo structure, I think we should continue to use the same SO
version for all libraries produced by the build. Otherwise the ABI
version from a libparquet.so coming from an Arrow release artifact
could cause conflict with a different libparquet.so having the same
ABI version but originating from a different released version.

If this is not what other people want, that is OK, but I don't think
we should be building libparquet.so with a snapshot Parquet version
from an Arrow release artifact. So we can make the SO version
distinct, so long as there is not the possibility of conflict if and
when the Parquet community makes a C++ release.

- Wes

On Mon, Feb 25, 2019 at 1:35 PM Hatem Helal  wrote:
>
> Hi all,
>
> I’d like to discuss the versioning of the parquet shared libs that are 
built when you use -DARROW_PARQUET=ON.  My observation is that back when 
parquet-cpp was a separate project the shared libs were versioned using the 
parquet-cpp version number (e.g 1.4.0).  Since moving to a single repo, the 
parquet shared libraries are now versioned with the arrow version number (e.g. 
0.12.0)
>
> I assumed this wasn't carried over to the mono-repo and opened a JIRA [1] 
and a PR [2] to version the parquet shared libraries separately from Arrow.  
I've read through the thread discussing the mono-repo [3] and otherwise can't 
find mention of Wes' comment that:
>
> "I had thought we had discussed using the same SO version for all shared 
libraries produced by a particular build. Let's discuss this some more."
>
> I see some value in maintaining the parquet library version but equally 
see value in matching the arrow version.  I'm not sure this is correct, but I 
still consider parquet-cpp somewhat separate
>
> (it has its own JIRA).  An additional proposal is that we could modify 
the CREATED_BY_VERSION [4] to reference the Arrow version number for additional 
traceability with Parquet files written using the mono-repo.
>
> Any additional thoughts or a link to prior discussion on shared lib 
versioning would be much appreciated.
>
> Thanks!
>
> Hatem
>
> [1] https://issues.apache.org/jira/browse/PARQUET-1540
> [2] https://github.com/apache/arrow/pull/3743
> [3] 
https://lists.apache.org/thread.html/efdb7de9fd5f3e7d345caa85639ca65fa2c41f50a977b3eca959e9f9@%3Cdev.arrow.apache.org%3E
> [4] 
https://github.com/apache/arrow/blob/master/cpp/src/parquet/parquet_version.h.in
>
>
>
>
>




[jira] [Created] (ARROW-4758) [Flight] Build fails on Mac due to missing Schema_generated.h

2019-03-04 Thread Pindikura Ravindra (JIRA)
Pindikura Ravindra created ARROW-4758:
-

 Summary: [Flight] Build fails on Mac due to missing 
Schema_generated.h
 Key: ARROW-4758
 URL: https://issues.apache.org/jira/browse/ARROW-4758
 Project: Apache Arrow
  Issue Type: Task
  Components: FlightRPC
Reporter: Pindikura Ravindra


I saw this on CI, a retrigger of the build fixed the issue and I am not able to 
get the link of the previous build failure.

The error happened for the file flight/client.cc, which includes 
-ipc/metadata--internal.h, which includes arrow/ipc/Schema_generated.h

arrow/ipc/Schema_generated.h

arrow/ipc/Schema_generated.h

arrow/ipc/Schema_generated.h



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Boost and manylinux CI builds

2019-03-04 Thread Ravindra Pindikura



> On Mar 4, 2019, at 4:45 AM, Wes McKinney  wrote:
> 
> hi Ravindra,
> 
> Can we document this (just by copy-pasting what you wrote) on the wiki
> or someplace for future work that may touch the manylinux package
> builds? This might be a bit more discoverable than going through the
> email logs

Sure Wes. Will do via ARROW-4756

> 
> Thanks!
> Wes
> 
> On Fri, Mar 1, 2019 at 9:58 PM Ravindra Pindikura  wrote:
>> 
>> Thanks Uwe.
>> 
>> For the record (in case someone needs to do it again), these are the steps :
>> 
>> 1. Make the change in build_boost.sh
>> 
>> 2. Setup an account on quay.io  and link to your GitHub 
>> account
>> 
>> 3. In quay.io ,  Add a new repository using :
>> 
>> A. Link to GitHub repository push
>> B. Trigger build on changes to a specific branch (eg. myquay) of the repo 
>> (eq. pravindra/arrow)
>> C. Set Dockerfile location to "/python/manylinux1/Dockerfile-x86_64_base”
>> D. Set Context location to "/python/manylinux1”
>> 
>> 4. Push change (in step 1) to the branch specified in step 3B
>> 
>> This should trigger a build in quay.io , the build takes 
>> about 2 hrs to finish.
>> 
>> 5. Add a tag “latest” to the build after step 4 finishes, save the URL of 
>> the build (eg. quay.io/pravindra 
>> /arrow_manylinux1_x86_64_base:latest
>>  )
>> 
>> 6. In your arrow PR,
>> 
>> - include the change from 1.
>> - update travis_script_manylinux.sh to point to the location from step 5.
>> 
>> Thanks & regards,
>> Ravindra.
>> 
>>> On Feb 27, 2019, at 3:02 PM, Uwe L. Korn  wrote:
>>> 
>>> Hello Ravindra,
>>> 
>>> simplest thing would be when you open a pull request and I can then pick 
>>> this up and push it to my personal fork. Then a new image is built on 
>>> quay.io. Otherwise, you can also activate quay.io on your fork to get the 
>>> docker image to build.
>>> 
>>> Uwe
>>> 
>>> On Wed, Feb 27, 2019, at 8:41 AM, Krisztián Szűcs wrote:
 Hi Ravindra!
 
 You'll need to rebuild the docker image and change this line accordingly:
 https://github.com/apache/arrow/blob/master/ci/travis_script_manylinux.sh#L57
 
 On Wed, Feb 27, 2019 at 8:29 AM Ravindra Pindikura 
 wrote:
 
> Hi,
> 
> I added an include for boost header file in gandiva. This compiles on
> ubuntu/Mac/windows, but fails with the manylinux CI entry.
> 
> I’m getting a compilation failure :
> 
> https://travis-ci.org/apache/arrow/jobs/498718755 <
> https://travis-ci.org/apache/arrow/jobs/498718755>
> /arrow/cpp/src/gandiva/decimal_xlarge.cc:29:44: fatal error:
> boost/multiprecision/cpp_int.hpp: No such file or directory
> #include "boost/multiprecision/cpp_int.hpp"
> ^
> compilation terminated.
> 
> 
> 
> @xhocy and @kszucs pointed out the manylinux1 image has a very minimal
> boost, and doesn’t include the multi precision files. So, the script that
> builds boost for manylinux1 needs to be updated for this.
> 
> 
> 
> https://github.com/apache/arrow/blob/master/python/manylinux1/scripts/build_boost.sh#L38
> <
> https://github.com/apache/arrow/blob/master/python/manylinux1/scripts/build_boost.sh#L38
>> 
> 
> After making change, the manylinux1 build still fails with the same error
> :(.
> 
> https://travis-ci.org/apache/arrow/jobs/498847622
> 
> Looks like the CI run downloads a prebuilt docker image. Do I need to
> update the docker image ? If yes, can you please point out the 
> instructions
> for this ?
> 
> Thanks & regards,
> Ravindra.
 
>> 



[jira] [Created] (ARROW-4757) Nested chunked array support

2019-03-04 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-4757:
-

 Summary: Nested chunked array support
 Key: ARROW-4757
 URL: https://issues.apache.org/jira/browse/ARROW-4757
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


Dear all,

I'm currently trying to lift the 2GB limit on the python serialization. For 
this, I implemented a chunked union builder to split the array into smaller 
arrays.

However, some of the children of the union array can be ListArrays, which can 
themselves contain UnionArrays which can contain ListArrays etc. I'm at a bit 
of a loss how to handle this. In principle I'd like to chunk the children too. 
However, currently UnionArrays can only have children of type Array, and there 
is no way to treat a chunked array (which is a vector of Arrays) as an Array to 
store it as a child of a UnionArray. Any ideas how to best support this use 
case?

-- Philipp.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)