Re: [Discuss] Format additions to Arrow for sparse data and data integrity

2019-07-09 Thread Micah Kornfield
Hi Jacques,

> That's quite interesting. Can you share more about the use case.

Sorry I realized I missed answering this.  We are still investigating, so
the initial diagnosis might be off.  The use-case is a data transfer
application, reading data at rest, translating it to arrow and sending it
out to clients.

I look forward hearing your thoughts on the rest of the proposal.

Thanks,
Micah



On Sat, Jul 6, 2019 at 2:53 PM Jacques Nadeau  wrote:

> What is the driving force for transport compression? Are you seeing that
>>> as a major bottleneck in particular circumstances? (I'm not disagreeing,
>>> just want to clearly define the particular problem you're worried about.)
>>
>>
>> I've been working on a 20% project where we appear to be IO bound for
>> transporting record batches.   Also, I believe Ji Liu (tianchen92) has been
>> seeing some of the same bottlenecks with the query engine they are is
>> working on.  Trading off some CPU here would allow us to lower the overall
>> latency in the system.
>>
>
> That's quite interesting. Can you share more about the use case. With the
> exception of broadcast and round-robin type distribution patterns, we find
> that there is typically more cycles focused on partitioning the sending
> data such that IO bounding is less of a problem. In most of our operations,
> almost all the largest workloads are done via partitioning thus it isn't
> typically a problem. (We also have clients with 10gbps and 100gbps network
> interconnects...) Are you partitioning the data pre-send?
>
>
>
>> Random thought: what do you think of defining this at the transport level
>>> rather than the record batch level? (e.g. in Arrow Flight). This is one way
>>> to avoid extending the core record batch concept with something that isn't
>>> related to processing (at least in your initial proposal)
>>
>>
>> Per above, this seems like a reasonable approach to me if we want to hold
>> off on buffer level compression.  Another use-case for buffer/record-batch
>> level compression would be the Feather file format for only decompressing
>> subset of columns/rows.  If this use-case isn't compelling, I'd be happy to
>> hold off adding compression to sparse batches until we have benchmarks
>> showing the trade-off between channel level and buffer level compression.
>>
>
> I was proposing that type specific buffer encodings be done at the Flight
> level, not message level encodings. Just want to make sure the formats
> don't leak into the core spec until we're ready.
>


Re: [Discuss] Compatibility Guarantees and Versioning Post "1.0.0"

2019-07-09 Thread Micah Kornfield
Hi Eric,
Short answer: I think your understanding matches what I was proposing.
Longer answer below.

So, for example, we release library v1.0.0 in a few months and then library
> v2.0.0 a few months after that.  In v2.0.0, C++, Python, and Java didn't
> make any breaking API changes from 1.0.0. But C# made 3 API breaking
> changes. This would be acceptable?

Yes.  I think all language bindings are going under rapid enough iteration
that we are making at least a few small breaking API changes on each
release even though we try to avoid it.  I think it will be worth having
further discussions on the release process once at least a few languages
get to a more stable point.

Thanks,
Micah

On Tue, Jul 9, 2019 at 2:26 PM Eric Erhardt 
wrote:

> Just to be sure I fully understand the proposal:
>
> For the Library Version, we are going to increment the MAJOR version on
> every normal release, and increment the MINOR version if we need to release
> a patch/bug fix type of release.
>
> Since SemVer allows for API breaking changes on MAJOR versions, this
> basically means, each library (C++, Python, C#, Java, etc) _can_ introduce
> API breaking changes on every normal release (like we have been with the
> 0.x.0 releases).
>
> So, for example, we release library v1.0.0 in a few months and then
> library v2.0.0 a few months after that.  In v2.0.0, C++, Python, and Java
> didn't make any breaking API changes from 1.0.0. But C# made 3 API breaking
> changes. This would be acceptable?
>
> If my understanding above is correct, then I think this is a good plan.
> Initially I was concerned that the C# library wouldn't be free to make API
> breaking changes with making the version `1.0.0`. The C# library is still
> pretty inadequate, and I have a feeling there are a few things that will
> need to change about it in the future. But with the above plan, this
> concern won't be a problem.
>
> Eric
>
> -Original Message-
> From: Micah Kornfield 
> Sent: Monday, July 1, 2019 10:02 PM
> To: Wes McKinney 
> Cc: dev@arrow.apache.org
> Subject: Re: [Discuss] Compatibility Guarantees and Versioning Post "1.0.0"
>
> Hi Wes,
> Thanks for your response.  In regards to the protocol negotiation your
> description of feature reporting (snipped below) is along the lines of what
> I was thinking.  It might not be necessary for 1.0.0, but at some point
> might become useful.
>
>
> >  Note that we don't really have a mechanism for clients and servers to
> > report to each other what features they support, so this could help
> > with that when for applications where it might matter.
>
>
> Thanks,
> Micah
>
>
> On Mon, Jul 1, 2019 at 12:54 PM Wes McKinney  wrote:
>
> > hi Micah,
> >
> > Sorry for the delay in feedback. I looked at the document and it seems
> > like a reasonable perspective about forward- and
> > backward-compatibility.
> >
> > It seems like the main thing you are proposing is to apply Semantic
> > Versioning to Format and Library versions separately. That's an
> > interesting idea, my thought had been to have a version number that is
> > FORMAT_VERSION.LIBRARY_VERSION.PATCH_VERSION. But your proposal is
> > more flexible in some ways, so let me clarify for others reading
> >
> > In what you are proposing, the next release would be:
> >
> > Format version: 1.0.0
> > Library version: 1.0.0
> >
> > Suppose that 20 major versions down the road we stand at
> >
> > Format version: 1.5.0
> > Library version: 20.0.0
> >
> > The minor version of the Format would indicate that there are
> > additions, like new elements in the Type union, but otherwise backward
> > and forward compatible. So the Minor version means "new things, but
> > old clients will not be disrupted if those new things are not used".
> > We've already been doing this since the V4 Format iteration but we
> > have not had a way to signal that there may be new features. As a
> > corollary to this, I wonder if we should create a dual version in the
> > metadata
> >
> > PROTOCOL VERSION: (what is currently MetadataVersion, V2) FEATURE
> > VERSION: not tracked at all
> >
> > So Minor version bumps in the format would trigger a bump in the
> > FeatureVersion. Note that we don't really have a mechanism for clients
> > and servers to report to each other what features they support, so
> > this could help with that when for applications where it might matter.
> >
> > Should backward/forward compatibility be disrupted in the future, then
> > a change to the major version would be required. So in year 2025, say,
> > we might decide that we want to do:
> >
> > Format version: 2.0.0
> > Library version: 21.0.0
> >
> > The Format version would live in the project's Documentation, so the
> > Apache releases are only the library version.
> >
> > Regarding your open questions:
> >
> > 1. Should we clean up "warts" on the specification, like redundant
> > information
> >
> > I don't think it's necessary. So if Metadata V5 is Format Version
> > 1.0.0 (currently we are V4, but 

[jira] [Created] (ARROW-5897) [Java] Remove duplicated logic in MapVector

2019-07-09 Thread Liya Fan (JIRA)
Liya Fan created ARROW-5897:
---

 Summary: [Java] Remove duplicated logic in MapVector
 Key: ARROW-5897
 URL: https://issues.apache.org/jira/browse/ARROW-5897
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


Current implementation of MapVector contains much logic duplicate from the 
super class. We remove the duplication by:
 # Making the default data vector name configurable
 # Extract a method for creating the reader



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [DISCUSS][C++] Evaluating the arrow::Column C++ class

2019-07-09 Thread Wes McKinney
Thanks for the feedback.

I just posted a PR that removes the class in the C++ and Python
libraries, hopefully this will help with the discussion. That I was
able to do it in less than a day should be good evidence that the
abstraction may be superfluous

https://github.com/apache/arrow/pull/4841

On Tue, Jul 9, 2019 at 4:26 PM Tim Swast  wrote:
>
> FWIW, I found the Column class to be confusing in Python. It felt redundant
> / unneeded to actually create Tables.
>
> On Tue, Jul 9, 2019 at 11:19 AM Wes McKinney  wrote:
>
> > On Tue, Jul 9, 2019 at 1:14 PM Antoine Pitrou  wrote:
> > >
> > >
> > > Le 08/07/2019 à 23:17, Wes McKinney a écrit :
> > > >
> > > > I'm concerned about continuing to maintain the Column class as it's
> > > > spilling complexity into computational libraries and bindings alike.
> > > >
> > > > The Python Column class for example mostly forwards method calls to
> > > > the underlying ChunkedArray
> > > >
> > > >
> > https://github.com/apache/arrow/blob/master/python/pyarrow/table.pxi#L355
> > > >
> > > > If the developer wants to construct a Table or insert a new "column",
> > > > Column objects must generally be constructed, leading to boilerplate
> > > > without clear benefit.
> > >
> > > We could simply add the desired ChunkedArray-based convenience methods
> > > without removing the Column-based APIs.
> > >
> > > I don't know if it's really cumbersome to maintain the Column class.
> > > It's generally a very stable part of the API, and the Column class is
> > > just a thin wrapper over a ChunkedArray + a field.
> > >
> >
> > The indirection that it produces in public APIs I have found to be a
> > nuisance, though (for example, doing things with the result of
> > table[i] in Python).
> >
> > I'm about halfway through a patch to remove it, I'll let people review
> > the work to assess the before-and-after.
> >
> > > Regards
> > >
> > > Antoine.
> >


Re: Spark and Arrow Flight

2019-07-09 Thread Wes McKinney
Hi Ryan, have you thought about developing this inside Apache Arrow?

On Tue, Jul 9, 2019, 5:42 PM Bryan Cutler  wrote:

> Great, thanks Ryan! I'll take a look
>
> On Tue, Jul 9, 2019 at 3:31 PM Ryan Murray  wrote:
>
> > Hi Bryan,
> >
> > I have an implementation of option #3 nearly ready for a PR. I will
> mention
> > you when I publish it.
> >
> > The working prototype for the Spark connector is here:
> > https://github.com/rymurr/flight-spark-source. It technically works (and
> > is
> > very fast!) however the implementation is pretty dodgy and needs to be
> > cleaned up before ready for prime time. I plan to have it ready to go for
> > the Arrow 1.0.0 release as an apache 2.0 licensed project. Please shout
> if
> > you have any comments or are interested in contributing!
> >
> > Best,
> > Ryan
> >
> > On Tue, Jul 9, 2019 at 3:21 PM Bryan Cutler  wrote:
> >
> > > I'm in favor of option #3 also, but not sure what the best thing to do
> > with
> > > the existing FlightInfo response is. I'm definitely interested in
> > > connecting Spark with Flight, can you share more details of your work
> or
> > is
> > > it planned to be open sourced?
> > >
> > > Thanks,
> > > Bryan
> > >
> > > On Tue, Jul 2, 2019 at 3:35 AM Antoine Pitrou 
> > wrote:
> > >
> > > >
> > > > Either #3 or #4 for me.  If #3, the default GetSchema implementation
> > can
> > > > rely on calling GetFlightInfo.
> > > >
> > > >
> > > > Le 01/07/2019 à 22:50, David Li a écrit :
> > > > > I think I'd prefer #3 over overloading an existing call (#2).
> > > > >
> > > > > We've been thinking about a similar issue, where sometimes we want
> > > > > just the schema, but the service can't necessarily return the
> schema
> > > > > without fetching data - right now we return a sentinel value in
> > > > > GetFlightInfo, but a separate RPC would let us explicitly indicate
> an
> > > > > error.
> > > > >
> > > > > I might be missing something though - what happens between step 1
> and
> > > > > 2 that makes the endpoints available? Would it make sense to use
> > > > > DoAction to cause the backend to "prepare" the endpoints, and have
> > the
> > > > > result of that be an encoded schema? So then the flow would be
> > > > > DoAction -> GetFlightInfo -> DoGet.
> > > > >
> > > > > Best,
> > > > > David
> > > > >
> > > > > On 7/1/19, Wes McKinney  wrote:
> > > > >> My inclination is either #2 or #3. #4 is an option of course, but
> I
> > > > >> like the more structured solution of explicitly requesting the
> > schema
> > > > >> given a descriptor.
> > > > >>
> > > > >> In both cases, it's possible that schemas are sent twice, e.g. if
> > you
> > > > >> call GetSchema and then later call GetFlightInfo and so you
> receive
> > > > >> the schema again. The schema is optional, so if it became a
> > > > >> performance problem then a particular server might return the
> schema
> > > > >> as null from GetFlightInfo.
> > > > >>
> > > > >> I think it's valid to want to make a single GetFlightInfo RPC
> > request
> > > > >> that returns _both_ the schema and the query plan.
> > > > >>
> > > > >> Thoughts from others?
> > > > >>
> > > > >> On Fri, Jun 28, 2019 at 8:52 PM Jacques Nadeau <
> jacq...@apache.org>
> > > > wrote:
> > > > >>>
> > > > >>> My initial inclination is towards #3 but I'd be curious what
> others
> > > > >>> think.
> > > > >>> In the case of #3, I wonder if it makes sense to then pull the
> > Schema
> > > > off
> > > > >>> the GetFlightInfo response...
> > > > >>>
> > > > >>> On Fri, Jun 28, 2019 at 10:57 AM Ryan Murray 
> > > > wrote:
> > > > >>>
> > > >  Hi All,
> > > > 
> > > >  I have been working on building an arrow flight source for
> spark.
> > > The
> > > >  goal
> > > >  here is for Spark to be able to use a group of arrow flight
> > > endpoints
> > > >  to
> > > >  get a dataset pulled over to spark in parallel.
> > > > 
> > > >  I am unsure of the best model for the spark <-> flight
> > conversation
> > > > and
> > > >  wanted to get your opinion on the best way to go.
> > > > 
> > > >  I am breaking up the query to flight from spark into 3 parts:
> > > >  1) get the schema using GetFlightInfo. This is needed to do
> > further
> > > >  lazy
> > > >  operations in Spark
> > > >  2) get the endpoints by calling GetFlightInfo a 2nd time with a
> > > >  different
> > > >  argument. This returns the list endpoints on the parallel flight
> > > >  server.
> > > >  The endpoints are not available till data is ready to be
> fetched,
> > > > which
> > > >  is
> > > >  done after the schema but is needed before DoGet is called.
> > > >  3) call get stream on all endpoints from 2
> > > > 
> > > >  I think I have to do each step however I don't like having to
> call
> > > >  getInfo
> > > >  twice, it doesn't seem very elegant. I see a few options:
> > > >  1) live with calling GetFlightInfo twice and with a custom bytes
> > cmd
> > > > to
> > 

[jira] [Created] (ARROW-5896) [C#] Array Builders should take an initial capacity in their constructors

2019-07-09 Thread Eric Erhardt (JIRA)
Eric Erhardt created ARROW-5896:
---

 Summary: [C#] Array Builders should take an initial capacity in 
their constructors
 Key: ARROW-5896
 URL: https://issues.apache.org/jira/browse/ARROW-5896
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C#
Reporter: Eric Erhardt


When using the Fluent Array Builder API, we should take in an initial capacity 
in the constructor, so we can avoid allocating unnecessary memory.

Today, if you create a builder, and then .Reserve(length) on it, the initial 
byte[] that was created in the constructor is wasted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Spark and Arrow Flight

2019-07-09 Thread Bryan Cutler
Great, thanks Ryan! I'll take a look

On Tue, Jul 9, 2019 at 3:31 PM Ryan Murray  wrote:

> Hi Bryan,
>
> I have an implementation of option #3 nearly ready for a PR. I will mention
> you when I publish it.
>
> The working prototype for the Spark connector is here:
> https://github.com/rymurr/flight-spark-source. It technically works (and
> is
> very fast!) however the implementation is pretty dodgy and needs to be
> cleaned up before ready for prime time. I plan to have it ready to go for
> the Arrow 1.0.0 release as an apache 2.0 licensed project. Please shout if
> you have any comments or are interested in contributing!
>
> Best,
> Ryan
>
> On Tue, Jul 9, 2019 at 3:21 PM Bryan Cutler  wrote:
>
> > I'm in favor of option #3 also, but not sure what the best thing to do
> with
> > the existing FlightInfo response is. I'm definitely interested in
> > connecting Spark with Flight, can you share more details of your work or
> is
> > it planned to be open sourced?
> >
> > Thanks,
> > Bryan
> >
> > On Tue, Jul 2, 2019 at 3:35 AM Antoine Pitrou 
> wrote:
> >
> > >
> > > Either #3 or #4 for me.  If #3, the default GetSchema implementation
> can
> > > rely on calling GetFlightInfo.
> > >
> > >
> > > Le 01/07/2019 à 22:50, David Li a écrit :
> > > > I think I'd prefer #3 over overloading an existing call (#2).
> > > >
> > > > We've been thinking about a similar issue, where sometimes we want
> > > > just the schema, but the service can't necessarily return the schema
> > > > without fetching data - right now we return a sentinel value in
> > > > GetFlightInfo, but a separate RPC would let us explicitly indicate an
> > > > error.
> > > >
> > > > I might be missing something though - what happens between step 1 and
> > > > 2 that makes the endpoints available? Would it make sense to use
> > > > DoAction to cause the backend to "prepare" the endpoints, and have
> the
> > > > result of that be an encoded schema? So then the flow would be
> > > > DoAction -> GetFlightInfo -> DoGet.
> > > >
> > > > Best,
> > > > David
> > > >
> > > > On 7/1/19, Wes McKinney  wrote:
> > > >> My inclination is either #2 or #3. #4 is an option of course, but I
> > > >> like the more structured solution of explicitly requesting the
> schema
> > > >> given a descriptor.
> > > >>
> > > >> In both cases, it's possible that schemas are sent twice, e.g. if
> you
> > > >> call GetSchema and then later call GetFlightInfo and so you receive
> > > >> the schema again. The schema is optional, so if it became a
> > > >> performance problem then a particular server might return the schema
> > > >> as null from GetFlightInfo.
> > > >>
> > > >> I think it's valid to want to make a single GetFlightInfo RPC
> request
> > > >> that returns _both_ the schema and the query plan.
> > > >>
> > > >> Thoughts from others?
> > > >>
> > > >> On Fri, Jun 28, 2019 at 8:52 PM Jacques Nadeau 
> > > wrote:
> > > >>>
> > > >>> My initial inclination is towards #3 but I'd be curious what others
> > > >>> think.
> > > >>> In the case of #3, I wonder if it makes sense to then pull the
> Schema
> > > off
> > > >>> the GetFlightInfo response...
> > > >>>
> > > >>> On Fri, Jun 28, 2019 at 10:57 AM Ryan Murray 
> > > wrote:
> > > >>>
> > >  Hi All,
> > > 
> > >  I have been working on building an arrow flight source for spark.
> > The
> > >  goal
> > >  here is for Spark to be able to use a group of arrow flight
> > endpoints
> > >  to
> > >  get a dataset pulled over to spark in parallel.
> > > 
> > >  I am unsure of the best model for the spark <-> flight
> conversation
> > > and
> > >  wanted to get your opinion on the best way to go.
> > > 
> > >  I am breaking up the query to flight from spark into 3 parts:
> > >  1) get the schema using GetFlightInfo. This is needed to do
> further
> > >  lazy
> > >  operations in Spark
> > >  2) get the endpoints by calling GetFlightInfo a 2nd time with a
> > >  different
> > >  argument. This returns the list endpoints on the parallel flight
> > >  server.
> > >  The endpoints are not available till data is ready to be fetched,
> > > which
> > >  is
> > >  done after the schema but is needed before DoGet is called.
> > >  3) call get stream on all endpoints from 2
> > > 
> > >  I think I have to do each step however I don't like having to call
> > >  getInfo
> > >  twice, it doesn't seem very elegant. I see a few options:
> > >  1) live with calling GetFlightInfo twice and with a custom bytes
> cmd
> > > to
> > >  differentiate the purpose of each call
> > >  2) add an argument to GetFlightInfo to tell it its being called
> only
> > >  for
> > >  the schema
> > >  3) add another rpc endpoint: ie GetSchema(FlightDescriptor) to
> > return
> > >  just
> > >  the Schema in question
> > >  4) use DoAction and wrap the expected FlightInfo in a Result
> > > 
> > >  I am aware that 4 is probably 

Re: Spark and Arrow Flight

2019-07-09 Thread Ryan Murray
Hi Bryan,

I have an implementation of option #3 nearly ready for a PR. I will mention
you when I publish it.

The working prototype for the Spark connector is here:
https://github.com/rymurr/flight-spark-source. It technically works (and is
very fast!) however the implementation is pretty dodgy and needs to be
cleaned up before ready for prime time. I plan to have it ready to go for
the Arrow 1.0.0 release as an apache 2.0 licensed project. Please shout if
you have any comments or are interested in contributing!

Best,
Ryan

On Tue, Jul 9, 2019 at 3:21 PM Bryan Cutler  wrote:

> I'm in favor of option #3 also, but not sure what the best thing to do with
> the existing FlightInfo response is. I'm definitely interested in
> connecting Spark with Flight, can you share more details of your work or is
> it planned to be open sourced?
>
> Thanks,
> Bryan
>
> On Tue, Jul 2, 2019 at 3:35 AM Antoine Pitrou  wrote:
>
> >
> > Either #3 or #4 for me.  If #3, the default GetSchema implementation can
> > rely on calling GetFlightInfo.
> >
> >
> > Le 01/07/2019 à 22:50, David Li a écrit :
> > > I think I'd prefer #3 over overloading an existing call (#2).
> > >
> > > We've been thinking about a similar issue, where sometimes we want
> > > just the schema, but the service can't necessarily return the schema
> > > without fetching data - right now we return a sentinel value in
> > > GetFlightInfo, but a separate RPC would let us explicitly indicate an
> > > error.
> > >
> > > I might be missing something though - what happens between step 1 and
> > > 2 that makes the endpoints available? Would it make sense to use
> > > DoAction to cause the backend to "prepare" the endpoints, and have the
> > > result of that be an encoded schema? So then the flow would be
> > > DoAction -> GetFlightInfo -> DoGet.
> > >
> > > Best,
> > > David
> > >
> > > On 7/1/19, Wes McKinney  wrote:
> > >> My inclination is either #2 or #3. #4 is an option of course, but I
> > >> like the more structured solution of explicitly requesting the schema
> > >> given a descriptor.
> > >>
> > >> In both cases, it's possible that schemas are sent twice, e.g. if you
> > >> call GetSchema and then later call GetFlightInfo and so you receive
> > >> the schema again. The schema is optional, so if it became a
> > >> performance problem then a particular server might return the schema
> > >> as null from GetFlightInfo.
> > >>
> > >> I think it's valid to want to make a single GetFlightInfo RPC request
> > >> that returns _both_ the schema and the query plan.
> > >>
> > >> Thoughts from others?
> > >>
> > >> On Fri, Jun 28, 2019 at 8:52 PM Jacques Nadeau 
> > wrote:
> > >>>
> > >>> My initial inclination is towards #3 but I'd be curious what others
> > >>> think.
> > >>> In the case of #3, I wonder if it makes sense to then pull the Schema
> > off
> > >>> the GetFlightInfo response...
> > >>>
> > >>> On Fri, Jun 28, 2019 at 10:57 AM Ryan Murray 
> > wrote:
> > >>>
> >  Hi All,
> > 
> >  I have been working on building an arrow flight source for spark.
> The
> >  goal
> >  here is for Spark to be able to use a group of arrow flight
> endpoints
> >  to
> >  get a dataset pulled over to spark in parallel.
> > 
> >  I am unsure of the best model for the spark <-> flight conversation
> > and
> >  wanted to get your opinion on the best way to go.
> > 
> >  I am breaking up the query to flight from spark into 3 parts:
> >  1) get the schema using GetFlightInfo. This is needed to do further
> >  lazy
> >  operations in Spark
> >  2) get the endpoints by calling GetFlightInfo a 2nd time with a
> >  different
> >  argument. This returns the list endpoints on the parallel flight
> >  server.
> >  The endpoints are not available till data is ready to be fetched,
> > which
> >  is
> >  done after the schema but is needed before DoGet is called.
> >  3) call get stream on all endpoints from 2
> > 
> >  I think I have to do each step however I don't like having to call
> >  getInfo
> >  twice, it doesn't seem very elegant. I see a few options:
> >  1) live with calling GetFlightInfo twice and with a custom bytes cmd
> > to
> >  differentiate the purpose of each call
> >  2) add an argument to GetFlightInfo to tell it its being called only
> >  for
> >  the schema
> >  3) add another rpc endpoint: ie GetSchema(FlightDescriptor) to
> return
> >  just
> >  the Schema in question
> >  4) use DoAction and wrap the expected FlightInfo in a Result
> > 
> >  I am aware that 4 is probably the least disruptive but I'm also not
> a
> >  fan
> >  as (to me) it implies performing an action on the server side.
> >  Suggestions
> >  2 & 3 are larger changes and I am reluctant to do that unless there
> is
> >  a
> >  consensus here. None of them are great options and I am wondering
> what
> >  everyone thinks the best 

Re: Spark and Arrow Flight

2019-07-09 Thread Bryan Cutler
I'm in favor of option #3 also, but not sure what the best thing to do with
the existing FlightInfo response is. I'm definitely interested in
connecting Spark with Flight, can you share more details of your work or is
it planned to be open sourced?

Thanks,
Bryan

On Tue, Jul 2, 2019 at 3:35 AM Antoine Pitrou  wrote:

>
> Either #3 or #4 for me.  If #3, the default GetSchema implementation can
> rely on calling GetFlightInfo.
>
>
> Le 01/07/2019 à 22:50, David Li a écrit :
> > I think I'd prefer #3 over overloading an existing call (#2).
> >
> > We've been thinking about a similar issue, where sometimes we want
> > just the schema, but the service can't necessarily return the schema
> > without fetching data - right now we return a sentinel value in
> > GetFlightInfo, but a separate RPC would let us explicitly indicate an
> > error.
> >
> > I might be missing something though - what happens between step 1 and
> > 2 that makes the endpoints available? Would it make sense to use
> > DoAction to cause the backend to "prepare" the endpoints, and have the
> > result of that be an encoded schema? So then the flow would be
> > DoAction -> GetFlightInfo -> DoGet.
> >
> > Best,
> > David
> >
> > On 7/1/19, Wes McKinney  wrote:
> >> My inclination is either #2 or #3. #4 is an option of course, but I
> >> like the more structured solution of explicitly requesting the schema
> >> given a descriptor.
> >>
> >> In both cases, it's possible that schemas are sent twice, e.g. if you
> >> call GetSchema and then later call GetFlightInfo and so you receive
> >> the schema again. The schema is optional, so if it became a
> >> performance problem then a particular server might return the schema
> >> as null from GetFlightInfo.
> >>
> >> I think it's valid to want to make a single GetFlightInfo RPC request
> >> that returns _both_ the schema and the query plan.
> >>
> >> Thoughts from others?
> >>
> >> On Fri, Jun 28, 2019 at 8:52 PM Jacques Nadeau 
> wrote:
> >>>
> >>> My initial inclination is towards #3 but I'd be curious what others
> >>> think.
> >>> In the case of #3, I wonder if it makes sense to then pull the Schema
> off
> >>> the GetFlightInfo response...
> >>>
> >>> On Fri, Jun 28, 2019 at 10:57 AM Ryan Murray 
> wrote:
> >>>
>  Hi All,
> 
>  I have been working on building an arrow flight source for spark. The
>  goal
>  here is for Spark to be able to use a group of arrow flight endpoints
>  to
>  get a dataset pulled over to spark in parallel.
> 
>  I am unsure of the best model for the spark <-> flight conversation
> and
>  wanted to get your opinion on the best way to go.
> 
>  I am breaking up the query to flight from spark into 3 parts:
>  1) get the schema using GetFlightInfo. This is needed to do further
>  lazy
>  operations in Spark
>  2) get the endpoints by calling GetFlightInfo a 2nd time with a
>  different
>  argument. This returns the list endpoints on the parallel flight
>  server.
>  The endpoints are not available till data is ready to be fetched,
> which
>  is
>  done after the schema but is needed before DoGet is called.
>  3) call get stream on all endpoints from 2
> 
>  I think I have to do each step however I don't like having to call
>  getInfo
>  twice, it doesn't seem very elegant. I see a few options:
>  1) live with calling GetFlightInfo twice and with a custom bytes cmd
> to
>  differentiate the purpose of each call
>  2) add an argument to GetFlightInfo to tell it its being called only
>  for
>  the schema
>  3) add another rpc endpoint: ie GetSchema(FlightDescriptor) to return
>  just
>  the Schema in question
>  4) use DoAction and wrap the expected FlightInfo in a Result
> 
>  I am aware that 4 is probably the least disruptive but I'm also not a
>  fan
>  as (to me) it implies performing an action on the server side.
>  Suggestions
>  2 & 3 are larger changes and I am reluctant to do that unless there is
>  a
>  consensus here. None of them are great options and I am wondering what
>  everyone thinks the best approach might be? Particularly as I think
> this
>  is
>  likely to come up in more applications than just spark.
> 
>  Best,
>  Ryan
> 
> >>
>


Re: [DISCUSS][C++] Evaluating the arrow::Column C++ class

2019-07-09 Thread Tim Swast
FWIW, I found the Column class to be confusing in Python. It felt redundant
/ unneeded to actually create Tables.

On Tue, Jul 9, 2019 at 11:19 AM Wes McKinney  wrote:

> On Tue, Jul 9, 2019 at 1:14 PM Antoine Pitrou  wrote:
> >
> >
> > Le 08/07/2019 à 23:17, Wes McKinney a écrit :
> > >
> > > I'm concerned about continuing to maintain the Column class as it's
> > > spilling complexity into computational libraries and bindings alike.
> > >
> > > The Python Column class for example mostly forwards method calls to
> > > the underlying ChunkedArray
> > >
> > >
> https://github.com/apache/arrow/blob/master/python/pyarrow/table.pxi#L355
> > >
> > > If the developer wants to construct a Table or insert a new "column",
> > > Column objects must generally be constructed, leading to boilerplate
> > > without clear benefit.
> >
> > We could simply add the desired ChunkedArray-based convenience methods
> > without removing the Column-based APIs.
> >
> > I don't know if it's really cumbersome to maintain the Column class.
> > It's generally a very stable part of the API, and the Column class is
> > just a thin wrapper over a ChunkedArray + a field.
> >
>
> The indirection that it produces in public APIs I have found to be a
> nuisance, though (for example, doing things with the result of
> table[i] in Python).
>
> I'm about halfway through a patch to remove it, I'll let people review
> the work to assess the before-and-after.
>
> > Regards
> >
> > Antoine.
>


RE: [Discuss] Compatibility Guarantees and Versioning Post "1.0.0"

2019-07-09 Thread Eric Erhardt
Just to be sure I fully understand the proposal:

For the Library Version, we are going to increment the MAJOR version on every 
normal release, and increment the MINOR version if we need to release a 
patch/bug fix type of release.

Since SemVer allows for API breaking changes on MAJOR versions, this basically 
means, each library (C++, Python, C#, Java, etc) _can_ introduce API breaking 
changes on every normal release (like we have been with the 0.x.0 releases).

So, for example, we release library v1.0.0 in a few months and then library 
v2.0.0 a few months after that.  In v2.0.0, C++, Python, and Java didn't make 
any breaking API changes from 1.0.0. But C# made 3 API breaking changes. This 
would be acceptable?

If my understanding above is correct, then I think this is a good plan. 
Initially I was concerned that the C# library wouldn't be free to make API 
breaking changes with making the version `1.0.0`. The C# library is still 
pretty inadequate, and I have a feeling there are a few things that will need 
to change about it in the future. But with the above plan, this concern won't 
be a problem.

Eric

-Original Message-
From: Micah Kornfield  
Sent: Monday, July 1, 2019 10:02 PM
To: Wes McKinney 
Cc: dev@arrow.apache.org
Subject: Re: [Discuss] Compatibility Guarantees and Versioning Post "1.0.0"

Hi Wes,
Thanks for your response.  In regards to the protocol negotiation your 
description of feature reporting (snipped below) is along the lines of what I 
was thinking.  It might not be necessary for 1.0.0, but at some point might 
become useful.


>  Note that we don't really have a mechanism for clients and servers to 
> report to each other what features they support, so this could help 
> with that when for applications where it might matter.


Thanks,
Micah


On Mon, Jul 1, 2019 at 12:54 PM Wes McKinney  wrote:

> hi Micah,
>
> Sorry for the delay in feedback. I looked at the document and it seems 
> like a reasonable perspective about forward- and 
> backward-compatibility.
>
> It seems like the main thing you are proposing is to apply Semantic 
> Versioning to Format and Library versions separately. That's an 
> interesting idea, my thought had been to have a version number that is 
> FORMAT_VERSION.LIBRARY_VERSION.PATCH_VERSION. But your proposal is 
> more flexible in some ways, so let me clarify for others reading
>
> In what you are proposing, the next release would be:
>
> Format version: 1.0.0
> Library version: 1.0.0
>
> Suppose that 20 major versions down the road we stand at
>
> Format version: 1.5.0
> Library version: 20.0.0
>
> The minor version of the Format would indicate that there are 
> additions, like new elements in the Type union, but otherwise backward 
> and forward compatible. So the Minor version means "new things, but 
> old clients will not be disrupted if those new things are not used".
> We've already been doing this since the V4 Format iteration but we 
> have not had a way to signal that there may be new features. As a 
> corollary to this, I wonder if we should create a dual version in the 
> metadata
>
> PROTOCOL VERSION: (what is currently MetadataVersion, V2) FEATURE 
> VERSION: not tracked at all
>
> So Minor version bumps in the format would trigger a bump in the 
> FeatureVersion. Note that we don't really have a mechanism for clients 
> and servers to report to each other what features they support, so 
> this could help with that when for applications where it might matter.
>
> Should backward/forward compatibility be disrupted in the future, then 
> a change to the major version would be required. So in year 2025, say, 
> we might decide that we want to do:
>
> Format version: 2.0.0
> Library version: 21.0.0
>
> The Format version would live in the project's Documentation, so the 
> Apache releases are only the library version.
>
> Regarding your open questions:
>
> 1. Should we clean up "warts" on the specification, like redundant 
> information
>
> I don't think it's necessary. So if Metadata V5 is Format Version
> 1.0.0 (currently we are V4, but we're discussing some possible 
> non-forward compatible changes...) I think that's OK. None of these 
> things are "hurting" anything
>
> 2. Do we need additional mechanisms for marking some features as 
> experimental?
>
> Not sure, but I think this can be mostly addressed through 
> documentation. Flight will still be experimental in 1.0.0, for 
> example.
>
> 3. Do we need protocol negotiation mechanisms in Flight
>
> Could you explain what you mean? Are you thinking if there is some 
> major revamp of the protocol and you need to switch between a "V1 
> Flight Protocol" and a "V2 Flight Protocol"?
>
> - Wes
>
> On Thu, Jun 13, 2019 at 2:17 AM Micah Kornfield 
> 
> wrote:
> >
> > Hi Everyone,
> > I think there might be some ideas that we still need to reach 
> > consensus
> on
> > for how the format and libraries evolve in a post-1.0.0 release world.
> >  Specifically, I think we need to agree on 

Re: [DISCUSS] Need for 0.14.1 release due to Python package problems, Parquet forward compatibility problems

2019-07-09 Thread Wes McKinney
Hi Eric -- of course!

On Tue, Jul 9, 2019, 4:03 PM Eric Erhardt
 wrote:

> Can we propose getting changes other than Python or Parquet related into
> this release?
>
> For example, I found a critical issue in the C# implementation that, if
> possible, I'd like to get included in a patch release.
> https://github.com/apache/arrow/pull/4836
>
> Eric
>
> -Original Message-
> From: Wes McKinney 
> Sent: Tuesday, July 9, 2019 7:59 AM
> To: dev@arrow.apache.org
> Subject: Re: [DISCUSS] Need for 0.14.1 release due to Python package
> problems, Parquet forward compatibility problems
>
> On Tue, Jul 9, 2019 at 12:02 AM Sutou Kouhei  wrote:
> >
> > Hi,
> >
> > > If the problems can be resolved quickly, I should think we could cut
> > > an RC for 0.14.1 by the end of this week. The RC could either be cut
> > > from a maintenance branch or out of master -- any thoughts about
> > > this (cutting from master is definitely easier)?
> >
> > How about just releasing 0.15.0 from master?
> > It'll be simpler than creating a patch release.
> >
>
> I'd be fine with that, too.
>
> >
> > Thanks,
> > --
> > kou
> >
> > In 
> >   "[DISCUSS] Need for 0.14.1 release due to Python package problems,
> Parquet forward compatibility problems" on Mon, 8 Jul 2019 11:32:07 -0500,
> >   Wes McKinney  wrote:
> >
> > > hi folks,
> > >
> > > Perhaps unsurprisingly due to the expansion of our Python packages,
> > > a number of things are broken in 0.14.0 that we should fix sooner
> > > than the next major release. I'll try to send a complete list to
> > > this thread to give a status within a day or two. Other problems may
> > > arise in the next 48 hours as more people install the package.
> > >
> > > If the problems can be resolved quickly, I should think we could cut
> > > an RC for 0.14.1 by the end of this week. The RC could either be cut
> > > from a maintenance branch or out of master -- any thoughts about
> > > this (cutting from master is definitely easier)?
> > >
> > > Would someone (who is not Kou) be able to assist with creating the RC?
> > >
> > > Thanks,
> > > Wes
>


RE: [DISCUSS] Need for 0.14.1 release due to Python package problems, Parquet forward compatibility problems

2019-07-09 Thread Eric Erhardt
Can we propose getting changes other than Python or Parquet related into this 
release?

For example, I found a critical issue in the C# implementation that, if 
possible, I'd like to get included in a patch release.  
https://github.com/apache/arrow/pull/4836

Eric

-Original Message-
From: Wes McKinney  
Sent: Tuesday, July 9, 2019 7:59 AM
To: dev@arrow.apache.org
Subject: Re: [DISCUSS] Need for 0.14.1 release due to Python package problems, 
Parquet forward compatibility problems

On Tue, Jul 9, 2019 at 12:02 AM Sutou Kouhei  wrote:
>
> Hi,
>
> > If the problems can be resolved quickly, I should think we could cut 
> > an RC for 0.14.1 by the end of this week. The RC could either be cut 
> > from a maintenance branch or out of master -- any thoughts about 
> > this (cutting from master is definitely easier)?
>
> How about just releasing 0.15.0 from master?
> It'll be simpler than creating a patch release.
>

I'd be fine with that, too.

>
> Thanks,
> --
> kou
>
> In 
>   "[DISCUSS] Need for 0.14.1 release due to Python package problems, Parquet 
> forward compatibility problems" on Mon, 8 Jul 2019 11:32:07 -0500,
>   Wes McKinney  wrote:
>
> > hi folks,
> >
> > Perhaps unsurprisingly due to the expansion of our Python packages, 
> > a number of things are broken in 0.14.0 that we should fix sooner 
> > than the next major release. I'll try to send a complete list to 
> > this thread to give a status within a day or two. Other problems may 
> > arise in the next 48 hours as more people install the package.
> >
> > If the problems can be resolved quickly, I should think we could cut 
> > an RC for 0.14.1 by the end of this week. The RC could either be cut 
> > from a maintenance branch or out of master -- any thoughts about 
> > this (cutting from master is definitely easier)?
> >
> > Would someone (who is not Kou) be able to assist with creating the RC?
> >
> > Thanks,
> > Wes


[jira] [Created] (ARROW-5895) [Python] New version stores timestamps as epoch ms instead of ISO timestamp string

2019-07-09 Thread John Wilson (JIRA)
John Wilson created ARROW-5895:
--

 Summary: [Python] New version stores timestamps as epoch ms 
instead of ISO timestamp string
 Key: ARROW-5895
 URL: https://issues.apache.org/jira/browse/ARROW-5895
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.14.0
 Environment: Linux dev.office.whoop.com 3.10.0-957.21.3.el7.x86_64 #1 
SMP Tue Jun 18 16:35:19 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Reporter: John Wilson


Just upgraded from pyarrow 0.13 to 0.14.

Columns of type TimestampType(timestmap[ns]) now get written as epoch ms 
values: 
1561939200507
Where 0.13 wrote TimestampType(timestamp[ns]) as an ISO string:
2019-07-01T00:00:00.507Z
This broke my implementation.  How do I get pyarrow to write ISO strings again 
in 0.14?

 

Here is my table write:

{{ pyarrow.parquet.write_to_dataset(table=tbl, root_path=local_path,}}
{{ partition_cols=['env', 'dt'],}}
{{ coerce_timestamps='ms',}}
{{ allow_truncated_timestamps=True,}}
{{ version='2.0',}}
{{ compression='SNAPPY')}}

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5894) libgandiva.so.14 is exporting libstdc++ symbols

2019-07-09 Thread Zhuo Peng (JIRA)
Zhuo Peng created ARROW-5894:


 Summary: libgandiva.so.14 is exporting libstdc++ symbols
 Key: ARROW-5894
 URL: https://issues.apache.org/jira/browse/ARROW-5894
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++ - Gandiva
Affects Versions: 0.14.0
Reporter: Zhuo Peng


For example:

$ nm libgandiva.so.14 | grep "once_proxy"
018c0a10 T __once_proxy

 

many other symbols are also exported which I guess shouldn't be (e.g. LLVM 
symbols)

 

There seems to be no linker script for libgandiva.so (there was, but was never 
used and got deleted? 
[https://github.com/apache/arrow/blob/9265fe35b67db93f5af0b47e92e039c637ad5b3e/cpp/src/gandiva/symbols-helpers.map]).

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [DISCUSS][C++] Evaluating the arrow::Column C++ class

2019-07-09 Thread Wes McKinney
On Tue, Jul 9, 2019 at 1:14 PM Antoine Pitrou  wrote:
>
>
> Le 08/07/2019 à 23:17, Wes McKinney a écrit :
> >
> > I'm concerned about continuing to maintain the Column class as it's
> > spilling complexity into computational libraries and bindings alike.
> >
> > The Python Column class for example mostly forwards method calls to
> > the underlying ChunkedArray
> >
> > https://github.com/apache/arrow/blob/master/python/pyarrow/table.pxi#L355
> >
> > If the developer wants to construct a Table or insert a new "column",
> > Column objects must generally be constructed, leading to boilerplate
> > without clear benefit.
>
> We could simply add the desired ChunkedArray-based convenience methods
> without removing the Column-based APIs.
>
> I don't know if it's really cumbersome to maintain the Column class.
> It's generally a very stable part of the API, and the Column class is
> just a thin wrapper over a ChunkedArray + a field.
>

The indirection that it produces in public APIs I have found to be a
nuisance, though (for example, doing things with the result of
table[i] in Python).

I'm about halfway through a patch to remove it, I'll let people review
the work to assess the before-and-after.

> Regards
>
> Antoine.


[jira] [Created] (ARROW-5893) [C++] Remove arrow::Column class from C++ library

2019-07-09 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-5893:
---

 Summary: [C++] Remove arrow::Column class from C++ library
 Key: ARROW-5893
 URL: https://issues.apache.org/jira/browse/ARROW-5893
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++, GLib, MATLAB, Python, R
Reporter: Wes McKinney
 Fix For: 1.0.0


Opening JIRA per ongoing discussion on mailing list.

This class unfortunately touches a lot of places, so I'm going to start by 
removing it from the C++ and Python libraries to assist with discussion about 
its fate. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [DISCUSS][C++] Evaluating the arrow::Column C++ class

2019-07-09 Thread Antoine Pitrou


Le 08/07/2019 à 23:17, Wes McKinney a écrit :
> 
> I'm concerned about continuing to maintain the Column class as it's
> spilling complexity into computational libraries and bindings alike.
> 
> The Python Column class for example mostly forwards method calls to
> the underlying ChunkedArray
> 
> https://github.com/apache/arrow/blob/master/python/pyarrow/table.pxi#L355
> 
> If the developer wants to construct a Table or insert a new "column",
> Column objects must generally be constructed, leading to boilerplate
> without clear benefit.

We could simply add the desired ChunkedArray-based convenience methods
without removing the Column-based APIs.

I don't know if it's really cumbersome to maintain the Column class.
It's generally a very stable part of the API, and the Column class is
just a thin wrapper over a ChunkedArray + a field.

Regards

Antoine.


[jira] [Created] (ARROW-5892) [C++][Gandiva] Support function aliases

2019-07-09 Thread Prudhvi Porandla (JIRA)
Prudhvi Porandla created ARROW-5892:
---

 Summary: [C++][Gandiva] Support function aliases
 Key: ARROW-5892
 URL: https://issues.apache.org/jira/browse/ARROW-5892
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++ - Gandiva
Reporter: Prudhvi Porandla
Assignee: Prudhvi Porandla


This allows linking of several external names to the same precompiled function.
For example, 'mod', 'modulo' can be used to access the mod function 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5891) [C++][Gandiva] Remove duplicates in function registries

2019-07-09 Thread Prudhvi Porandla (JIRA)
Prudhvi Porandla created ARROW-5891:
---

 Summary: [C++][Gandiva] Remove duplicates in function registries
 Key: ARROW-5891
 URL: https://issues.apache.org/jira/browse/ARROW-5891
 Project: Apache Arrow
  Issue Type: Task
  Components: C++ - Gandiva
Reporter: Prudhvi Porandla


Each precompiled function should have at most one "NativeFunction" entry in the 
registry. Also add a UnitTest which checks if there are duplicates



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5890) [C++][Python] Support ExtensionType arrays in more kernels

2019-07-09 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-5890:


 Summary: [C++][Python] Support ExtensionType arrays in more kernels
 Key: ARROW-5890
 URL: https://issues.apache.org/jira/browse/ARROW-5890
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Joris Van den Bossche


>From a quick test (through Python), it seems that {{slice}} and {{take}} work, 
>but the following not:

- {{cast}}: it could rely on the casting rules for the storage type. Or do we 
want that you explicitly have to take the storage array before casting?
- {{dictionary_encode}} / {{unique}}





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5889) [Python][C++] Parquet backwards compat for timestamps without timezone broken

2019-07-09 Thread Florian Jetter (JIRA)
Florian Jetter created ARROW-5889:
-

 Summary: [Python][C++] Parquet backwards compat for timestamps 
without timezone broken
 Key: ARROW-5889
 URL: https://issues.apache.org/jira/browse/ARROW-5889
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: 0.14.0
Reporter: Florian Jetter
 Attachments: 0.12.1.parquet, 0.13.0.parquet

When reading a parquet file which has timestamp fields they are read as a 
timestamp with timezone UTC if the parquet file was written by pyarrow 0.13.0 
and/or 0.12.1.

Expected behavior would be that they are loaded as timestamps without any 
timezone information.

The attached files contain one row for all basic types and a few nested types, 
the timestamp fields are called datetime64 and datetime64_tz

see also 
[https://github.com/JDASoftwareGroup/kartothek/tree/master/reference-data/arrow-compat]

[https://github.com/JDASoftwareGroup/kartothek/blob/c47e52116e2dc726a74d7d6b97922a0252722ed0/tests/serialization/test_arrow_compat.py#L31]

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5888) [Python][C++] Parquet write metadata not roundtrip safe for timezone timestamps

2019-07-09 Thread Florian Jetter (JIRA)
Florian Jetter created ARROW-5888:
-

 Summary: [Python][C++] Parquet write metadata not roundtrip safe 
for timezone timestamps
 Key: ARROW-5888
 URL: https://issues.apache.org/jira/browse/ARROW-5888
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Florian Jetter


The timezone is not roundtrip safe for timezones other than UTC when storing to 
parquet. Expected behavior would be that the timezone is properly reconstructed

{code:python}
schema = pa.schema(
[
pa.field("no_tz", pa.timestamp('us')),
pa.field("no_tz", pa.timestamp('us', tz="UTC")),
pa.field("no_tz", pa.timestamp('us', tz="Europe/Berlin")),
]
)
buf = pa.BufferOutputStream()
pq.write_metadata(
schema,
buf,
coerce_timestamps="us"
)

pq_bytes = buf.getvalue().to_pybytes()
reader = pa.BufferReader(pq_bytes)
parquet_file = pq.ParquetFile(reader)
parquet_file.schema.to_arrow_schema()
# Output:
# no_tz: timestamp[us]
# utc: timestamp[us, tz=UTC]
# europe: timestamp[us, tz=UTC]
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [DISCUSS][C++] Evaluating the arrow::Column C++ class

2019-07-09 Thread Francois Saint-Jacques
I'm also +1 on removing this class.

François

On Tue, Jul 9, 2019 at 10:57 AM Uwe L. Korn  wrote:
>
> This sounds fine to me, thus I'm +1 on removing this class.
>
> On Tue, Jul 9, 2019, at 2:11 PM, Wes McKinney wrote:
> > Yes, the schema would be the point of truth for the Field. The ChunkedArray
> > type would have to be validated against the schema types as with RecordBatch
> >
> > On Tue, Jul 9, 2019, 2:54 AM Uwe L. Korn  wrote:
> >
> > > Hello Wes,
> > >
> > > where do you intend the Field object living then? Would this be part of
> > > the schema of the Table object?
> > >
> > > Uwe
> > >
> > > On Mon, Jul 8, 2019, at 11:18 PM, Wes McKinney wrote:
> > > > hi folks,
> > > >
> > > > For some time now I have been uncertain about the utility provided by
> > > > the arrow::Column C++ class. Fundamentally, it is a container for two
> > > > things:
> > > >
> > > > * An arrow::Field object (name and data type)
> > > > * An arrow::ChunkedArray object for the data
> > > >
> > > > It was added to the C++ library in ARROW-23 in March 2016 as the basis
> > > > for the arrow::Table class which represents a collection of
> > > > ChunkedArray objects coming usually from multiple RecordBatches.
> > > > Sometimes a Table will have mostly columns with a single chunk while
> > > > some columns will have many chunks.
> > > >
> > > > I'm concerned about continuing to maintain the Column class as it's
> > > > spilling complexity into computational libraries and bindings alike.
> > > >
> > > > The Python Column class for example mostly forwards method calls to
> > > > the underlying ChunkedArray
> > > >
> > > >
> > > https://github.com/apache/arrow/blob/master/python/pyarrow/table.pxi#L355
> > > >
> > > > If the developer wants to construct a Table or insert a new "column",
> > > > Column objects must generally be constructed, leading to boilerplate
> > > > without clear benefit.
> > > >
> > > > Since we're discussing building a more significant higher-level
> > > > DataFrame interface per past mailing list discussions, my preference
> > > > would be to consider removing the Column class to make the user- and
> > > > developer-facing data structures simpler. I hate to propose breaking
> > > > API changes, so it may not be practical at this point, but I wanted to
> > > > at least bring up the issue to see if others have opinions after
> > > > working with the library for a few years.
> > > >
> > > > Thanks
> > > > Wes
> > > >
> > >
> >


Re: [DISCUSS][C++] Evaluating the arrow::Column C++ class

2019-07-09 Thread Wes McKinney
I'll try to spend a little time soon refactoring to see how disruptive
the change would be, and also to help persuade others about the
benefits.

On Tue, Jul 9, 2019 at 9:57 AM Uwe L. Korn  wrote:
>
> This sounds fine to me, thus I'm +1 on removing this class.
>
> On Tue, Jul 9, 2019, at 2:11 PM, Wes McKinney wrote:
> > Yes, the schema would be the point of truth for the Field. The ChunkedArray
> > type would have to be validated against the schema types as with RecordBatch
> >
> > On Tue, Jul 9, 2019, 2:54 AM Uwe L. Korn  wrote:
> >
> > > Hello Wes,
> > >
> > > where do you intend the Field object living then? Would this be part of
> > > the schema of the Table object?
> > >
> > > Uwe
> > >
> > > On Mon, Jul 8, 2019, at 11:18 PM, Wes McKinney wrote:
> > > > hi folks,
> > > >
> > > > For some time now I have been uncertain about the utility provided by
> > > > the arrow::Column C++ class. Fundamentally, it is a container for two
> > > > things:
> > > >
> > > > * An arrow::Field object (name and data type)
> > > > * An arrow::ChunkedArray object for the data
> > > >
> > > > It was added to the C++ library in ARROW-23 in March 2016 as the basis
> > > > for the arrow::Table class which represents a collection of
> > > > ChunkedArray objects coming usually from multiple RecordBatches.
> > > > Sometimes a Table will have mostly columns with a single chunk while
> > > > some columns will have many chunks.
> > > >
> > > > I'm concerned about continuing to maintain the Column class as it's
> > > > spilling complexity into computational libraries and bindings alike.
> > > >
> > > > The Python Column class for example mostly forwards method calls to
> > > > the underlying ChunkedArray
> > > >
> > > >
> > > https://github.com/apache/arrow/blob/master/python/pyarrow/table.pxi#L355
> > > >
> > > > If the developer wants to construct a Table or insert a new "column",
> > > > Column objects must generally be constructed, leading to boilerplate
> > > > without clear benefit.
> > > >
> > > > Since we're discussing building a more significant higher-level
> > > > DataFrame interface per past mailing list discussions, my preference
> > > > would be to consider removing the Column class to make the user- and
> > > > developer-facing data structures simpler. I hate to propose breaking
> > > > API changes, so it may not be practical at this point, but I wanted to
> > > > at least bring up the issue to see if others have opinions after
> > > > working with the library for a few years.
> > > >
> > > > Thanks
> > > > Wes
> > > >
> > >
> >


Re: [DISCUSS][C++] Evaluating the arrow::Column C++ class

2019-07-09 Thread Uwe L. Korn
This sounds fine to me, thus I'm +1 on removing this class.

On Tue, Jul 9, 2019, at 2:11 PM, Wes McKinney wrote:
> Yes, the schema would be the point of truth for the Field. The ChunkedArray
> type would have to be validated against the schema types as with RecordBatch
> 
> On Tue, Jul 9, 2019, 2:54 AM Uwe L. Korn  wrote:
> 
> > Hello Wes,
> >
> > where do you intend the Field object living then? Would this be part of
> > the schema of the Table object?
> >
> > Uwe
> >
> > On Mon, Jul 8, 2019, at 11:18 PM, Wes McKinney wrote:
> > > hi folks,
> > >
> > > For some time now I have been uncertain about the utility provided by
> > > the arrow::Column C++ class. Fundamentally, it is a container for two
> > > things:
> > >
> > > * An arrow::Field object (name and data type)
> > > * An arrow::ChunkedArray object for the data
> > >
> > > It was added to the C++ library in ARROW-23 in March 2016 as the basis
> > > for the arrow::Table class which represents a collection of
> > > ChunkedArray objects coming usually from multiple RecordBatches.
> > > Sometimes a Table will have mostly columns with a single chunk while
> > > some columns will have many chunks.
> > >
> > > I'm concerned about continuing to maintain the Column class as it's
> > > spilling complexity into computational libraries and bindings alike.
> > >
> > > The Python Column class for example mostly forwards method calls to
> > > the underlying ChunkedArray
> > >
> > >
> > https://github.com/apache/arrow/blob/master/python/pyarrow/table.pxi#L355
> > >
> > > If the developer wants to construct a Table or insert a new "column",
> > > Column objects must generally be constructed, leading to boilerplate
> > > without clear benefit.
> > >
> > > Since we're discussing building a more significant higher-level
> > > DataFrame interface per past mailing list discussions, my preference
> > > would be to consider removing the Column class to make the user- and
> > > developer-facing data structures simpler. I hate to propose breaking
> > > API changes, so it may not be practical at this point, but I wanted to
> > > at least bring up the issue to see if others have opinions after
> > > working with the library for a few years.
> > >
> > > Thanks
> > > Wes
> > >
> >
>


[jira] [Created] (ARROW-5887) [C#] ArrowStreamWriter writes FieldNodes in wrong order

2019-07-09 Thread Eric Erhardt (JIRA)
Eric Erhardt created ARROW-5887:
---

 Summary: [C#] ArrowStreamWriter writes FieldNodes in wrong order
 Key: ARROW-5887
 URL: https://issues.apache.org/jira/browse/ARROW-5887
 Project: Apache Arrow
  Issue Type: Bug
  Components: C#
Reporter: Eric Erhardt
Assignee: Eric Erhardt


When ArrowStreamWriter is writing a {{RecordBatch}} with {{null}}s in it, it is 
mixing up the column's {{NullCount}}.

You can see here:

[https://github.com/apache/arrow/blob/90affbd2c41e80aa8c3fac1e4dbff60aafb415d3/csharp/src/Apache.Arrow/Ipc/ArrowStreamWriter.cs#L195-L200]

It is writing the fields from {{0}} -> {{fieldCount}} order. But then 
[lower|https://github.com/apache/arrow/blob/90affbd2c41e80aa8c3fac1e4dbff60aafb415d3/csharp/src/Apache.Arrow/Ipc/ArrowStreamWriter.cs#L216-L220],
 it is writing the fields from {{fieldCount}} -> {{0}}.

Looking at the [Java 
implementation|https://github.com/apache/arrow/blob/7b2d68570b4336308c52081a0349675e488caf11/java/vector/src/main/java/org/apache/arrow/vector/ipc/message/FBSerializables.java#L36-L44]
 it says
{quote}// struct vectors have to be created in reverse order
{quote}
 

A simple test of roundtripping the following RecordBatch shows the issue:

 
{code:java}
var result = new RecordBatch(
new Schema.Builder()
.Field(f => f.Name("age").DataType(Int32Type.Default))
.Field(f => f.Name("CharCount").DataType(Int32Type.Default))
.Build(),
new IArrowArray[]
{
new Int32Array(
new ArrowBuffer.Builder().Append(0).Build(),
new ArrowBuffer.Builder().Append(0).Build(),
length: 1,
nullCount: 1,
offset: 0),
new Int32Array(
new ArrowBuffer.Builder().Append(7).Build(),
ArrowBuffer.Empty,
length: 1,
nullCount: 0,
offset: 0)
},
length: 1);
{code}
Here, the "age" column should have a `null` in it. However, when you write and 
read this RecordBatch back, you see that the "CharCount" column has `NullCount` 
== 1 and "age" column has `NullCount` == 0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5886) [Python][Packaging] Manylinux1/2010 complience issue with libz

2019-07-09 Thread Krisztian Szucs (JIRA)
Krisztian Szucs created ARROW-5886:
--

 Summary: [Python][Packaging] Manylinux1/2010 complience issue with 
libz
 Key: ARROW-5886
 URL: https://issues.apache.org/jira/browse/ARROW-5886
 Project: Apache Arrow
  Issue Type: Bug
  Components: Packaging, Python
Affects Versions: 0.14.0
Reporter: Krisztian Szucs


So we statically link liblz4 in the manylinux1 wheels
{code}
# ldd pyarrow-manylinux1/libarrow.so.14 | grep z
libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x7fc28cef4000)
{code}
but dynamically in the manylinux2010 wheels
{code}
# ldd pyarrow-manylinux2010/libarrow.so.14 | grep z
liblz4.so.1 => not found  (already deleted to reproduce the issue)
libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x7f56f744)
{code}
this what this PR resolves.

What I'm finding strange, that auditwheel seems to bundle libz for manylinux1:
{code}
# ls -lah pyarrow-manylinux1/*z*so.*
-rwxr-xr-x 1 root root 115K Jun 29 00:14 
pyarrow-manylinux1/libz-7f57503f.so.1.2.11
{code}
while ldd still uses the system libz:
{code}
# ldd pyarrow-manylinux1/libarrow.so.14 | grep z
libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x7f91fcf3f000)
{code}
For manylinux2010 we also have liblz4:
{code}
#  ls -lah pyarrow-manylinux2010/*z*so.*
-rwxr-xr-x 1 root root 191K Jun 28 23:38 
pyarrow-manylinux2010/liblz4-8cb8bdde.so.1.8.3
-rwxr-xr-x 1 root root 115K Jun 28 23:38 
pyarrow-manylinux2010/libz-c69b9943.so.1.2.11
{code}
and ldd similarly tries to load the system libs:
{code}
# ldd pyarrow-manylinux2010/libarrow.so.14 | grep z
liblz4.so.1 => not found
libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x7fd72764e000)
{code}

Inspecting manylinux1 with `LD_DEBUG=files,libs ldd libarrow.so.14` it seems 
like to search the right path, but cannot find the hashed version of libz 
`libz-7f57503f.so.1.2.11`
{code}
   463: file=libz.so.1 [0];  needed by ./libarrow.so.14 [0]
   463: find library=libz.so.1 [0]; searching
   463:  search path=/tmp/pyarrow-manylinux1/.  (RPATH from 
file ./libarrow.so.14)
   463:   trying file=/tmp/pyarrow-manylinux1/./libz.so.1
   463:  search cache=/etc/ld.so.cache
   463:   trying file=/lib/x86_64-linux-gnu/libz.so.1
{code}
There is no `libz.so.1` just `libz-7f57503f.so.1.2.11`.

Similarly for manylinux2010 and libz:
{code}
   470: file=libz.so.1 [0];  needed by ./libarrow.so.14 [0]
   470: find library=libz.so.1 [0]; searching
   470:  search path=/tmp/pyarrow-manylinux2010/.   (RPATH 
from file ./libarrow.so.14)
   470:   trying file=/tmp/pyarrow-manylinux2010/./libz.so.1
   470:  search cache=/etc/ld.so.cache
   470:   trying file=/lib/x86_64-linux-gnu/libz.so.1
{code}
for liblz4 (again, I've deleted the system one):
{code}
   470: file=liblz4.so.1 [0];  needed by ./libarrow.so.14 [0]
   470: find library=liblz4.so.1 [0]; searching
   470:  search path=/tmp/pyarrow-manylinux2010/.   (RPATH 
from file ./libarrow.so.14)
   470:   trying file=/tmp/pyarrow-manylinux2010/./liblz4.so.1
   470:  search cache=/etc/ld.so.cache
   470:  search 
path=/lib/x86_64-linux-gnu/tls/x86_64:/lib/x86_64-linux-gnu/tls:/lib/x86_64-linux-gnu/x86_64:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu/tls/x86_64:/usr/lib/x86_64-linux-gnu/tls:/usr/lib/x86_64-linux-gnu/x86_6$
:/usr/lib/x86_64-linux-gnu:/lib/tls/x86_64:/lib/tls:/lib/x86_64:/lib:/usr/lib/tls/x86_64:/usr/lib/tls:/usr/lib/x86_64:/usr/lib
  (system search path)
{code}
There are no `libz.so.1` nor `liblz4.so.1`, just `libz-c69b9943.so.1.2.11` and 
`liblz4-8cb8bdde.so.1.8.3`

According to https://www.python.org/dev/peps/pep-0571/ `liblz4` nor `libz` are 
part of the whitelist, and while these are bundled with the wheel, seemingly 
cannot be found - perhaps because of the hash in the library name?

I've tried to inspect the wheels with `auditwheel show` with version `2` and 
`1.10`, both says the following:

{code}
# auditwheel show pyarrow-0.14.0-cp37-cp37m-manylinux2010_x86_64.whl

pyarrow-0.14.0-cp37-cp37m-manylinux2010_x86_64.whl is consistent with
the following platform tag: "linux_x86_64".

The wheel references external versioned symbols in these system-
provided shared libraries: libgcc_s.so.1 with versions {'GCC_3.3',
'GCC_3.4', 'GCC_3.0'}, libpthread.so.0 with versions {'GLIBC_2.3.3',
'GLIBC_2.12', 'GLIBC_2.2.5', 'GLIBC_2.3.2'}, libc.so.6 with versions
{'GLIBC_2.4', 'GLIBC_2.6', 'GLIBC_2.2.5', 'GLIBC_2.7', 'GLIBC_2.3.4',
'GLIBC_2.3.2', 'GLIBC_2.3'}, libstdc++.so.6 with versions
{'CXXABI_1.3', 'GLIBCXX_3.4.10', 'GLIBCXX_3.4.9', 'GLIBCXX_3.4.11',
'GLIBCXX_3.4.5', 'GLIBCXX_3.4', 'CXXABI_1.3.2', 'CXXABI_1.3.3'},
librt.so.1 with versions {'GLIBC_2.2.5'}, libm.so.6 with versions
{'GLIBC_2.2.5'}, libdl.so.2 with versions 

[jira] [Created] (ARROW-5885) Support optional arrow components via extras_require

2019-07-09 Thread George Sakkis (JIRA)
George Sakkis created ARROW-5885:


 Summary: Support optional arrow components via extras_require
 Key: ARROW-5885
 URL: https://issues.apache.org/jira/browse/ARROW-5885
 Project: Apache Arrow
  Issue Type: Wish
  Components: Python
Reporter: George Sakkis


Since Arrow (and pyarrow) have several independent optional component, instead 
of installing all of them it would be convenient if these could be opt-in from 
pip like 

{{pip install pyarrow[gandiva,flight,plasma]}}

or opt-out like

{{pip install pyarrow[no-gandiva,no-flight,no-plasma]}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [DISCUSS] Need for 0.14.1 release due to Python package problems, Parquet forward compatibility problems

2019-07-09 Thread Wes McKinney
On Tue, Jul 9, 2019 at 12:02 AM Sutou Kouhei  wrote:
>
> Hi,
>
> > If the problems can be resolved quickly, I should think we could cut
> > an RC for 0.14.1 by the end of this week. The RC could either be cut
> > from a maintenance branch or out of master -- any thoughts about this
> > (cutting from master is definitely easier)?
>
> How about just releasing 0.15.0 from master?
> It'll be simpler than creating a patch release.
>

I'd be fine with that, too.

>
> Thanks,
> --
> kou
>
> In 
>   "[DISCUSS] Need for 0.14.1 release due to Python package problems, Parquet 
> forward compatibility problems" on Mon, 8 Jul 2019 11:32:07 -0500,
>   Wes McKinney  wrote:
>
> > hi folks,
> >
> > Perhaps unsurprisingly due to the expansion of our Python packages, a
> > number of things are broken in 0.14.0 that we should fix sooner than
> > the next major release. I'll try to send a complete list to this
> > thread to give a status within a day or two. Other problems may arise
> > in the next 48 hours as more people install the package.
> >
> > If the problems can be resolved quickly, I should think we could cut
> > an RC for 0.14.1 by the end of this week. The RC could either be cut
> > from a maintenance branch or out of master -- any thoughts about this
> > (cutting from master is definitely easier)?
> >
> > Would someone (who is not Kou) be able to assist with creating the RC?
> >
> > Thanks,
> > Wes


Re: [DISCUSS][C++] Evaluating the arrow::Column C++ class

2019-07-09 Thread Wes McKinney
Yes, the schema would be the point of truth for the Field. The ChunkedArray
type would have to be validated against the schema types as with RecordBatch

On Tue, Jul 9, 2019, 2:54 AM Uwe L. Korn  wrote:

> Hello Wes,
>
> where do you intend the Field object living then? Would this be part of
> the schema of the Table object?
>
> Uwe
>
> On Mon, Jul 8, 2019, at 11:18 PM, Wes McKinney wrote:
> > hi folks,
> >
> > For some time now I have been uncertain about the utility provided by
> > the arrow::Column C++ class. Fundamentally, it is a container for two
> > things:
> >
> > * An arrow::Field object (name and data type)
> > * An arrow::ChunkedArray object for the data
> >
> > It was added to the C++ library in ARROW-23 in March 2016 as the basis
> > for the arrow::Table class which represents a collection of
> > ChunkedArray objects coming usually from multiple RecordBatches.
> > Sometimes a Table will have mostly columns with a single chunk while
> > some columns will have many chunks.
> >
> > I'm concerned about continuing to maintain the Column class as it's
> > spilling complexity into computational libraries and bindings alike.
> >
> > The Python Column class for example mostly forwards method calls to
> > the underlying ChunkedArray
> >
> >
> https://github.com/apache/arrow/blob/master/python/pyarrow/table.pxi#L355
> >
> > If the developer wants to construct a Table or insert a new "column",
> > Column objects must generally be constructed, leading to boilerplate
> > without clear benefit.
> >
> > Since we're discussing building a more significant higher-level
> > DataFrame interface per past mailing list discussions, my preference
> > would be to consider removing the Column class to make the user- and
> > developer-facing data structures simpler. I hate to propose breaking
> > API changes, so it may not be practical at this point, but I wanted to
> > at least bring up the issue to see if others have opinions after
> > working with the library for a few years.
> >
> > Thanks
> > Wes
> >
>


[jira] [Created] (ARROW-5884) [Java] Fix the get method of StructVector

2019-07-09 Thread Liya Fan (JIRA)
Liya Fan created ARROW-5884:
---

 Summary: [Java] Fix the get method of StructVector
 Key: ARROW-5884
 URL: https://issues.apache.org/jira/browse/ARROW-5884
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


When the data at the specified location is null, there is no need to call the 
method from super to set the reader

holder.isSet = isSet(index);
super.get(index, holder);



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5883) [Java] Support Dictionary Encoding for List type

2019-07-09 Thread Ji Liu (JIRA)
Ji Liu created ARROW-5883:
-

 Summary: [Java] Support Dictionary Encoding for List type
 Key: ARROW-5883
 URL: https://issues.apache.org/jira/browse/ARROW-5883
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


As described in 
[http://arrow.apache.org/docs/format/Layout.html#dictionary-encoding], List 
type encoding should be supported.

Now ListVector getObject returns a ArrayList implementation, and its equals and 
hashCode are already overwritten, so it could be directly supported to be 
hashMap key in DictionaryEncoder. Since we won't change Dictionary data, use 
mutable key seems dose't matter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5882) [C++][Gandiva] Throw error if divisor is 0 in integer mod functions

2019-07-09 Thread Prudhvi Porandla (JIRA)
Prudhvi Porandla created ARROW-5882:
---

 Summary: [C++][Gandiva] Throw error if divisor is 0 in integer mod 
functions 
 Key: ARROW-5882
 URL: https://issues.apache.org/jira/browse/ARROW-5882
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Prudhvi Porandla


mod_int64_int32, mod_int64_int64 should throw an error when divisor is 0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [DISCUSS][C++] Evaluating the arrow::Column C++ class

2019-07-09 Thread Uwe L. Korn
Hello Wes,

where do you intend the Field object living then? Would this be part of the 
schema of the Table object?

Uwe

On Mon, Jul 8, 2019, at 11:18 PM, Wes McKinney wrote:
> hi folks,
> 
> For some time now I have been uncertain about the utility provided by
> the arrow::Column C++ class. Fundamentally, it is a container for two
> things:
> 
> * An arrow::Field object (name and data type)
> * An arrow::ChunkedArray object for the data
> 
> It was added to the C++ library in ARROW-23 in March 2016 as the basis
> for the arrow::Table class which represents a collection of
> ChunkedArray objects coming usually from multiple RecordBatches.
> Sometimes a Table will have mostly columns with a single chunk while
> some columns will have many chunks.
> 
> I'm concerned about continuing to maintain the Column class as it's
> spilling complexity into computational libraries and bindings alike.
> 
> The Python Column class for example mostly forwards method calls to
> the underlying ChunkedArray
> 
> https://github.com/apache/arrow/blob/master/python/pyarrow/table.pxi#L355
> 
> If the developer wants to construct a Table or insert a new "column",
> Column objects must generally be constructed, leading to boilerplate
> without clear benefit.
> 
> Since we're discussing building a more significant higher-level
> DataFrame interface per past mailing list discussions, my preference
> would be to consider removing the Column class to make the user- and
> developer-facing data structures simpler. I hate to propose breaking
> API changes, so it may not be practical at this point, but I wanted to
> at least bring up the issue to see if others have opinions after
> working with the library for a few years.
> 
> Thanks
> Wes
>


[jira] [Created] (ARROW-5881) [Java] Provide functionalities to efficiently determine if a validity buffer has completely 1 bits/0 bits

2019-07-09 Thread Liya Fan (JIRA)
Liya Fan created ARROW-5881:
---

 Summary: [Java] Provide functionalities to efficiently determine 
if a validity buffer has completely 1 bits/0 bits
 Key: ARROW-5881
 URL: https://issues.apache.org/jira/browse/ARROW-5881
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


These utilities can be used to efficiently determine, for example, 
* If all values in a vector are null
* If a vector contains no null
* If a vector contains any valid element
* If a vector contains any invalid element



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)