[jira] [Created] (ARROW-5635) Support "compacting" a table

2019-06-17 Thread Zhuo Peng (JIRA)
Zhuo Peng created ARROW-5635:


 Summary: Support "compacting" a table
 Key: ARROW-5635
 URL: https://issues.apache.org/jira/browse/ARROW-5635
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Zhuo Peng


A column in a table might consists of multiple chunks. I'm proposing a 
Table.Compact() method that returns a table whose columns are of just one 
chunks, which is the concatenation of the corresponding column's chunks.

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [Discuss][Format][Java] Finalizing Union Types

2019-06-17 Thread Wes McKinney
example sparse union:

types: (int64, utf8)
type_ids: [0, 4]

type buffer: [0, 0, 0, 4, 4, 4]

child 0: [1, 2, 3, --, --, --]
child 1: [--, --, --, 'foo', 'bar', 'baz']

example dense union:

types: (int64, utf8)
type_ids: [0, 4]

type buffer: [0, 0, 0, 4, 4, 4]
offsets buffer: [0, 1, 2, 0, 1, 2]

child 0: [1, 2, 3]
child 1: ['foo', 'bar', 'baz']

On Mon, Jun 17, 2019 at 3:50 PM Antoine Pitrou  wrote:
>
>
> Le 17/06/2019 à 22:46, Wes McKinney a écrit :
> > https://github.com/apache/arrow/blob/master/format/Schema.fbs#L88
> >
> > "optionally typeIds provides an indirection between the child offset
> > and the type id for each child typeIds[offset] is the id used in the
> > type vector"
>
> Does this mean typeIds[logical_id] gives the physical child index?
> or typeIds[physical_index] gives the logical type id?
>
> I don't understand what "offset" means in this context, usually it's a
> displacement from the beginning of an array.
>
>
> >
> > On Mon, Jun 17, 2019 at 12:26 PM Ben Kietzman  
> > wrote:
> >>
> >> Somewhat related:
> >>
> >> Could we clarify the expected content of the type_ids buffer of union
> >> arrays? Layout.rst
> >> 
> >> seems to indicate these should be indices of the corresponding child array,
> >> but the C++ implementation allows them to be any positive int8 and
> >> maintains a mapping to child indices. (for example see what is generated
> >> for integration testing
> >> :
> >> has two child arrays and type_ids 5, 10)
> >>
> >> On Mon, Jun 17, 2019 at 11:35 AM Micah Kornfield 
> >> wrote:
> >>
> >>> Sounds good.  Sorry I got distracted with some other stuff but should be
> >>> getting back to this soonish
> >>>
> >>> On Monday, June 17, 2019, Wes McKinney  wrote:
> >>>
>  I'd already moved the Union issues to 1.0.0 so we are all good there
> 
>  On Mon, Jun 17, 2019 at 10:18 AM Wes McKinney 
> >>> wrote:
> >
> > I'm also +1 for generalized unions as we currently have specified. The
> > objections from the Java users seems to be mostly on the basis of
> > performance in the union-of-primitives case -- that's an
> > implementation specific issue, so if Java needs to have a
> > "GeneralizedDenseUnionVector" or something to handle the
> > union-of-anything case, then that seems reasonable. The important
> > thing is that the binary protocol itself and serialized metadata is
> > something that we are happy with and won't need to change going
> > forward.
> >
> > It seems we're getting a bit long in the tooth to get this into 0.14.0
> > so I'm going to move the Union-related issues to the 1.0.0 milestone
> > so we can get this resolved for the 1.0.0 release
> >
> > Thanks
> > Wes
> >
> > On Mon, Jun 10, 2019 at 12:33 AM Ravindra Pindikura <
> >>> ravin...@dremio.com>
>  wrote:
> >>
> >> On Sat, May 25, 2019 at 12:29 PM Micah Kornfield <
>  emkornfi...@gmail.com>
> >> wrote:
> >>
> >>> Thanks for the responses, I've clipped the questions and provided
>  responses
> >>> inline.
> >>>
> >>> is the proposal that both cpp & java will support only option 2 ?
>  I guess 1 is a subset of 2 anyway.
> >>>
> >>> CPP already supports option 2.  I would like to make CPP and java
> >>> compatible, in a way that this acceptable for Java maintainers.
>  Yes, 1 is
> >>> a subset of 2.
> >>>
> >>> The metadata on java side uses the minor type id as the type
>  identifier in
>  the union (and the field name is expected to the the same as the
>  type
> >>> name
>  in a union). If you were to support a generalized union, this
>  wouldn't
>  work. How will the type identifiers be generated ?
>  I'm trying to see if we can make the change backward compatible,
>  with
>  existing unions in java.
> >>>
> >>>
> >>> Looking at the code, I don't think the existing Union class is
> >>> generalizable because of this assumption (it caches a single type
> >>> of
>  each
> >>> locally) and based on the Javadoc this seems to be for performance
>  reasons,
> >>> so I would like to try to avoid touching it if possible.
> >>>
> >>
> >> @Micah Kornfield  sorry, I wasn't clear. I
>  meant to
> >> ask if the format would be backward compatible, which I think it will
>  be
> >> (since 1 is a subset of 2, and your proposal isn't making a change in
>  the
> >> wire format).
> >>
> >> I'm fine if the APIs are not backward compatible. Or, once we have 2,
>  we
> >> can add wrappers for 1, if required.
> >>
> >>
> >>>
> >>> My original thinking was to try to factor out a common base class
>  from
> >>> UnionVector, then create a 

Re: [Discuss][Format][Java] Finalizing Union Types

2019-06-17 Thread Antoine Pitrou


Le 17/06/2019 à 22:46, Wes McKinney a écrit :
> https://github.com/apache/arrow/blob/master/format/Schema.fbs#L88
> 
> "optionally typeIds provides an indirection between the child offset
> and the type id for each child typeIds[offset] is the id used in the
> type vector"

Does this mean typeIds[logical_id] gives the physical child index?
or typeIds[physical_index] gives the logical type id?

I don't understand what "offset" means in this context, usually it's a
displacement from the beginning of an array.


> 
> On Mon, Jun 17, 2019 at 12:26 PM Ben Kietzman  
> wrote:
>>
>> Somewhat related:
>>
>> Could we clarify the expected content of the type_ids buffer of union
>> arrays? Layout.rst
>> 
>> seems to indicate these should be indices of the corresponding child array,
>> but the C++ implementation allows them to be any positive int8 and
>> maintains a mapping to child indices. (for example see what is generated
>> for integration testing
>> :
>> has two child arrays and type_ids 5, 10)
>>
>> On Mon, Jun 17, 2019 at 11:35 AM Micah Kornfield 
>> wrote:
>>
>>> Sounds good.  Sorry I got distracted with some other stuff but should be
>>> getting back to this soonish
>>>
>>> On Monday, June 17, 2019, Wes McKinney  wrote:
>>>
 I'd already moved the Union issues to 1.0.0 so we are all good there

 On Mon, Jun 17, 2019 at 10:18 AM Wes McKinney 
>>> wrote:
>
> I'm also +1 for generalized unions as we currently have specified. The
> objections from the Java users seems to be mostly on the basis of
> performance in the union-of-primitives case -- that's an
> implementation specific issue, so if Java needs to have a
> "GeneralizedDenseUnionVector" or something to handle the
> union-of-anything case, then that seems reasonable. The important
> thing is that the binary protocol itself and serialized metadata is
> something that we are happy with and won't need to change going
> forward.
>
> It seems we're getting a bit long in the tooth to get this into 0.14.0
> so I'm going to move the Union-related issues to the 1.0.0 milestone
> so we can get this resolved for the 1.0.0 release
>
> Thanks
> Wes
>
> On Mon, Jun 10, 2019 at 12:33 AM Ravindra Pindikura <
>>> ravin...@dremio.com>
 wrote:
>>
>> On Sat, May 25, 2019 at 12:29 PM Micah Kornfield <
 emkornfi...@gmail.com>
>> wrote:
>>
>>> Thanks for the responses, I've clipped the questions and provided
 responses
>>> inline.
>>>
>>> is the proposal that both cpp & java will support only option 2 ?
 I guess 1 is a subset of 2 anyway.
>>>
>>> CPP already supports option 2.  I would like to make CPP and java
>>> compatible, in a way that this acceptable for Java maintainers.
 Yes, 1 is
>>> a subset of 2.
>>>
>>> The metadata on java side uses the minor type id as the type
 identifier in
 the union (and the field name is expected to the the same as the
 type
>>> name
 in a union). If you were to support a generalized union, this
 wouldn't
 work. How will the type identifiers be generated ?
 I'm trying to see if we can make the change backward compatible,
 with
 existing unions in java.
>>>
>>>
>>> Looking at the code, I don't think the existing Union class is
>>> generalizable because of this assumption (it caches a single type
>>> of
 each
>>> locally) and based on the Javadoc this seems to be for performance
 reasons,
>>> so I would like to try to avoid touching it if possible.
>>>
>>
>> @Micah Kornfield  sorry, I wasn't clear. I
 meant to
>> ask if the format would be backward compatible, which I think it will
 be
>> (since 1 is a subset of 2, and your proposal isn't making a change in
 the
>> wire format).
>>
>> I'm fine if the APIs are not backward compatible. Or, once we have 2,
 we
>> can add wrappers for 1, if required.
>>
>>
>>>
>>> My original thinking was to try to factor out a common base class
 from
>>> UnionVector, then create a new GeneralizedUnionVector class that
>>> has
>>> slightly different method signatures (its possible I will need
 additional
>>> supporting classes like a new GeneralizedUnionWriter, but i haven't
 gotten
>>> that far yet).  The main challenge I see is a way to let users
>>> switch
>>> between the two implementations.  Some thoughts off the top of my
 head
>>> (none of them seem good):
>>> 1.  Create a flag like:
>>>
>>>
>>> https://github.com/apache/arrow/blob/ccdaa9f2a4c1af1222df840b608e2e
 f465d331fc/java/memory/src/main/java/org/apache/arrow/
 memory/BoundsChecking.java
>>> 

Re: New CI system: Ursabot

2019-06-17 Thread Krisztián Szűcs
That's right, OWNER, MEMBER and CONTRIBUTOR roles are allowed:
CONTRIBUTOR

Author has previously committed to the repository.
MEMBER

Author is a member of the organization that owns the repository.
OWNER

Author is the owner of the repository.
See https://developer.github.com/v4/enum/commentauthorassociation/

On Mon, Jun 17, 2019 at 3:16 PM Wes McKinney  wrote:

> On Mon, Jun 17, 2019 at 7:25 AM Krisztián Szűcs
>  wrote:
> >
> > On Sun, Jun 16, 2019 at 6:17 AM Micah Kornfield 
> > wrote:
> >
> > > Hi Krisztian,
> > > This is really cool, thank you for doing this.   Two questions:
> > > 1.  How reliable is the build setup? Is it reliable enough at this
> point to
> > > be considered a merge blocker if a build fails?
> > >
> >  IMO yes.
> >
> > > 2.  What is the permission model for triggering runs?  Is it open to
> > > anybody on github?  Only Ursalab members?  Committers?
> > >
> > Most of the builders are automatically triggered on each commits.
> > Specific control buttons are available for ursalabs member at the moment,
> > but I can grant access to other organizations (e.g. apache) and
> individual
> > members.
> >
>
> You're talking about the Buildbot UI here? Suffice to say if any CI
> system is going to be depended on for decision-making, then any
> _contributor_ needs to be able to trigger runs. It seems that
> presently any contributor can trigger builds from GitHub comments, is
> that right?
>
> > >
> > > Thanks,
> > > Micah
> > >
> > > On Fri, Jun 14, 2019 at 2:30 PM Antoine Pitrou 
> wrote:
> > >
> > > >
> > > > Le 14/06/2019 à 23:22, Krisztián Szűcs a écrit :
> > > > >>
> > > > >> * Do machines have to be co-located on the same physical network
> as
> > > > >> the master, or can they reside in other locations?
> > > > >>
> > > > > It is preferable to have a master in the same network where the
> workers
> > > > are,
> > > > > because the build steps are rpc calls made by the master.
> > > >
> > > > I'm unaware that this is a problem.
> > > > CPython has build workers all over the world (contributed by
> volunteers)
> > > > connected to a single build master.
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > >
>


Re: [Discuss][Format][Java] Finalizing Union Types

2019-06-17 Thread Wes McKinney
https://github.com/apache/arrow/blob/master/format/Schema.fbs#L88

"optionally typeIds provides an indirection between the child offset
and the type id for each child typeIds[offset] is the id used in the
type vector"

On Mon, Jun 17, 2019 at 12:26 PM Ben Kietzman  wrote:
>
> Somewhat related:
>
> Could we clarify the expected content of the type_ids buffer of union
> arrays? Layout.rst
> 
> seems to indicate these should be indices of the corresponding child array,
> but the C++ implementation allows them to be any positive int8 and
> maintains a mapping to child indices. (for example see what is generated
> for integration testing
> :
> has two child arrays and type_ids 5, 10)
>
> On Mon, Jun 17, 2019 at 11:35 AM Micah Kornfield 
> wrote:
>
> > Sounds good.  Sorry I got distracted with some other stuff but should be
> > getting back to this soonish
> >
> > On Monday, June 17, 2019, Wes McKinney  wrote:
> >
> > > I'd already moved the Union issues to 1.0.0 so we are all good there
> > >
> > > On Mon, Jun 17, 2019 at 10:18 AM Wes McKinney 
> > wrote:
> > > >
> > > > I'm also +1 for generalized unions as we currently have specified. The
> > > > objections from the Java users seems to be mostly on the basis of
> > > > performance in the union-of-primitives case -- that's an
> > > > implementation specific issue, so if Java needs to have a
> > > > "GeneralizedDenseUnionVector" or something to handle the
> > > > union-of-anything case, then that seems reasonable. The important
> > > > thing is that the binary protocol itself and serialized metadata is
> > > > something that we are happy with and won't need to change going
> > > > forward.
> > > >
> > > > It seems we're getting a bit long in the tooth to get this into 0.14.0
> > > > so I'm going to move the Union-related issues to the 1.0.0 milestone
> > > > so we can get this resolved for the 1.0.0 release
> > > >
> > > > Thanks
> > > > Wes
> > > >
> > > > On Mon, Jun 10, 2019 at 12:33 AM Ravindra Pindikura <
> > ravin...@dremio.com>
> > > wrote:
> > > > >
> > > > > On Sat, May 25, 2019 at 12:29 PM Micah Kornfield <
> > > emkornfi...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Thanks for the responses, I've clipped the questions and provided
> > > responses
> > > > > > inline.
> > > > > >
> > > > > > is the proposal that both cpp & java will support only option 2 ?
> > > > > > > I guess 1 is a subset of 2 anyway.
> > > > > >
> > > > > > CPP already supports option 2.  I would like to make CPP and java
> > > > > > compatible, in a way that this acceptable for Java maintainers.
> > > Yes, 1 is
> > > > > > a subset of 2.
> > > > > >
> > > > > > The metadata on java side uses the minor type id as the type
> > > identifier in
> > > > > > > the union (and the field name is expected to the the same as the
> > > type
> > > > > > name
> > > > > > > in a union). If you were to support a generalized union, this
> > > wouldn't
> > > > > > > work. How will the type identifiers be generated ?
> > > > > > > I'm trying to see if we can make the change backward compatible,
> > > with
> > > > > > > existing unions in java.
> > > > > >
> > > > > >
> > > > > > Looking at the code, I don't think the existing Union class is
> > > > > > generalizable because of this assumption (it caches a single type
> > of
> > > each
> > > > > > locally) and based on the Javadoc this seems to be for performance
> > > reasons,
> > > > > > so I would like to try to avoid touching it if possible.
> > > > > >
> > > > >
> > > > > @Micah Kornfield  sorry, I wasn't clear. I
> > > meant to
> > > > > ask if the format would be backward compatible, which I think it will
> > > be
> > > > > (since 1 is a subset of 2, and your proposal isn't making a change in
> > > the
> > > > > wire format).
> > > > >
> > > > > I'm fine if the APIs are not backward compatible. Or, once we have 2,
> > > we
> > > > > can add wrappers for 1, if required.
> > > > >
> > > > >
> > > > > >
> > > > > > My original thinking was to try to factor out a common base class
> > > from
> > > > > > UnionVector, then create a new GeneralizedUnionVector class that
> > has
> > > > > > slightly different method signatures (its possible I will need
> > > additional
> > > > > > supporting classes like a new GeneralizedUnionWriter, but i haven't
> > > gotten
> > > > > > that far yet).  The main challenge I see is a way to let users
> > switch
> > > > > > between the two implementations.  Some thoughts off the top of my
> > > head
> > > > > > (none of them seem good):
> > > > > > 1.  Create a flag like:
> > > > > >
> > > > > >
> > https://github.com/apache/arrow/blob/ccdaa9f2a4c1af1222df840b608e2e
> > > f465d331fc/java/memory/src/main/java/org/apache/arrow/
> > > memory/BoundsChecking.java
> > > > > > so
> > > > > > it is statically decided before hand, and have the 

[jira] [Created] (ARROW-5634) [C#] ArrayData.NullCount should be a property

2019-06-17 Thread Prashanth Govindarajan (JIRA)
Prashanth Govindarajan created ARROW-5634:
-

 Summary: [C#] ArrayData.NullCount should be a property 
 Key: ARROW-5634
 URL: https://issues.apache.org/jira/browse/ARROW-5634
 Project: Apache Arrow
  Issue Type: Task
  Components: C#
Reporter: Prashanth Govindarajan


ArrayData.NullCount should be a property so that it can be computed when 
necessary: for ex: after Slice(), NullCount is -1 and needs to be computed 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [Discuss][Format][Java] Finalizing Union Types

2019-06-17 Thread Ben Kietzman
Somewhat related:

Could we clarify the expected content of the type_ids buffer of union
arrays? Layout.rst

seems to indicate these should be indices of the corresponding child array,
but the C++ implementation allows them to be any positive int8 and
maintains a mapping to child indices. (for example see what is generated
for integration testing
:
has two child arrays and type_ids 5, 10)

On Mon, Jun 17, 2019 at 11:35 AM Micah Kornfield 
wrote:

> Sounds good.  Sorry I got distracted with some other stuff but should be
> getting back to this soonish
>
> On Monday, June 17, 2019, Wes McKinney  wrote:
>
> > I'd already moved the Union issues to 1.0.0 so we are all good there
> >
> > On Mon, Jun 17, 2019 at 10:18 AM Wes McKinney 
> wrote:
> > >
> > > I'm also +1 for generalized unions as we currently have specified. The
> > > objections from the Java users seems to be mostly on the basis of
> > > performance in the union-of-primitives case -- that's an
> > > implementation specific issue, so if Java needs to have a
> > > "GeneralizedDenseUnionVector" or something to handle the
> > > union-of-anything case, then that seems reasonable. The important
> > > thing is that the binary protocol itself and serialized metadata is
> > > something that we are happy with and won't need to change going
> > > forward.
> > >
> > > It seems we're getting a bit long in the tooth to get this into 0.14.0
> > > so I'm going to move the Union-related issues to the 1.0.0 milestone
> > > so we can get this resolved for the 1.0.0 release
> > >
> > > Thanks
> > > Wes
> > >
> > > On Mon, Jun 10, 2019 at 12:33 AM Ravindra Pindikura <
> ravin...@dremio.com>
> > wrote:
> > > >
> > > > On Sat, May 25, 2019 at 12:29 PM Micah Kornfield <
> > emkornfi...@gmail.com>
> > > > wrote:
> > > >
> > > > > Thanks for the responses, I've clipped the questions and provided
> > responses
> > > > > inline.
> > > > >
> > > > > is the proposal that both cpp & java will support only option 2 ?
> > > > > > I guess 1 is a subset of 2 anyway.
> > > > >
> > > > > CPP already supports option 2.  I would like to make CPP and java
> > > > > compatible, in a way that this acceptable for Java maintainers.
> > Yes, 1 is
> > > > > a subset of 2.
> > > > >
> > > > > The metadata on java side uses the minor type id as the type
> > identifier in
> > > > > > the union (and the field name is expected to the the same as the
> > type
> > > > > name
> > > > > > in a union). If you were to support a generalized union, this
> > wouldn't
> > > > > > work. How will the type identifiers be generated ?
> > > > > > I'm trying to see if we can make the change backward compatible,
> > with
> > > > > > existing unions in java.
> > > > >
> > > > >
> > > > > Looking at the code, I don't think the existing Union class is
> > > > > generalizable because of this assumption (it caches a single type
> of
> > each
> > > > > locally) and based on the Javadoc this seems to be for performance
> > reasons,
> > > > > so I would like to try to avoid touching it if possible.
> > > > >
> > > >
> > > > @Micah Kornfield  sorry, I wasn't clear. I
> > meant to
> > > > ask if the format would be backward compatible, which I think it will
> > be
> > > > (since 1 is a subset of 2, and your proposal isn't making a change in
> > the
> > > > wire format).
> > > >
> > > > I'm fine if the APIs are not backward compatible. Or, once we have 2,
> > we
> > > > can add wrappers for 1, if required.
> > > >
> > > >
> > > > >
> > > > > My original thinking was to try to factor out a common base class
> > from
> > > > > UnionVector, then create a new GeneralizedUnionVector class that
> has
> > > > > slightly different method signatures (its possible I will need
> > additional
> > > > > supporting classes like a new GeneralizedUnionWriter, but i haven't
> > gotten
> > > > > that far yet).  The main challenge I see is a way to let users
> switch
> > > > > between the two implementations.  Some thoughts off the top of my
> > head
> > > > > (none of them seem good):
> > > > > 1.  Create a flag like:
> > > > >
> > > > >
> https://github.com/apache/arrow/blob/ccdaa9f2a4c1af1222df840b608e2e
> > f465d331fc/java/memory/src/main/java/org/apache/arrow/
> > memory/BoundsChecking.java
> > > > > so
> > > > > it is statically decided before hand, and have the new class
> > implement the
> > > > > same signatures as UnionVector to but throw an exception if a
> method
> > that
> > > > > isn't compatible is called.
> > > > > 2. Possibly try to augment ArrowType to pass through information
> > about its
> > > > > children vectors when reading vectors, but use the flag in option 1
> > if it
> > > > > can't be determined.
> > > > >
> > > > > I'm open to suggestions.  I'm also happy to try to prototype
> > something and
> > > > > get feedback once there is concrete 

[jira] [Created] (ARROW-5633) [Python] Enable bz2 in wheels

2019-06-17 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-5633:
-

 Summary: [Python] Enable bz2 in wheels
 Key: ARROW-5633
 URL: https://issues.apache.org/jira/browse/ARROW-5633
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Antoine Pitrou


This would allow e.g. reading bz2-compressed CSV or JSON files.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5632) [Doc] Add some documentation describing compile/debug workflow on macOS with Xcode IDE

2019-06-17 Thread Hatem Helal (JIRA)
Hatem Helal created ARROW-5632:
--

 Summary: [Doc] Add some documentation describing compile/debug 
workflow on macOS with Xcode IDE
 Key: ARROW-5632
 URL: https://issues.apache.org/jira/browse/ARROW-5632
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation
Reporter: Hatem Helal
Assignee: Hatem Helal






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5631) [C++] CMake 3.2 build is broken

2019-06-17 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-5631:
---

 Summary: [C++] CMake 3.2 build is broken
 Key: ARROW-5631
 URL: https://issues.apache.org/jira/browse/ARROW-5631
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.14.0


See log

https://gist.github.com/wesm/53147d8d672aa850f49eaeb3c3f24630



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5630) [Python] Table of nested arrays doesn't round trip

2019-06-17 Thread Philip Felton (JIRA)
Philip Felton created ARROW-5630:


 Summary: [Python] Table of nested arrays doesn't round trip
 Key: ARROW-5630
 URL: https://issues.apache.org/jira/browse/ARROW-5630
 Project: Apache Arrow
  Issue Type: Bug
 Environment: pyarrow 0.13, Windows 10
Reporter: Philip Felton


This is pyarrow 0.13 on Windows.

{code:python}
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

def make_table(num_rows):
typ = pa.list_(pa.field("item", pa.float32(), False))
return pa.Table.from_arrays([
pa.array([[0] * (i%10) for i in range(0, num_rows)], type=typ),
pa.array([[0] * ((i+5)%10) for i in range(0, num_rows)], type=typ)
], ['a', 'b'])

pq.write_table(make_table(100), 'test.parquet')

pq.read_table('test.parquet')
{code}

The last line throws the following exception:


{noformat}
---
ArrowInvalid  Traceback (most recent call last)
 in 
> 1 pq.read_table('full.parquet')

~\Anaconda3\lib\site-packages\pyarrow\parquet.py in read_table(source, columns, 
use_threads, metadata, use_pandas_metadata, memory_map, filesystem)
   1150 return fs.read_parquet(path, columns=columns,
   1151use_threads=use_threads, 
metadata=metadata,
-> 1152use_pandas_metadata=use_pandas_metadata)
   1153 
   1154 pf = ParquetFile(source, metadata=metadata)

~\Anaconda3\lib\site-packages\pyarrow\filesystem.py in read_parquet(self, path, 
columns, metadata, schema, use_threads, use_pandas_metadata)
179  filesystem=self)
180 return dataset.read(columns=columns, use_threads=use_threads,
--> 181 use_pandas_metadata=use_pandas_metadata)
182 
183 def open(self, path, mode='rb'):

~\Anaconda3\lib\site-packages\pyarrow\parquet.py in read(self, columns, 
use_threads, use_pandas_metadata)
   1012 table = piece.read(columns=columns, use_threads=use_threads,
   1013partitions=self.partitions,
-> 1014use_pandas_metadata=use_pandas_metadata)
   1015 tables.append(table)
   1016 

~\Anaconda3\lib\site-packages\pyarrow\parquet.py in read(self, columns, 
use_threads, partitions, open_file_func, file, use_pandas_metadata)
562 table = reader.read_row_group(self.row_group, **options)
563 else:
--> 564 table = reader.read(**options)
565 
566 if len(self.partition_keys) > 0:

~\Anaconda3\lib\site-packages\pyarrow\parquet.py in read(self, columns, 
use_threads, use_pandas_metadata)
212 columns, use_pandas_metadata=use_pandas_metadata)
213 return self.reader.read_all(column_indices=column_indices,
--> 214 use_threads=use_threads)
215 
216 def scan_contents(self, columns=None, batch_size=65536):

~\Anaconda3\lib\site-packages\pyarrow\_parquet.pyx in 
pyarrow._parquet.ParquetReader.read_all()

~\Anaconda3\lib\site-packages\pyarrow\error.pxi in pyarrow.lib.check_status()

ArrowInvalid: Column 1 named b expected length 932066 but got length 932063
{noformat}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [Discuss][Format][Java] Finalizing Union Types

2019-06-17 Thread Micah Kornfield
Sounds good.  Sorry I got distracted with some other stuff but should be
getting back to this soonish

On Monday, June 17, 2019, Wes McKinney  wrote:

> I'd already moved the Union issues to 1.0.0 so we are all good there
>
> On Mon, Jun 17, 2019 at 10:18 AM Wes McKinney  wrote:
> >
> > I'm also +1 for generalized unions as we currently have specified. The
> > objections from the Java users seems to be mostly on the basis of
> > performance in the union-of-primitives case -- that's an
> > implementation specific issue, so if Java needs to have a
> > "GeneralizedDenseUnionVector" or something to handle the
> > union-of-anything case, then that seems reasonable. The important
> > thing is that the binary protocol itself and serialized metadata is
> > something that we are happy with and won't need to change going
> > forward.
> >
> > It seems we're getting a bit long in the tooth to get this into 0.14.0
> > so I'm going to move the Union-related issues to the 1.0.0 milestone
> > so we can get this resolved for the 1.0.0 release
> >
> > Thanks
> > Wes
> >
> > On Mon, Jun 10, 2019 at 12:33 AM Ravindra Pindikura 
> wrote:
> > >
> > > On Sat, May 25, 2019 at 12:29 PM Micah Kornfield <
> emkornfi...@gmail.com>
> > > wrote:
> > >
> > > > Thanks for the responses, I've clipped the questions and provided
> responses
> > > > inline.
> > > >
> > > > is the proposal that both cpp & java will support only option 2 ?
> > > > > I guess 1 is a subset of 2 anyway.
> > > >
> > > > CPP already supports option 2.  I would like to make CPP and java
> > > > compatible, in a way that this acceptable for Java maintainers.
> Yes, 1 is
> > > > a subset of 2.
> > > >
> > > > The metadata on java side uses the minor type id as the type
> identifier in
> > > > > the union (and the field name is expected to the the same as the
> type
> > > > name
> > > > > in a union). If you were to support a generalized union, this
> wouldn't
> > > > > work. How will the type identifiers be generated ?
> > > > > I'm trying to see if we can make the change backward compatible,
> with
> > > > > existing unions in java.
> > > >
> > > >
> > > > Looking at the code, I don't think the existing Union class is
> > > > generalizable because of this assumption (it caches a single type of
> each
> > > > locally) and based on the Javadoc this seems to be for performance
> reasons,
> > > > so I would like to try to avoid touching it if possible.
> > > >
> > >
> > > @Micah Kornfield  sorry, I wasn't clear. I
> meant to
> > > ask if the format would be backward compatible, which I think it will
> be
> > > (since 1 is a subset of 2, and your proposal isn't making a change in
> the
> > > wire format).
> > >
> > > I'm fine if the APIs are not backward compatible. Or, once we have 2,
> we
> > > can add wrappers for 1, if required.
> > >
> > >
> > > >
> > > > My original thinking was to try to factor out a common base class
> from
> > > > UnionVector, then create a new GeneralizedUnionVector class that has
> > > > slightly different method signatures (its possible I will need
> additional
> > > > supporting classes like a new GeneralizedUnionWriter, but i haven't
> gotten
> > > > that far yet).  The main challenge I see is a way to let users switch
> > > > between the two implementations.  Some thoughts off the top of my
> head
> > > > (none of them seem good):
> > > > 1.  Create a flag like:
> > > >
> > > > https://github.com/apache/arrow/blob/ccdaa9f2a4c1af1222df840b608e2e
> f465d331fc/java/memory/src/main/java/org/apache/arrow/
> memory/BoundsChecking.java
> > > > so
> > > > it is statically decided before hand, and have the new class
> implement the
> > > > same signatures as UnionVector to but throw an exception if a method
> that
> > > > isn't compatible is called.
> > > > 2. Possibly try to augment ArrowType to pass through information
> about its
> > > > children vectors when reading vectors, but use the flag in option 1
> if it
> > > > can't be determined.
> > > >
> > > > I'm open to suggestions.  I'm also happy to try to prototype
> something and
> > > > get feedback once there is concrete code to evaluate.
> > > >
> > > > I don't understand the limitation to different types, so +1 for
> > > > > generalized unions.  That said, I don't think it's high-priority
> either.
> > > >
> > > >
> > > > Antoine, the fact that it isn't high-priority probably  is why it
> has taken
> > > > so long to resolve.  I'm excited to get to more interesting higher
> priority
> > > > work, but I would like to see some of the basics finished off
> first.  BTW,
> > > > if you have suggestions on priorities, I'd be happy to hear them.
> After
> > > > this, handling empty record batches, and the UBSan work I'm doing, I
> was
> > > > thinking of either trying to get Avro support in, or work on fit and
> finish
> > > > items for the C++/Python CSV reader.
> > > >
> > > > Thanks,
> > > > Micah
> > > >
> > > > [1] https://github.com/apache/arrow/pull/987#issuecomment-493231493
> > > > [2]
> 

Re: [Discuss][Format][Java] Finalizing Union Types

2019-06-17 Thread Wes McKinney
I'd already moved the Union issues to 1.0.0 so we are all good there

On Mon, Jun 17, 2019 at 10:18 AM Wes McKinney  wrote:
>
> I'm also +1 for generalized unions as we currently have specified. The
> objections from the Java users seems to be mostly on the basis of
> performance in the union-of-primitives case -- that's an
> implementation specific issue, so if Java needs to have a
> "GeneralizedDenseUnionVector" or something to handle the
> union-of-anything case, then that seems reasonable. The important
> thing is that the binary protocol itself and serialized metadata is
> something that we are happy with and won't need to change going
> forward.
>
> It seems we're getting a bit long in the tooth to get this into 0.14.0
> so I'm going to move the Union-related issues to the 1.0.0 milestone
> so we can get this resolved for the 1.0.0 release
>
> Thanks
> Wes
>
> On Mon, Jun 10, 2019 at 12:33 AM Ravindra Pindikura  
> wrote:
> >
> > On Sat, May 25, 2019 at 12:29 PM Micah Kornfield 
> > wrote:
> >
> > > Thanks for the responses, I've clipped the questions and provided 
> > > responses
> > > inline.
> > >
> > > is the proposal that both cpp & java will support only option 2 ?
> > > > I guess 1 is a subset of 2 anyway.
> > >
> > > CPP already supports option 2.  I would like to make CPP and java
> > > compatible, in a way that this acceptable for Java maintainers.  Yes, 1 is
> > > a subset of 2.
> > >
> > > The metadata on java side uses the minor type id as the type identifier in
> > > > the union (and the field name is expected to the the same as the type
> > > name
> > > > in a union). If you were to support a generalized union, this wouldn't
> > > > work. How will the type identifiers be generated ?
> > > > I'm trying to see if we can make the change backward compatible, with
> > > > existing unions in java.
> > >
> > >
> > > Looking at the code, I don't think the existing Union class is
> > > generalizable because of this assumption (it caches a single type of each
> > > locally) and based on the Javadoc this seems to be for performance 
> > > reasons,
> > > so I would like to try to avoid touching it if possible.
> > >
> >
> > @Micah Kornfield  sorry, I wasn't clear. I meant to
> > ask if the format would be backward compatible, which I think it will be
> > (since 1 is a subset of 2, and your proposal isn't making a change in the
> > wire format).
> >
> > I'm fine if the APIs are not backward compatible. Or, once we have 2, we
> > can add wrappers for 1, if required.
> >
> >
> > >
> > > My original thinking was to try to factor out a common base class from
> > > UnionVector, then create a new GeneralizedUnionVector class that has
> > > slightly different method signatures (its possible I will need additional
> > > supporting classes like a new GeneralizedUnionWriter, but i haven't gotten
> > > that far yet).  The main challenge I see is a way to let users switch
> > > between the two implementations.  Some thoughts off the top of my head
> > > (none of them seem good):
> > > 1.  Create a flag like:
> > >
> > > https://github.com/apache/arrow/blob/ccdaa9f2a4c1af1222df840b608e2ef465d331fc/java/memory/src/main/java/org/apache/arrow/memory/BoundsChecking.java
> > > so
> > > it is statically decided before hand, and have the new class implement the
> > > same signatures as UnionVector to but throw an exception if a method that
> > > isn't compatible is called.
> > > 2. Possibly try to augment ArrowType to pass through information about its
> > > children vectors when reading vectors, but use the flag in option 1 if it
> > > can't be determined.
> > >
> > > I'm open to suggestions.  I'm also happy to try to prototype something and
> > > get feedback once there is concrete code to evaluate.
> > >
> > > I don't understand the limitation to different types, so +1 for
> > > > generalized unions.  That said, I don't think it's high-priority either.
> > >
> > >
> > > Antoine, the fact that it isn't high-priority probably  is why it has 
> > > taken
> > > so long to resolve.  I'm excited to get to more interesting higher 
> > > priority
> > > work, but I would like to see some of the basics finished off first.  BTW,
> > > if you have suggestions on priorities, I'd be happy to hear them.  After
> > > this, handling empty record batches, and the UBSan work I'm doing, I was
> > > thinking of either trying to get Avro support in, or work on fit and 
> > > finish
> > > items for the C++/Python CSV reader.
> > >
> > > Thanks,
> > > Micah
> > >
> > > [1] https://github.com/apache/arrow/pull/987#issuecomment-493231493
> > > [2]
> > >
> > > https://github.com/apache/arrow/blob/7b2d68570b4336308c52081a0349675e488caf11/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Field.java#L104
> > >
> > > On Fri, May 24, 2019 at 2:08 AM Antoine Pitrou  wrote:
> > >
> > > >
> > > > I don't understand the limitation to different types, so +1 for
> > > > generalized unions.  That said, I don't think it's high-priority 

[jira] [Created] (ARROW-5627) [R] split code generation into mock and implementation

2019-06-17 Thread JIRA
Romain François created ARROW-5627:
--

 Summary: [R] split code generation into mock and implementation
 Key: ARROW-5627
 URL: https://issues.apache.org/jira/browse/ARROW-5627
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Romain François


https://github.com/apache/arrow/pull/4575#discussion_r294304984



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: New CI system: Ursabot

2019-06-17 Thread Wes McKinney
On Mon, Jun 17, 2019 at 7:25 AM Krisztián Szűcs
 wrote:
>
> On Sun, Jun 16, 2019 at 6:17 AM Micah Kornfield 
> wrote:
>
> > Hi Krisztian,
> > This is really cool, thank you for doing this.   Two questions:
> > 1.  How reliable is the build setup? Is it reliable enough at this point to
> > be considered a merge blocker if a build fails?
> >
>  IMO yes.
>
> > 2.  What is the permission model for triggering runs?  Is it open to
> > anybody on github?  Only Ursalab members?  Committers?
> >
> Most of the builders are automatically triggered on each commits.
> Specific control buttons are available for ursalabs member at the moment,
> but I can grant access to other organizations (e.g. apache) and individual
> members.
>

You're talking about the Buildbot UI here? Suffice to say if any CI
system is going to be depended on for decision-making, then any
_contributor_ needs to be able to trigger runs. It seems that
presently any contributor can trigger builds from GitHub comments, is
that right?

> >
> > Thanks,
> > Micah
> >
> > On Fri, Jun 14, 2019 at 2:30 PM Antoine Pitrou  wrote:
> >
> > >
> > > Le 14/06/2019 à 23:22, Krisztián Szűcs a écrit :
> > > >>
> > > >> * Do machines have to be co-located on the same physical network as
> > > >> the master, or can they reside in other locations?
> > > >>
> > > > It is preferable to have a master in the same network where the workers
> > > are,
> > > > because the build steps are rpc calls made by the master.
> > >
> > > I'm unaware that this is a problem.
> > > CPython has build workers all over the world (contributed by volunteers)
> > > connected to a single build master.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> >


[jira] [Created] (ARROW-5626) [C

2019-06-17 Thread Pindikura Ravindra (JIRA)
Pindikura Ravindra created ARROW-5626:
-

 Summary: [C
 Key: ARROW-5626
 URL: https://issues.apache.org/jira/browse/ARROW-5626
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Pindikura Ravindra






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5625) [R] convert Array of struct type to data frame columns

2019-06-17 Thread JIRA
Romain François created ARROW-5625:
--

 Summary: [R] convert Array of struct type to data frame columns
 Key: ARROW-5625
 URL: https://issues.apache.org/jira/browse/ARROW-5625
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Romain François






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)