[jira] [Created] (ARROW-2843) [Format/C++/Java] Removing field layout from Schema.fbs breaks backward compatibility

2018-07-12 Thread Yuliya Feldman (JIRA)
Yuliya Feldman created ARROW-2843:
-

 Summary: [Format/C++/Java] Removing field layout from Schema.fbs 
breaks backward compatibility
 Key: ARROW-2843
 URL: https://issues.apache.org/jira/browse/ARROW-2843
 Project: Apache Arrow
  Issue Type: Bug
  Components: Format
Affects Versions: 0.9.0, 0.8.0
Reporter: Yuliya Feldman
Assignee: Yuliya Feldman


Change in [ARROW-1785|https://issues.apache.org/jira/browse/ARROW-1785]

Breaks backward compatibility with schema generated in prior versions of Arrow, 
as VectorLayout field was removed from Field table.

This is definitely not supported behavior in FlatBuffers.

I suggest to put back following field
{code:java}
layout: [ VectorLayout ] (deprecated);
{code}
 into 
{code:java}
table Field 
{code}
and mark it "deprecated" - as shown above

It will also require putting back 
{code:java}
table VectorLayout
{code}
which can have all the deprecated fields as well.

code generation footprint will be minimal in this case, as FlatBuffers won't 
generate getters for deprecated fields, but it will really solve backward 
compatibility issues.

Also may be a good idea to introduce "version" into Schema.

 

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2842) [Python] Cannot read parquet files with row group size of 1 From HDFS

2018-07-12 Thread Robert Gruener (JIRA)
Robert Gruener created ARROW-2842:
-

 Summary: [Python] Cannot read parquet files with row group size of 
1 From HDFS
 Key: ARROW-2842
 URL: https://issues.apache.org/jira/browse/ARROW-2842
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Robert Gruener
 Attachments: single-row.parquet

This might be a bug in parquet-cpp, I need to spend a bit more time tracking 
this down but basically given a file with a single row on hdfs, reading it with 
pyarrow yields this error

```

TcpSocket.cpp: 79: HdfsEndOfStream: Read 8 bytes failed from 
"10.103.182.28:50010": End of the stream
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ arrow::io::HdfsReadableFile::ReadAt(long, long, long*, void*)
 @ parquet::ArrowInputFile::ReadAt(long, long, unsigned char*)
 @ parquet::SerializedFile::ParseMetaData()
 @ 
parquet::ParquetFileReader::Contents::Open(std::unique_ptr >, parquet::ReaderProperties 
const&, std::shared_ptr const&)
 @ 
parquet::ParquetFileReader::Open(std::unique_ptr >, parquet::ReaderProperties 
const&, std::shared_ptr const&)
 @ parquet::arrow::OpenFile(std::shared_ptr 
const&, arrow::MemoryPool*, parquet::ReaderProperties const&, 
std::shared_ptr const&, 
std::unique_ptr >*)
 @ __pyx_pw_7pyarrow_8_parquet_13ParquetReader_3open(_object*, _object*, 
_object*)

```

The following code causes it:

```

import pyarrow

import pyarrow.parquet as pq

 

fs = pyarrow.hdfs.connect() # fill in namenode information

file_object = fs.open('single-row.parquet') # update for hdfs path of file

pq.read_metadata(file_object) # this works

parquet_file = pq.ParquetFile(file_object)

parquet_file.read_row_group(0) # throws error

```

 

I am working on writing a unit test for this



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [Go] Go failures on Travis-CI

2018-07-12 Thread Wes McKinney
I opened https://issues.apache.org/jira/browse/ARROW-2841. Thanks!

On Thu, Jul 12, 2018 at 2:36 PM, Sebastien Binet  wrote:
> I'll have a look tomorrow (Paris time).
> It looks like a GOPATH issue.
>
> sent from my droid
>
> On Thu, Jul 12, 2018, 20:52 Antoine Pitrou  wrote:
>
>>
>> Hello,
>>
>> I'm getting persistent failures in the Go job on Travis-CI:
>> https://travis-ci.org/pitrou/arrow/jobs/403221354
>>
>> Is this expected?  Excerpt:
>>
>> """
>> $ go get -t -v ./...
>> github.com/apache/arrow (download)
>> github.com/stretchr/testify (download)
>> go/arrow/type_traits_boolean.go:20:2: use of internal package not allowed
>> go/arrow/array/array.go:23:2: use of internal package not allowed
>> go/arrow/array/array.go:24:2: use of internal package not allowed
>> go/arrow/math/math_amd64.go:22:2: use of internal package not allowed
>> go/arrow/memory/memory_amd64.go:22:2: use of internal package not allowed
>> go/arrow/memory/buffer.go:22:2: use of internal package not allowed
>> The command "eval go get -t -v ./... " failed. Retrying, 2 of 3.
>> """
>>
>> Regards
>>
>> Antoine.
>>


[jira] [Created] (ARROW-2841) [Go] Fix recent Go build failures in Travis CI

2018-07-12 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2841:
---

 Summary: [Go] Fix recent Go build failures in Travis CI
 Key: ARROW-2841
 URL: https://issues.apache.org/jira/browse/ARROW-2841
 Project: Apache Arrow
  Issue Type: Bug
  Components: Go
Reporter: Antoine Pitrou
 Fix For: 0.10.0


I'm getting persistent failures in the Go job on Travis-CI:
https://travis-ci.org/pitrou/arrow/jobs/403221354

Is this expected?  Excerpt:

{code}
$ go get -t -v ./...
github.com/apache/arrow (download)
github.com/stretchr/testify (download)
go/arrow/type_traits_boolean.go:20:2: use of internal package not allowed
go/arrow/array/array.go:23:2: use of internal package not allowed
go/arrow/array/array.go:24:2: use of internal package not allowed
go/arrow/math/math_amd64.go:22:2: use of internal package not allowed
go/arrow/memory/memory_amd64.go:22:2: use of internal package not allowed
go/arrow/memory/buffer.go:22:2: use of internal package not allowed
The command "eval go get -t -v ./... " failed. Retrying, 2 of 3.
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Help understanding IPC Message/Buffer structure

2018-07-12 Thread Randy Zwitch
Thanks Paul, all your detail is super-helpful. I became paralyzed by
looking at too many codebases, but now I think I get it.

Wes, you’re absolutely right that this is complicated by the lack of
support. I’m working this example because it’s a good use of work time, and
then hopefully success with my work example will help me filling in the
remaining pieces for the julia libraries. Option B is definitely how I’ll
go to start, and then Option A is probably better for the Julia community
longer term.

On July 12, 2018 at 4:39:35 PM, Wes McKinney (wesmck...@gmail.com) wrote:

hi Randy,

In Julia I think this is complicated by the lack of a Flatbuffers
compiler for the language. In the case of Feather files, in Feather.jl
they have implemented the Flatbuffers schema in Julia code:

https://github.com/JuliaData/Feather.jl/blob/master/src/metadata.jl#L3

So you need to do one of:

a) make a Julia compiler for Flatbuffers files
b) Write a native implementation of the Arrow schemas by hand or
c) Wrap a C or C++ version of the compiled Flatbuffers schema

Here is some C++ code where we read a generic Message

https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/message.cc#L139

Here's where we read the message protocol from a generic InputStream
(and then call Message::ReadFrom):

https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/message.cc#L236

In the case of a Schema, the body length will be 0.

- Wes

On Thu, Jul 12, 2018 at 3:58 PM, Paul Taylor 
wrote:
> Hi Randy,
>
> The first four bytes are the int32 length of the flatbuffers Message
> metadata
> <
https://github.com/apache/arrow/blob/e14705745bb8d625b3c7dda2857e93cdfe848178/format/Message.fbs#L93>

> plus 4 bytes of padding between the length and the Message metadata
itself.
> The Message metadata starts on the 8th byte.
>
> So to read an entire Message, read and store the first four bytes (the
> metadata length). Then advance past the 4 padding bytes, and use the
> flatbuffers API to read the Message table.
>
> The Message table has a bodyLength field, which is byte length of all the
> buffers (data, validity, offsets, and typeIds) for all the Arrays in the
> Message (since Schema messages don't contain any data, its bodyLength is
> always 0).
>
> Once you've read the Message table via flabuffers, advance `metadata
length`
> number of bytes to position yourself to read the Array buffers.
>
> After reading the buffers, advance another `bodyLength` number of bytes
to
> read the next message. Repeat this process to read all Messages from an
> Arrow stream.
>
> If you're familiar with JavaScript/TypeScript, you can reference the
> implementation here
> <
https://github.com/apache/arrow/blob/e14705745bb8d625b3c7dda2857e93cdfe848178/js/src/ipc/reader/binary.ts#L145>.

>
> Hope this clears things up,
>
> Paul
>
>
>
> On 07/12/2018 11:30 AM, Randy Zwitch wrote:
>>
>> I’m trying to understand how to parse a Buffer into a Schema, but using
>> using pdb with Python and reading the TS/Python/C++ Arrow source hasn’t
>> really cleared much up for me. Nor has studying
>> https://arrow.apache.org/docs/ipc.html
>>
>>
>> Here’s are the steps of what I’ve tried (the code is Julia, but only
>> because I’m trying to do this natively, rather than wrap the Arrow C
>> code):
>>
>>
>> # Thrift API method returning a struct (sm_buf, sm_size, df_buf,
df_size)
>> (works as expected)
>> julia> tdf = sql_execute_df(conn, "select * from flights_2008_7m limit
>> 1000", 0, 0, 1000)
>>
>> MapD.TDataFrame(UInt8[0xba, 0x58, 0x1b, 0x3d], 93856, UInt8[0xab, 0xd7,
>> 0x7e, 0x50], 10)
>>
>> # Wrap shared memory into julia array, based on handle and size (works
as
>> expected)
>> julia> sm_buf = MapD.load_buffer(tdf.sm_handle, tdf.sm_size) #wrapper
>> using
>> shmget/shmat
>> 93856-element Array{UInt8,1}:
>> 0x2c
>> 0x16
>> 0x00
>> 0x00
>> 0x14
>> 0x00
>> 0x00
>> 0x00
>> 0x00
>> 0x00
>> ⋮
>> 0x20
>> 0x74
>> 0x6f
>> 0x20
>> 0x4d
>> 0x66
>> 0x72
>> 0x00
>> 0x00
>>
>> At this point, walking through an similar Python process, I know that
>> sm_buf represents
>> - type: Schema
>> - metadata length: 5676
>> - body_length: 0
>>
>> Where I’m confused is how to proceed.
>>
>> I am getting metadata_length by reinterpreting the first 4-bytes as
Int32.
>>
>> julia> mlen = reinterpret(Int32, sm_buf[1:4])[1]
>> 5676
>>
>> I then assumed that I could start at byte 5 and take the next `mlen-1`
>> bytes:
>>
>> julia> metadata = sm_buf[5:5+mlen-1]
>> 5676-element Array{UInt8,1}:
>> 0x14
>> 0x00
>> 0x00
>> 0x00
>> 0x00
>> 0x00
>> 0x00
>> 0x00
>> 0x0c
>> 0x00
>> ⋮
>> 0x79
>> 0x65
>> 0x61
>> 0x72
>> 0x00
>> 0x00
>> 0x00
>> 0x00
>> 0x00
>>
>>
>> Am I on the right track here? I *think* that my `metadata` variable
above
>> is a FlatBuffer, but how do I know what its structure is? Additionally,
>> what am I supposed to do with all of the bytes that haven’t been read
from
>> `sm_buf` yet? `sm_buf` is 93856 bytes and I’ve only read the first 4
bytes
>> + metadata length, leaving some 88,000 bytes 

[jira] [Created] (ARROW-2840) [C++] See if stream alignment logic can be simplified

2018-07-12 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2840:
---

 Summary: [C++] See if stream alignment logic can be simplified
 Key: ARROW-2840
 URL: https://issues.apache.org/jira/browse/ARROW-2840
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.10.0


There were a couple patches related to 64-byte alignment for tensor messages at 
the beginning of April that I've been concerned about: 

ARROW-2308: 
https://github.com/apache/arrow/commit/26bc4ab5a31e3430e1d545068e5a5a5ba5bc7a22#diff-bf4eb56f3cb806c32f60ae67afbb1bf0

ARROW-2437: 
https://github.com/apache/arrow/commit/0f87c12d45250ee763ac8c43b7e57e8f06a0b9f3

In my opinion, the alignment issue should be encapsulated in the protocol if 
possible; to have argument that toggles whether the file is to be aligned after 
reading the metadata seems wrong to me. The bytes written for alignment 
purposes should be accounted for in the metadata size prefix in the IPC message.

Since there are unit tests for this, I'd like to take a look and see if I can 
improve this without breaking the tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Help understanding IPC Message/Buffer structure

2018-07-12 Thread Wes McKinney
hi Randy,

In Julia I think this is complicated by the lack of a Flatbuffers
compiler for the language. In the case of Feather files, in Feather.jl
they have implemented the Flatbuffers schema in Julia code:

https://github.com/JuliaData/Feather.jl/blob/master/src/metadata.jl#L3

So you need to do one of:

a) make a Julia compiler for Flatbuffers files
b) Write a native implementation of the Arrow schemas by hand or
c) Wrap a C or C++ version of the compiled Flatbuffers schema

Here is some C++ code where we read a generic Message

https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/message.cc#L139

Here's where we read the message protocol from a generic InputStream
(and then call Message::ReadFrom):

https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/message.cc#L236

In the case of a Schema, the body length will be 0.

- Wes

On Thu, Jul 12, 2018 at 3:58 PM, Paul Taylor  wrote:
> Hi Randy,
>
> The first four bytes are the int32 length of the flatbuffers Message
> metadata
> 
> plus 4 bytes of padding between the length and the Message metadata itself.
> The Message metadata starts on the 8th byte.
>
> So to read an entire Message, read and store the first four bytes (the
> metadata length). Then advance past the 4 padding bytes, and use the
> flatbuffers API to read the Message table.
>
> The Message table has a bodyLength field, which is byte length of all the
> buffers (data, validity, offsets, and typeIds) for all the Arrays in the
> Message (since Schema messages don't contain any data, its bodyLength is
> always 0).
>
> Once you've read the Message table via flabuffers, advance `metadata length`
> number of bytes to position yourself to read the Array buffers.
>
> After reading the buffers, advance another `bodyLength` number of bytes to
> read the next message. Repeat this process to read all Messages from an
> Arrow stream.
>
> If you're familiar with JavaScript/TypeScript, you can reference the
> implementation here
> .
>
> Hope this clears things up,
>
> Paul
>
>
>
> On 07/12/2018 11:30 AM, Randy Zwitch wrote:
>>
>> I’m trying to understand how to parse a Buffer into a Schema, but using
>> using pdb with Python and reading the TS/Python/C++ Arrow source hasn’t
>> really cleared much up for me. Nor has studying
>> https://arrow.apache.org/docs/ipc.html
>>
>>
>> Here’s are the steps of what I’ve tried (the code is Julia, but only
>> because I’m trying to do this natively, rather than wrap the Arrow C
>> code):
>>
>>
>> # Thrift API method returning a struct (sm_buf, sm_size, df_buf, df_size)
>>   (works as expected)
>> julia> tdf = sql_execute_df(conn, "select * from flights_2008_7m limit
>> 1000", 0, 0, 1000)
>>
>> MapD.TDataFrame(UInt8[0xba, 0x58, 0x1b, 0x3d], 93856, UInt8[0xab, 0xd7,
>> 0x7e, 0x50], 10)
>>
>> # Wrap shared memory into julia array, based on handle and size (works as
>> expected)
>> julia> sm_buf = MapD.load_buffer(tdf.sm_handle, tdf.sm_size) #wrapper
>> using
>> shmget/shmat
>> 93856-element Array{UInt8,1}:
>>   0x2c
>>   0x16
>>   0x00
>>   0x00
>>   0x14
>>   0x00
>>   0x00
>>   0x00
>>   0x00
>>   0x00
>>  ⋮
>>   0x20
>>   0x74
>>   0x6f
>>   0x20
>>   0x4d
>>   0x66
>>   0x72
>>   0x00
>>   0x00
>>
>> At this point, walking through an similar Python process, I know that
>> sm_buf represents
>> - type: Schema
>>  - metadata length: 5676
>> - body_length: 0
>>
>> Where I’m confused is how to proceed.
>>
>> I am getting metadata_length by reinterpreting the first 4-bytes as Int32.
>>
>> julia> mlen = reinterpret(Int32, sm_buf[1:4])[1]
>> 5676
>>
>> I then assumed that I could start at byte 5 and take the next `mlen-1`
>> bytes:
>>
>> julia> metadata = sm_buf[5:5+mlen-1]
>> 5676-element Array{UInt8,1}:
>>   0x14
>>   0x00
>>   0x00
>>   0x00
>>   0x00
>>   0x00
>>   0x00
>>   0x00
>>   0x0c
>>   0x00
>>  ⋮
>>   0x79
>>   0x65
>>   0x61
>>   0x72
>>   0x00
>>   0x00
>>   0x00
>>   0x00
>>   0x00
>>
>>
>> Am I on the right track here? I *think* that my `metadata` variable above
>> is a FlatBuffer, but how do I know what its structure is? Additionally,
>> what am I supposed to do with all of the bytes that haven’t been read from
>> `sm_buf` yet? `sm_buf` is 93856 bytes and I’ve only read the first 4 bytes
>> + metadata length, leaving some 88,000 bytes not processed yet.
>>
>> Any help would be greatly appreciated here. Please note that I’m not
>> asking
>> for julia coding help, but rather what the Arrow bytes actually mean/their
>> structure and how to process them further.
>>
>> Thanks,
>> Randy Zwitch
>>
>


Re: Help understanding IPC Message/Buffer structure

2018-07-12 Thread Paul Taylor

Hi Randy,

The first four bytes are the int32 length of the flatbuffers Message 
metadata 
 
plus 4 bytes of padding between the length and the Message metadata 
itself. The Message metadata starts on the 8th byte.


So to read an entire Message, read and store the first four bytes (the 
metadata length). Then advance past the 4 padding bytes, and use the 
flatbuffers API to read the Message table.


The Message table has a bodyLength field, which is byte length of all 
the buffers (data, validity, offsets, and typeIds) for all the Arrays in 
the Message (since Schema messages don't contain any data, its 
bodyLength is always 0).


Once you've read the Message table via flabuffers, advance `metadata 
length` number of bytes to position yourself to read the Array buffers.


After reading the buffers, advance another `bodyLength` number of bytes 
to read the next message. Repeat this process to read all Messages from 
an Arrow stream.


If you're familiar with JavaScript/TypeScript, you can reference the 
implementation here 
.


Hope this clears things up,

Paul


On 07/12/2018 11:30 AM, Randy Zwitch wrote:

I’m trying to understand how to parse a Buffer into a Schema, but using
using pdb with Python and reading the TS/Python/C++ Arrow source hasn’t
really cleared much up for me. Nor has studying
https://arrow.apache.org/docs/ipc.html


Here’s are the steps of what I’ve tried (the code is Julia, but only
because I’m trying to do this natively, rather than wrap the Arrow C code):


# Thrift API method returning a struct (sm_buf, sm_size, df_buf, df_size)
  (works as expected)
julia> tdf = sql_execute_df(conn, "select * from flights_2008_7m limit
1000", 0, 0, 1000)

MapD.TDataFrame(UInt8[0xba, 0x58, 0x1b, 0x3d], 93856, UInt8[0xab, 0xd7,
0x7e, 0x50], 10)

# Wrap shared memory into julia array, based on handle and size (works as
expected)
julia> sm_buf = MapD.load_buffer(tdf.sm_handle, tdf.sm_size) #wrapper using
shmget/shmat
93856-element Array{UInt8,1}:
  0x2c
  0x16
  0x00
  0x00
  0x14
  0x00
  0x00
  0x00
  0x00
  0x00
 ⋮
  0x20
  0x74
  0x6f
  0x20
  0x4d
  0x66
  0x72
  0x00
  0x00

At this point, walking through an similar Python process, I know that
sm_buf represents
- type: Schema
 - metadata length: 5676
- body_length: 0

Where I’m confused is how to proceed.

I am getting metadata_length by reinterpreting the first 4-bytes as Int32.

julia> mlen = reinterpret(Int32, sm_buf[1:4])[1]
5676

I then assumed that I could start at byte 5 and take the next `mlen-1`
bytes:

julia> metadata = sm_buf[5:5+mlen-1]
5676-element Array{UInt8,1}:
  0x14
  0x00
  0x00
  0x00
  0x00
  0x00
  0x00
  0x00
  0x0c
  0x00
 ⋮
  0x79
  0x65
  0x61
  0x72
  0x00
  0x00
  0x00
  0x00
  0x00


Am I on the right track here? I *think* that my `metadata` variable above
is a FlatBuffer, but how do I know what its structure is? Additionally,
what am I supposed to do with all of the bytes that haven’t been read from
`sm_buf` yet? `sm_buf` is 93856 bytes and I’ve only read the first 4 bytes
+ metadata length, leaving some 88,000 bytes not processed yet.

Any help would be greatly appreciated here. Please note that I’m not asking
for julia coding help, but rather what the Arrow bytes actually mean/their
structure and how to process them further.

Thanks,
Randy Zwitch





Re: Housing longer-term Arrow development, design, and roadmap documents

2018-07-12 Thread Bryan Cutler
Yup I see it now, thanks!

On Thu, Jul 12, 2018 at 10:37 AM, Wes McKinney  wrote:

> hey Bryan -- I just enabled "Add/Edit" permission for all registered
> Confluence users. Let me know if it works now
>
> On Thu, Jul 12, 2018 at 1:34 PM, Bryan Cutler  wrote:
> > I haven't used confluence before, but do I need permission to edit a
> page?
> > I believe there should be a button in the menu bar, but I don't see
> > anything except "Spaces" and "People." If so, could someone please add
> me,
> > my user id is "cutlerb", thanks!
> >
> > On Fri, Jul 6, 2018 at 10:35 AM, Wes McKinney 
> wrote:
> >
> >> I've started building out some organization on the Arrow wiki landing
> >> page. I think something we can do to help keep organized is to use a
> >> combination of Component and Label tags in JIRA, then add JIRA filters
> >> to pages related to each subproject. We can see how that goes
> >>
> >> As an example, I just created a page to track work on Parquet support in
> >> Python:
> >>
> >> https://cwiki.apache.org/confluence/display/ARROW/
> >> Python+Parquet+Format+Support
> >>
> >> As we add more issues labels, they'll show up in the filter.
> >>
> >> - Wes
> >>
> >> On Fri, Jun 29, 2018 at 6:38 PM, Kouhei Sutou 
> wrote:
> >> > Hi,
> >> >
> >> >> https://cwiki.apache.org/confluence/display/ARROW
> >> >>
> >> >> If any PMC members would like to be administrators of the space,
> >> >> please let me know your Confluence username. You have to create a
> >> >> separate account (it does not appear to be linked to JIRA accounts)
> >> >
> >> > Can you add me? I've created "kou" account on Confluence.
> >> >
> >> >
> >> > Thanks,
> >> > --
> >> > kou
> >> >
> >> > In  gmail.com>
> >> >   "Re: Housing longer-term Arrow development, design, and roadmap
> >> documents" on Tue, 26 Jun 2018 11:27:50 -0400,
> >> >   Wes McKinney  wrote:
> >> >
> >> >> GitHub wiki pages lack collaboration features like commenting. It
> will
> >> >> be interesting to see what we can work up with JIRA integration, e.g.
> >> >> burndown charts for release management.
> >> >>
> >> >> I asked INFRA to create a Confluence space for us so we can give it a
> >> >> try to see if it works for us. Confluence seems to have gotten a lot
> >> >> nicer since I last used it:
> >> >>
> >> >> https://cwiki.apache.org/confluence/display/ARROW
> >> >>
> >> >> If any PMC members would like to be administrators of the space,
> >> >> please let me know your Confluence username. You have to create a
> >> >> separate account (it does not appear to be linked to JIRA accounts)
> >> >>
> >> >> Thanks
> >> >>
> >> >> On Sun, Jun 24, 2018 at 1:14 PM, Uwe L. Korn 
> wrote:
> >> >>> Hello,
> >> >>>
> >> >>> I would prefer Confluence over GitHub pages because I would hope
> that
> >> one can integrate the ASF JIRA via widgets into the wiki pages. The vast
> >> amount of issues should all be categorizable into some topic. Once these
> >> are triaged, they should pop up in the respective wiki pages that could
> >> form a roadmap. That way, newcomers should get a better start to find
> the
> >> things to work on for a certain topic.
> >> >>>
> >> >>> Cheers
> >> >>> Uwe
> >> >>>
> >> >>> On Sun, Jun 24, 2018, at 7:02 PM, Antoine Pitrou wrote:
> >> 
> >>  Hi Wes,
> >> 
> >>  I wonder if GitHub wiki pages would be an easier-to-approach
> >> alternative?
> >> 
> >>  Regards
> >> 
> >>  Antoine.
> >> 
> >> 
> >>  Le 24/06/2018 à 08:42, Wes McKinney a écrit :
> >>  > hi folks,
> >>  >
> >>  > Since the scope of Apache Arrow has grown significantly in the
> last
> >>  > 2.5 years to encompass many programming languages and new areas
> of
> >>  > functionality, I'd like to discuss how we could better
> accommodate
> >>  > longer-term asynchronous discussions and stay organized about the
> >>  > development roadmap.
> >>  >
> >>  > At any given time, there could be 10 or more initiatives ongoing,
> >> and
> >>  > the number of concurrent initiatives is likely to continue
> >> increasing
> >>  > over time as the community grows larger. Just off the top of my
> head
> >>  > here's some stuff that's ongoing / up in the air:
> >>  >
> >>  > * Remaining columnar format design questions (interval types,
> >> unions, etc.)
> >>  > * Arrow RPC client/server design (aka "Arrow Flight")
> >>  > * Packaging / deployment / release management
> >>  > * Rust language build out
> >>  > * Go language build out
> >>  > * Code generation / LLVM (Gandiva)
> >>  > * ML/AI framework integration (e.g. with TensorFlow, PyTorch)
> >>  > * Plasma roadmap
> >>  > * Record data types (thread I just opened)
> >>  >
> >>  > With ~500 open issues on JIRA, I have found that newcomers feel a
> >> bit
> >>  > overwhelmed when they're trying to find a part of the project to
> get
> >>  > involved with. Eventually one must sink one's teeth into the JIRA
> >>  > backlog, b

[jira] [Created] (ARROW-2839) [JS] Support whatwg/streams in IPC reader/writer

2018-07-12 Thread Paul Taylor (JIRA)
Paul Taylor created ARROW-2839:
--

 Summary: [JS] Support whatwg/streams in IPC reader/writer
 Key: ARROW-2839
 URL: https://issues.apache.org/jira/browse/ARROW-2839
 Project: Apache Arrow
  Issue Type: Improvement
  Components: JavaScript
Affects Versions: JS-0.3.1
Reporter: Paul Taylor
Assignee: Paul Taylor
 Fix For: 0.4.0


We should make it easy to stream Arrow in the browser via 
[whatwg/streams|https://github.com/whatwg/streams]. I already have this working 
at Graphistry, but I had to use some of the IPC internal methods. Creating this 
issue to track back-porting that work and the few minor refactors to the IPC 
internals that we'll need to do.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [Go] Go failures on Travis-CI

2018-07-12 Thread Sebastien Binet
I'll have a look tomorrow (Paris time).
It looks like a GOPATH issue.

sent from my droid

On Thu, Jul 12, 2018, 20:52 Antoine Pitrou  wrote:

>
> Hello,
>
> I'm getting persistent failures in the Go job on Travis-CI:
> https://travis-ci.org/pitrou/arrow/jobs/403221354
>
> Is this expected?  Excerpt:
>
> """
> $ go get -t -v ./...
> github.com/apache/arrow (download)
> github.com/stretchr/testify (download)
> go/arrow/type_traits_boolean.go:20:2: use of internal package not allowed
> go/arrow/array/array.go:23:2: use of internal package not allowed
> go/arrow/array/array.go:24:2: use of internal package not allowed
> go/arrow/math/math_amd64.go:22:2: use of internal package not allowed
> go/arrow/memory/memory_amd64.go:22:2: use of internal package not allowed
> go/arrow/memory/buffer.go:22:2: use of internal package not allowed
> The command "eval go get -t -v ./... " failed. Retrying, 2 of 3.
> """
>
> Regards
>
> Antoine.
>


Help understanding IPC Message/Buffer structure

2018-07-12 Thread Randy Zwitch
I’m trying to understand how to parse a Buffer into a Schema, but using
using pdb with Python and reading the TS/Python/C++ Arrow source hasn’t
really cleared much up for me. Nor has studying
https://arrow.apache.org/docs/ipc.html


Here’s are the steps of what I’ve tried (the code is Julia, but only
because I’m trying to do this natively, rather than wrap the Arrow C code):


# Thrift API method returning a struct (sm_buf, sm_size, df_buf, df_size)
 (works as expected)
julia> tdf = sql_execute_df(conn, "select * from flights_2008_7m limit
1000", 0, 0, 1000)

MapD.TDataFrame(UInt8[0xba, 0x58, 0x1b, 0x3d], 93856, UInt8[0xab, 0xd7,
0x7e, 0x50], 10)

# Wrap shared memory into julia array, based on handle and size (works as
expected)
julia> sm_buf = MapD.load_buffer(tdf.sm_handle, tdf.sm_size) #wrapper using
shmget/shmat
93856-element Array{UInt8,1}:
 0x2c
 0x16
 0x00
 0x00
 0x14
 0x00
 0x00
 0x00
 0x00
 0x00
⋮
 0x20
 0x74
 0x6f
 0x20
 0x4d
 0x66
 0x72
 0x00
 0x00

At this point, walking through an similar Python process, I know that
sm_buf represents
- type: Schema
- metadata length: 5676
- body_length: 0

Where I’m confused is how to proceed.

I am getting metadata_length by reinterpreting the first 4-bytes as Int32.

julia> mlen = reinterpret(Int32, sm_buf[1:4])[1]
5676

I then assumed that I could start at byte 5 and take the next `mlen-1`
bytes:

julia> metadata = sm_buf[5:5+mlen-1]
5676-element Array{UInt8,1}:
 0x14
 0x00
 0x00
 0x00
 0x00
 0x00
 0x00
 0x00
 0x0c
 0x00
⋮
 0x79
 0x65
 0x61
 0x72
 0x00
 0x00
 0x00
 0x00
 0x00


Am I on the right track here? I *think* that my `metadata` variable above
is a FlatBuffer, but how do I know what its structure is? Additionally,
what am I supposed to do with all of the bytes that haven’t been read from
`sm_buf` yet? `sm_buf` is 93856 bytes and I’ve only read the first 4 bytes
+ metadata length, leaving some 88,000 bytes not processed yet.

Any help would be greatly appreciated here. Please note that I’m not asking
for julia coding help, but rather what the Arrow bytes actually mean/their
structure and how to process them further.

Thanks,
Randy Zwitch


[Go] Go failures on Travis-CI

2018-07-12 Thread Antoine Pitrou


Hello,

I'm getting persistent failures in the Go job on Travis-CI:
https://travis-ci.org/pitrou/arrow/jobs/403221354

Is this expected?  Excerpt:

"""
$ go get -t -v ./...
github.com/apache/arrow (download)
github.com/stretchr/testify (download)
go/arrow/type_traits_boolean.go:20:2: use of internal package not allowed
go/arrow/array/array.go:23:2: use of internal package not allowed
go/arrow/array/array.go:24:2: use of internal package not allowed
go/arrow/math/math_amd64.go:22:2: use of internal package not allowed
go/arrow/memory/memory_amd64.go:22:2: use of internal package not allowed
go/arrow/memory/buffer.go:22:2: use of internal package not allowed
The command "eval go get -t -v ./... " failed. Retrying, 2 of 3.
"""

Regards

Antoine.


Re: Housing longer-term Arrow development, design, and roadmap documents

2018-07-12 Thread Wes McKinney
hey Bryan -- I just enabled "Add/Edit" permission for all registered
Confluence users. Let me know if it works now

On Thu, Jul 12, 2018 at 1:34 PM, Bryan Cutler  wrote:
> I haven't used confluence before, but do I need permission to edit a page?
> I believe there should be a button in the menu bar, but I don't see
> anything except "Spaces" and "People." If so, could someone please add me,
> my user id is "cutlerb", thanks!
>
> On Fri, Jul 6, 2018 at 10:35 AM, Wes McKinney  wrote:
>
>> I've started building out some organization on the Arrow wiki landing
>> page. I think something we can do to help keep organized is to use a
>> combination of Component and Label tags in JIRA, then add JIRA filters
>> to pages related to each subproject. We can see how that goes
>>
>> As an example, I just created a page to track work on Parquet support in
>> Python:
>>
>> https://cwiki.apache.org/confluence/display/ARROW/
>> Python+Parquet+Format+Support
>>
>> As we add more issues labels, they'll show up in the filter.
>>
>> - Wes
>>
>> On Fri, Jun 29, 2018 at 6:38 PM, Kouhei Sutou  wrote:
>> > Hi,
>> >
>> >> https://cwiki.apache.org/confluence/display/ARROW
>> >>
>> >> If any PMC members would like to be administrators of the space,
>> >> please let me know your Confluence username. You have to create a
>> >> separate account (it does not appear to be linked to JIRA accounts)
>> >
>> > Can you add me? I've created "kou" account on Confluence.
>> >
>> >
>> > Thanks,
>> > --
>> > kou
>> >
>> > In 
>> >   "Re: Housing longer-term Arrow development, design, and roadmap
>> documents" on Tue, 26 Jun 2018 11:27:50 -0400,
>> >   Wes McKinney  wrote:
>> >
>> >> GitHub wiki pages lack collaboration features like commenting. It will
>> >> be interesting to see what we can work up with JIRA integration, e.g.
>> >> burndown charts for release management.
>> >>
>> >> I asked INFRA to create a Confluence space for us so we can give it a
>> >> try to see if it works for us. Confluence seems to have gotten a lot
>> >> nicer since I last used it:
>> >>
>> >> https://cwiki.apache.org/confluence/display/ARROW
>> >>
>> >> If any PMC members would like to be administrators of the space,
>> >> please let me know your Confluence username. You have to create a
>> >> separate account (it does not appear to be linked to JIRA accounts)
>> >>
>> >> Thanks
>> >>
>> >> On Sun, Jun 24, 2018 at 1:14 PM, Uwe L. Korn  wrote:
>> >>> Hello,
>> >>>
>> >>> I would prefer Confluence over GitHub pages because I would hope that
>> one can integrate the ASF JIRA via widgets into the wiki pages. The vast
>> amount of issues should all be categorizable into some topic. Once these
>> are triaged, they should pop up in the respective wiki pages that could
>> form a roadmap. That way, newcomers should get a better start to find the
>> things to work on for a certain topic.
>> >>>
>> >>> Cheers
>> >>> Uwe
>> >>>
>> >>> On Sun, Jun 24, 2018, at 7:02 PM, Antoine Pitrou wrote:
>> 
>>  Hi Wes,
>> 
>>  I wonder if GitHub wiki pages would be an easier-to-approach
>> alternative?
>> 
>>  Regards
>> 
>>  Antoine.
>> 
>> 
>>  Le 24/06/2018 à 08:42, Wes McKinney a écrit :
>>  > hi folks,
>>  >
>>  > Since the scope of Apache Arrow has grown significantly in the last
>>  > 2.5 years to encompass many programming languages and new areas of
>>  > functionality, I'd like to discuss how we could better accommodate
>>  > longer-term asynchronous discussions and stay organized about the
>>  > development roadmap.
>>  >
>>  > At any given time, there could be 10 or more initiatives ongoing,
>> and
>>  > the number of concurrent initiatives is likely to continue
>> increasing
>>  > over time as the community grows larger. Just off the top of my head
>>  > here's some stuff that's ongoing / up in the air:
>>  >
>>  > * Remaining columnar format design questions (interval types,
>> unions, etc.)
>>  > * Arrow RPC client/server design (aka "Arrow Flight")
>>  > * Packaging / deployment / release management
>>  > * Rust language build out
>>  > * Go language build out
>>  > * Code generation / LLVM (Gandiva)
>>  > * ML/AI framework integration (e.g. with TensorFlow, PyTorch)
>>  > * Plasma roadmap
>>  > * Record data types (thread I just opened)
>>  >
>>  > With ~500 open issues on JIRA, I have found that newcomers feel a
>> bit
>>  > overwhelmed when they're trying to find a part of the project to get
>>  > involved with. Eventually one must sink one's teeth into the JIRA
>>  > backlog, but I think it would be helpful to have some centralized
>>  > project organization and roadmap documents to help navigate all of
>> the
>>  > efforts going on in the project.
>>  >
>>  > I don't think documents in the repository are a great solution for
>>  > this, as they don't facilitate discussions very easily --
>>  > documentati

Re: Housing longer-term Arrow development, design, and roadmap documents

2018-07-12 Thread Bryan Cutler
I haven't used confluence before, but do I need permission to edit a page?
I believe there should be a button in the menu bar, but I don't see
anything except "Spaces" and "People." If so, could someone please add me,
my user id is "cutlerb", thanks!

On Fri, Jul 6, 2018 at 10:35 AM, Wes McKinney  wrote:

> I've started building out some organization on the Arrow wiki landing
> page. I think something we can do to help keep organized is to use a
> combination of Component and Label tags in JIRA, then add JIRA filters
> to pages related to each subproject. We can see how that goes
>
> As an example, I just created a page to track work on Parquet support in
> Python:
>
> https://cwiki.apache.org/confluence/display/ARROW/
> Python+Parquet+Format+Support
>
> As we add more issues labels, they'll show up in the filter.
>
> - Wes
>
> On Fri, Jun 29, 2018 at 6:38 PM, Kouhei Sutou  wrote:
> > Hi,
> >
> >> https://cwiki.apache.org/confluence/display/ARROW
> >>
> >> If any PMC members would like to be administrators of the space,
> >> please let me know your Confluence username. You have to create a
> >> separate account (it does not appear to be linked to JIRA accounts)
> >
> > Can you add me? I've created "kou" account on Confluence.
> >
> >
> > Thanks,
> > --
> > kou
> >
> > In 
> >   "Re: Housing longer-term Arrow development, design, and roadmap
> documents" on Tue, 26 Jun 2018 11:27:50 -0400,
> >   Wes McKinney  wrote:
> >
> >> GitHub wiki pages lack collaboration features like commenting. It will
> >> be interesting to see what we can work up with JIRA integration, e.g.
> >> burndown charts for release management.
> >>
> >> I asked INFRA to create a Confluence space for us so we can give it a
> >> try to see if it works for us. Confluence seems to have gotten a lot
> >> nicer since I last used it:
> >>
> >> https://cwiki.apache.org/confluence/display/ARROW
> >>
> >> If any PMC members would like to be administrators of the space,
> >> please let me know your Confluence username. You have to create a
> >> separate account (it does not appear to be linked to JIRA accounts)
> >>
> >> Thanks
> >>
> >> On Sun, Jun 24, 2018 at 1:14 PM, Uwe L. Korn  wrote:
> >>> Hello,
> >>>
> >>> I would prefer Confluence over GitHub pages because I would hope that
> one can integrate the ASF JIRA via widgets into the wiki pages. The vast
> amount of issues should all be categorizable into some topic. Once these
> are triaged, they should pop up in the respective wiki pages that could
> form a roadmap. That way, newcomers should get a better start to find the
> things to work on for a certain topic.
> >>>
> >>> Cheers
> >>> Uwe
> >>>
> >>> On Sun, Jun 24, 2018, at 7:02 PM, Antoine Pitrou wrote:
> 
>  Hi Wes,
> 
>  I wonder if GitHub wiki pages would be an easier-to-approach
> alternative?
> 
>  Regards
> 
>  Antoine.
> 
> 
>  Le 24/06/2018 à 08:42, Wes McKinney a écrit :
>  > hi folks,
>  >
>  > Since the scope of Apache Arrow has grown significantly in the last
>  > 2.5 years to encompass many programming languages and new areas of
>  > functionality, I'd like to discuss how we could better accommodate
>  > longer-term asynchronous discussions and stay organized about the
>  > development roadmap.
>  >
>  > At any given time, there could be 10 or more initiatives ongoing,
> and
>  > the number of concurrent initiatives is likely to continue
> increasing
>  > over time as the community grows larger. Just off the top of my head
>  > here's some stuff that's ongoing / up in the air:
>  >
>  > * Remaining columnar format design questions (interval types,
> unions, etc.)
>  > * Arrow RPC client/server design (aka "Arrow Flight")
>  > * Packaging / deployment / release management
>  > * Rust language build out
>  > * Go language build out
>  > * Code generation / LLVM (Gandiva)
>  > * ML/AI framework integration (e.g. with TensorFlow, PyTorch)
>  > * Plasma roadmap
>  > * Record data types (thread I just opened)
>  >
>  > With ~500 open issues on JIRA, I have found that newcomers feel a
> bit
>  > overwhelmed when they're trying to find a part of the project to get
>  > involved with. Eventually one must sink one's teeth into the JIRA
>  > backlog, but I think it would be helpful to have some centralized
>  > project organization and roadmap documents to help navigate all of
> the
>  > efforts going on in the project.
>  >
>  > I don't think documents in the repository are a great solution for
>  > this, as they don't facilitate discussions very easily --
>  > documentation or Markdown documents (like the columnar format
>  > specification) are good to write there when some decisions have been
>  > made. Google Documents are great, but they are somewhat ephemeral.
>  >
>  > I would suggest using the ASF's Confluence wiki for these purposes.
>  

[jira] [Created] (ARROW-2838) [Python] Speed up null testing with Pandas semantics

2018-07-12 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-2838:
-

 Summary: [Python] Speed up null testing with Pandas semantics
 Key: ARROW-2838
 URL: https://issues.apache.org/jira/browse/ARROW-2838
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Python
Affects Versions: 0.9.0
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou


The {{PandasObjectIsNull}} helper function can be a significant contributor 
when converting a Pandas dataframe to Arrow format (e.g. when writing a 
dataframe to feather format). We can try to speed up the type checks in that 
function.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Need a release manager for Arrow 0.10.0 [was Re: Packaging status update]

2018-07-12 Thread Wes McKinney
Can a member of the Arrow PMC volunteer to be the release manager for
0.10.0? There is some prep work to be done to make sure that you can
produce the binary packages as part of the release vote; if this is
done at the last minute it will probably delay the release.

Thank you,
Wes

On Wed, Jul 11, 2018 at 12:10 PM, Wes McKinney  wrote:
> Thanks Krisztian for the update and all your work on this.
>
> We need a PMC member to step up to be the release manager for 0.10.
> Ideally they should undertake one or more dry runs of simulating the
> 0.10 release so we aren't faced with any surprises when we go to cut
> the release on ~July 23.
>
> I moved the release management guide to Confluence:
>
> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide
>
> so we should ensure that there aren't any snowflake-y / undocumented
> aspects to the release build workflow.
>
> - Wes
>
> On Wed, Jul 11, 2018 at 11:54 AM, Krisztián Szűcs
>  wrote:
>> Hi All!
>>
>> Just want to send a quick summary about the packaging.
>> What's working:
>> python wheel builds
>>
>> conda package builds (the recent conda forge changes caused some unexpected 
>> problems)
>>
>> linux package builds (enriched with ubuntu bionic)
>>
>> scheduled daily builds (triggered by travis)
>>
>> query build statuses from the CLI (see output https://bit.ly/2ztnSqV 
>> (https://link.getmailspring.com/link/1531321550.local-b501f335-6185-v1.2.2-96fb3...@getmailspring.com/0?redirect=https%3A%2F%2Fbit.ly%2F2ztnSqV&recipient=ZGV2QGFycm93LmFwYWNoZS5vcmc%3D))
>>
>> download and gpg sign the binary artifacts
>>
>> What's partially working:
>> binary artifact uploading (most of them are uploaded though)
>>
>> We have package name collisions due to uploading multiple tasks' artifacts 
>> to a single github release
>> (see for example: https://github.com/kszucs/crossbow/releases/tag/build-160 
>> (https://link.getmailspring.com/link/1531321550.local-b501f335-6185-v1.2.2-96fb3...@getmailspring.com/1?redirect=https%3A%2F%2Fgithub.com%2Fkszucs%2Fcrossbow%2Freleases%2Ftag%2Fbuild-160&recipient=ZGV2QGFycm93LmFwYWNoZS5vcmc%3D)).
>>  We used to postfix
>> these assets with the particular architecture, but we need a more robust 
>> solution without renaming
>> anything (either by creating tarballs or uploading to multiple release tags).
>>
>> All-in-all it's in quite good shape, I'll work on to refactor the artifact 
>> uploading and finish the
>> remaining jira tasks. IMHO we should start to test the whole release 
>> procedure from next Monday
>> and ship a couple of test releases before We reach July 20th.
>>
>> - Krisztian


[jira] [Created] (ARROW-2837) [C++] ArrayBuilder::null_bitmap returns PoolBuffer

2018-07-12 Thread Dimitri Vorona (JIRA)
Dimitri Vorona created ARROW-2837:
-

 Summary: [C++] ArrayBuilder::null_bitmap returns PoolBuffer
 Key: ARROW-2837
 URL: https://issues.apache.org/jira/browse/ARROW-2837
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Affects Versions: 0.9.0
Reporter: Dimitri Vorona


A simple buffer (like in case of ArrayBuilder::Data) seem to be enough to me, 
and it doesn't break anything.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2836) [Packaging] Expanding build matrices to multiple tasks

2018-07-12 Thread Krisztian Szucs (JIRA)
Krisztian Szucs created ARROW-2836:
--

 Summary: [Packaging] Expanding build matrices to multiple tasks
 Key: ARROW-2836
 URL: https://issues.apache.org/jira/browse/ARROW-2836
 Project: Apache Arrow
  Issue Type: Task
  Components: Packaging
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs
 Fix For: 0.10.0


Create release tags for each task instead a single one for the job, and create 
additional tasks per build variant. 
This will resolve the package naming conflicts, but increases the config 
verbosity (which is fine for now).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2835) [C++] ReadAt/WriteAt are inconsistent with moving the files position

2018-07-12 Thread Dimitri Vorona (JIRA)
Dimitri Vorona created ARROW-2835:
-

 Summary: [C++] ReadAt/WriteAt are inconsistent with moving the 
files position
 Key: ARROW-2835
 URL: https://issues.apache.org/jira/browse/ARROW-2835
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Dimitri Vorona


Right now, there is inconsistent behaviour regarding moving the files position 
pointer after calling ReadAt or WriteAt. For example, the default 
implementation of ReadAt seeks to the desired offset and calls Read which moves 
the position pointer. MemoryMappedFile::ReadAt, however, doesn't change the 
position. WriteableFile::WriteAt seem to move the position in the current 
implementation, but there is no docstring which prescribes this behaviour.

Antoine suggested that *At methods shouldn't touch the position and it makes 
more sense, IMHO. The change isn't huge and doesn't seem to break anything 
internally, but it might break the existing user code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)