[jira] [Updated] (ARROW-2841) [Go] Fix recent Go build failures in Travis CI

2018-07-12 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2841:
--
Labels: pull-request-available  (was: )

> [Go] Fix recent Go build failures in Travis CI
> --
>
> Key: ARROW-2841
> URL: https://issues.apache.org/jira/browse/ARROW-2841
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Go
>Reporter: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> I'm getting persistent failures in the Go job on Travis-CI:
> https://travis-ci.org/pitrou/arrow/jobs/403221354
> Is this expected?  Excerpt:
> {code}
> $ go get -t -v ./...
> github.com/apache/arrow (download)
> github.com/stretchr/testify (download)
> go/arrow/type_traits_boolean.go:20:2: use of internal package not allowed
> go/arrow/array/array.go:23:2: use of internal package not allowed
> go/arrow/array/array.go:24:2: use of internal package not allowed
> go/arrow/math/math_amd64.go:22:2: use of internal package not allowed
> go/arrow/memory/memory_amd64.go:22:2: use of internal package not allowed
> go/arrow/memory/buffer.go:22:2: use of internal package not allowed
> The command "eval go get -t -v ./... " failed. Retrying, 2 of 3.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2843) [Format/C++/Java] Removing field layout from Schema.fbs breaks backward compatibility

2018-07-12 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16542495#comment-16542495
 ] 

Wes McKinney commented on ARROW-2843:
-

In my opinion, we haven't worked hard enough as a community to stabilize the 
metadata. We have failed to make progress since mid-2017, why is that? I have 
been trying consistently to advance to a 1.0 release; can we not work together 
to finalize the binary protocol? 

> [Format/C++/Java] Removing field layout from Schema.fbs breaks backward 
> compatibility
> -
>
> Key: ARROW-2843
> URL: https://issues.apache.org/jira/browse/ARROW-2843
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Format
>Affects Versions: 0.8.0, 0.9.0
>Reporter: Yuliya Feldman
>Assignee: Yuliya Feldman
>Priority: Major
>
> Change in [ARROW-1785|https://issues.apache.org/jira/browse/ARROW-1785]
> Breaks backward compatibility with schema generated in prior versions of 
> Arrow, as VectorLayout field was removed from Field table.
> This is definitely not supported behavior in FlatBuffers.
> I suggest to put back following field
> {code:java}
> layout: [ VectorLayout ] (deprecated);
> {code}
>  into 
> {code:java}
> table Field 
> {code}
> and mark it "deprecated" - as shown above
> It will also require putting back 
> {code:java}
> table VectorLayout
> {code}
> which can have all the deprecated fields as well.
> code generation footprint will be minimal in this case, as FlatBuffers won't 
> generate getters for deprecated fields, but it will really solve backward 
> compatibility issues.
> Also may be a good idea to introduce "version" into Schema.
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-2843) [Format/C++/Java] Removing field layout from Schema.fbs breaks backward compatibility

2018-07-12 Thread Yuliya Feldman (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16542383#comment-16542383
 ] 

Yuliya Feldman edited comment on ARROW-2843 at 7/13/18 12:55 AM:
-

Thank you [~wesmckinn] for very quick reply and thoughts.

I do understand about "no guarantees" part, but why not do our best to maintain 
backwards compatibility, especially when it is possible.

 

 


was (Author: yufeldman):
Thank you @Wes for very quick reply and thoughts.

I do understand about "no guarantees" part, but why not do our best to maintain 
backwards compatibility, especially when it is possible.

 

 

> [Format/C++/Java] Removing field layout from Schema.fbs breaks backward 
> compatibility
> -
>
> Key: ARROW-2843
> URL: https://issues.apache.org/jira/browse/ARROW-2843
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Format
>Affects Versions: 0.8.0, 0.9.0
>Reporter: Yuliya Feldman
>Assignee: Yuliya Feldman
>Priority: Major
>
> Change in [ARROW-1785|https://issues.apache.org/jira/browse/ARROW-1785]
> Breaks backward compatibility with schema generated in prior versions of 
> Arrow, as VectorLayout field was removed from Field table.
> This is definitely not supported behavior in FlatBuffers.
> I suggest to put back following field
> {code:java}
> layout: [ VectorLayout ] (deprecated);
> {code}
>  into 
> {code:java}
> table Field 
> {code}
> and mark it "deprecated" - as shown above
> It will also require putting back 
> {code:java}
> table VectorLayout
> {code}
> which can have all the deprecated fields as well.
> code generation footprint will be minimal in this case, as FlatBuffers won't 
> generate getters for deprecated fields, but it will really solve backward 
> compatibility issues.
> Also may be a good idea to introduce "version" into Schema.
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-2843) [Format/C++/Java] Removing field layout from Schema.fbs breaks backward compatibility

2018-07-12 Thread Yuliya Feldman (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16542383#comment-16542383
 ] 

Yuliya Feldman edited comment on ARROW-2843 at 7/13/18 12:55 AM:
-

Thank you @Wes for very quick reply and thoughts.

I do understand about "no guarantees" part, but why not do our best to maintain 
backwards compatibility, especially when it is possible.

 

 


was (Author: yufeldman):
Thank you Wes for very quick reply and thoughts.

I do understand about "no guarantees" part, but why not do our best to maintain 
backwards compatibility, especially when it is possible.

 

 

> [Format/C++/Java] Removing field layout from Schema.fbs breaks backward 
> compatibility
> -
>
> Key: ARROW-2843
> URL: https://issues.apache.org/jira/browse/ARROW-2843
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Format
>Affects Versions: 0.8.0, 0.9.0
>Reporter: Yuliya Feldman
>Assignee: Yuliya Feldman
>Priority: Major
>
> Change in [ARROW-1785|https://issues.apache.org/jira/browse/ARROW-1785]
> Breaks backward compatibility with schema generated in prior versions of 
> Arrow, as VectorLayout field was removed from Field table.
> This is definitely not supported behavior in FlatBuffers.
> I suggest to put back following field
> {code:java}
> layout: [ VectorLayout ] (deprecated);
> {code}
>  into 
> {code:java}
> table Field 
> {code}
> and mark it "deprecated" - as shown above
> It will also require putting back 
> {code:java}
> table VectorLayout
> {code}
> which can have all the deprecated fields as well.
> code generation footprint will be minimal in this case, as FlatBuffers won't 
> generate getters for deprecated fields, but it will really solve backward 
> compatibility issues.
> Also may be a good idea to introduce "version" into Schema.
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2843) [Format/C++/Java] Removing field layout from Schema.fbs breaks backward compatibility

2018-07-12 Thread Yuliya Feldman (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16542383#comment-16542383
 ] 

Yuliya Feldman commented on ARROW-2843:
---

Thank you Wes for very quick reply and thoughts.

I do understand about "no guarantees" part, but why not do our best to maintain 
backwards compatibility, especially when it is possible.

 

 

> [Format/C++/Java] Removing field layout from Schema.fbs breaks backward 
> compatibility
> -
>
> Key: ARROW-2843
> URL: https://issues.apache.org/jira/browse/ARROW-2843
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Format
>Affects Versions: 0.8.0, 0.9.0
>Reporter: Yuliya Feldman
>Assignee: Yuliya Feldman
>Priority: Major
>
> Change in [ARROW-1785|https://issues.apache.org/jira/browse/ARROW-1785]
> Breaks backward compatibility with schema generated in prior versions of 
> Arrow, as VectorLayout field was removed from Field table.
> This is definitely not supported behavior in FlatBuffers.
> I suggest to put back following field
> {code:java}
> layout: [ VectorLayout ] (deprecated);
> {code}
>  into 
> {code:java}
> table Field 
> {code}
> and mark it "deprecated" - as shown above
> It will also require putting back 
> {code:java}
> table VectorLayout
> {code}
> which can have all the deprecated fields as well.
> code generation footprint will be minimal in this case, as FlatBuffers won't 
> generate getters for deprecated fields, but it will really solve backward 
> compatibility issues.
> Also may be a good idea to introduce "version" into Schema.
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2760) [Python] Remove legacy property definition syntax from parquet module and test them

2018-07-12 Thread Alex Hagerman (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Hagerman updated ARROW-2760:
-
Component/s: Python

> [Python] Remove legacy property definition syntax from parquet module and 
> test them
> ---
>
> Key: ARROW-2760
> URL: https://issues.apache.org/jira/browse/ARROW-2760
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2843) [Format/C++/Java] Removing field layout from Schema.fbs breaks backward compatibility

2018-07-12 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16542323#comment-16542323
 ] 

Wes McKinney commented on ARROW-2843:
-

I'm -0 to -1 on this change. We do already have a metadata version number. 

We've made no guarantees of backwards compatibility in the columnar format 
metadata yet, so we have broken no contracts with users.

To be honest, I'm not pleased that there are still a number of underspecified 
parts of the columnar format and metadata. As an example, I proposed a more 
complete Interval / Timedelta type about a year ago and these changes have 
still not been implemented. 

I would very much appreciate if Dremio and other parties depending on Arrow in 
production would work more proactively with me and others in the Arrow 
community to reach a point where we can declare a stable 1.0 version of the 
columnar format and its metadata. In particular, if any more breaking changes 
are required (I hope that there will not be), that we make them as soon as 
possible to minimize disruption to applications.

> [Format/C++/Java] Removing field layout from Schema.fbs breaks backward 
> compatibility
> -
>
> Key: ARROW-2843
> URL: https://issues.apache.org/jira/browse/ARROW-2843
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Format
>Affects Versions: 0.8.0, 0.9.0
>Reporter: Yuliya Feldman
>Assignee: Yuliya Feldman
>Priority: Major
>
> Change in [ARROW-1785|https://issues.apache.org/jira/browse/ARROW-1785]
> Breaks backward compatibility with schema generated in prior versions of 
> Arrow, as VectorLayout field was removed from Field table.
> This is definitely not supported behavior in FlatBuffers.
> I suggest to put back following field
> {code:java}
> layout: [ VectorLayout ] (deprecated);
> {code}
>  into 
> {code:java}
> table Field 
> {code}
> and mark it "deprecated" - as shown above
> It will also require putting back 
> {code:java}
> table VectorLayout
> {code}
> which can have all the deprecated fields as well.
> code generation footprint will be minimal in this case, as FlatBuffers won't 
> generate getters for deprecated fields, but it will really solve backward 
> compatibility issues.
> Also may be a good idea to introduce "version" into Schema.
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2843) [Format/C++/Java] Removing field layout from Schema.fbs breaks backward compatibility

2018-07-12 Thread Yuliya Feldman (JIRA)
Yuliya Feldman created ARROW-2843:
-

 Summary: [Format/C++/Java] Removing field layout from Schema.fbs 
breaks backward compatibility
 Key: ARROW-2843
 URL: https://issues.apache.org/jira/browse/ARROW-2843
 Project: Apache Arrow
  Issue Type: Bug
  Components: Format
Affects Versions: 0.9.0, 0.8.0
Reporter: Yuliya Feldman
Assignee: Yuliya Feldman


Change in [ARROW-1785|https://issues.apache.org/jira/browse/ARROW-1785]

Breaks backward compatibility with schema generated in prior versions of Arrow, 
as VectorLayout field was removed from Field table.

This is definitely not supported behavior in FlatBuffers.

I suggest to put back following field
{code:java}
layout: [ VectorLayout ] (deprecated);
{code}
 into 
{code:java}
table Field 
{code}
and mark it "deprecated" - as shown above

It will also require putting back 
{code:java}
table VectorLayout
{code}
which can have all the deprecated fields as well.

code generation footprint will be minimal in this case, as FlatBuffers won't 
generate getters for deprecated fields, but it will really solve backward 
compatibility issues.

Also may be a good idea to introduce "version" into Schema.

 

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2842) [Python] Cannot read parquet files with row group size of 1 From HDFS

2018-07-12 Thread Robert Gruener (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Gruener updated ARROW-2842:
--
Description: 
This might be a bug in parquet-cpp, I need to spend a bit more time tracking 
this down but basically given a file with a single row on hdfs, reading it with 
pyarrow yields this error

```

TcpSocket.cpp: 79: HdfsEndOfStream: Read 8 bytes failed from: End of the stream
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ arrow::io::HdfsReadableFile::ReadAt(long, long, long*, void*)
 @ parquet::ArrowInputFile::ReadAt(long, long, unsigned char*)
 @ parquet::SerializedFile::ParseMetaData()
 @ 
parquet::ParquetFileReader::Contents::Open(std::unique_ptr >, parquet::ReaderProperties 
const&, std::shared_ptr const&)
 @ 
parquet::ParquetFileReader::Open(std::unique_ptr >, parquet::ReaderProperties 
const&, std::shared_ptr const&)
 @ parquet::arrow::OpenFile(std::shared_ptr 
const&, arrow::MemoryPool*, parquet::ReaderProperties const&, 
std::shared_ptr const&, 
std::unique_ptr >*)
 @ __pyx_pw_7pyarrow_8_parquet_13ParquetReader_3open(_object*, _object*, 
_object*)

```

The following code causes it:

```

import pyarrow

import pyarrow.parquet as pq

 

fs = pyarrow.hdfs.connect('my-namenode-url', driver='libhdfs3') # fill in 
namenode information

file_object = fs.open('single-row.parquet') # update for hdfs path of file

pq.read_metadata(file_object) # this works

parquet_file = pq.ParquetFile(file_object)

parquet_file.read_row_group(0) # throws error

```

 

I am working on writing a unit test for this. Note that I am using libhdfs3.

  was:
This might be a bug in parquet-cpp, I need to spend a bit more time tracking 
this down but basically given a file with a single row on hdfs, reading it with 
pyarrow yields this error

```

TcpSocket.cpp: 79: HdfsEndOfStream: Read 8 bytes failed from 
"10.103.182.28:50010": End of the stream
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ arrow::io::HdfsReadableFile::ReadAt(long, long, long*, void*)
 @ parquet::ArrowInputFile::ReadAt(long, long, unsigned char*)
 @ parquet::SerializedFile::ParseMetaData()
 @ 
parquet::ParquetFileReader::Contents::Open(std::unique_ptr >, parquet::ReaderProperties 
const&, std::shared_ptr const&)
 @ 
parquet::ParquetFileReader::Open(std::unique_ptr >, parquet::ReaderProperties 
const&, std::shared_ptr const&)
 @ parquet::arrow::OpenFile(std::shared_ptr 
const&, arrow::MemoryPool*, parquet::ReaderProperties const&, 
std::shared_ptr const&, 
std::unique_ptr >*)
 @ __pyx_pw_7pyarrow_8_parquet_13ParquetReader_3open(_object*, _object*, 
_object*)

```

The following code causes it:

```

import pyarrow

import pyarrow.parquet as pq

 

fs = pyarrow.hdfs.connect('my-namenode-url', driver='libhdfs3') # fill in 
namenode information

file_object = fs.open('single-row.parquet') # update for hdfs path of file

pq.read_metadata(file_object) # this works

parquet_file = pq.ParquetFile(file_object)

parquet_file.read_row_group(0) # throws error

```

 

I am working on writing a unit test for this. Note that I am using libhdfs3.


> [Python] Cannot read parquet files with row group size of 1 From HDFS
> -
>
> Key: ARROW-2842
> URL: https://issues.apache.org/jira/browse/ARROW-2842
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Robert Gruener
>Priority: Major
> Attachments: single-row.parquet
>
>
> This might be a bug in parquet-cpp, I need to spend a bit more time tracking 
> this down but basically given a file with a single row on hdfs, reading it 
> with pyarrow yields this error
> ```
> TcpSocket.cpp: 79: HdfsEndOfStream: Read 8 bytes failed from: End of the 
> stream
>  @ Unknown
>  @ Unknown
>  @ Unknown
>  @ Unknown
>  @ Unknown
>  @ Unknown
>  @ Unknown
>  @ Unknown
>  @ Unknown
>  @ arrow::io::HdfsReadableFile::ReadAt(long, long, long*, void*)
>  @ parquet::ArrowInputFile::ReadAt(long, long, unsigned char*)
>  @ parquet::SerializedFile::ParseMetaData()
>  @ 
> parquet::ParquetFileReader::Contents::Open(std::unique_ptr  std::default_delete >, 
> parquet::ReaderProperties const&, std::shared_ptr 
> const&)
>  @ 
> parquet::ParquetFileReader::Open(std::unique_ptr std::default_delete >, parquet::ReaderProperties 
> const&, std::shared_ptr const&)
>  @ parquet::arrow::OpenFile(std::shared_ptr 
> const&, arrow::MemoryPool*, parquet::ReaderProperties const&, 
> std::shared_ptr const&, 
> std::unique_ptr std::default_delete >*)
>  @ __pyx_pw_7pyarrow_8_parquet_13ParquetReader_3open(_object*, _object*, 
> _object*)
> ```
> The following code causes it:
> ```
> import pyarrow
> import pyarrow.parquet as pq
>  
> fs = pyarrow.hdfs.connect('my-namenode-url', 

[jira] [Updated] (ARROW-2842) [Python] Cannot read parquet files with row group size of 1 From HDFS

2018-07-12 Thread Robert Gruener (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Gruener updated ARROW-2842:
--
Description: 
This might be a bug in parquet-cpp, I need to spend a bit more time tracking 
this down but basically given a file with a single row on hdfs, reading it with 
pyarrow yields this error

```

TcpSocket.cpp: 79: HdfsEndOfStream: Read 8 bytes failed from 
"10.103.182.28:50010": End of the stream
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ arrow::io::HdfsReadableFile::ReadAt(long, long, long*, void*)
 @ parquet::ArrowInputFile::ReadAt(long, long, unsigned char*)
 @ parquet::SerializedFile::ParseMetaData()
 @ 
parquet::ParquetFileReader::Contents::Open(std::unique_ptr >, parquet::ReaderProperties 
const&, std::shared_ptr const&)
 @ 
parquet::ParquetFileReader::Open(std::unique_ptr >, parquet::ReaderProperties 
const&, std::shared_ptr const&)
 @ parquet::arrow::OpenFile(std::shared_ptr 
const&, arrow::MemoryPool*, parquet::ReaderProperties const&, 
std::shared_ptr const&, 
std::unique_ptr >*)
 @ __pyx_pw_7pyarrow_8_parquet_13ParquetReader_3open(_object*, _object*, 
_object*)

```

The following code causes it:

```

import pyarrow

import pyarrow.parquet as pq

 

fs = pyarrow.hdfs.connect('my-namenode-url', driver='libhdfs3') # fill in 
namenode information

file_object = fs.open('single-row.parquet') # update for hdfs path of file

pq.read_metadata(file_object) # this works

parquet_file = pq.ParquetFile(file_object)

parquet_file.read_row_group(0) # throws error

```

 

I am working on writing a unit test for this. Note that I am using libhdfs3.

  was:
This might be a bug in parquet-cpp, I need to spend a bit more time tracking 
this down but basically given a file with a single row on hdfs, reading it with 
pyarrow yields this error

```

TcpSocket.cpp: 79: HdfsEndOfStream: Read 8 bytes failed from 
"10.103.182.28:50010": End of the stream
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ arrow::io::HdfsReadableFile::ReadAt(long, long, long*, void*)
 @ parquet::ArrowInputFile::ReadAt(long, long, unsigned char*)
 @ parquet::SerializedFile::ParseMetaData()
 @ 
parquet::ParquetFileReader::Contents::Open(std::unique_ptr >, parquet::ReaderProperties 
const&, std::shared_ptr const&)
 @ 
parquet::ParquetFileReader::Open(std::unique_ptr >, parquet::ReaderProperties 
const&, std::shared_ptr const&)
 @ parquet::arrow::OpenFile(std::shared_ptr 
const&, arrow::MemoryPool*, parquet::ReaderProperties const&, 
std::shared_ptr const&, 
std::unique_ptr >*)
 @ __pyx_pw_7pyarrow_8_parquet_13ParquetReader_3open(_object*, _object*, 
_object*)

```

The following code causes it:

```

import pyarrow

import pyarrow.parquet as pq

 

fs = pyarrow.hdfs.connect() # fill in namenode information

file_object = fs.open('single-row.parquet') # update for hdfs path of file

pq.read_metadata(file_object) # this works

parquet_file = pq.ParquetFile(file_object)

parquet_file.read_row_group(0) # throws error

```

 

I am working on writing a unit test for this


> [Python] Cannot read parquet files with row group size of 1 From HDFS
> -
>
> Key: ARROW-2842
> URL: https://issues.apache.org/jira/browse/ARROW-2842
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Robert Gruener
>Priority: Major
> Attachments: single-row.parquet
>
>
> This might be a bug in parquet-cpp, I need to spend a bit more time tracking 
> this down but basically given a file with a single row on hdfs, reading it 
> with pyarrow yields this error
> ```
> TcpSocket.cpp: 79: HdfsEndOfStream: Read 8 bytes failed from 
> "10.103.182.28:50010": End of the stream
>  @ Unknown
>  @ Unknown
>  @ Unknown
>  @ Unknown
>  @ Unknown
>  @ Unknown
>  @ Unknown
>  @ Unknown
>  @ Unknown
>  @ arrow::io::HdfsReadableFile::ReadAt(long, long, long*, void*)
>  @ parquet::ArrowInputFile::ReadAt(long, long, unsigned char*)
>  @ parquet::SerializedFile::ParseMetaData()
>  @ 
> parquet::ParquetFileReader::Contents::Open(std::unique_ptr  std::default_delete >, 
> parquet::ReaderProperties const&, std::shared_ptr 
> const&)
>  @ 
> parquet::ParquetFileReader::Open(std::unique_ptr std::default_delete >, parquet::ReaderProperties 
> const&, std::shared_ptr const&)
>  @ parquet::arrow::OpenFile(std::shared_ptr 
> const&, arrow::MemoryPool*, parquet::ReaderProperties const&, 
> std::shared_ptr const&, 
> std::unique_ptr std::default_delete >*)
>  @ __pyx_pw_7pyarrow_8_parquet_13ParquetReader_3open(_object*, _object*, 
> _object*)
> ```
> The following code causes it:
> ```
> import pyarrow
> import pyarrow.parquet as pq
>  
> fs = pyarrow.hdfs.connect('my-namenode-url', driver='libhdfs3') # fil

[jira] [Created] (ARROW-2842) [Python] Cannot read parquet files with row group size of 1 From HDFS

2018-07-12 Thread Robert Gruener (JIRA)
Robert Gruener created ARROW-2842:
-

 Summary: [Python] Cannot read parquet files with row group size of 
1 From HDFS
 Key: ARROW-2842
 URL: https://issues.apache.org/jira/browse/ARROW-2842
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Robert Gruener
 Attachments: single-row.parquet

This might be a bug in parquet-cpp, I need to spend a bit more time tracking 
this down but basically given a file with a single row on hdfs, reading it with 
pyarrow yields this error

```

TcpSocket.cpp: 79: HdfsEndOfStream: Read 8 bytes failed from 
"10.103.182.28:50010": End of the stream
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ arrow::io::HdfsReadableFile::ReadAt(long, long, long*, void*)
 @ parquet::ArrowInputFile::ReadAt(long, long, unsigned char*)
 @ parquet::SerializedFile::ParseMetaData()
 @ 
parquet::ParquetFileReader::Contents::Open(std::unique_ptr >, parquet::ReaderProperties 
const&, std::shared_ptr const&)
 @ 
parquet::ParquetFileReader::Open(std::unique_ptr >, parquet::ReaderProperties 
const&, std::shared_ptr const&)
 @ parquet::arrow::OpenFile(std::shared_ptr 
const&, arrow::MemoryPool*, parquet::ReaderProperties const&, 
std::shared_ptr const&, 
std::unique_ptr >*)
 @ __pyx_pw_7pyarrow_8_parquet_13ParquetReader_3open(_object*, _object*, 
_object*)

```

The following code causes it:

```

import pyarrow

import pyarrow.parquet as pq

 

fs = pyarrow.hdfs.connect() # fill in namenode information

file_object = fs.open('single-row.parquet') # update for hdfs path of file

pq.read_metadata(file_object) # this works

parquet_file = pq.ParquetFile(file_object)

parquet_file.read_row_group(0) # throws error

```

 

I am working on writing a unit test for this



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2841) [Go] Fix recent Go build failures in Travis CI

2018-07-12 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2841:
---

 Summary: [Go] Fix recent Go build failures in Travis CI
 Key: ARROW-2841
 URL: https://issues.apache.org/jira/browse/ARROW-2841
 Project: Apache Arrow
  Issue Type: Bug
  Components: Go
Reporter: Antoine Pitrou
 Fix For: 0.10.0


I'm getting persistent failures in the Go job on Travis-CI:
https://travis-ci.org/pitrou/arrow/jobs/403221354

Is this expected?  Excerpt:

{code}
$ go get -t -v ./...
github.com/apache/arrow (download)
github.com/stretchr/testify (download)
go/arrow/type_traits_boolean.go:20:2: use of internal package not allowed
go/arrow/array/array.go:23:2: use of internal package not allowed
go/arrow/array/array.go:24:2: use of internal package not allowed
go/arrow/math/math_amd64.go:22:2: use of internal package not allowed
go/arrow/memory/memory_amd64.go:22:2: use of internal package not allowed
go/arrow/memory/buffer.go:22:2: use of internal package not allowed
The command "eval go get -t -v ./... " failed. Retrying, 2 of 3.
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2840) [C++] See if stream alignment logic can be simplified

2018-07-12 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2840:
---

 Summary: [C++] See if stream alignment logic can be simplified
 Key: ARROW-2840
 URL: https://issues.apache.org/jira/browse/ARROW-2840
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.10.0


There were a couple patches related to 64-byte alignment for tensor messages at 
the beginning of April that I've been concerned about: 

ARROW-2308: 
https://github.com/apache/arrow/commit/26bc4ab5a31e3430e1d545068e5a5a5ba5bc7a22#diff-bf4eb56f3cb806c32f60ae67afbb1bf0

ARROW-2437: 
https://github.com/apache/arrow/commit/0f87c12d45250ee763ac8c43b7e57e8f06a0b9f3

In my opinion, the alignment issue should be encapsulated in the protocol if 
possible; to have argument that toggles whether the file is to be aligned after 
reading the metadata seems wrong to me. The bytes written for alignment 
purposes should be accounted for in the metadata size prefix in the IPC message.

Since there are unit tests for this, I'd like to take a look and see if I can 
improve this without breaking the tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2836) [Packaging] Expand build matrices to multiple tasks

2018-07-12 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2836:
--
Labels: pull-request-available  (was: )

> [Packaging] Expand build matrices to multiple tasks
> ---
>
> Key: ARROW-2836
> URL: https://issues.apache.org/jira/browse/ARROW-2836
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Packaging
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> Create release tags for each task instead a single one for the job, and 
> create additional tasks per build variant. 
> This will resolve the package naming conflicts, but increases the config 
> verbosity (which is fine for now).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2836) [Packaging] Expand build matrices to multiple tasks

2018-07-12 Thread Krisztian Szucs (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-2836:
---
Summary: [Packaging] Expand build matrices to multiple tasks  (was: 
[Packaging] Expanding build matrices to multiple tasks)

> [Packaging] Expand build matrices to multiple tasks
> ---
>
> Key: ARROW-2836
> URL: https://issues.apache.org/jira/browse/ARROW-2836
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Packaging
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
> Fix For: 0.10.0
>
>
> Create release tags for each task instead a single one for the job, and 
> create additional tasks per build variant. 
> This will resolve the package naming conflicts, but increases the config 
> verbosity (which is fine for now).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2839) [JS] Support whatwg/streams in IPC reader/writer

2018-07-12 Thread Paul Taylor (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Taylor updated ARROW-2839:
---
Fix Version/s: (was: 0.4.0)
   JS-0.4.0

> [JS] Support whatwg/streams in IPC reader/writer
> 
>
> Key: ARROW-2839
> URL: https://issues.apache.org/jira/browse/ARROW-2839
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Affects Versions: JS-0.3.1
>Reporter: Paul Taylor
>Assignee: Paul Taylor
>Priority: Major
> Fix For: JS-0.4.0
>
>
> We should make it easy to stream Arrow in the browser via 
> [whatwg/streams|https://github.com/whatwg/streams]. I already have this 
> working at Graphistry, but I had to use some of the IPC internal methods. 
> Creating this issue to track back-porting that work and the few minor 
> refactors to the IPC internals that we'll need to do.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2839) [JS] Support whatwg/streams in IPC reader/writer

2018-07-12 Thread Paul Taylor (JIRA)
Paul Taylor created ARROW-2839:
--

 Summary: [JS] Support whatwg/streams in IPC reader/writer
 Key: ARROW-2839
 URL: https://issues.apache.org/jira/browse/ARROW-2839
 Project: Apache Arrow
  Issue Type: Improvement
  Components: JavaScript
Affects Versions: JS-0.3.1
Reporter: Paul Taylor
Assignee: Paul Taylor
 Fix For: 0.4.0


We should make it easy to stream Arrow in the browser via 
[whatwg/streams|https://github.com/whatwg/streams]. I already have this working 
at Graphistry, but I had to use some of the IPC internal methods. Creating this 
issue to track back-porting that work and the few minor refactors to the IPC 
internals that we'll need to do.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2771) [JS] Add row proxy object accessor

2018-07-12 Thread Paul Taylor (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Taylor reassigned ARROW-2771:
--

Assignee: Paul Taylor  (was: Brian Hulette)

> [JS] Add row proxy object accessor
> --
>
> Key: ARROW-2771
> URL: https://issues.apache.org/jira/browse/ARROW-2771
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Reporter: Brian Hulette
>Assignee: Paul Taylor
>Priority: Major
>  Labels: pull-request-available
> Fix For: JS-0.4.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> The {{Table}} class would be much easier to interact with if it returned 
> familiar Javascript objects representing a row. As Jeff Heer 
> [demonstrated|https://beta.observablehq.com/@jheer/from-apache-arrow-to-javascript-objects]
>  it's possible to create JS Proxy objects that read directly from Arrow 
> memory. We should generate these types of objects in {{Table.get}} and in the 
> {{Table}} iterator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2705) [JS] CombinationPredicates should take list of predicates

2018-07-12 Thread Paul Taylor (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Taylor reassigned ARROW-2705:
--

Assignee: Paul Taylor  (was: Brian Hulette)

> [JS] CombinationPredicates should take list of predicates
> -
>
> Key: ARROW-2705
> URL: https://issues.apache.org/jira/browse/ARROW-2705
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Reporter: Dominik Moritz
>Assignee: Paul Taylor
>Priority: Minor
>  Labels: pull-request-available
> Fix For: JS-0.4.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> CombinationPredicate like And and Or only take two predicates: left and 
> right. However, it often makes sense to combine more than two predicates. I 
> suggest that `and` and `or` support arbitrarily many arguments. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2705) [JS] CombinationPredicates should take list of predicates

2018-07-12 Thread Paul Taylor (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Taylor reassigned ARROW-2705:
--

Assignee: Brian Hulette

> [JS] CombinationPredicates should take list of predicates
> -
>
> Key: ARROW-2705
> URL: https://issues.apache.org/jira/browse/ARROW-2705
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Reporter: Dominik Moritz
>Assignee: Brian Hulette
>Priority: Minor
>  Labels: pull-request-available
> Fix For: JS-0.4.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> CombinationPredicate like And and Or only take two predicates: left and 
> right. However, it often makes sense to combine more than two predicates. I 
> suggest that `and` and `or` support arbitrarily many arguments. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2767) [JS] Add generic to Table for column names

2018-07-12 Thread Paul Taylor (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Taylor reassigned ARROW-2767:
--

Assignee: Paul Taylor  (was: Brian Hulette)

> [JS] Add generic to Table for column names
> --
>
> Key: ARROW-2767
> URL: https://issues.apache.org/jira/browse/ARROW-2767
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Reporter: Brian Hulette
>Assignee: Paul Taylor
>Priority: Major
>  Labels: pull-request-available
> Fix For: JS-0.4.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Requested by [~domoritz]
> Something like:
> {code:javascript}
> class Table {
> ...
> getColumn(name: ColName): Vector {
> }
> ...
> }
> {code}
> It would be even better if we could find a way to map the column names to the 
> actual vector data types, but one thing at a time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2767) [JS] Add generic to Table for column names

2018-07-12 Thread Paul Taylor (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Taylor reassigned ARROW-2767:
--

Assignee: Brian Hulette

> [JS] Add generic to Table for column names
> --
>
> Key: ARROW-2767
> URL: https://issues.apache.org/jira/browse/ARROW-2767
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Reporter: Brian Hulette
>Assignee: Brian Hulette
>Priority: Major
>  Labels: pull-request-available
> Fix For: JS-0.4.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Requested by [~domoritz]
> Something like:
> {code:javascript}
> class Table {
> ...
> getColumn(name: ColName): Vector {
> }
> ...
> }
> {code}
> It would be even better if we could find a way to map the column names to the 
> actual vector data types, but one thing at a time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2819) [JS] Fails to build with TS 2.8.3

2018-07-12 Thread Paul Taylor (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Taylor reassigned ARROW-2819:
--

Assignee: Paul Taylor

> [JS] Fails to build with TS 2.8.3
> -
>
> Key: ARROW-2819
> URL: https://issues.apache.org/jira/browse/ARROW-2819
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Affects Versions: JS-0.3.1
>Reporter: Brian Hulette
>Assignee: Paul Taylor
>Priority: Major
> Fix For: JS-0.4.0
>
>
> See the [GitHub 
> issue|https://github.com/apache/arrow/issues/2115#issuecomment-403612925]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2838) [Python] Speed up null testing with Pandas semantics

2018-07-12 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2838:
--
Labels: pull-request-available  (was: )

> [Python] Speed up null testing with Pandas semantics
> 
>
> Key: ARROW-2838
> URL: https://issues.apache.org/jira/browse/ARROW-2838
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 0.9.0
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>
> The {{PandasObjectIsNull}} helper function can be a significant contributor 
> when converting a Pandas dataframe to Arrow format (e.g. when writing a 
> dataframe to feather format). We can try to speed up the type checks in that 
> function.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2838) [Python] Speed up null testing with Pandas semantics

2018-07-12 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-2838:
-

 Summary: [Python] Speed up null testing with Pandas semantics
 Key: ARROW-2838
 URL: https://issues.apache.org/jira/browse/ARROW-2838
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Python
Affects Versions: 0.9.0
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou


The {{PandasObjectIsNull}} helper function can be a significant contributor 
when converting a Pandas dataframe to Arrow format (e.g. when writing a 
dataframe to feather format). We can try to speed up the type checks in that 
function.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2835) [C++] ReadAt/WriteAt are inconsistent with moving the files position

2018-07-12 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541972#comment-16541972
 ] 

Antoine Pitrou commented on ARROW-2835:
---

In that case it would mean only advance the position if we're on Windows :) I 
don't think there's a nice way out of this. If we want to keep the 
parallelization benefits of ReadAt and WriteAt, we'll need to live with the 
platform specifics, I think. [~wesmckinn]

> [C++] ReadAt/WriteAt are inconsistent with moving the files position
> 
>
> Key: ARROW-2835
> URL: https://issues.apache.org/jira/browse/ARROW-2835
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Dimitri Vorona
>Priority: Major
>
> Right now, there is inconsistent behaviour regarding moving the files 
> position pointer after calling ReadAt or WriteAt. For example, the default 
> implementation of ReadAt seeks to the desired offset and calls Read which 
> moves the position pointer. MemoryMappedFile::ReadAt, however, doesn't change 
> the position. WriteableFile::WriteAt seem to move the position in the current 
> implementation, but there is no docstring which prescribes this behaviour.
> Antoine suggested that *At methods shouldn't touch the position and it makes 
> more sense, IMHO. The change isn't huge and doesn't seem to break anything 
> internally, but it might break the existing user code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2835) [C++] ReadAt/WriteAt are inconsistent with moving the files position

2018-07-12 Thread Dimitri Vorona (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541962#comment-16541962
 ] 

Dimitri Vorona commented on ARROW-2835:
---

Ok, then we should at least be consistent across implementation, i.e. advance 
the position in MemoryMappedFile::ReadAt, right?

> [C++] ReadAt/WriteAt are inconsistent with moving the files position
> 
>
> Key: ARROW-2835
> URL: https://issues.apache.org/jira/browse/ARROW-2835
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Dimitri Vorona
>Priority: Major
>
> Right now, there is inconsistent behaviour regarding moving the files 
> position pointer after calling ReadAt or WriteAt. For example, the default 
> implementation of ReadAt seeks to the desired offset and calls Read which 
> moves the position pointer. MemoryMappedFile::ReadAt, however, doesn't change 
> the position. WriteableFile::WriteAt seem to move the position in the current 
> implementation, but there is no docstring which prescribes this behaviour.
> Antoine suggested that *At methods shouldn't touch the position and it makes 
> more sense, IMHO. The change isn't huge and doesn't seem to break anything 
> internally, but it might break the existing user code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2553) [Python] Set MACOSX_DEPLOYMENT_TARGET in wheel build

2018-07-12 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541955#comment-16541955
 ] 

Wes McKinney commented on ARROW-2553:
-

Sorry -- I had commented on the PR. If we fix the wheels there, then ARROW-2326 
will probably be fixed

> [Python] Set MACOSX_DEPLOYMENT_TARGET in wheel build
> 
>
> Key: ARROW-2553
> URL: https://issues.apache.org/jira/browse/ARROW-2553
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging, Python
>Reporter: Uwe L. Korn
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> The current `pyarrow` wheels are not usable on older OSX releases due to a 
> problem in the newest Xcode SDK. We need to set {{MACOSX_DEPLOYMENT_TARGET}} 
> to an older OSX release to avoid getting {{Symbol not found: 
> _os_unfair_lock_lock}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2553) [Python] Set MACOSX_DEPLOYMENT_TARGET in wheel build

2018-07-12 Thread Krisztian Szucs (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541943#comment-16541943
 ] 

Krisztian Szucs commented on ARROW-2553:


[~wesmckinn] Uwe has already checked, doesn't work properly yet - see the PR.

> [Python] Set MACOSX_DEPLOYMENT_TARGET in wheel build
> 
>
> Key: ARROW-2553
> URL: https://issues.apache.org/jira/browse/ARROW-2553
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging, Python
>Reporter: Uwe L. Korn
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> The current `pyarrow` wheels are not usable on older OSX releases due to a 
> problem in the newest Xcode SDK. We need to set {{MACOSX_DEPLOYMENT_TARGET}} 
> to an older OSX release to avoid getting {{Symbol not found: 
> _os_unfair_lock_lock}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2059) [Python] Possible performance regression in Feather read/write path

2018-07-12 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541928#comment-16541928
 ] 

Antoine Pitrou commented on ARROW-2059:
---

I use Linux "perf" on a benchmark script than ran "perf report" to get call 
trees and then manually fed some paste of that into "c++filt".

The most accessible resource I've found about the "perf" utility is 
http://www.brendangregg.com/perf.html

> [Python] Possible performance regression in Feather read/write path
> ---
>
> Key: ARROW-2059
> URL: https://issues.apache.org/jira/browse/ARROW-2059
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Jingyuan Wang
>Priority: Major
> Fix For: 0.11.0
>
>
> See discussion in https://github.com/wesm/feather/issues/329. Needs to be 
> investigated



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2059) [Python] Possible performance regression in Feather read/write path

2018-07-12 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541910#comment-16541910
 ] 

Wes McKinney commented on ARROW-2059:
-

Thanks [~pitrou], do you have some tricks for creating profiles like that (I 
have made FlameGraphs but not much more)?

> [Python] Possible performance regression in Feather read/write path
> ---
>
> Key: ARROW-2059
> URL: https://issues.apache.org/jira/browse/ARROW-2059
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Jingyuan Wang
>Priority: Major
> Fix For: 0.11.0
>
>
> See discussion in https://github.com/wesm/feather/issues/329. Needs to be 
> investigated



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2059) [Python] Possible performance regression in Feather read/write path

2018-07-12 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541898#comment-16541898
 ] 

Antoine Pitrou commented on ARROW-2059:
---

Here is a profile tree got by exercising the write path on Python 3.6:
{code}
   - arrow::py::NumPyConverter::ConvertObjectStrings()
  - 80,27% arrow::py::AppendObjectStrings(tagPyArrayObject*, 
tagPyArrayObject*, long, bool, arrow::StringBuilder*, long*, bool*)
 - 50,74% arrow::py::internal::BuilderAppend(arrow::StringBuilder*, 
_object*, bool, bool*)
- 24,95% arrow::BinaryBuilder::Append(unsigned char const*, int)
 7,43% arrow::BinaryBuilder::AppendNextOffset()
   + 6,28% arrow::BufferBuilder::Resize(long, bool)
 2,30% __memcpy_avx_unaligned
 0,71% arrow::ArrayBuilder::Reserve(long)
  6,16% PyUnicode_AsUTF8AndSize
+ 4,37% PyErr_Occurred
 - 16,70% arrow::py::internal::PandasObjectIsNull(_object*)
- 8,29% arrow::py::internal::PyDecimal_Check(_object*)
 PyType_IsSubtype
- 4,59% arrow::py::internal::PyFloat_IsNaN(_object*)
 PyType_IsSubtype
   2,51% PyArray_MultiplyList
   2,41% PyType_IsSubtype
  + 1,57% arrow::ArrayBuilder::Finish(std::shared_ptr*)
{code}


> [Python] Possible performance regression in Feather read/write path
> ---
>
> Key: ARROW-2059
> URL: https://issues.apache.org/jira/browse/ARROW-2059
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Jingyuan Wang
>Priority: Major
> Fix For: 0.11.0
>
>
> See discussion in https://github.com/wesm/feather/issues/329. Needs to be 
> investigated



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Reopened] (ARROW-2782) [Python] Ongoing Travis CI failures in Plasma unit tests

2018-07-12 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reopened ARROW-2782:
-

> [Python] Ongoing Travis CI failures in Plasma unit tests
> 
>
> Key: ARROW-2782
> URL: https://issues.apache.org/jira/browse/ARROW-2782
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> e.g.
> {code}
> _ test_use_huge_pages 
> __
> @pytest.mark.skipif(not os.path.exists("/mnt/hugepages"),
> reason="requires hugepage support")
> def test_use_huge_pages():
> import pyarrow.plasma as plasma
> with plasma.start_plasma_store(
> plasma_store_memory=2*10**9,
> plasma_directory="/mnt/hugepages",
> use_hugepages=True) as (plasma_store_name, p):
> plasma_client = plasma.connect(plasma_store_name, "", 64)
> >   create_object(plasma_client, 10**8)
> pyarrow/tests/test_plasma.py:773: 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ 
> pyarrow/tests/test_plasma.py:79: in create_object
> seal=seal)
> pyarrow/tests/test_plasma.py:68: in create_object_with_id
> memory_buffer = client.create(object_id, data_size, metadata)
> pyarrow/_plasma.pyx:300: in pyarrow._plasma.PlasmaClient.create
> check_status(self.client.get().Create(object_id.data, data_size,
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ 
> >   raise PlasmaStoreFull(message)
> E   PlasmaStoreFull: 
> /home/travis/build/apache/arrow/cpp/src/plasma/client.cc:375 code: 
> ReadCreateReply(buffer.data(), buffer.size(), &id, &object, &store_fd, 
> &mmap_size)
> E   object does not fit in the plasma store
> pyarrow/error.pxi:99: PlasmaStoreFull
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2837) [C++] ArrayBuilder::null_bitmap returns PoolBuffer

2018-07-12 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541881#comment-16541881
 ] 

Wes McKinney commented on ARROW-2837:
-

This is used in parquet-cpp 
(https://github.com/apache/parquet-cpp/search?q=PoolBuffer&unscoped_q=PoolBuffer)
 so if we did that, we would want to fix up that library so it can release 
shortly after Arrow 0.10.0 comes out. cc [~xhochy]

> [C++] ArrayBuilder::null_bitmap returns PoolBuffer
> --
>
> Key: ARROW-2837
> URL: https://issues.apache.org/jira/browse/ARROW-2837
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Affects Versions: 0.9.0
>Reporter: Dimitri Vorona
>Priority: Major
> Fix For: 0.10.0
>
>
> A simple buffer (like in case of ArrayBuilder::Data) seem to be enough to me, 
> and it doesn't break anything.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2837) [C++] ArrayBuilder::null_bitmap returns PoolBuffer

2018-07-12 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541879#comment-16541879
 ] 

Wes McKinney commented on ARROW-2837:
-

Agreed. We should probably move its implementation into {{memory_pool.cc}} and 
provide only factory APIs that return ResizableBuffer

> [C++] ArrayBuilder::null_bitmap returns PoolBuffer
> --
>
> Key: ARROW-2837
> URL: https://issues.apache.org/jira/browse/ARROW-2837
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Affects Versions: 0.9.0
>Reporter: Dimitri Vorona
>Priority: Major
> Fix For: 0.10.0
>
>
> A simple buffer (like in case of ArrayBuilder::Data) seem to be enough to me, 
> and it doesn't break anything.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2837) [C++] ArrayBuilder::null_bitmap returns PoolBuffer

2018-07-12 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2837:

Fix Version/s: 0.10.0

> [C++] ArrayBuilder::null_bitmap returns PoolBuffer
> --
>
> Key: ARROW-2837
> URL: https://issues.apache.org/jira/browse/ARROW-2837
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Affects Versions: 0.9.0
>Reporter: Dimitri Vorona
>Priority: Major
> Fix For: 0.10.0
>
>
> A simple buffer (like in case of ArrayBuilder::Data) seem to be enough to me, 
> and it doesn't break anything.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file

2018-07-12 Thread Robert Gruener (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541803#comment-16541803
 ] 

Robert Gruener commented on ARROW-1983:
---

[~xhochy] I made this dependent task PARQUET-1348

> [Python] Add ability to write parquet `_metadata` file
> --
>
> Key: ARROW-1983
> URL: https://issues.apache.org/jira/browse/ARROW-1983
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Jim Crist
>Priority: Major
>  Labels: beginner, parquet
> Fix For: 0.11.0
>
>
> Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file 
> (mostly just schema information). It would be useful to add the ability to 
> write a {{_metadata}} file as well. This should include information about 
> each row group in the dataset, including summary statistics. Having this 
> summary file would allow filtering of row groups without needing to access 
> each file beforehand.
> This would require that the user is able to get the written RowGroups out of 
> a {{pyarrow.parquet.write_table}} call and then give these objects as a list 
> to new function that then passes them on as C++ objects to {{parquet-cpp}} 
> that generates the respective {{_metadata}} file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2825) [C++] Need AllocateBuffer / AllocateResizableBuffer variant with default memory pool

2018-07-12 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2825:
--
Labels: pull-request-available  (was: )

> [C++] Need AllocateBuffer / AllocateResizableBuffer variant with default 
> memory pool
> 
>
> Key: ARROW-2825
> URL: https://issues.apache.org/jira/browse/ARROW-2825
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++
>Affects Versions: 0.9.0
>Reporter: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>
> It's not very practical that you have to pass the default memory pool 
> explicitly to {{AllocateBuffer}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2837) [C++] ArrayBuilder::null_bitmap returns PoolBuffer

2018-07-12 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541580#comment-16541580
 ] 

Antoine Pitrou commented on ARROW-2837:
---

Yes, I think PoolBuffer should stop being visible as an API (it can stay around 
as a private implementation detail). [~wesmckinn] Any thoughts?

> [C++] ArrayBuilder::null_bitmap returns PoolBuffer
> --
>
> Key: ARROW-2837
> URL: https://issues.apache.org/jira/browse/ARROW-2837
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Affects Versions: 0.9.0
>Reporter: Dimitri Vorona
>Priority: Major
>
> A simple buffer (like in case of ArrayBuilder::Data) seem to be enough to me, 
> and it doesn't break anything.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2837) [C++] ArrayBuilder::null_bitmap returns PoolBuffer

2018-07-12 Thread Dimitri Vorona (JIRA)
Dimitri Vorona created ARROW-2837:
-

 Summary: [C++] ArrayBuilder::null_bitmap returns PoolBuffer
 Key: ARROW-2837
 URL: https://issues.apache.org/jira/browse/ARROW-2837
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Affects Versions: 0.9.0
Reporter: Dimitri Vorona


A simple buffer (like in case of ArrayBuilder::Data) seem to be enough to me, 
and it doesn't break anything.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2760) [Python] Remove legacy property definition syntax from parquet module and test them

2018-07-12 Thread Krisztian Szucs (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541519#comment-16541519
 ] 

Krisztian Szucs commented on ARROW-2760:


See the second block 
https://cython.readthedocs.io/en/latest/src/userguide/extension_types.html#properties
 
{code}
cdef class Spam:

property cheese:

"A doc string can go here."

def __get__(self):
# This is called when the property is read.
...

def __set__(self, value):
# This is called when the property is written.
...

def __del__(self):
# This is called when the property is deleted.
{code}
syntax is deprecated.

> [Python] Remove legacy property definition syntax from parquet module and 
> test them
> ---
>
> Key: ARROW-2760
> URL: https://issues.apache.org/jira/browse/ARROW-2760
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2760) [Python] Remove legacy property definition syntax from parquet module and test them

2018-07-12 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541517#comment-16541517
 ] 

Antoine Pitrou commented on ARROW-2760:
---

Can you elaborate on what you call "legacy property definition syntax"?

> [Python] Remove legacy property definition syntax from parquet module and 
> test them
> ---
>
> Key: ARROW-2760
> URL: https://issues.apache.org/jira/browse/ARROW-2760
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2822) [C++] Zero padding bytes in PoolBuffer::Resize

2018-07-12 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2822:
--
Labels: pull-request-available  (was: )

> [C++] Zero padding bytes in PoolBuffer::Resize
> --
>
> Key: ARROW-2822
> URL: https://issues.apache.org/jira/browse/ARROW-2822
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
>
> This is a possible proposed fix which would supersede some of the changes 
> made in ARROW-2790 as elsewhere.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2836) [Packaging] Expanding build matrices to multiple tasks

2018-07-12 Thread Krisztian Szucs (JIRA)
Krisztian Szucs created ARROW-2836:
--

 Summary: [Packaging] Expanding build matrices to multiple tasks
 Key: ARROW-2836
 URL: https://issues.apache.org/jira/browse/ARROW-2836
 Project: Apache Arrow
  Issue Type: Task
  Components: Packaging
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs
 Fix For: 0.10.0


Create release tags for each task instead a single one for the job, and create 
additional tasks per build variant. 
This will resolve the package naming conflicts, but increases the config 
verbosity (which is fine for now).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2835) [C++] ReadAt/WriteAt are inconsistent with moving the files position

2018-07-12 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541394#comment-16541394
 ] 

Antoine Pitrou commented on ARROW-2835:
---

Unfortunately, things are bit more complicated as in some cases, ReadAt / 
WriteAt are forced to update the file position anyway (on Windows, see 
implementation of {{FileRead}} in {{io-util.cc}}).

> [C++] ReadAt/WriteAt are inconsistent with moving the files position
> 
>
> Key: ARROW-2835
> URL: https://issues.apache.org/jira/browse/ARROW-2835
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Dimitri Vorona
>Priority: Major
>
> Right now, there is inconsistent behaviour regarding moving the files 
> position pointer after calling ReadAt or WriteAt. For example, the default 
> implementation of ReadAt seeks to the desired offset and calls Read which 
> moves the position pointer. MemoryMappedFile::ReadAt, however, doesn't change 
> the position. WriteableFile::WriteAt seem to move the position in the current 
> implementation, but there is no docstring which prescribes this behaviour.
> Antoine suggested that *At methods shouldn't touch the position and it makes 
> more sense, IMHO. The change isn't huge and doesn't seem to break anything 
> internally, but it might break the existing user code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2835) [C++] ReadAt/WriteAt are inconsistent with moving the files position

2018-07-12 Thread Dimitri Vorona (JIRA)
Dimitri Vorona created ARROW-2835:
-

 Summary: [C++] ReadAt/WriteAt are inconsistent with moving the 
files position
 Key: ARROW-2835
 URL: https://issues.apache.org/jira/browse/ARROW-2835
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Dimitri Vorona


Right now, there is inconsistent behaviour regarding moving the files position 
pointer after calling ReadAt or WriteAt. For example, the default 
implementation of ReadAt seeks to the desired offset and calls Read which moves 
the position pointer. MemoryMappedFile::ReadAt, however, doesn't change the 
position. WriteableFile::WriteAt seem to move the position in the current 
implementation, but there is no docstring which prescribes this behaviour.

Antoine suggested that *At methods shouldn't touch the position and it makes 
more sense, IMHO. The change isn't huge and doesn't seem to break anything 
internally, but it might break the existing user code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)