[jira] [Updated] (ARROW-2841) [Go] Fix recent Go build failures in Travis CI
[ https://issues.apache.org/jira/browse/ARROW-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-2841: -- Labels: pull-request-available (was: ) > [Go] Fix recent Go build failures in Travis CI > -- > > Key: ARROW-2841 > URL: https://issues.apache.org/jira/browse/ARROW-2841 > Project: Apache Arrow > Issue Type: Bug > Components: Go >Reporter: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > I'm getting persistent failures in the Go job on Travis-CI: > https://travis-ci.org/pitrou/arrow/jobs/403221354 > Is this expected? Excerpt: > {code} > $ go get -t -v ./... > github.com/apache/arrow (download) > github.com/stretchr/testify (download) > go/arrow/type_traits_boolean.go:20:2: use of internal package not allowed > go/arrow/array/array.go:23:2: use of internal package not allowed > go/arrow/array/array.go:24:2: use of internal package not allowed > go/arrow/math/math_amd64.go:22:2: use of internal package not allowed > go/arrow/memory/memory_amd64.go:22:2: use of internal package not allowed > go/arrow/memory/buffer.go:22:2: use of internal package not allowed > The command "eval go get -t -v ./... " failed. Retrying, 2 of 3. > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2843) [Format/C++/Java] Removing field layout from Schema.fbs breaks backward compatibility
[ https://issues.apache.org/jira/browse/ARROW-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16542495#comment-16542495 ] Wes McKinney commented on ARROW-2843: - In my opinion, we haven't worked hard enough as a community to stabilize the metadata. We have failed to make progress since mid-2017, why is that? I have been trying consistently to advance to a 1.0 release; can we not work together to finalize the binary protocol? > [Format/C++/Java] Removing field layout from Schema.fbs breaks backward > compatibility > - > > Key: ARROW-2843 > URL: https://issues.apache.org/jira/browse/ARROW-2843 > Project: Apache Arrow > Issue Type: Bug > Components: Format >Affects Versions: 0.8.0, 0.9.0 >Reporter: Yuliya Feldman >Assignee: Yuliya Feldman >Priority: Major > > Change in [ARROW-1785|https://issues.apache.org/jira/browse/ARROW-1785] > Breaks backward compatibility with schema generated in prior versions of > Arrow, as VectorLayout field was removed from Field table. > This is definitely not supported behavior in FlatBuffers. > I suggest to put back following field > {code:java} > layout: [ VectorLayout ] (deprecated); > {code} > into > {code:java} > table Field > {code} > and mark it "deprecated" - as shown above > It will also require putting back > {code:java} > table VectorLayout > {code} > which can have all the deprecated fields as well. > code generation footprint will be minimal in this case, as FlatBuffers won't > generate getters for deprecated fields, but it will really solve backward > compatibility issues. > Also may be a good idea to introduce "version" into Schema. > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (ARROW-2843) [Format/C++/Java] Removing field layout from Schema.fbs breaks backward compatibility
[ https://issues.apache.org/jira/browse/ARROW-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16542383#comment-16542383 ] Yuliya Feldman edited comment on ARROW-2843 at 7/13/18 12:55 AM: - Thank you [~wesmckinn] for very quick reply and thoughts. I do understand about "no guarantees" part, but why not do our best to maintain backwards compatibility, especially when it is possible. was (Author: yufeldman): Thank you @Wes for very quick reply and thoughts. I do understand about "no guarantees" part, but why not do our best to maintain backwards compatibility, especially when it is possible. > [Format/C++/Java] Removing field layout from Schema.fbs breaks backward > compatibility > - > > Key: ARROW-2843 > URL: https://issues.apache.org/jira/browse/ARROW-2843 > Project: Apache Arrow > Issue Type: Bug > Components: Format >Affects Versions: 0.8.0, 0.9.0 >Reporter: Yuliya Feldman >Assignee: Yuliya Feldman >Priority: Major > > Change in [ARROW-1785|https://issues.apache.org/jira/browse/ARROW-1785] > Breaks backward compatibility with schema generated in prior versions of > Arrow, as VectorLayout field was removed from Field table. > This is definitely not supported behavior in FlatBuffers. > I suggest to put back following field > {code:java} > layout: [ VectorLayout ] (deprecated); > {code} > into > {code:java} > table Field > {code} > and mark it "deprecated" - as shown above > It will also require putting back > {code:java} > table VectorLayout > {code} > which can have all the deprecated fields as well. > code generation footprint will be minimal in this case, as FlatBuffers won't > generate getters for deprecated fields, but it will really solve backward > compatibility issues. > Also may be a good idea to introduce "version" into Schema. > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (ARROW-2843) [Format/C++/Java] Removing field layout from Schema.fbs breaks backward compatibility
[ https://issues.apache.org/jira/browse/ARROW-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16542383#comment-16542383 ] Yuliya Feldman edited comment on ARROW-2843 at 7/13/18 12:55 AM: - Thank you @Wes for very quick reply and thoughts. I do understand about "no guarantees" part, but why not do our best to maintain backwards compatibility, especially when it is possible. was (Author: yufeldman): Thank you Wes for very quick reply and thoughts. I do understand about "no guarantees" part, but why not do our best to maintain backwards compatibility, especially when it is possible. > [Format/C++/Java] Removing field layout from Schema.fbs breaks backward > compatibility > - > > Key: ARROW-2843 > URL: https://issues.apache.org/jira/browse/ARROW-2843 > Project: Apache Arrow > Issue Type: Bug > Components: Format >Affects Versions: 0.8.0, 0.9.0 >Reporter: Yuliya Feldman >Assignee: Yuliya Feldman >Priority: Major > > Change in [ARROW-1785|https://issues.apache.org/jira/browse/ARROW-1785] > Breaks backward compatibility with schema generated in prior versions of > Arrow, as VectorLayout field was removed from Field table. > This is definitely not supported behavior in FlatBuffers. > I suggest to put back following field > {code:java} > layout: [ VectorLayout ] (deprecated); > {code} > into > {code:java} > table Field > {code} > and mark it "deprecated" - as shown above > It will also require putting back > {code:java} > table VectorLayout > {code} > which can have all the deprecated fields as well. > code generation footprint will be minimal in this case, as FlatBuffers won't > generate getters for deprecated fields, but it will really solve backward > compatibility issues. > Also may be a good idea to introduce "version" into Schema. > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2843) [Format/C++/Java] Removing field layout from Schema.fbs breaks backward compatibility
[ https://issues.apache.org/jira/browse/ARROW-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16542383#comment-16542383 ] Yuliya Feldman commented on ARROW-2843: --- Thank you Wes for very quick reply and thoughts. I do understand about "no guarantees" part, but why not do our best to maintain backwards compatibility, especially when it is possible. > [Format/C++/Java] Removing field layout from Schema.fbs breaks backward > compatibility > - > > Key: ARROW-2843 > URL: https://issues.apache.org/jira/browse/ARROW-2843 > Project: Apache Arrow > Issue Type: Bug > Components: Format >Affects Versions: 0.8.0, 0.9.0 >Reporter: Yuliya Feldman >Assignee: Yuliya Feldman >Priority: Major > > Change in [ARROW-1785|https://issues.apache.org/jira/browse/ARROW-1785] > Breaks backward compatibility with schema generated in prior versions of > Arrow, as VectorLayout field was removed from Field table. > This is definitely not supported behavior in FlatBuffers. > I suggest to put back following field > {code:java} > layout: [ VectorLayout ] (deprecated); > {code} > into > {code:java} > table Field > {code} > and mark it "deprecated" - as shown above > It will also require putting back > {code:java} > table VectorLayout > {code} > which can have all the deprecated fields as well. > code generation footprint will be minimal in this case, as FlatBuffers won't > generate getters for deprecated fields, but it will really solve backward > compatibility issues. > Also may be a good idea to introduce "version" into Schema. > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2760) [Python] Remove legacy property definition syntax from parquet module and test them
[ https://issues.apache.org/jira/browse/ARROW-2760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Hagerman updated ARROW-2760: - Component/s: Python > [Python] Remove legacy property definition syntax from parquet module and > test them > --- > > Key: ARROW-2760 > URL: https://issues.apache.org/jira/browse/ARROW-2760 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Krisztian Szucs >Assignee: Krisztian Szucs >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > Time Spent: 4.5h > Remaining Estimate: 0h > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2843) [Format/C++/Java] Removing field layout from Schema.fbs breaks backward compatibility
[ https://issues.apache.org/jira/browse/ARROW-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16542323#comment-16542323 ] Wes McKinney commented on ARROW-2843: - I'm -0 to -1 on this change. We do already have a metadata version number. We've made no guarantees of backwards compatibility in the columnar format metadata yet, so we have broken no contracts with users. To be honest, I'm not pleased that there are still a number of underspecified parts of the columnar format and metadata. As an example, I proposed a more complete Interval / Timedelta type about a year ago and these changes have still not been implemented. I would very much appreciate if Dremio and other parties depending on Arrow in production would work more proactively with me and others in the Arrow community to reach a point where we can declare a stable 1.0 version of the columnar format and its metadata. In particular, if any more breaking changes are required (I hope that there will not be), that we make them as soon as possible to minimize disruption to applications. > [Format/C++/Java] Removing field layout from Schema.fbs breaks backward > compatibility > - > > Key: ARROW-2843 > URL: https://issues.apache.org/jira/browse/ARROW-2843 > Project: Apache Arrow > Issue Type: Bug > Components: Format >Affects Versions: 0.8.0, 0.9.0 >Reporter: Yuliya Feldman >Assignee: Yuliya Feldman >Priority: Major > > Change in [ARROW-1785|https://issues.apache.org/jira/browse/ARROW-1785] > Breaks backward compatibility with schema generated in prior versions of > Arrow, as VectorLayout field was removed from Field table. > This is definitely not supported behavior in FlatBuffers. > I suggest to put back following field > {code:java} > layout: [ VectorLayout ] (deprecated); > {code} > into > {code:java} > table Field > {code} > and mark it "deprecated" - as shown above > It will also require putting back > {code:java} > table VectorLayout > {code} > which can have all the deprecated fields as well. > code generation footprint will be minimal in this case, as FlatBuffers won't > generate getters for deprecated fields, but it will really solve backward > compatibility issues. > Also may be a good idea to introduce "version" into Schema. > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2843) [Format/C++/Java] Removing field layout from Schema.fbs breaks backward compatibility
Yuliya Feldman created ARROW-2843: - Summary: [Format/C++/Java] Removing field layout from Schema.fbs breaks backward compatibility Key: ARROW-2843 URL: https://issues.apache.org/jira/browse/ARROW-2843 Project: Apache Arrow Issue Type: Bug Components: Format Affects Versions: 0.9.0, 0.8.0 Reporter: Yuliya Feldman Assignee: Yuliya Feldman Change in [ARROW-1785|https://issues.apache.org/jira/browse/ARROW-1785] Breaks backward compatibility with schema generated in prior versions of Arrow, as VectorLayout field was removed from Field table. This is definitely not supported behavior in FlatBuffers. I suggest to put back following field {code:java} layout: [ VectorLayout ] (deprecated); {code} into {code:java} table Field {code} and mark it "deprecated" - as shown above It will also require putting back {code:java} table VectorLayout {code} which can have all the deprecated fields as well. code generation footprint will be minimal in this case, as FlatBuffers won't generate getters for deprecated fields, but it will really solve backward compatibility issues. Also may be a good idea to introduce "version" into Schema. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2842) [Python] Cannot read parquet files with row group size of 1 From HDFS
[ https://issues.apache.org/jira/browse/ARROW-2842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Gruener updated ARROW-2842: -- Description: This might be a bug in parquet-cpp, I need to spend a bit more time tracking this down but basically given a file with a single row on hdfs, reading it with pyarrow yields this error ``` TcpSocket.cpp: 79: HdfsEndOfStream: Read 8 bytes failed from: End of the stream @ Unknown @ Unknown @ Unknown @ Unknown @ Unknown @ Unknown @ Unknown @ Unknown @ Unknown @ arrow::io::HdfsReadableFile::ReadAt(long, long, long*, void*) @ parquet::ArrowInputFile::ReadAt(long, long, unsigned char*) @ parquet::SerializedFile::ParseMetaData() @ parquet::ParquetFileReader::Contents::Open(std::unique_ptr >, parquet::ReaderProperties const&, std::shared_ptr const&) @ parquet::ParquetFileReader::Open(std::unique_ptr >, parquet::ReaderProperties const&, std::shared_ptr const&) @ parquet::arrow::OpenFile(std::shared_ptr const&, arrow::MemoryPool*, parquet::ReaderProperties const&, std::shared_ptr const&, std::unique_ptr >*) @ __pyx_pw_7pyarrow_8_parquet_13ParquetReader_3open(_object*, _object*, _object*) ``` The following code causes it: ``` import pyarrow import pyarrow.parquet as pq fs = pyarrow.hdfs.connect('my-namenode-url', driver='libhdfs3') # fill in namenode information file_object = fs.open('single-row.parquet') # update for hdfs path of file pq.read_metadata(file_object) # this works parquet_file = pq.ParquetFile(file_object) parquet_file.read_row_group(0) # throws error ``` I am working on writing a unit test for this. Note that I am using libhdfs3. was: This might be a bug in parquet-cpp, I need to spend a bit more time tracking this down but basically given a file with a single row on hdfs, reading it with pyarrow yields this error ``` TcpSocket.cpp: 79: HdfsEndOfStream: Read 8 bytes failed from "10.103.182.28:50010": End of the stream @ Unknown @ Unknown @ Unknown @ Unknown @ Unknown @ Unknown @ Unknown @ Unknown @ Unknown @ arrow::io::HdfsReadableFile::ReadAt(long, long, long*, void*) @ parquet::ArrowInputFile::ReadAt(long, long, unsigned char*) @ parquet::SerializedFile::ParseMetaData() @ parquet::ParquetFileReader::Contents::Open(std::unique_ptr >, parquet::ReaderProperties const&, std::shared_ptr const&) @ parquet::ParquetFileReader::Open(std::unique_ptr >, parquet::ReaderProperties const&, std::shared_ptr const&) @ parquet::arrow::OpenFile(std::shared_ptr const&, arrow::MemoryPool*, parquet::ReaderProperties const&, std::shared_ptr const&, std::unique_ptr >*) @ __pyx_pw_7pyarrow_8_parquet_13ParquetReader_3open(_object*, _object*, _object*) ``` The following code causes it: ``` import pyarrow import pyarrow.parquet as pq fs = pyarrow.hdfs.connect('my-namenode-url', driver='libhdfs3') # fill in namenode information file_object = fs.open('single-row.parquet') # update for hdfs path of file pq.read_metadata(file_object) # this works parquet_file = pq.ParquetFile(file_object) parquet_file.read_row_group(0) # throws error ``` I am working on writing a unit test for this. Note that I am using libhdfs3. > [Python] Cannot read parquet files with row group size of 1 From HDFS > - > > Key: ARROW-2842 > URL: https://issues.apache.org/jira/browse/ARROW-2842 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Robert Gruener >Priority: Major > Attachments: single-row.parquet > > > This might be a bug in parquet-cpp, I need to spend a bit more time tracking > this down but basically given a file with a single row on hdfs, reading it > with pyarrow yields this error > ``` > TcpSocket.cpp: 79: HdfsEndOfStream: Read 8 bytes failed from: End of the > stream > @ Unknown > @ Unknown > @ Unknown > @ Unknown > @ Unknown > @ Unknown > @ Unknown > @ Unknown > @ Unknown > @ arrow::io::HdfsReadableFile::ReadAt(long, long, long*, void*) > @ parquet::ArrowInputFile::ReadAt(long, long, unsigned char*) > @ parquet::SerializedFile::ParseMetaData() > @ > parquet::ParquetFileReader::Contents::Open(std::unique_ptr std::default_delete >, > parquet::ReaderProperties const&, std::shared_ptr > const&) > @ > parquet::ParquetFileReader::Open(std::unique_ptr std::default_delete >, parquet::ReaderProperties > const&, std::shared_ptr const&) > @ parquet::arrow::OpenFile(std::shared_ptr > const&, arrow::MemoryPool*, parquet::ReaderProperties const&, > std::shared_ptr const&, > std::unique_ptr std::default_delete >*) > @ __pyx_pw_7pyarrow_8_parquet_13ParquetReader_3open(_object*, _object*, > _object*) > ``` > The following code causes it: > ``` > import pyarrow > import pyarrow.parquet as pq > > fs = pyarrow.hdfs.connect('my-namenode-url',
[jira] [Updated] (ARROW-2842) [Python] Cannot read parquet files with row group size of 1 From HDFS
[ https://issues.apache.org/jira/browse/ARROW-2842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Gruener updated ARROW-2842: -- Description: This might be a bug in parquet-cpp, I need to spend a bit more time tracking this down but basically given a file with a single row on hdfs, reading it with pyarrow yields this error ``` TcpSocket.cpp: 79: HdfsEndOfStream: Read 8 bytes failed from "10.103.182.28:50010": End of the stream @ Unknown @ Unknown @ Unknown @ Unknown @ Unknown @ Unknown @ Unknown @ Unknown @ Unknown @ arrow::io::HdfsReadableFile::ReadAt(long, long, long*, void*) @ parquet::ArrowInputFile::ReadAt(long, long, unsigned char*) @ parquet::SerializedFile::ParseMetaData() @ parquet::ParquetFileReader::Contents::Open(std::unique_ptr >, parquet::ReaderProperties const&, std::shared_ptr const&) @ parquet::ParquetFileReader::Open(std::unique_ptr >, parquet::ReaderProperties const&, std::shared_ptr const&) @ parquet::arrow::OpenFile(std::shared_ptr const&, arrow::MemoryPool*, parquet::ReaderProperties const&, std::shared_ptr const&, std::unique_ptr >*) @ __pyx_pw_7pyarrow_8_parquet_13ParquetReader_3open(_object*, _object*, _object*) ``` The following code causes it: ``` import pyarrow import pyarrow.parquet as pq fs = pyarrow.hdfs.connect('my-namenode-url', driver='libhdfs3') # fill in namenode information file_object = fs.open('single-row.parquet') # update for hdfs path of file pq.read_metadata(file_object) # this works parquet_file = pq.ParquetFile(file_object) parquet_file.read_row_group(0) # throws error ``` I am working on writing a unit test for this. Note that I am using libhdfs3. was: This might be a bug in parquet-cpp, I need to spend a bit more time tracking this down but basically given a file with a single row on hdfs, reading it with pyarrow yields this error ``` TcpSocket.cpp: 79: HdfsEndOfStream: Read 8 bytes failed from "10.103.182.28:50010": End of the stream @ Unknown @ Unknown @ Unknown @ Unknown @ Unknown @ Unknown @ Unknown @ Unknown @ Unknown @ arrow::io::HdfsReadableFile::ReadAt(long, long, long*, void*) @ parquet::ArrowInputFile::ReadAt(long, long, unsigned char*) @ parquet::SerializedFile::ParseMetaData() @ parquet::ParquetFileReader::Contents::Open(std::unique_ptr >, parquet::ReaderProperties const&, std::shared_ptr const&) @ parquet::ParquetFileReader::Open(std::unique_ptr >, parquet::ReaderProperties const&, std::shared_ptr const&) @ parquet::arrow::OpenFile(std::shared_ptr const&, arrow::MemoryPool*, parquet::ReaderProperties const&, std::shared_ptr const&, std::unique_ptr >*) @ __pyx_pw_7pyarrow_8_parquet_13ParquetReader_3open(_object*, _object*, _object*) ``` The following code causes it: ``` import pyarrow import pyarrow.parquet as pq fs = pyarrow.hdfs.connect() # fill in namenode information file_object = fs.open('single-row.parquet') # update for hdfs path of file pq.read_metadata(file_object) # this works parquet_file = pq.ParquetFile(file_object) parquet_file.read_row_group(0) # throws error ``` I am working on writing a unit test for this > [Python] Cannot read parquet files with row group size of 1 From HDFS > - > > Key: ARROW-2842 > URL: https://issues.apache.org/jira/browse/ARROW-2842 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Robert Gruener >Priority: Major > Attachments: single-row.parquet > > > This might be a bug in parquet-cpp, I need to spend a bit more time tracking > this down but basically given a file with a single row on hdfs, reading it > with pyarrow yields this error > ``` > TcpSocket.cpp: 79: HdfsEndOfStream: Read 8 bytes failed from > "10.103.182.28:50010": End of the stream > @ Unknown > @ Unknown > @ Unknown > @ Unknown > @ Unknown > @ Unknown > @ Unknown > @ Unknown > @ Unknown > @ arrow::io::HdfsReadableFile::ReadAt(long, long, long*, void*) > @ parquet::ArrowInputFile::ReadAt(long, long, unsigned char*) > @ parquet::SerializedFile::ParseMetaData() > @ > parquet::ParquetFileReader::Contents::Open(std::unique_ptr std::default_delete >, > parquet::ReaderProperties const&, std::shared_ptr > const&) > @ > parquet::ParquetFileReader::Open(std::unique_ptr std::default_delete >, parquet::ReaderProperties > const&, std::shared_ptr const&) > @ parquet::arrow::OpenFile(std::shared_ptr > const&, arrow::MemoryPool*, parquet::ReaderProperties const&, > std::shared_ptr const&, > std::unique_ptr std::default_delete >*) > @ __pyx_pw_7pyarrow_8_parquet_13ParquetReader_3open(_object*, _object*, > _object*) > ``` > The following code causes it: > ``` > import pyarrow > import pyarrow.parquet as pq > > fs = pyarrow.hdfs.connect('my-namenode-url', driver='libhdfs3') # fil
[jira] [Created] (ARROW-2842) [Python] Cannot read parquet files with row group size of 1 From HDFS
Robert Gruener created ARROW-2842: - Summary: [Python] Cannot read parquet files with row group size of 1 From HDFS Key: ARROW-2842 URL: https://issues.apache.org/jira/browse/ARROW-2842 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Robert Gruener Attachments: single-row.parquet This might be a bug in parquet-cpp, I need to spend a bit more time tracking this down but basically given a file with a single row on hdfs, reading it with pyarrow yields this error ``` TcpSocket.cpp: 79: HdfsEndOfStream: Read 8 bytes failed from "10.103.182.28:50010": End of the stream @ Unknown @ Unknown @ Unknown @ Unknown @ Unknown @ Unknown @ Unknown @ Unknown @ Unknown @ arrow::io::HdfsReadableFile::ReadAt(long, long, long*, void*) @ parquet::ArrowInputFile::ReadAt(long, long, unsigned char*) @ parquet::SerializedFile::ParseMetaData() @ parquet::ParquetFileReader::Contents::Open(std::unique_ptr >, parquet::ReaderProperties const&, std::shared_ptr const&) @ parquet::ParquetFileReader::Open(std::unique_ptr >, parquet::ReaderProperties const&, std::shared_ptr const&) @ parquet::arrow::OpenFile(std::shared_ptr const&, arrow::MemoryPool*, parquet::ReaderProperties const&, std::shared_ptr const&, std::unique_ptr >*) @ __pyx_pw_7pyarrow_8_parquet_13ParquetReader_3open(_object*, _object*, _object*) ``` The following code causes it: ``` import pyarrow import pyarrow.parquet as pq fs = pyarrow.hdfs.connect() # fill in namenode information file_object = fs.open('single-row.parquet') # update for hdfs path of file pq.read_metadata(file_object) # this works parquet_file = pq.ParquetFile(file_object) parquet_file.read_row_group(0) # throws error ``` I am working on writing a unit test for this -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2841) [Go] Fix recent Go build failures in Travis CI
Wes McKinney created ARROW-2841: --- Summary: [Go] Fix recent Go build failures in Travis CI Key: ARROW-2841 URL: https://issues.apache.org/jira/browse/ARROW-2841 Project: Apache Arrow Issue Type: Bug Components: Go Reporter: Antoine Pitrou Fix For: 0.10.0 I'm getting persistent failures in the Go job on Travis-CI: https://travis-ci.org/pitrou/arrow/jobs/403221354 Is this expected? Excerpt: {code} $ go get -t -v ./... github.com/apache/arrow (download) github.com/stretchr/testify (download) go/arrow/type_traits_boolean.go:20:2: use of internal package not allowed go/arrow/array/array.go:23:2: use of internal package not allowed go/arrow/array/array.go:24:2: use of internal package not allowed go/arrow/math/math_amd64.go:22:2: use of internal package not allowed go/arrow/memory/memory_amd64.go:22:2: use of internal package not allowed go/arrow/memory/buffer.go:22:2: use of internal package not allowed The command "eval go get -t -v ./... " failed. Retrying, 2 of 3. {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2840) [C++] See if stream alignment logic can be simplified
Wes McKinney created ARROW-2840: --- Summary: [C++] See if stream alignment logic can be simplified Key: ARROW-2840 URL: https://issues.apache.org/jira/browse/ARROW-2840 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 0.10.0 There were a couple patches related to 64-byte alignment for tensor messages at the beginning of April that I've been concerned about: ARROW-2308: https://github.com/apache/arrow/commit/26bc4ab5a31e3430e1d545068e5a5a5ba5bc7a22#diff-bf4eb56f3cb806c32f60ae67afbb1bf0 ARROW-2437: https://github.com/apache/arrow/commit/0f87c12d45250ee763ac8c43b7e57e8f06a0b9f3 In my opinion, the alignment issue should be encapsulated in the protocol if possible; to have argument that toggles whether the file is to be aligned after reading the metadata seems wrong to me. The bytes written for alignment purposes should be accounted for in the metadata size prefix in the IPC message. Since there are unit tests for this, I'd like to take a look and see if I can improve this without breaking the tests. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2836) [Packaging] Expand build matrices to multiple tasks
[ https://issues.apache.org/jira/browse/ARROW-2836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-2836: -- Labels: pull-request-available (was: ) > [Packaging] Expand build matrices to multiple tasks > --- > > Key: ARROW-2836 > URL: https://issues.apache.org/jira/browse/ARROW-2836 > Project: Apache Arrow > Issue Type: Task > Components: Packaging >Reporter: Krisztian Szucs >Assignee: Krisztian Szucs >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > Create release tags for each task instead a single one for the job, and > create additional tasks per build variant. > This will resolve the package naming conflicts, but increases the config > verbosity (which is fine for now). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2836) [Packaging] Expand build matrices to multiple tasks
[ https://issues.apache.org/jira/browse/ARROW-2836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs updated ARROW-2836: --- Summary: [Packaging] Expand build matrices to multiple tasks (was: [Packaging] Expanding build matrices to multiple tasks) > [Packaging] Expand build matrices to multiple tasks > --- > > Key: ARROW-2836 > URL: https://issues.apache.org/jira/browse/ARROW-2836 > Project: Apache Arrow > Issue Type: Task > Components: Packaging >Reporter: Krisztian Szucs >Assignee: Krisztian Szucs >Priority: Major > Fix For: 0.10.0 > > > Create release tags for each task instead a single one for the job, and > create additional tasks per build variant. > This will resolve the package naming conflicts, but increases the config > verbosity (which is fine for now). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2839) [JS] Support whatwg/streams in IPC reader/writer
[ https://issues.apache.org/jira/browse/ARROW-2839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Taylor updated ARROW-2839: --- Fix Version/s: (was: 0.4.0) JS-0.4.0 > [JS] Support whatwg/streams in IPC reader/writer > > > Key: ARROW-2839 > URL: https://issues.apache.org/jira/browse/ARROW-2839 > Project: Apache Arrow > Issue Type: Improvement > Components: JavaScript >Affects Versions: JS-0.3.1 >Reporter: Paul Taylor >Assignee: Paul Taylor >Priority: Major > Fix For: JS-0.4.0 > > > We should make it easy to stream Arrow in the browser via > [whatwg/streams|https://github.com/whatwg/streams]. I already have this > working at Graphistry, but I had to use some of the IPC internal methods. > Creating this issue to track back-porting that work and the few minor > refactors to the IPC internals that we'll need to do. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2839) [JS] Support whatwg/streams in IPC reader/writer
Paul Taylor created ARROW-2839: -- Summary: [JS] Support whatwg/streams in IPC reader/writer Key: ARROW-2839 URL: https://issues.apache.org/jira/browse/ARROW-2839 Project: Apache Arrow Issue Type: Improvement Components: JavaScript Affects Versions: JS-0.3.1 Reporter: Paul Taylor Assignee: Paul Taylor Fix For: 0.4.0 We should make it easy to stream Arrow in the browser via [whatwg/streams|https://github.com/whatwg/streams]. I already have this working at Graphistry, but I had to use some of the IPC internal methods. Creating this issue to track back-porting that work and the few minor refactors to the IPC internals that we'll need to do. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-2771) [JS] Add row proxy object accessor
[ https://issues.apache.org/jira/browse/ARROW-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Taylor reassigned ARROW-2771: -- Assignee: Paul Taylor (was: Brian Hulette) > [JS] Add row proxy object accessor > -- > > Key: ARROW-2771 > URL: https://issues.apache.org/jira/browse/ARROW-2771 > Project: Apache Arrow > Issue Type: Improvement > Components: JavaScript >Reporter: Brian Hulette >Assignee: Paul Taylor >Priority: Major > Labels: pull-request-available > Fix For: JS-0.4.0 > > Time Spent: 40m > Remaining Estimate: 0h > > The {{Table}} class would be much easier to interact with if it returned > familiar Javascript objects representing a row. As Jeff Heer > [demonstrated|https://beta.observablehq.com/@jheer/from-apache-arrow-to-javascript-objects] > it's possible to create JS Proxy objects that read directly from Arrow > memory. We should generate these types of objects in {{Table.get}} and in the > {{Table}} iterator. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-2705) [JS] CombinationPredicates should take list of predicates
[ https://issues.apache.org/jira/browse/ARROW-2705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Taylor reassigned ARROW-2705: -- Assignee: Paul Taylor (was: Brian Hulette) > [JS] CombinationPredicates should take list of predicates > - > > Key: ARROW-2705 > URL: https://issues.apache.org/jira/browse/ARROW-2705 > Project: Apache Arrow > Issue Type: Improvement > Components: JavaScript >Reporter: Dominik Moritz >Assignee: Paul Taylor >Priority: Minor > Labels: pull-request-available > Fix For: JS-0.4.0 > > Time Spent: 10m > Remaining Estimate: 0h > > CombinationPredicate like And and Or only take two predicates: left and > right. However, it often makes sense to combine more than two predicates. I > suggest that `and` and `or` support arbitrarily many arguments. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-2705) [JS] CombinationPredicates should take list of predicates
[ https://issues.apache.org/jira/browse/ARROW-2705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Taylor reassigned ARROW-2705: -- Assignee: Brian Hulette > [JS] CombinationPredicates should take list of predicates > - > > Key: ARROW-2705 > URL: https://issues.apache.org/jira/browse/ARROW-2705 > Project: Apache Arrow > Issue Type: Improvement > Components: JavaScript >Reporter: Dominik Moritz >Assignee: Brian Hulette >Priority: Minor > Labels: pull-request-available > Fix For: JS-0.4.0 > > Time Spent: 10m > Remaining Estimate: 0h > > CombinationPredicate like And and Or only take two predicates: left and > right. However, it often makes sense to combine more than two predicates. I > suggest that `and` and `or` support arbitrarily many arguments. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-2767) [JS] Add generic to Table for column names
[ https://issues.apache.org/jira/browse/ARROW-2767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Taylor reassigned ARROW-2767: -- Assignee: Paul Taylor (was: Brian Hulette) > [JS] Add generic to Table for column names > -- > > Key: ARROW-2767 > URL: https://issues.apache.org/jira/browse/ARROW-2767 > Project: Apache Arrow > Issue Type: Improvement > Components: JavaScript >Reporter: Brian Hulette >Assignee: Paul Taylor >Priority: Major > Labels: pull-request-available > Fix For: JS-0.4.0 > > Time Spent: 10m > Remaining Estimate: 0h > > Requested by [~domoritz] > Something like: > {code:javascript} > class Table { > ... > getColumn(name: ColName): Vector { > } > ... > } > {code} > It would be even better if we could find a way to map the column names to the > actual vector data types, but one thing at a time. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-2767) [JS] Add generic to Table for column names
[ https://issues.apache.org/jira/browse/ARROW-2767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Taylor reassigned ARROW-2767: -- Assignee: Brian Hulette > [JS] Add generic to Table for column names > -- > > Key: ARROW-2767 > URL: https://issues.apache.org/jira/browse/ARROW-2767 > Project: Apache Arrow > Issue Type: Improvement > Components: JavaScript >Reporter: Brian Hulette >Assignee: Brian Hulette >Priority: Major > Labels: pull-request-available > Fix For: JS-0.4.0 > > Time Spent: 10m > Remaining Estimate: 0h > > Requested by [~domoritz] > Something like: > {code:javascript} > class Table { > ... > getColumn(name: ColName): Vector { > } > ... > } > {code} > It would be even better if we could find a way to map the column names to the > actual vector data types, but one thing at a time. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-2819) [JS] Fails to build with TS 2.8.3
[ https://issues.apache.org/jira/browse/ARROW-2819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Taylor reassigned ARROW-2819: -- Assignee: Paul Taylor > [JS] Fails to build with TS 2.8.3 > - > > Key: ARROW-2819 > URL: https://issues.apache.org/jira/browse/ARROW-2819 > Project: Apache Arrow > Issue Type: Bug > Components: JavaScript >Affects Versions: JS-0.3.1 >Reporter: Brian Hulette >Assignee: Paul Taylor >Priority: Major > Fix For: JS-0.4.0 > > > See the [GitHub > issue|https://github.com/apache/arrow/issues/2115#issuecomment-403612925] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2838) [Python] Speed up null testing with Pandas semantics
[ https://issues.apache.org/jira/browse/ARROW-2838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-2838: -- Labels: pull-request-available (was: ) > [Python] Speed up null testing with Pandas semantics > > > Key: ARROW-2838 > URL: https://issues.apache.org/jira/browse/ARROW-2838 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Affects Versions: 0.9.0 >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > > The {{PandasObjectIsNull}} helper function can be a significant contributor > when converting a Pandas dataframe to Arrow format (e.g. when writing a > dataframe to feather format). We can try to speed up the type checks in that > function. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2838) [Python] Speed up null testing with Pandas semantics
Antoine Pitrou created ARROW-2838: - Summary: [Python] Speed up null testing with Pandas semantics Key: ARROW-2838 URL: https://issues.apache.org/jira/browse/ARROW-2838 Project: Apache Arrow Issue Type: Improvement Components: C++, Python Affects Versions: 0.9.0 Reporter: Antoine Pitrou Assignee: Antoine Pitrou The {{PandasObjectIsNull}} helper function can be a significant contributor when converting a Pandas dataframe to Arrow format (e.g. when writing a dataframe to feather format). We can try to speed up the type checks in that function. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2835) [C++] ReadAt/WriteAt are inconsistent with moving the files position
[ https://issues.apache.org/jira/browse/ARROW-2835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541972#comment-16541972 ] Antoine Pitrou commented on ARROW-2835: --- In that case it would mean only advance the position if we're on Windows :) I don't think there's a nice way out of this. If we want to keep the parallelization benefits of ReadAt and WriteAt, we'll need to live with the platform specifics, I think. [~wesmckinn] > [C++] ReadAt/WriteAt are inconsistent with moving the files position > > > Key: ARROW-2835 > URL: https://issues.apache.org/jira/browse/ARROW-2835 > Project: Apache Arrow > Issue Type: New Feature >Reporter: Dimitri Vorona >Priority: Major > > Right now, there is inconsistent behaviour regarding moving the files > position pointer after calling ReadAt or WriteAt. For example, the default > implementation of ReadAt seeks to the desired offset and calls Read which > moves the position pointer. MemoryMappedFile::ReadAt, however, doesn't change > the position. WriteableFile::WriteAt seem to move the position in the current > implementation, but there is no docstring which prescribes this behaviour. > Antoine suggested that *At methods shouldn't touch the position and it makes > more sense, IMHO. The change isn't huge and doesn't seem to break anything > internally, but it might break the existing user code. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2835) [C++] ReadAt/WriteAt are inconsistent with moving the files position
[ https://issues.apache.org/jira/browse/ARROW-2835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541962#comment-16541962 ] Dimitri Vorona commented on ARROW-2835: --- Ok, then we should at least be consistent across implementation, i.e. advance the position in MemoryMappedFile::ReadAt, right? > [C++] ReadAt/WriteAt are inconsistent with moving the files position > > > Key: ARROW-2835 > URL: https://issues.apache.org/jira/browse/ARROW-2835 > Project: Apache Arrow > Issue Type: New Feature >Reporter: Dimitri Vorona >Priority: Major > > Right now, there is inconsistent behaviour regarding moving the files > position pointer after calling ReadAt or WriteAt. For example, the default > implementation of ReadAt seeks to the desired offset and calls Read which > moves the position pointer. MemoryMappedFile::ReadAt, however, doesn't change > the position. WriteableFile::WriteAt seem to move the position in the current > implementation, but there is no docstring which prescribes this behaviour. > Antoine suggested that *At methods shouldn't touch the position and it makes > more sense, IMHO. The change isn't huge and doesn't seem to break anything > internally, but it might break the existing user code. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2553) [Python] Set MACOSX_DEPLOYMENT_TARGET in wheel build
[ https://issues.apache.org/jira/browse/ARROW-2553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541955#comment-16541955 ] Wes McKinney commented on ARROW-2553: - Sorry -- I had commented on the PR. If we fix the wheels there, then ARROW-2326 will probably be fixed > [Python] Set MACOSX_DEPLOYMENT_TARGET in wheel build > > > Key: ARROW-2553 > URL: https://issues.apache.org/jira/browse/ARROW-2553 > Project: Apache Arrow > Issue Type: Improvement > Components: Packaging, Python >Reporter: Uwe L. Korn >Priority: Blocker > Labels: pull-request-available > Fix For: 0.10.0 > > Time Spent: 1h > Remaining Estimate: 0h > > The current `pyarrow` wheels are not usable on older OSX releases due to a > problem in the newest Xcode SDK. We need to set {{MACOSX_DEPLOYMENT_TARGET}} > to an older OSX release to avoid getting {{Symbol not found: > _os_unfair_lock_lock}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2553) [Python] Set MACOSX_DEPLOYMENT_TARGET in wheel build
[ https://issues.apache.org/jira/browse/ARROW-2553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541943#comment-16541943 ] Krisztian Szucs commented on ARROW-2553: [~wesmckinn] Uwe has already checked, doesn't work properly yet - see the PR. > [Python] Set MACOSX_DEPLOYMENT_TARGET in wheel build > > > Key: ARROW-2553 > URL: https://issues.apache.org/jira/browse/ARROW-2553 > Project: Apache Arrow > Issue Type: Improvement > Components: Packaging, Python >Reporter: Uwe L. Korn >Priority: Blocker > Labels: pull-request-available > Fix For: 0.10.0 > > Time Spent: 1h > Remaining Estimate: 0h > > The current `pyarrow` wheels are not usable on older OSX releases due to a > problem in the newest Xcode SDK. We need to set {{MACOSX_DEPLOYMENT_TARGET}} > to an older OSX release to avoid getting {{Symbol not found: > _os_unfair_lock_lock}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2059) [Python] Possible performance regression in Feather read/write path
[ https://issues.apache.org/jira/browse/ARROW-2059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541928#comment-16541928 ] Antoine Pitrou commented on ARROW-2059: --- I use Linux "perf" on a benchmark script than ran "perf report" to get call trees and then manually fed some paste of that into "c++filt". The most accessible resource I've found about the "perf" utility is http://www.brendangregg.com/perf.html > [Python] Possible performance regression in Feather read/write path > --- > > Key: ARROW-2059 > URL: https://issues.apache.org/jira/browse/ARROW-2059 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Wes McKinney >Assignee: Jingyuan Wang >Priority: Major > Fix For: 0.11.0 > > > See discussion in https://github.com/wesm/feather/issues/329. Needs to be > investigated -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2059) [Python] Possible performance regression in Feather read/write path
[ https://issues.apache.org/jira/browse/ARROW-2059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541910#comment-16541910 ] Wes McKinney commented on ARROW-2059: - Thanks [~pitrou], do you have some tricks for creating profiles like that (I have made FlameGraphs but not much more)? > [Python] Possible performance regression in Feather read/write path > --- > > Key: ARROW-2059 > URL: https://issues.apache.org/jira/browse/ARROW-2059 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Wes McKinney >Assignee: Jingyuan Wang >Priority: Major > Fix For: 0.11.0 > > > See discussion in https://github.com/wesm/feather/issues/329. Needs to be > investigated -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2059) [Python] Possible performance regression in Feather read/write path
[ https://issues.apache.org/jira/browse/ARROW-2059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541898#comment-16541898 ] Antoine Pitrou commented on ARROW-2059: --- Here is a profile tree got by exercising the write path on Python 3.6: {code} - arrow::py::NumPyConverter::ConvertObjectStrings() - 80,27% arrow::py::AppendObjectStrings(tagPyArrayObject*, tagPyArrayObject*, long, bool, arrow::StringBuilder*, long*, bool*) - 50,74% arrow::py::internal::BuilderAppend(arrow::StringBuilder*, _object*, bool, bool*) - 24,95% arrow::BinaryBuilder::Append(unsigned char const*, int) 7,43% arrow::BinaryBuilder::AppendNextOffset() + 6,28% arrow::BufferBuilder::Resize(long, bool) 2,30% __memcpy_avx_unaligned 0,71% arrow::ArrayBuilder::Reserve(long) 6,16% PyUnicode_AsUTF8AndSize + 4,37% PyErr_Occurred - 16,70% arrow::py::internal::PandasObjectIsNull(_object*) - 8,29% arrow::py::internal::PyDecimal_Check(_object*) PyType_IsSubtype - 4,59% arrow::py::internal::PyFloat_IsNaN(_object*) PyType_IsSubtype 2,51% PyArray_MultiplyList 2,41% PyType_IsSubtype + 1,57% arrow::ArrayBuilder::Finish(std::shared_ptr*) {code} > [Python] Possible performance regression in Feather read/write path > --- > > Key: ARROW-2059 > URL: https://issues.apache.org/jira/browse/ARROW-2059 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Wes McKinney >Assignee: Jingyuan Wang >Priority: Major > Fix For: 0.11.0 > > > See discussion in https://github.com/wesm/feather/issues/329. Needs to be > investigated -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Reopened] (ARROW-2782) [Python] Ongoing Travis CI failures in Plasma unit tests
[ https://issues.apache.org/jira/browse/ARROW-2782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reopened ARROW-2782: - > [Python] Ongoing Travis CI failures in Plasma unit tests > > > Key: ARROW-2782 > URL: https://issues.apache.org/jira/browse/ARROW-2782 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > Time Spent: 40m > Remaining Estimate: 0h > > e.g. > {code} > [1m[31m_ test_use_huge_pages > __[0m > [1m@pytest.mark.skipif(not os.path.exists("/mnt/hugepages"),[0m > [1mreason="requires hugepage support")[0m > [1mdef test_use_huge_pages():[0m > [1mimport pyarrow.plasma as plasma[0m > [1mwith plasma.start_plasma_store([0m > [1mplasma_store_memory=2*10**9,[0m > [1mplasma_directory="/mnt/hugepages",[0m > [1muse_hugepages=True) as (plasma_store_name, p):[0m > [1mplasma_client = plasma.connect(plasma_store_name, "", 64)[0m > [1m> create_object(plasma_client, 10**8)[0m > [1m[31mpyarrow/tests/test_plasma.py[0m:773: > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > _ > [1m[31mpyarrow/tests/test_plasma.py[0m:79: in create_object > [1mseal=seal)[0m > [1m[31mpyarrow/tests/test_plasma.py[0m:68: in create_object_with_id > [1mmemory_buffer = client.create(object_id, data_size, metadata)[0m > [1m[31mpyarrow/_plasma.pyx[0m:300: in pyarrow._plasma.PlasmaClient.create > [1mcheck_status(self.client.get().Create(object_id.data, data_size,[0m > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > _ > [1m> raise PlasmaStoreFull(message)[0m > [1m[31mE PlasmaStoreFull: > /home/travis/build/apache/arrow/cpp/src/plasma/client.cc:375 code: > ReadCreateReply(buffer.data(), buffer.size(), &id, &object, &store_fd, > &mmap_size)[0m > [1m[31mE object does not fit in the plasma store[0m > [1m[31mpyarrow/error.pxi[0m:99: PlasmaStoreFull > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2837) [C++] ArrayBuilder::null_bitmap returns PoolBuffer
[ https://issues.apache.org/jira/browse/ARROW-2837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541881#comment-16541881 ] Wes McKinney commented on ARROW-2837: - This is used in parquet-cpp (https://github.com/apache/parquet-cpp/search?q=PoolBuffer&unscoped_q=PoolBuffer) so if we did that, we would want to fix up that library so it can release shortly after Arrow 0.10.0 comes out. cc [~xhochy] > [C++] ArrayBuilder::null_bitmap returns PoolBuffer > -- > > Key: ARROW-2837 > URL: https://issues.apache.org/jira/browse/ARROW-2837 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Affects Versions: 0.9.0 >Reporter: Dimitri Vorona >Priority: Major > Fix For: 0.10.0 > > > A simple buffer (like in case of ArrayBuilder::Data) seem to be enough to me, > and it doesn't break anything. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2837) [C++] ArrayBuilder::null_bitmap returns PoolBuffer
[ https://issues.apache.org/jira/browse/ARROW-2837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541879#comment-16541879 ] Wes McKinney commented on ARROW-2837: - Agreed. We should probably move its implementation into {{memory_pool.cc}} and provide only factory APIs that return ResizableBuffer > [C++] ArrayBuilder::null_bitmap returns PoolBuffer > -- > > Key: ARROW-2837 > URL: https://issues.apache.org/jira/browse/ARROW-2837 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Affects Versions: 0.9.0 >Reporter: Dimitri Vorona >Priority: Major > Fix For: 0.10.0 > > > A simple buffer (like in case of ArrayBuilder::Data) seem to be enough to me, > and it doesn't break anything. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2837) [C++] ArrayBuilder::null_bitmap returns PoolBuffer
[ https://issues.apache.org/jira/browse/ARROW-2837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2837: Fix Version/s: 0.10.0 > [C++] ArrayBuilder::null_bitmap returns PoolBuffer > -- > > Key: ARROW-2837 > URL: https://issues.apache.org/jira/browse/ARROW-2837 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Affects Versions: 0.9.0 >Reporter: Dimitri Vorona >Priority: Major > Fix For: 0.10.0 > > > A simple buffer (like in case of ArrayBuilder::Data) seem to be enough to me, > and it doesn't break anything. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file
[ https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541803#comment-16541803 ] Robert Gruener commented on ARROW-1983: --- [~xhochy] I made this dependent task PARQUET-1348 > [Python] Add ability to write parquet `_metadata` file > -- > > Key: ARROW-1983 > URL: https://issues.apache.org/jira/browse/ARROW-1983 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Jim Crist >Priority: Major > Labels: beginner, parquet > Fix For: 0.11.0 > > > Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file > (mostly just schema information). It would be useful to add the ability to > write a {{_metadata}} file as well. This should include information about > each row group in the dataset, including summary statistics. Having this > summary file would allow filtering of row groups without needing to access > each file beforehand. > This would require that the user is able to get the written RowGroups out of > a {{pyarrow.parquet.write_table}} call and then give these objects as a list > to new function that then passes them on as C++ objects to {{parquet-cpp}} > that generates the respective {{_metadata}} file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2825) [C++] Need AllocateBuffer / AllocateResizableBuffer variant with default memory pool
[ https://issues.apache.org/jira/browse/ARROW-2825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-2825: -- Labels: pull-request-available (was: ) > [C++] Need AllocateBuffer / AllocateResizableBuffer variant with default > memory pool > > > Key: ARROW-2825 > URL: https://issues.apache.org/jira/browse/ARROW-2825 > Project: Apache Arrow > Issue Type: Wish > Components: C++ >Affects Versions: 0.9.0 >Reporter: Antoine Pitrou >Priority: Major > Labels: pull-request-available > > It's not very practical that you have to pass the default memory pool > explicitly to {{AllocateBuffer}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2837) [C++] ArrayBuilder::null_bitmap returns PoolBuffer
[ https://issues.apache.org/jira/browse/ARROW-2837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541580#comment-16541580 ] Antoine Pitrou commented on ARROW-2837: --- Yes, I think PoolBuffer should stop being visible as an API (it can stay around as a private implementation detail). [~wesmckinn] Any thoughts? > [C++] ArrayBuilder::null_bitmap returns PoolBuffer > -- > > Key: ARROW-2837 > URL: https://issues.apache.org/jira/browse/ARROW-2837 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Affects Versions: 0.9.0 >Reporter: Dimitri Vorona >Priority: Major > > A simple buffer (like in case of ArrayBuilder::Data) seem to be enough to me, > and it doesn't break anything. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2837) [C++] ArrayBuilder::null_bitmap returns PoolBuffer
Dimitri Vorona created ARROW-2837: - Summary: [C++] ArrayBuilder::null_bitmap returns PoolBuffer Key: ARROW-2837 URL: https://issues.apache.org/jira/browse/ARROW-2837 Project: Apache Arrow Issue Type: New Feature Components: C++ Affects Versions: 0.9.0 Reporter: Dimitri Vorona A simple buffer (like in case of ArrayBuilder::Data) seem to be enough to me, and it doesn't break anything. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2760) [Python] Remove legacy property definition syntax from parquet module and test them
[ https://issues.apache.org/jira/browse/ARROW-2760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541519#comment-16541519 ] Krisztian Szucs commented on ARROW-2760: See the second block https://cython.readthedocs.io/en/latest/src/userguide/extension_types.html#properties {code} cdef class Spam: property cheese: "A doc string can go here." def __get__(self): # This is called when the property is read. ... def __set__(self, value): # This is called when the property is written. ... def __del__(self): # This is called when the property is deleted. {code} syntax is deprecated. > [Python] Remove legacy property definition syntax from parquet module and > test them > --- > > Key: ARROW-2760 > URL: https://issues.apache.org/jira/browse/ARROW-2760 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Krisztian Szucs >Assignee: Krisztian Szucs >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > Time Spent: 2h 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2760) [Python] Remove legacy property definition syntax from parquet module and test them
[ https://issues.apache.org/jira/browse/ARROW-2760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541517#comment-16541517 ] Antoine Pitrou commented on ARROW-2760: --- Can you elaborate on what you call "legacy property definition syntax"? > [Python] Remove legacy property definition syntax from parquet module and > test them > --- > > Key: ARROW-2760 > URL: https://issues.apache.org/jira/browse/ARROW-2760 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Krisztian Szucs >Assignee: Krisztian Szucs >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > Time Spent: 2h 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2822) [C++] Zero padding bytes in PoolBuffer::Resize
[ https://issues.apache.org/jira/browse/ARROW-2822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-2822: -- Labels: pull-request-available (was: ) > [C++] Zero padding bytes in PoolBuffer::Resize > -- > > Key: ARROW-2822 > URL: https://issues.apache.org/jira/browse/ARROW-2822 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: pull-request-available > > This is a possible proposed fix which would supersede some of the changes > made in ARROW-2790 as elsewhere. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2836) [Packaging] Expanding build matrices to multiple tasks
Krisztian Szucs created ARROW-2836: -- Summary: [Packaging] Expanding build matrices to multiple tasks Key: ARROW-2836 URL: https://issues.apache.org/jira/browse/ARROW-2836 Project: Apache Arrow Issue Type: Task Components: Packaging Reporter: Krisztian Szucs Assignee: Krisztian Szucs Fix For: 0.10.0 Create release tags for each task instead a single one for the job, and create additional tasks per build variant. This will resolve the package naming conflicts, but increases the config verbosity (which is fine for now). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2835) [C++] ReadAt/WriteAt are inconsistent with moving the files position
[ https://issues.apache.org/jira/browse/ARROW-2835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541394#comment-16541394 ] Antoine Pitrou commented on ARROW-2835: --- Unfortunately, things are bit more complicated as in some cases, ReadAt / WriteAt are forced to update the file position anyway (on Windows, see implementation of {{FileRead}} in {{io-util.cc}}). > [C++] ReadAt/WriteAt are inconsistent with moving the files position > > > Key: ARROW-2835 > URL: https://issues.apache.org/jira/browse/ARROW-2835 > Project: Apache Arrow > Issue Type: New Feature >Reporter: Dimitri Vorona >Priority: Major > > Right now, there is inconsistent behaviour regarding moving the files > position pointer after calling ReadAt or WriteAt. For example, the default > implementation of ReadAt seeks to the desired offset and calls Read which > moves the position pointer. MemoryMappedFile::ReadAt, however, doesn't change > the position. WriteableFile::WriteAt seem to move the position in the current > implementation, but there is no docstring which prescribes this behaviour. > Antoine suggested that *At methods shouldn't touch the position and it makes > more sense, IMHO. The change isn't huge and doesn't seem to break anything > internally, but it might break the existing user code. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2835) [C++] ReadAt/WriteAt are inconsistent with moving the files position
Dimitri Vorona created ARROW-2835: - Summary: [C++] ReadAt/WriteAt are inconsistent with moving the files position Key: ARROW-2835 URL: https://issues.apache.org/jira/browse/ARROW-2835 Project: Apache Arrow Issue Type: New Feature Reporter: Dimitri Vorona Right now, there is inconsistent behaviour regarding moving the files position pointer after calling ReadAt or WriteAt. For example, the default implementation of ReadAt seeks to the desired offset and calls Read which moves the position pointer. MemoryMappedFile::ReadAt, however, doesn't change the position. WriteableFile::WriteAt seem to move the position in the current implementation, but there is no docstring which prescribes this behaviour. Antoine suggested that *At methods shouldn't touch the position and it makes more sense, IMHO. The change isn't huge and doesn't seem to break anything internally, but it might break the existing user code. -- This message was sent by Atlassian JIRA (v7.6.3#76005)