[jira] [Commented] (ARROW-2313) [GLib] Release builds must define NDEBUG
[ https://issues.apache.org/jira/browse/ARROW-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16399956#comment-16399956 ] ASF GitHub Bot commented on ARROW-2313: --- kou opened a new pull request #1752: ARROW-2313: [C++] Add -NDEBUG flag to arrow.pc URL: https://github.com/apache/arrow/pull/1752 Arrow C++ users should use the same -NDEBUG flag as Arrow C++ itself. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [GLib] Release builds must define NDEBUG > > > Key: ARROW-2313 > URL: https://issues.apache.org/jira/browse/ARROW-2313 > Project: Apache Arrow > Issue Type: Bug > Components: GLib >Reporter: Wes McKinney >Assignee: Kouhei Sutou >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > Ran into another problem with {{verify-release-candidate.sh 0.9.0 0}} -- the > GLib build is not defining NDEBUG -- depending on whether Arrow was built in > release or debug mode, some symbols (like {{Buffer::mutable_data}}) may not > be inlined > {code} > CXX libarrow_glib_la-compute.lo > CC enums.lo > CXXLDlibarrow-glib.la > ar: `u' modifier ignored since `D' is the default (see `U') > GISCAN Arrow-1.0.gir > ./.libs/libarrow-glib.so: undefined reference to > `arrow::Buffer::mutable_data()' > collect2: error: ld returned 1 exit status > linking of temporary binary failed: Command '['/bin/bash', '../libtool', > '--mode=link', '--tag=CC', '--silent', 'gcc', '-o', > '/tmp/arrow-0.9.0.hlQDV/apache-arrow-0.9.0/c_glib/arrow-glib/tmp-introspect7g38ad/Arrow-1.0', > '-export-dynamic', '-g', '-O2', > 'tmp-introspect7g38ad/tmp/arrow-0.9.0.hlQDV/apache-arrow-0.9.0/c_glib/arrow-glib/tmp-introspect7g38ad/Arrow-1.0.o', > '-L.', 'libarrow-glib.la', '-Wl,--export-dynamic', '-lgmodule-2.0', > '-pthread', '-lgio-2.0', '-lgobject-2.0', '-lglib-2.0']' returned non-zero > exit status 1 > /usr/share/gobject-introspection-1.0/Makefile.introspection:155: recipe for > target 'Arrow-1.0.gir' failed > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2313) [GLib] Release builds must define NDEBUG
[ https://issues.apache.org/jira/browse/ARROW-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-2313: -- Labels: pull-request-available (was: ) > [GLib] Release builds must define NDEBUG > > > Key: ARROW-2313 > URL: https://issues.apache.org/jira/browse/ARROW-2313 > Project: Apache Arrow > Issue Type: Bug > Components: GLib >Reporter: Wes McKinney >Assignee: Kouhei Sutou >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > Ran into another problem with {{verify-release-candidate.sh 0.9.0 0}} -- the > GLib build is not defining NDEBUG -- depending on whether Arrow was built in > release or debug mode, some symbols (like {{Buffer::mutable_data}}) may not > be inlined > {code} > CXX libarrow_glib_la-compute.lo > CC enums.lo > CXXLDlibarrow-glib.la > ar: `u' modifier ignored since `D' is the default (see `U') > GISCAN Arrow-1.0.gir > ./.libs/libarrow-glib.so: undefined reference to > `arrow::Buffer::mutable_data()' > collect2: error: ld returned 1 exit status > linking of temporary binary failed: Command '['/bin/bash', '../libtool', > '--mode=link', '--tag=CC', '--silent', 'gcc', '-o', > '/tmp/arrow-0.9.0.hlQDV/apache-arrow-0.9.0/c_glib/arrow-glib/tmp-introspect7g38ad/Arrow-1.0', > '-export-dynamic', '-g', '-O2', > 'tmp-introspect7g38ad/tmp/arrow-0.9.0.hlQDV/apache-arrow-0.9.0/c_glib/arrow-glib/tmp-introspect7g38ad/Arrow-1.0.o', > '-L.', 'libarrow-glib.la', '-Wl,--export-dynamic', '-lgmodule-2.0', > '-pthread', '-lgio-2.0', '-lgobject-2.0', '-lglib-2.0']' returned non-zero > exit status 1 > /usr/share/gobject-introspection-1.0/Makefile.introspection:155: recipe for > target 'Arrow-1.0.gir' failed > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-2313) [GLib] Release builds must define NDEBUG
[ https://issues.apache.org/jira/browse/ARROW-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou reassigned ARROW-2313: --- Assignee: Kouhei Sutou > [GLib] Release builds must define NDEBUG > > > Key: ARROW-2313 > URL: https://issues.apache.org/jira/browse/ARROW-2313 > Project: Apache Arrow > Issue Type: Bug > Components: GLib >Reporter: Wes McKinney >Assignee: Kouhei Sutou >Priority: Major > Fix For: 0.9.0 > > > Ran into another problem with {{verify-release-candidate.sh 0.9.0 0}} -- the > GLib build is not defining NDEBUG -- depending on whether Arrow was built in > release or debug mode, some symbols (like {{Buffer::mutable_data}}) may not > be inlined > {code} > CXX libarrow_glib_la-compute.lo > CC enums.lo > CXXLDlibarrow-glib.la > ar: `u' modifier ignored since `D' is the default (see `U') > GISCAN Arrow-1.0.gir > ./.libs/libarrow-glib.so: undefined reference to > `arrow::Buffer::mutable_data()' > collect2: error: ld returned 1 exit status > linking of temporary binary failed: Command '['/bin/bash', '../libtool', > '--mode=link', '--tag=CC', '--silent', 'gcc', '-o', > '/tmp/arrow-0.9.0.hlQDV/apache-arrow-0.9.0/c_glib/arrow-glib/tmp-introspect7g38ad/Arrow-1.0', > '-export-dynamic', '-g', '-O2', > 'tmp-introspect7g38ad/tmp/arrow-0.9.0.hlQDV/apache-arrow-0.9.0/c_glib/arrow-glib/tmp-introspect7g38ad/Arrow-1.0.o', > '-L.', 'libarrow-glib.la', '-Wl,--export-dynamic', '-lgmodule-2.0', > '-pthread', '-lgio-2.0', '-lgobject-2.0', '-lglib-2.0']' returned non-zero > exit status 1 > /usr/share/gobject-introspection-1.0/Makefile.introspection:155: recipe for > target 'Arrow-1.0.gir' failed > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-2312) [JS] verify-release-candidate-sh must be updated to include JS in integration tests
[ https://issues.apache.org/jira/browse/ARROW-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-2312: --- Assignee: Paul Taylor > [JS] verify-release-candidate-sh must be updated to include JS in integration > tests > --- > > Key: ARROW-2312 > URL: https://issues.apache.org/jira/browse/ARROW-2312 > Project: Apache Arrow > Issue Type: Bug > Components: JavaScript >Reporter: Wes McKinney >Assignee: Paul Taylor >Priority: Blocker > Labels: pull-request-available > Fix For: 0.9.0 > > > I was unable to run verify-release-candidate.sh when working on the first > iteration of the 0.9.0 release. > JavaScript was added to the integration tests, but the verification script > has not been updated yet -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2312) [JS] verify-release-candidate-sh must be updated to include JS in integration tests
[ https://issues.apache.org/jira/browse/ARROW-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16399909#comment-16399909 ] ASF GitHub Bot commented on ARROW-2312: --- wesm closed pull request #1751: ARROW-2312: [JS] run test_js before test_integration URL: https://github.com/apache/arrow/pull/1751 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/dev/release/verify-release-candidate.sh b/dev/release/verify-release-candidate.sh index cb9b01b37..0b278e7cf 100755 --- a/dev/release/verify-release-candidate.sh +++ b/dev/release/verify-release-candidate.sh @@ -246,12 +246,11 @@ cd ${DIST_NAME} test_package_java setup_miniconda test_and_install_cpp +test_js test_integration test_glib install_parquet_cpp test_python -test_js - echo 'Release candidate looks good!' exit 0 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [JS] verify-release-candidate-sh must be updated to include JS in integration > tests > --- > > Key: ARROW-2312 > URL: https://issues.apache.org/jira/browse/ARROW-2312 > Project: Apache Arrow > Issue Type: Bug > Components: JavaScript >Reporter: Wes McKinney >Priority: Blocker > Labels: pull-request-available > Fix For: 0.9.0 > > > I was unable to run verify-release-candidate.sh when working on the first > iteration of the 0.9.0 release. > JavaScript was added to the integration tests, but the verification script > has not been updated yet -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-2312) [JS] verify-release-candidate-sh must be updated to include JS in integration tests
[ https://issues.apache.org/jira/browse/ARROW-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-2312. - Resolution: Fixed Issue resolved by pull request 1751 [https://github.com/apache/arrow/pull/1751] > [JS] verify-release-candidate-sh must be updated to include JS in integration > tests > --- > > Key: ARROW-2312 > URL: https://issues.apache.org/jira/browse/ARROW-2312 > Project: Apache Arrow > Issue Type: Bug > Components: JavaScript >Reporter: Wes McKinney >Priority: Blocker > Labels: pull-request-available > Fix For: 0.9.0 > > > I was unable to run verify-release-candidate.sh when working on the first > iteration of the 0.9.0 release. > JavaScript was added to the integration tests, but the verification script > has not been updated yet -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2313) [GLib] Release builds must define NDEBUG
Wes McKinney created ARROW-2313: --- Summary: [GLib] Release builds must define NDEBUG Key: ARROW-2313 URL: https://issues.apache.org/jira/browse/ARROW-2313 Project: Apache Arrow Issue Type: Bug Components: GLib Reporter: Wes McKinney Fix For: 0.9.0 Ran into another problem with {{verify-release-candidate.sh 0.9.0 0}} -- the GLib build is not defining NDEBUG -- depending on whether Arrow was built in release or debug mode, some symbols (like {{Buffer::mutable_data}}) may not be inlined {code} CXX libarrow_glib_la-compute.lo CC enums.lo CXXLDlibarrow-glib.la ar: `u' modifier ignored since `D' is the default (see `U') GISCAN Arrow-1.0.gir ./.libs/libarrow-glib.so: undefined reference to `arrow::Buffer::mutable_data()' collect2: error: ld returned 1 exit status linking of temporary binary failed: Command '['/bin/bash', '../libtool', '--mode=link', '--tag=CC', '--silent', 'gcc', '-o', '/tmp/arrow-0.9.0.hlQDV/apache-arrow-0.9.0/c_glib/arrow-glib/tmp-introspect7g38ad/Arrow-1.0', '-export-dynamic', '-g', '-O2', 'tmp-introspect7g38ad/tmp/arrow-0.9.0.hlQDV/apache-arrow-0.9.0/c_glib/arrow-glib/tmp-introspect7g38ad/Arrow-1.0.o', '-L.', 'libarrow-glib.la', '-Wl,--export-dynamic', '-lgmodule-2.0', '-pthread', '-lgio-2.0', '-lgobject-2.0', '-lglib-2.0']' returned non-zero exit status 1 /usr/share/gobject-introspection-1.0/Makefile.introspection:155: recipe for target 'Arrow-1.0.gir' failed {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-640) [Python] Arrow scalar values should have a sensible __hash__ and comparison
[ https://issues.apache.org/jira/browse/ARROW-640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16399783#comment-16399783 ] Alex Hagerman commented on ARROW-640: - Sounds good. Just to verify Integer only or Number types in general? I've got a deployment happening during the day right now, so I'll hopefully be able to wrap up a version one this weekend and do a PR for review. You mentioned for items like StructValue the as_py fallback won't work. Similarly with ListValue I would expect both of these to raise a TypeError: Unhashable Type, but I'll check the current behavior. Depending on what that is do you have any thoughts if the hash() TypeError should be raised on mutable types like standard python behavior? Wanted to check so I don't conflict with existing expected behavior if this has been handled previously and to look at tying it in with __eq__. > [Python] Arrow scalar values should have a sensible __hash__ and comparison > --- > > Key: ARROW-640 > URL: https://issues.apache.org/jira/browse/ARROW-640 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Miki Tebeka >Assignee: Alex Hagerman >Priority: Major > Fix For: 0.10.0 > > > {noformat} > In [86]: arr = pa.from_pylist([1, 1, 1, 2]) > In [87]: set(arr) > Out[87]: {1, 2, 1, 1} > In [88]: arr[0] == arr[1] > Out[88]: False > In [89]: arr > Out[89]: > > [ > 1, > 1, > 1, > 2 > ] > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2312) [JS] verify-release-candidate-sh must be updated to include JS in integration tests
[ https://issues.apache.org/jira/browse/ARROW-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16399757#comment-16399757 ] ASF GitHub Bot commented on ARROW-2312: --- trxcllnt opened a new pull request #1751: ARROW-2312: [JS] run test_js before test_integration URL: https://github.com/apache/arrow/pull/1751 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [JS] verify-release-candidate-sh must be updated to include JS in integration > tests > --- > > Key: ARROW-2312 > URL: https://issues.apache.org/jira/browse/ARROW-2312 > Project: Apache Arrow > Issue Type: Bug > Components: JavaScript >Reporter: Wes McKinney >Priority: Blocker > Labels: pull-request-available > Fix For: 0.9.0 > > > I was unable to run verify-release-candidate.sh when working on the first > iteration of the 0.9.0 release. > JavaScript was added to the integration tests, but the verification script > has not been updated yet -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2312) [JS] verify-release-candidate-sh must be updated to include JS in integration tests
[ https://issues.apache.org/jira/browse/ARROW-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-2312: -- Labels: pull-request-available (was: ) > [JS] verify-release-candidate-sh must be updated to include JS in integration > tests > --- > > Key: ARROW-2312 > URL: https://issues.apache.org/jira/browse/ARROW-2312 > Project: Apache Arrow > Issue Type: Bug > Components: JavaScript >Reporter: Wes McKinney >Priority: Blocker > Labels: pull-request-available > Fix For: 0.9.0 > > > I was unable to run verify-release-candidate.sh when working on the first > iteration of the 0.9.0 release. > JavaScript was added to the integration tests, but the verification script > has not been updated yet -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2312) [JS] verify-release-candidate-sh must be updated to include JS in integration tests
Wes McKinney created ARROW-2312: --- Summary: [JS] verify-release-candidate-sh must be updated to include JS in integration tests Key: ARROW-2312 URL: https://issues.apache.org/jira/browse/ARROW-2312 Project: Apache Arrow Issue Type: Bug Components: JavaScript Reporter: Wes McKinney Fix For: 0.9.0 I was unable to run verify-release-candidate.sh when working on the first iteration of the 0.9.0 release. JavaScript was added to the integration tests, but the verification script has not been updated yet -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-1886) [Python] Add function to "flatten" structs within tables
[ https://issues.apache.org/jira/browse/ARROW-1886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou reassigned ARROW-1886: - Assignee: Antoine Pitrou > [Python] Add function to "flatten" structs within tables > > > Key: ARROW-1886 > URL: https://issues.apache.org/jira/browse/ARROW-1886 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Antoine Pitrou >Priority: Major > Fix For: 0.10.0 > > > See discussion in https://issues.apache.org/jira/browse/ARROW-1873 > When a user has a struct column, it may be more efficient to flatten the > struct into multiple columns of the form {{struct_name.field_name}} for each > field in the struct. Then when you call {{to_pandas}}, Python dictionaries do > not have to be created, and the conversion will be much more efficient -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2311) [Python] Struct array slicing defective
Antoine Pitrou created ARROW-2311: - Summary: [Python] Struct array slicing defective Key: ARROW-2311 URL: https://issues.apache.org/jira/browse/ARROW-2311 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.8.0 Reporter: Antoine Pitrou Assignee: Antoine Pitrou {code:python} >>> arr = pa.array([(1, 2.0), (3, 4.0), (5, 6.0)], >>> type=pa.struct([pa.field('x', pa.int16()), pa.field('y', pa.float32())])) >>> arr [ {'x': 1, 'y': 2.0}, {'x': 3, 'y': 4.0}, {'x': 5, 'y': 6.0} ] >>> arr[1:] [ {'x': 1, 'y': 2.0}, {'x': 3, 'y': 4.0} ] {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2310) Source release scripts fail with Java8
Wes McKinney created ARROW-2310: --- Summary: Source release scripts fail with Java8 Key: ARROW-2310 URL: https://issues.apache.org/jira/browse/ARROW-2310 Project: Apache Arrow Issue Type: Bug Reporter: Wes McKinney Fix For: 0.10.0 It's getting harder and harder to install Java7 these days. On a new install of Ubuntu 16.04 I am not even sure how to get Oracle's Java7 installed (though Java8 can be installed through a PPA). In lieu of fixing all the javadoc problems, it would be great if there was some other workaround to build the release on Java8 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-2307) [Python] Unable to read arrow stream containing 0 record batches
[ https://issues.apache.org/jira/browse/ARROW-2307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-2307. - Resolution: Fixed Fix Version/s: 0.9.0 Issue resolved by pull request 1747 [https://github.com/apache/arrow/pull/1747] > [Python] Unable to read arrow stream containing 0 record batches > > > Key: ARROW-2307 > URL: https://issues.apache.org/jira/browse/ARROW-2307 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Benjamin Duffield >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > Using java arrow I'm creating an arrow stream, using the stream writer. > > Sometimes I don't have anything to serialize, and so I don't write any record > batches. My arrow stream thus consists of just a schema message. > {code:java} > > > {code} > I am able to deserialize this arrow stream correctly using the java stream > reader, but when reading it with python I instead hit an error > {code} > import pyarrow as pa > # ... > reader = pa.open_stream(stream) > df = reader.read_all().to_pandas() > {code} > produces > {code} > File "ipc.pxi", line 307, in pyarrow.lib._RecordBatchReader.read_all > File "error.pxi", line 77, in pyarrow.lib.check_status > ArrowInvalid: Must pass at least one record batch > {code} > i.e. we're hitting the check in > https://github.com/apache/arrow/blob/apache-arrow-0.8.0/cpp/src/arrow/table.cc#L284 > The workaround we're currently using is to always ensure we serialize at > least one record batch, even if it's empty. However, I think it would be nice > to either support a stream without record batches or explicitly disallow this > and then match behaviour in java. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2307) [Python] Unable to read arrow stream containing 0 record batches
[ https://issues.apache.org/jira/browse/ARROW-2307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16399687#comment-16399687 ] ASF GitHub Bot commented on ARROW-2307: --- wesm closed pull request #1747: ARROW-2307: [Python] Allow reading record batch streams with zero record batches URL: https://github.com/apache/arrow/pull/1747 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/ci/msvc-build.bat b/ci/msvc-build.bat index beefee6c0..a29ef0bad 100644 --- a/ci/msvc-build.bat +++ b/ci/msvc-build.bat @@ -69,7 +69,8 @@ if "%JOB%" == "Build_Debug" ( ) conda create -n arrow -q -y python=%PYTHON% ^ - six pytest setuptools numpy pandas cython ^ + six pytest setuptools numpy pandas ^ + cython=0.27.3 ^ thrift-cpp=0.11.0 call activate arrow diff --git a/ci/travis_script_python.sh b/ci/travis_script_python.sh index a776c4263..247d10278 100755 --- a/ci/travis_script_python.sh +++ b/ci/travis_script_python.sh @@ -41,7 +41,7 @@ conda install -y -q pip \ cloudpickle \ numpy=1.13.1 \ pandas \ - cython + cython=0.27.3 # ARROW-2093: PyTorch increases the size of our conda dependency stack # significantly, and so we have disabled these tests in Travis CI for now diff --git a/cpp/src/arrow/table-test.cc b/cpp/src/arrow/table-test.cc index 24c8d5e15..b1cf6e59a 100644 --- a/cpp/src/arrow/table-test.cc +++ b/cpp/src/arrow/table-test.cc @@ -374,6 +374,17 @@ TEST_F(TestTable, FromRecordBatches) { ASSERT_RAISES(Invalid, Table::FromRecordBatches({batch1, batch2}, &result)); } +TEST_F(TestTable, FromRecordBatchesZeroLength) { + // ARROW-2307 + MakeExample1(10); + + std::shared_ptr result; + ASSERT_OK(Table::FromRecordBatches(schema_, {}, &result)); + + ASSERT_EQ(0, result->num_rows()); + ASSERT_TRUE(result->schema()->Equals(*schema_)); +} + TEST_F(TestTable, ConcatenateTables) { const int64_t length = 10; diff --git a/cpp/src/arrow/table.cc b/cpp/src/arrow/table.cc index ed5858624..f6ac6dd3b 100644 --- a/cpp/src/arrow/table.cc +++ b/cpp/src/arrow/table.cc @@ -297,14 +297,9 @@ std::shared_ptr Table::Make(const std::shared_ptr& schema, return std::make_shared(schema, arrays, num_rows); } -Status Table::FromRecordBatches(const std::vector>& batches, +Status Table::FromRecordBatches(const std::shared_ptr& schema, +const std::vector>& batches, std::shared_ptr* table) { - if (batches.size() == 0) { -return Status::Invalid("Must pass at least one record batch"); - } - - std::shared_ptr schema = batches[0]->schema(); - const int nbatches = static_cast(batches.size()); const int ncolumns = static_cast(schema->num_fields()); @@ -332,6 +327,15 @@ Status Table::FromRecordBatches(const std::vector>& return Status::OK(); } +Status Table::FromRecordBatches(const std::vector>& batches, +std::shared_ptr* table) { + if (batches.size() == 0) { +return Status::Invalid("Must pass at least one record batch"); + } + + return FromRecordBatches(batches[0]->schema(), batches, table); +} + Status ConcatenateTables(const std::vector>& tables, std::shared_ptr* table) { if (tables.size() == 0) { diff --git a/cpp/src/arrow/table.h b/cpp/src/arrow/table.h index 7274fca4d..20d027d6a 100644 --- a/cpp/src/arrow/table.h +++ b/cpp/src/arrow/table.h @@ -169,9 +169,25 @@ class ARROW_EXPORT Table { const std::vector>& arrays, int64_t num_rows = -1); - // Construct table from RecordBatch, but only if all of the batch schemas are - // equal. Returns Status::Invalid if there is some problem + /// \brief Construct table from RecordBatches, using schema supplied by the first + /// RecordBatch. + /// + /// \param[in] batches a std::vector of record batches + /// \param[out] table the returned table + /// \return Status Returns Status::Invalid if there is some problem + static Status FromRecordBatches( + const std::vector>& batches, + std::shared_ptr* table); + + /// Construct table from RecordBatches, using supplied schema. There may be + /// zero record batches + /// + /// \param[in] schema the arrow::Schema for each batch + /// \param[in] batches a std::vector of record batches + /// \param[out] table the returned table + /// \return Status static Status FromRecordBatches( + const std::shared_ptr& schema, const std::vector>& batches, std::shared_ptr* table); diff --git a/python/pyarrow/includes/libarrow.pxd b/python/pyarrow/includes/libarrow.pxd index 3d0c02b89..01a641896 100644 --- a/pyt
[jira] [Commented] (ARROW-2307) [Python] Unable to read arrow stream containing 0 record batches
[ https://issues.apache.org/jira/browse/ARROW-2307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16399684#comment-16399684 ] ASF GitHub Bot commented on ARROW-2307: --- wesm commented on issue #1747: ARROW-2307: [Python] Allow reading record batch streams with zero record batches URL: https://github.com/apache/arrow/pull/1747#issuecomment-373223120 +1. Appveyor build looking good: https://ci.appveyor.com/project/wesm/arrow/build/1.0.1776 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Unable to read arrow stream containing 0 record batches > > > Key: ARROW-2307 > URL: https://issues.apache.org/jira/browse/ARROW-2307 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Benjamin Duffield >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > > Using java arrow I'm creating an arrow stream, using the stream writer. > > Sometimes I don't have anything to serialize, and so I don't write any record > batches. My arrow stream thus consists of just a schema message. > {code:java} > > > {code} > I am able to deserialize this arrow stream correctly using the java stream > reader, but when reading it with python I instead hit an error > {code} > import pyarrow as pa > # ... > reader = pa.open_stream(stream) > df = reader.read_all().to_pandas() > {code} > produces > {code} > File "ipc.pxi", line 307, in pyarrow.lib._RecordBatchReader.read_all > File "error.pxi", line 77, in pyarrow.lib.check_status > ArrowInvalid: Must pass at least one record batch > {code} > i.e. we're hitting the check in > https://github.com/apache/arrow/blob/apache-arrow-0.8.0/cpp/src/arrow/table.cc#L284 > The workaround we're currently using is to always ensure we serialize at > least one record batch, even if it's empty. However, I think it would be nice > to either support a stream without record batches or explicitly disallow this > and then match behaviour in java. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2309) [C++] Use std::make_unsigned
[ https://issues.apache.org/jira/browse/ARROW-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-2309: -- Labels: pull-request-available (was: ) > [C++] Use std::make_unsigned > > > Key: ARROW-2309 > URL: https://issues.apache.org/jira/browse/ARROW-2309 > Project: Apache Arrow > Issue Type: Task > Components: C++ >Affects Versions: 0.8.0 >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Trivial > Labels: pull-request-available > > {{arrow/util/bit-util.h}} has a reimplementation of {{boost::make_unsigned}}, > but we could simply use {{std::make_unsigned}}, which is C++11. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2309) [C++] Use std::make_unsigned
[ https://issues.apache.org/jira/browse/ARROW-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16399439#comment-16399439 ] ASF GitHub Bot commented on ARROW-2309: --- pitrou opened a new pull request #1748: ARROW-2309: [C++] Use std::make_unsigned URL: https://github.com/apache/arrow/pull/1748 No need for our own reimplementation. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Use std::make_unsigned > > > Key: ARROW-2309 > URL: https://issues.apache.org/jira/browse/ARROW-2309 > Project: Apache Arrow > Issue Type: Task > Components: C++ >Affects Versions: 0.8.0 >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Trivial > Labels: pull-request-available > > {{arrow/util/bit-util.h}} has a reimplementation of {{boost::make_unsigned}}, > but we could simply use {{std::make_unsigned}}, which is C++11. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2309) [C++] Use std::make_unsigned
Antoine Pitrou created ARROW-2309: - Summary: [C++] Use std::make_unsigned Key: ARROW-2309 URL: https://issues.apache.org/jira/browse/ARROW-2309 Project: Apache Arrow Issue Type: Task Components: C++ Affects Versions: 0.8.0 Reporter: Antoine Pitrou Assignee: Antoine Pitrou {{arrow/util/bit-util.h}} has a reimplementation of {{boost::make_unsigned}}, but we could simply use {{std::make_unsigned}}, which is C++11. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2307) [Python] Unable to read arrow stream containing 0 record batches
[ https://issues.apache.org/jira/browse/ARROW-2307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-2307: -- Labels: pull-request-available (was: ) > [Python] Unable to read arrow stream containing 0 record batches > > > Key: ARROW-2307 > URL: https://issues.apache.org/jira/browse/ARROW-2307 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Benjamin Duffield >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > > Using java arrow I'm creating an arrow stream, using the stream writer. > > Sometimes I don't have anything to serialize, and so I don't write any record > batches. My arrow stream thus consists of just a schema message. > {code:java} > > > {code} > I am able to deserialize this arrow stream correctly using the java stream > reader, but when reading it with python I instead hit an error > {code} > import pyarrow as pa > # ... > reader = pa.open_stream(stream) > df = reader.read_all().to_pandas() > {code} > produces > {code} > File "ipc.pxi", line 307, in pyarrow.lib._RecordBatchReader.read_all > File "error.pxi", line 77, in pyarrow.lib.check_status > ArrowInvalid: Must pass at least one record batch > {code} > i.e. we're hitting the check in > https://github.com/apache/arrow/blob/apache-arrow-0.8.0/cpp/src/arrow/table.cc#L284 > The workaround we're currently using is to always ensure we serialize at > least one record batch, even if it's empty. However, I think it would be nice > to either support a stream without record batches or explicitly disallow this > and then match behaviour in java. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2307) [Python] Unable to read arrow stream containing 0 record batches
[ https://issues.apache.org/jira/browse/ARROW-2307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16399055#comment-16399055 ] ASF GitHub Bot commented on ARROW-2307: --- wesm opened a new pull request #1747: ARROW-2307: [Python] Allow reading record batch streams with zero record batches URL: https://github.com/apache/arrow/pull/1747 This is a pretty rough edge case -- it would be good to get this fix into 0.9.0. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Unable to read arrow stream containing 0 record batches > > > Key: ARROW-2307 > URL: https://issues.apache.org/jira/browse/ARROW-2307 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Benjamin Duffield >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > > Using java arrow I'm creating an arrow stream, using the stream writer. > > Sometimes I don't have anything to serialize, and so I don't write any record > batches. My arrow stream thus consists of just a schema message. > {code:java} > > > {code} > I am able to deserialize this arrow stream correctly using the java stream > reader, but when reading it with python I instead hit an error > {code} > import pyarrow as pa > # ... > reader = pa.open_stream(stream) > df = reader.read_all().to_pandas() > {code} > produces > {code} > File "ipc.pxi", line 307, in pyarrow.lib._RecordBatchReader.read_all > File "error.pxi", line 77, in pyarrow.lib.check_status > ArrowInvalid: Must pass at least one record batch > {code} > i.e. we're hitting the check in > https://github.com/apache/arrow/blob/apache-arrow-0.8.0/cpp/src/arrow/table.cc#L284 > The workaround we're currently using is to always ensure we serialize at > least one record batch, even if it's empty. However, I think it would be nice > to either support a stream without record batches or explicitly disallow this > and then match behaviour in java. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-2307) Unable to read arrow stream containing 0 record batches using pyarrow
[ https://issues.apache.org/jira/browse/ARROW-2307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-2307: --- Assignee: Wes McKinney > Unable to read arrow stream containing 0 record batches using pyarrow > - > > Key: ARROW-2307 > URL: https://issues.apache.org/jira/browse/ARROW-2307 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Benjamin Duffield >Assignee: Wes McKinney >Priority: Major > > Using java arrow I'm creating an arrow stream, using the stream writer. > > Sometimes I don't have anything to serialize, and so I don't write any record > batches. My arrow stream thus consists of just a schema message. > {code:java} > > > {code} > I am able to deserialize this arrow stream correctly using the java stream > reader, but when reading it with python I instead hit an error > {code} > import pyarrow as pa > # ... > reader = pa.open_stream(stream) > df = reader.read_all().to_pandas() > {code} > produces > {code} > File "ipc.pxi", line 307, in pyarrow.lib._RecordBatchReader.read_all > File "error.pxi", line 77, in pyarrow.lib.check_status > ArrowInvalid: Must pass at least one record batch > {code} > i.e. we're hitting the check in > https://github.com/apache/arrow/blob/apache-arrow-0.8.0/cpp/src/arrow/table.cc#L284 > The workaround we're currently using is to always ensure we serialize at > least one record batch, even if it's empty. However, I think it would be nice > to either support a stream without record batches or explicitly disallow this > and then match behaviour in java. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2307) [Python] Unable to read arrow stream containing 0 record batches
[ https://issues.apache.org/jira/browse/ARROW-2307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16398813#comment-16398813 ] Wes McKinney commented on ARROW-2307: - Working on a fix for this > [Python] Unable to read arrow stream containing 0 record batches > > > Key: ARROW-2307 > URL: https://issues.apache.org/jira/browse/ARROW-2307 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Benjamin Duffield >Assignee: Wes McKinney >Priority: Major > > Using java arrow I'm creating an arrow stream, using the stream writer. > > Sometimes I don't have anything to serialize, and so I don't write any record > batches. My arrow stream thus consists of just a schema message. > {code:java} > > > {code} > I am able to deserialize this arrow stream correctly using the java stream > reader, but when reading it with python I instead hit an error > {code} > import pyarrow as pa > # ... > reader = pa.open_stream(stream) > df = reader.read_all().to_pandas() > {code} > produces > {code} > File "ipc.pxi", line 307, in pyarrow.lib._RecordBatchReader.read_all > File "error.pxi", line 77, in pyarrow.lib.check_status > ArrowInvalid: Must pass at least one record batch > {code} > i.e. we're hitting the check in > https://github.com/apache/arrow/blob/apache-arrow-0.8.0/cpp/src/arrow/table.cc#L284 > The workaround we're currently using is to always ensure we serialize at > least one record batch, even if it's empty. However, I think it would be nice > to either support a stream without record batches or explicitly disallow this > and then match behaviour in java. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2307) [Python] Unable to read arrow stream containing 0 record batches
[ https://issues.apache.org/jira/browse/ARROW-2307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2307: Component/s: (was: C) > [Python] Unable to read arrow stream containing 0 record batches > > > Key: ARROW-2307 > URL: https://issues.apache.org/jira/browse/ARROW-2307 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Benjamin Duffield >Assignee: Wes McKinney >Priority: Major > > Using java arrow I'm creating an arrow stream, using the stream writer. > > Sometimes I don't have anything to serialize, and so I don't write any record > batches. My arrow stream thus consists of just a schema message. > {code:java} > > > {code} > I am able to deserialize this arrow stream correctly using the java stream > reader, but when reading it with python I instead hit an error > {code} > import pyarrow as pa > # ... > reader = pa.open_stream(stream) > df = reader.read_all().to_pandas() > {code} > produces > {code} > File "ipc.pxi", line 307, in pyarrow.lib._RecordBatchReader.read_all > File "error.pxi", line 77, in pyarrow.lib.check_status > ArrowInvalid: Must pass at least one record batch > {code} > i.e. we're hitting the check in > https://github.com/apache/arrow/blob/apache-arrow-0.8.0/cpp/src/arrow/table.cc#L284 > The workaround we're currently using is to always ensure we serialize at > least one record batch, even if it's empty. However, I think it would be nice > to either support a stream without record batches or explicitly disallow this > and then match behaviour in java. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2307) [Python] Unable to read arrow stream containing 0 record batches
[ https://issues.apache.org/jira/browse/ARROW-2307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2307: Summary: [Python] Unable to read arrow stream containing 0 record batches (was: Unable to read arrow stream containing 0 record batches using pyarrow) > [Python] Unable to read arrow stream containing 0 record batches > > > Key: ARROW-2307 > URL: https://issues.apache.org/jira/browse/ARROW-2307 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Benjamin Duffield >Assignee: Wes McKinney >Priority: Major > > Using java arrow I'm creating an arrow stream, using the stream writer. > > Sometimes I don't have anything to serialize, and so I don't write any record > batches. My arrow stream thus consists of just a schema message. > {code:java} > > > {code} > I am able to deserialize this arrow stream correctly using the java stream > reader, but when reading it with python I instead hit an error > {code} > import pyarrow as pa > # ... > reader = pa.open_stream(stream) > df = reader.read_all().to_pandas() > {code} > produces > {code} > File "ipc.pxi", line 307, in pyarrow.lib._RecordBatchReader.read_all > File "error.pxi", line 77, in pyarrow.lib.check_status > ArrowInvalid: Must pass at least one record batch > {code} > i.e. we're hitting the check in > https://github.com/apache/arrow/blob/apache-arrow-0.8.0/cpp/src/arrow/table.cc#L284 > The workaround we're currently using is to always ensure we serialize at > least one record batch, even if it's empty. However, I think it would be nice > to either support a stream without record batches or explicitly disallow this > and then match behaviour in java. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1701) [Serialization] Support zero copy PyTorch Tensor serialization
[ https://issues.apache.org/jira/browse/ARROW-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16398697#comment-16398697 ] ASF GitHub Bot commented on ARROW-1701: --- ppwwyyxx commented on issue #1223: ARROW-1701: [Serialization] Support zero copy PyTorch Tensor serialization URL: https://github.com/apache/arrow/pull/1223#issuecomment-373049215 Can we push the update to pypi? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Serialization] Support zero copy PyTorch Tensor serialization > -- > > Key: ARROW-1701 > URL: https://issues.apache.org/jira/browse/ARROW-1701 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Philipp Moritz >Assignee: Philipp Moritz >Priority: Major > Labels: pull-request-available > Fix For: 0.8.0 > > > see http://pytorch.org/docs/master/tensors.html > This should be optional and only included if the user has PyTorch installed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2140) [Python] Conversion from Numpy float16 array unimplemented
[ https://issues.apache.org/jira/browse/ARROW-2140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16398732#comment-16398732 ] ASF GitHub Bot commented on ARROW-2140: --- pitrou commented on issue #1744: ARROW-2140: [Python] Improve float16 support URL: https://github.com/apache/arrow/pull/1744#issuecomment-373059009 The Travis-CI failure is due to a regression in a Cython 0.28: https://github.com/cython/cython/issues/2148 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Conversion from Numpy float16 array unimplemented > -- > > Key: ARROW-2140 > URL: https://issues.apache.org/jira/browse/ARROW-2140 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.8.0 >Reporter: Antoine Pitrou >Priority: Major > Labels: pull-request-available > > {code} > >>> arr = np.array([1.5], dtype=np.float16) > >>> pa.array(arr, type=pa.float16()) > Traceback (most recent call last): > File "", line 1, in > pa.array(arr) > File "array.pxi", line 177, in pyarrow.lib.array > File "array.pxi", line 84, in pyarrow.lib._ndarray_to_array > File "public-api.pxi", line 158, in pyarrow.lib.pyarrow_wrap_array > KeyError: 10 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2140) [Python] Conversion from Numpy float16 array unimplemented
[ https://issues.apache.org/jira/browse/ARROW-2140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-2140: -- Labels: pull-request-available (was: ) > [Python] Conversion from Numpy float16 array unimplemented > -- > > Key: ARROW-2140 > URL: https://issues.apache.org/jira/browse/ARROW-2140 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.8.0 >Reporter: Antoine Pitrou >Priority: Major > Labels: pull-request-available > > {code} > >>> arr = np.array([1.5], dtype=np.float16) > >>> pa.array(arr, type=pa.float16()) > Traceback (most recent call last): > File "", line 1, in > pa.array(arr) > File "array.pxi", line 177, in pyarrow.lib.array > File "array.pxi", line 84, in pyarrow.lib._ndarray_to_array > File "public-api.pxi", line 158, in pyarrow.lib.pyarrow_wrap_array > KeyError: 10 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1701) [Serialization] Support zero copy PyTorch Tensor serialization
[ https://issues.apache.org/jira/browse/ARROW-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16398699#comment-16398699 ] ASF GitHub Bot commented on ARROW-1701: --- ppwwyyxx commented on issue #1223: ARROW-1701: [Serialization] Support zero copy PyTorch Tensor serialization URL: https://github.com/apache/arrow/pull/1223#issuecomment-373049215 Can we push the update to pypi? (Just bitten by the same issue again) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Serialization] Support zero copy PyTorch Tensor serialization > -- > > Key: ARROW-1701 > URL: https://issues.apache.org/jira/browse/ARROW-1701 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Philipp Moritz >Assignee: Philipp Moritz >Priority: Major > Labels: pull-request-available > Fix For: 0.8.0 > > > see http://pytorch.org/docs/master/tensors.html > This should be optional and only included if the user has PyTorch installed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2140) [Python] Conversion from Numpy float16 array unimplemented
[ https://issues.apache.org/jira/browse/ARROW-2140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16398688#comment-16398688 ] ASF GitHub Bot commented on ARROW-2140: --- pitrou opened a new pull request #1744: ARROW-2140: [Python] Improve float16 support URL: https://github.com/apache/arrow/pull/1744 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Conversion from Numpy float16 array unimplemented > -- > > Key: ARROW-2140 > URL: https://issues.apache.org/jira/browse/ARROW-2140 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.8.0 >Reporter: Antoine Pitrou >Priority: Major > Labels: pull-request-available > > {code} > >>> arr = np.array([1.5], dtype=np.float16) > >>> pa.array(arr, type=pa.float16()) > Traceback (most recent call last): > File "", line 1, in > pa.array(arr) > File "array.pxi", line 177, in pyarrow.lib.array > File "array.pxi", line 84, in pyarrow.lib._ndarray_to_array > File "public-api.pxi", line 158, in pyarrow.lib.pyarrow_wrap_array > KeyError: 10 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-45) Python: Add unnest/flatten function for List types
[ https://issues.apache.org/jira/browse/ARROW-45?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16398663#comment-16398663 ] Wes McKinney commented on ARROW-45: --- Yes > Python: Add unnest/flatten function for List types > -- > > Key: ARROW-45 > URL: https://issues.apache.org/jira/browse/ARROW-45 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Wes McKinney >Priority: Major > Fix For: 0.10.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2308) Serialized tensor data should be 64-byte aligned.
[ https://issues.apache.org/jira/browse/ARROW-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16398658#comment-16398658 ] Wes McKinney commented on ARROW-2308: - Making tensors 64-byte aligned makes sense to me. There's some ongoing refactoring related to this in ARROW-1860 -- I suggest we work on resolving all of these issues together > Serialized tensor data should be 64-byte aligned. > - > > Key: ARROW-2308 > URL: https://issues.apache.org/jira/browse/ARROW-2308 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Robert Nishihara >Priority: Major > > See [https://github.com/ray-project/ray/issues/1658] for an example of this > issue. Non-aligned data can trigger a copy when fed into TensorFlow and > things like that. > {code} > import pyarrow as pa > import numpy as np > x = np.zeros(10) > y = pa.deserialize(pa.serialize(x).to_buffer()) > x.ctypes.data % 64 # 0 (it starts out aligned) > y.ctypes.data % 64 # 48 (it is no longer aligned) > {code} > It should be possible to fix this by calling something like > {{RETURN_NOT_OK(AlignStreamPosition(dst));}} before writing the array data. > Note that we already do this before writing the tensor header, but the tensor > header is not necessarily a multiple of 64 bytes, so the subsequent data can > be unaligned. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1886) [Python] Add function to "flatten" structs within tables
[ https://issues.apache.org/jira/browse/ARROW-1886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16398620#comment-16398620 ] Wes McKinney commented on ARROW-1886: - I believe so, yes > [Python] Add function to "flatten" structs within tables > > > Key: ARROW-1886 > URL: https://issues.apache.org/jira/browse/ARROW-1886 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > Fix For: 0.10.0 > > > See discussion in https://issues.apache.org/jira/browse/ARROW-1873 > When a user has a struct column, it may be more efficient to flatten the > struct into multiple columns of the form {{struct_name.field_name}} for each > field in the struct. Then when you call {{to_pandas}}, Python dictionaries do > not have to be created, and the conversion will be much more efficient -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1886) [Python] Add function to "flatten" structs within tables
[ https://issues.apache.org/jira/browse/ARROW-1886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16398563#comment-16398563 ] Antoine Pitrou commented on ARROW-1886: --- Should this happen on the C++ side as well? > [Python] Add function to "flatten" structs within tables > > > Key: ARROW-1886 > URL: https://issues.apache.org/jira/browse/ARROW-1886 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > Fix For: 0.10.0 > > > See discussion in https://issues.apache.org/jira/browse/ARROW-1873 > When a user has a struct column, it may be more efficient to flatten the > struct into multiple columns of the form {{struct_name.field_name}} for each > field in the struct. Then when you call {{to_pandas}}, Python dictionaries do > not have to be created, and the conversion will be much more efficient -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-45) Python: Add unnest/flatten function for List types
[ https://issues.apache.org/jira/browse/ARROW-45?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16398556#comment-16398556 ] Antoine Pitrou commented on ARROW-45: - Should this happen on the C++ side as well? > Python: Add unnest/flatten function for List types > -- > > Key: ARROW-45 > URL: https://issues.apache.org/jira/browse/ARROW-45 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Wes McKinney >Priority: Major > Fix For: 0.10.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2304) [C++] MultipleClients test in io-hdfs-test fails on trunk
[ https://issues.apache.org/jira/browse/ARROW-2304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16398362#comment-16398362 ] ASF GitHub Bot commented on ARROW-2304: --- xhochy closed pull request #1743: ARROW-2304: [C++] Fix HDFS MultipleClients unit test URL: https://github.com/apache/arrow/pull/1743 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/cpp/src/arrow/io/io-hdfs-test.cc b/cpp/src/arrow/io/io-hdfs-test.cc index 610a91fbc..e02215b5e 100644 --- a/cpp/src/arrow/io/io-hdfs-test.cc +++ b/cpp/src/arrow/io/io-hdfs-test.cc @@ -181,6 +181,8 @@ TYPED_TEST(TestHadoopFileSystem, ConnectsAgain) { TYPED_TEST(TestHadoopFileSystem, MultipleClients) { SKIP_IF_NO_DRIVER(); + ASSERT_OK(this->MakeScratchDir()); + std::shared_ptr client1; std::shared_ptr client2; ASSERT_OK(HadoopFileSystem::Connect(&this->conf_, &client1)); @@ -189,7 +191,7 @@ TYPED_TEST(TestHadoopFileSystem, MultipleClients) { // client2 continues to function after equivalent client1 has shutdown std::vector listing; - EXPECT_OK(client2->ListDirectory(this->scratch_dir_, &listing)); + ASSERT_OK(client2->ListDirectory(this->scratch_dir_, &listing)); ASSERT_OK(client2->Disconnect()); } This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] MultipleClients test in io-hdfs-test fails on trunk > - > > Key: ARROW-2304 > URL: https://issues.apache.org/jira/browse/ARROW-2304 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Critical > Labels: pull-request-available > Fix For: 0.9.0 > > > This fails for me locally: > {code} > [ RUN ] TestHadoopFileSystem/0.MultipleClients > ../src/arrow/io/io-hdfs-test.cc:192: Failure > Value of: s.ok() > Actual: false > Expected: true > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-2304) [C++] MultipleClients test in io-hdfs-test fails on trunk
[ https://issues.apache.org/jira/browse/ARROW-2304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe L. Korn resolved ARROW-2304. Resolution: Fixed Issue resolved by pull request 1743 [https://github.com/apache/arrow/pull/1743] > [C++] MultipleClients test in io-hdfs-test fails on trunk > - > > Key: ARROW-2304 > URL: https://issues.apache.org/jira/browse/ARROW-2304 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Critical > Labels: pull-request-available > Fix For: 0.9.0 > > > This fails for me locally: > {code} > [ RUN ] TestHadoopFileSystem/0.MultipleClients > ../src/arrow/io/io-hdfs-test.cc:192: Failure > Value of: s.ok() > Actual: false > Expected: true > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-2306) [Python] HDFS test failures
[ https://issues.apache.org/jira/browse/ARROW-2306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe L. Korn resolved ARROW-2306. Resolution: Fixed Issue resolved by pull request 1742 [https://github.com/apache/arrow/pull/1742] > [Python] HDFS test failures > --- > > Key: ARROW-2306 > URL: https://issues.apache.org/jira/browse/ARROW-2306 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Blocker > Labels: pull-request-available > Fix For: 0.9.0 > > > These weren't caught because we aren't running the HDFS tests in Travis CI > {code} > pyarrow/tests/test_hdfs.py::TestLibHdfs::test_write_to_dataset_no_partitions > FAILED > >>> traceback > >>> > self = testMethod=test_write_to_dataset_no_partitions> > @test_parquet.parquet > def test_write_to_dataset_no_partitions(self): > tmpdir = pjoin(self.tmp_path, 'write-no_partitions-' + guid()) > self.hdfs.mkdir(tmpdir) > test_parquet._test_write_to_dataset_no_partitions( > > tmpdir, filesystem=self.hdfs) > pyarrow/tests/test_hdfs.py:367: > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > pyarrow/tests/test_parquet.py:1475: in _test_write_to_dataset_no_partitions > filesystem=filesystem) > pyarrow/parquet.py:1059: in write_to_dataset > _mkdir_if_not_exists(fs, root_path) > pyarrow/parquet.py:1006: in _mkdir_if_not_exists > if fs._isfilestore() and not fs.exists(path): > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > self = > def _isfilestore(self): > """ > Returns True if this FileSystem is a unix-style file store with > directories. > """ > > raise NotImplementedError > E NotImplementedError > pyarrow/filesystem.py:143: NotImplementedError > >> entering PDB > >> >> > > /home/wesm/code/arrow/python/pyarrow/filesystem.py(143)_isfilestore() > -> raise NotImplementedError > (Pdb) c > pyarrow/tests/test_hdfs.py::TestLibHdfs::test_write_to_dataset_with_partitions > FAILED > >>> traceback > >>> > self = testMethod=test_write_to_dataset_with_partitions> > @test_parquet.parquet > def test_write_to_dataset_with_partitions(self): > tmpdir = pjoin(self.tmp_path, 'write-partitions-' + guid()) > self.hdfs.mkdir(tmpdir) > test_parquet._test_write_to_dataset_with_partitions( > > tmpdir, filesystem=self.hdfs) > pyarrow/tests/test_hdfs.py:360: > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > pyarrow/tests/test_parquet.py:1433: in _test_write_to_dataset_with_partitions > filesystem=filesystem) > pyarrow/parquet.py:1059: in write_to_dataset > _mkdir_if_not_exists(fs, root_path) > pyarrow/parquet.py:1006: in _mkdir_if_not_exists > if fs._isfilestore() and not fs.exists(path): > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > self = > def _isfilestore(self): > """ > Returns True if this FileSystem is a unix-style file store with > directories. > """ > > raise NotImplementedError > E NotImplementedError > pyarrow/filesystem.py:143: NotImplementedError > >> entering PDB > >> >> > > /home/wesm/code/arrow/python/pyarrow/filesystem.py(143)_isfilestore() > -> raise NotImplementedError > (Pdb) c > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2306) [Python] HDFS test failures
[ https://issues.apache.org/jira/browse/ARROW-2306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16398360#comment-16398360 ] ASF GitHub Bot commented on ARROW-2306: --- xhochy closed pull request #1742: ARROW-2306: [Python] Fix partitioned Parquet test against HDFS URL: https://github.com/apache/arrow/pull/1742 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/python/pyarrow/hdfs.py b/python/pyarrow/hdfs.py index 3f2014b65..34ddfaef3 100644 --- a/python/pyarrow/hdfs.py +++ b/python/pyarrow/hdfs.py @@ -40,6 +40,13 @@ def __reduce__(self): return (HadoopFileSystem, (self.host, self.port, self.user, self.kerb_ticket, self.driver)) +def _isfilestore(self): +""" +Returns True if this FileSystem is a unix-style file store with +directories. +""" +return True + @implements(FileSystem.isdir) def isdir(self, path): return super(HadoopFileSystem, self).isdir(path) diff --git a/python/pyarrow/parquet.py b/python/pyarrow/parquet.py index fd9c740f1..0929a1549 100644 --- a/python/pyarrow/parquet.py +++ b/python/pyarrow/parquet.py @@ -1103,6 +1103,9 @@ def write_metadata(schema, where, version='1.0', coerce_timestamps : string, default None Cast timestamps a particular resolution. Valid values: {None, 'ms', 'us'} +filesystem : FileSystem, default None +If nothing passed, paths assumed to be found in the local on-disk +filesystem """ writer = ParquetWriter( where, schema, version=version, diff --git a/python/pyarrow/tests/test_parquet.py b/python/pyarrow/tests/test_parquet.py index a3da05fe3..b301de606 100644 --- a/python/pyarrow/tests/test_parquet.py +++ b/python/pyarrow/tests/test_parquet.py @@ -1431,8 +1431,15 @@ def _test_write_to_dataset_with_partitions(base_path, filesystem=None): output_table = pa.Table.from_pandas(output_df) pq.write_to_dataset(output_table, base_path, partition_by, filesystem=filesystem) -pq.write_metadata(output_table.schema, - os.path.join(base_path, '_common_metadata')) + +metadata_path = os.path.join(base_path, '_common_metadata') + +if filesystem is not None: +with filesystem.open(metadata_path, 'wb') as f: +pq.write_metadata(output_table.schema, f) +else: +pq.write_metadata(output_table.schema, metadata_path) + dataset = pq.ParquetDataset(base_path, filesystem=filesystem) # ARROW-2209: Ensure the dataset schema also includes the partition columns dataset_cols = set(dataset.schema.to_arrow_schema().names) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] HDFS test failures > --- > > Key: ARROW-2306 > URL: https://issues.apache.org/jira/browse/ARROW-2306 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Blocker > Labels: pull-request-available > Fix For: 0.9.0 > > > These weren't caught because we aren't running the HDFS tests in Travis CI > {code} > pyarrow/tests/test_hdfs.py::TestLibHdfs::test_write_to_dataset_no_partitions > FAILED > >>> traceback > >>> > self = testMethod=test_write_to_dataset_no_partitions> > @test_parquet.parquet > def test_write_to_dataset_no_partitions(self): > tmpdir = pjoin(self.tmp_path, 'write-no_partitions-' + guid()) > self.hdfs.mkdir(tmpdir) > test_parquet._test_write_to_dataset_no_partitions( > > tmpdir, filesystem=self.hdfs) > pyarrow/tests/test_hdfs.py:367: > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > pyarrow/tests/test_parquet.py:1475: in _test_write_to_dataset_no_partitions > filesystem=filesystem) > pyarrow/parquet.py:1059: in write_to_dataset > _mkdir_if_not_exists(fs, root_path) > pyarrow/parquet.py:1006: in _mkdir_if_not_exists > if fs._isfilestore() and not fs.exists(path): > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > _ _ _ _ _ _ _ _ _ _
[jira] [Commented] (ARROW-640) [Python] Arrow scalar values should have a sensible __hash__ and comparison
[ https://issues.apache.org/jira/browse/ARROW-640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16398355#comment-16398355 ] Antoine Pitrou commented on ARROW-640: -- I don't think we're concerned about particular workloads for now. Something like {{%timeit hash\(x)}} (in IPython syntax) is a good micro-benchmark for this. Integer is the main type that I think might be use in a hashing context so you may want to write a native hash implementation for them, while letting other types defer to {{as_py}}. Also in some cases (such as StructValue), the {{as_py}} fallback won't work. We may or may not care about this immediately (i.e. if you only want to implement numbers, we can open an issue for the other types). > [Python] Arrow scalar values should have a sensible __hash__ and comparison > --- > > Key: ARROW-640 > URL: https://issues.apache.org/jira/browse/ARROW-640 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Miki Tebeka >Assignee: Alex Hagerman >Priority: Major > Fix For: 0.10.0 > > > {noformat} > In [86]: arr = pa.from_pylist([1, 1, 1, 2]) > In [87]: set(arr) > Out[87]: {1, 2, 1, 1} > In [88]: arr[0] == arr[1] > Out[88]: False > In [89]: arr > Out[89]: > > [ > 1, > 1, > 1, > 2 > ] > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)