[jira] [Created] (ARROW-7004) [Plasma] Make it possible to bump up object in LRU cache
Philipp Moritz created ARROW-7004: - Summary: [Plasma] Make it possible to bump up object in LRU cache Key: ARROW-7004 URL: https://issues.apache.org/jira/browse/ARROW-7004 Project: Apache Arrow Issue Type: Improvement Components: C++ - Plasma Reporter: Philipp Moritz Assignee: Philipp Moritz To avoid evicting objects too early, we sometimes want to bump up a number of objects up in the LRU cache. While it would be possible to call Get() on these objects, this can be undesirable, since it is blocking on the objects if they are not available and will make it necessary to call Release() on them. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-5904) [Java] [Plasma] Fix compilation of Plasma Java client
Philipp Moritz created ARROW-5904: - Summary: [Java] [Plasma] Fix compilation of Plasma Java client Key: ARROW-5904 URL: https://issues.apache.org/jira/browse/ARROW-5904 Project: Apache Arrow Issue Type: Improvement Reporter: Philipp Moritz This is broken since the introduction of user-defined Status messages: {code:java} external/plasma/cpp/src/plasma/lib/java/org_apache_arrow_plasma_PlasmaClientJNI.cc: In function '_jobject* Java_org_apache_arrow_plasma_PlasmaClientJNI_create(JNIEnv*, jclass, jlong, jbyteArray, jint, jbyteArray)': external/plasma/cpp/src/plasma/lib/java/org_apache_arrow_plasma_PlasmaClientJNI.cc:114:9: error: 'class arrow::Status' has no member named 'IsPlasmaObjectExists' if (s.IsPlasmaObjectExists()) { ^ external/plasma/cpp/src/plasma/lib/java/org_apache_arrow_plasma_PlasmaClientJNI.cc:120:9: error: 'class arrow::Status' has no member named 'IsPlasmaStoreFull' if (s.IsPlasmaStoreFull()) { ^{code} [~guoyuhong85] Can you add this codepath to the test so we can catch this kind of breakage more quickly in the future? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5751) [Packaging][Python] Python 2.7 wheels broken on macOS: libcares.2.dylib not found
Philipp Moritz created ARROW-5751: - Summary: [Packaging][Python] Python 2.7 wheels broken on macOS: libcares.2.dylib not found Key: ARROW-5751 URL: https://issues.apache.org/jira/browse/ARROW-5751 Project: Apache Arrow Issue Type: Improvement Reporter: Philipp Moritz I'm afraid while [https://github.com/apache/arrow/pull/4685] fixed the macOS wheels for python 3, but the python 2.7 wheel is still broken (with a different error): {code:java} ImportError: dlopen(/Users/pcmoritz/anaconda3/lib/python3.6/site-packages/pyarrow/lib.cpython-36m-darwin.so, 2): Library not loaded: /usr/local/opt/c-ares/lib/libcares.2.dylib Referenced from: /Users/pcmoritz/anaconda3/lib/python3.6/site-packages/pyarrow/libarrow_python.14.dylib Reason: image not found{code} I tried the same hack as in [https://github.com/apache/arrow/pull/4685] for libcares but it doesn't work (removing the .dylib fails one of the earlier build steps). I think the only way to go forward on this is to compile grpc ourselves. My attempt to do this in [https://github.com/apache/arrow/compare/master...pcmoritz:mac-wheels-py2] fails because OpenSSL is not found even though I'm specifying the OPENSSL_ROOT_DIR (see [https://travis-ci.org/pcmoritz/crossbow/builds/550603543]). Let me know if you have any ideas how to fix this! -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5690) [Packaging] macOS wheels broken: libprotobuf.18.dylib missing
Philipp Moritz created ARROW-5690: - Summary: [Packaging] macOS wheels broken: libprotobuf.18.dylib missing Key: ARROW-5690 URL: https://issues.apache.org/jira/browse/ARROW-5690 Project: Apache Arrow Issue Type: Improvement Reporter: Philipp Moritz If I build macOS arrow wheels with crossbow from the latest master (a77257f4790c562dcb74724fc4a22c157ab36018) and install them, importing pyarrow gives the following error message: {code:java} In [1]: import pyarrow --- ImportError Traceback (most recent call last) in > 1 import pyarrow ~/anaconda3/lib/python3.6/site-packages/pyarrow/__init__.py in 47 import pyarrow.compat as compat 48 ---> 49 from pyarrow.lib import cpu_count, set_cpu_count 50 from pyarrow.lib import (null, bool_, 51 int8, int16, int32, int64, ImportError: dlopen(/Users/pcmoritz/anaconda3/lib/python3.6/site-packages/pyarrow/lib.cpython-36m-darwin.so, 2): Library not loaded: /usr/local/opt/protobuf/lib/libprotobuf.18.dylib Referenced from: /Users/pcmoritz/anaconda3/lib/python3.6/site-packages/pyarrow/libarrow.14.dylib Reason: image not found{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5671) [crossbow] mac os python wheels failing
Philipp Moritz created ARROW-5671: - Summary: [crossbow] mac os python wheels failing Key: ARROW-5671 URL: https://issues.apache.org/jira/browse/ARROW-5671 Project: Apache Arrow Issue Type: Improvement Reporter: Philipp Moritz The building of (all?) macOS python wheels is currently failing with {code:java} Traceback (most recent call last): File "", line 3, in File "/Users/travis/build/pcmoritz/crossbow/venv/lib/python3.7/site-packages/pyarrow/__init__.py", line 49, in from pyarrow.lib import cpu_count, set_cpu_count ImportError: dlopen(/Users/travis/build/pcmoritz/crossbow/venv/lib/python3.7/site-packages/pyarrow/lib.cpython-37m-darwin.so, 2): Library not loaded: @rpath/libarrow_boost_system.dylib Referenced from: /Users/travis/build/pcmoritz/crossbow/venv/lib/python3.7/site-packages/pyarrow/libarrow.14.dylib Reason: image not found{code} Not sure where this was introduced :( -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5670) [crossbow] mac os python 3.5 wheel failing
Philipp Moritz created ARROW-5670: - Summary: [crossbow] mac os python 3.5 wheel failing Key: ARROW-5670 URL: https://issues.apache.org/jira/browse/ARROW-5670 Project: Apache Arrow Issue Type: Improvement Reporter: Philipp Moritz Currently the macOS python 3.5 is failing with {code:java} Downloading Apache Thrift from Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 1254, in do_open h.request(req.get_method(), req.selector, req.data, headers) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/http/client.py", line 1107, in request self._send_request(method, url, body, headers) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/http/client.py", line 1152, in _send_request self.endheaders(body) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/http/client.py", line 1103, in endheaders self._send_output(message_body) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/http/client.py", line 934, in _send_output self.send(msg) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/http/client.py", line 877, in send self.connect() File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/http/client.py", line 1261, in connect server_hostname=server_hostname) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/ssl.py", line 385, in wrap_socket _context=self) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/ssl.py", line 760, in __init__ self.do_handshake() File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/ssl.py", line 996, in do_handshake self._sslobj.do_handshake() File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/ssl.py", line 641, in do_handshake self._sslobj.do_handshake() ssl.SSLError: [SSL: TLSV1_ALERT_PROTOCOL_VERSION] tlsv1 alert protocol version (_ssl.c:719){code} I've been looking into this error and will try to push a fix (the openssl version that is used with python 3.5 on macos is too old I think). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5669) [crossbow] manylinux1 wheel building failing
Philipp Moritz created ARROW-5669: - Summary: [crossbow] manylinux1 wheel building failing Key: ARROW-5669 URL: https://issues.apache.org/jira/browse/ARROW-5669 Project: Apache Arrow Issue Type: Improvement Reporter: Philipp Moritz I tried to set up a crossbow queue (on a0e1fbb9ef51d05a3f28e221cf8c5d4031a50c93), and right now building the manylinux1 wheels seems to be failing because of the arrow flight tests: {code:java} ___ test_tls_do_get def test_tls_do_get(): """Try a simple do_get call over TLS.""" table = simple_ints_table() > certs = example_tls_certs() usr/local/lib/python3.6/site-packages/pyarrow/tests/test_flight.py:563: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ usr/local/lib/python3.6/site-packages/pyarrow/tests/test_flight.py:64: in example_tls_certs "root_cert": read_flight_resource("root-ca.pem"), usr/local/lib/python3.6/site-packages/pyarrow/tests/test_flight.py:48: in read_flight_resource root = resource_root() _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ def resource_root(): """Get the path to the test resources directory.""" if not os.environ.get("ARROW_TEST_DATA"): > raise RuntimeError("Test resources not found; set " "ARROW_TEST_DATA to /testing") E RuntimeError: Test resources not found; set ARROW_TEST_DATA to /testing usr/local/lib/python3.6/site-packages/pyarrow/tests/test_flight.py:41: RuntimeError{code} This may have been introduced in [https://github.com/apache/arrow/pull/4594|https://github.com/apache/arrow/pull/4594.] Any thoughts how we should proceed with this? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5027) [Python] Add JSON Reader
Philipp Moritz created ARROW-5027: - Summary: [Python] Add JSON Reader Key: ARROW-5027 URL: https://issues.apache.org/jira/browse/ARROW-5027 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Philipp Moritz Add bindings for the JSON reader. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5022) [C++] Implement more "Datum" types for AggregateKernel
Philipp Moritz created ARROW-5022: - Summary: [C++] Implement more "Datum" types for AggregateKernel Key: ARROW-5022 URL: https://issues.apache.org/jira/browse/ARROW-5022 Project: Apache Arrow Issue Type: Improvement Reporter: Philipp Moritz Currently it gives the following error if the datum isn't an array: {code:java} AggregateKernel expects Array datum{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5002) [C++] Implement GroupBy
Philipp Moritz created ARROW-5002: - Summary: [C++] Implement GroupBy Key: ARROW-5002 URL: https://issues.apache.org/jira/browse/ARROW-5002 Project: Apache Arrow Issue Type: Improvement Reporter: Philipp Moritz Dear all, I wonder what the best way forward is for implementing GroupBy kernels. Initially this was part of https://issues.apache.org/jira/browse/ARROW-4124 but is not contained in the current implementation as far as I can tell. It seems that the part of group by that just returns indices could be conveniently implemented with the HashKernel. That seems useful in any case. Is that indeed the best way forward/should this be done? GroupBy + Aggregate could then either be implemented with that + the Take kernel + aggregation involving more memory copies than necessary though or as part of the aggregate kernel. Probably the latter is preferred, any thoughts on that? Am I missing any other JIRAs related to this? Best, Philipp. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4983) [Plasma] Unmap memory when the client is destroyed
Philipp Moritz created ARROW-4983: - Summary: [Plasma] Unmap memory when the client is destroyed Key: ARROW-4983 URL: https://issues.apache.org/jira/browse/ARROW-4983 Project: Apache Arrow Issue Type: Improvement Components: C++ - Plasma Affects Versions: 0.12.1 Reporter: Philipp Moritz Assignee: Philipp Moritz Currently the plasma memory mapped into the client is not unmapped upon destruction of the client, which can cause memory mapped files to be kept around longer than necessary. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4958) [C++] Purely static linking broken
Philipp Moritz created ARROW-4958: - Summary: [C++] Purely static linking broken Key: ARROW-4958 URL: https://issues.apache.org/jira/browse/ARROW-4958 Project: Apache Arrow Issue Type: Improvement Reporter: Philipp Moritz On the current master, 816c10d030842a1a0da4d00f95a5e3749c86a74f (#3965), running {code:java} docker-compose build cpp docker-compose run cpp-static-only{code} yields {code:java} [357/382] Linking CXX executable debug/parquet-encoding-benchmark FAILED: debug/parquet-encoding-benchmark : && /opt/conda/bin/ccache /usr/bin/g++ -Wno-noexcept-type -fdiagnostics-color=always -ggdb -O0 -Wall -Wno-conversion -Wno-sign-conversion -Werror -msse4.2 -g -rdynamic src/parquet/CMakeFiles/parquet-encoding-benchmark.dir/encoding-benchmark.cc.o -o debug/parquet-encoding-benchmark -Wl,-rpath,/opt/conda/lib /opt/conda/lib/libbenchmark_main.a debug/libparquet.a /opt/conda/lib/libbenchmark.a debug/libarrow.a /opt/conda/lib/libdouble-conversion.a /opt/conda/lib/libbrotlienc.so /opt/conda/lib/libbrotlidec.so /opt/conda/lib/libbrotlicommon.so /opt/conda/lib/libbz2.so /opt/conda/lib/liblz4.so /opt/conda/lib/libsnappy.so.1.1.7 /opt/conda/lib/libz.so /opt/conda/lib/libzstd.so orc_ep-install/lib/liborc.a /opt/conda/lib/libprotobuf.so /opt/conda/lib/libglog.so /opt/conda/lib/libboost_system.so /opt/conda/lib/libboost_filesystem.so jemalloc_ep-prefix/src/jemalloc_ep/dist//lib/libjemalloc_pic.a -pthread -lrt /opt/conda/lib/libboost_regex.so /opt/conda/lib/libthrift.so && : src/parquet/CMakeFiles/parquet-encoding-benchmark.dir/encoding-benchmark.cc.o: In function `testing::AssertionResult::AppendMessage(testing::Message const&)': /opt/conda/include/gtest/gtest.h:352: undefined reference to `testing::Message::GetString[abi:cxx11]() const' src/parquet/CMakeFiles/parquet-encoding-benchmark.dir/encoding-benchmark.cc.o: In function `parquet::BenchmarkDecodeArrow::InitDataInputs()': /arrow/cpp/src/parquet/encoding-benchmark.cc:201: undefined reference to `arrow::random::RandomArrayGenerator::StringWithRepeats(long, long, int, int, double)' src/parquet/CMakeFiles/parquet-encoding-benchmark.dir/encoding-benchmark.cc.o: In function `parquet::BM_DictDecodingByteArray::DoEncodeData()': /arrow/cpp/src/parquet/encoding-benchmark.cc:317: undefined reference to `testing::internal::AlwaysTrue()' /arrow/cpp/src/parquet/encoding-benchmark.cc:317: undefined reference to `testing::internal::AlwaysTrue()' /arrow/cpp/src/parquet/encoding-benchmark.cc:317: undefined reference to `testing::Message::Message()' /arrow/cpp/src/parquet/encoding-benchmark.cc:317: undefined reference to `testing::internal::AssertHelper::AssertHelper(testing::TestPartResult::Type, char const*, int, char const*)' /arrow/cpp/src/parquet/encoding-benchmark.cc:317: undefined reference to `testing::internal::AssertHelper::operator=(testing::Message const&) const' /arrow/cpp/src/parquet/encoding-benchmark.cc:317: undefined reference to `testing::internal::AssertHelper::~AssertHelper()' /arrow/cpp/src/parquet/encoding-benchmark.cc:321: undefined reference to `testing::Message::Message()' /arrow/cpp/src/parquet/encoding-benchmark.cc:321: undefined reference to `testing::internal::AssertHelper::AssertHelper(testing::TestPartResult::Type, char const*, int, char const*)' /arrow/cpp/src/parquet/encoding-benchmark.cc:321: undefined reference to `testing::internal::AssertHelper::operator=(testing::Message const&) const' /arrow/cpp/src/parquet/encoding-benchmark.cc:321: undefined reference to `testing::internal::AssertHelper::~AssertHelper()' /arrow/cpp/src/parquet/encoding-benchmark.cc:317: undefined reference to `testing::internal::AssertHelper::~AssertHelper()' /arrow/cpp/src/parquet/encoding-benchmark.cc:321: undefined reference to `testing::internal::AssertHelper::~AssertHelper()' src/parquet/CMakeFiles/parquet-encoding-benchmark.dir/encoding-benchmark.cc.o: In function `testing::internal::scoped_ptr, std::allocator > >::reset(std::__cxx11::basic_string, std::allocator >*)': /opt/conda/include/gtest/internal/gtest-port.h:1215: undefined reference to `testing::internal::IsTrue(bool)' src/parquet/CMakeFiles/parquet-encoding-benchmark.dir/encoding-benchmark.cc.o: In function `testing::AssertionResult testing::internal::CmpHelperNE >*, decltype(nullptr)>(char const*, char const*, parquet::DictEncoder >* const&, decltype(nullptr) const&)': /opt/conda/include/gtest/gtest.h:1573: undefined reference to `testing::AssertionSuccess()' src/parquet/CMakeFiles/parquet-encoding-benchmark.dir/encoding-benchmark.cc.o: In function `testing::internal::scoped_ptr, std::allocator > >::reset(std::__cxx11::basic_stringstream, std::allocator >*)': /opt/conda/include/gtest/internal/gtest-port.h:1215: undefined reference to `testing::internal::IsTrue(bool)'
[jira] [Created] (ARROW-4912) [C++, Python] Allow specifying column names to CSV reader
Philipp Moritz created ARROW-4912: - Summary: [C++, Python] Allow specifying column names to CSV reader Key: ARROW-4912 URL: https://issues.apache.org/jira/browse/ARROW-4912 Project: Apache Arrow Issue Type: Improvement Reporter: Philipp Moritz Currently I think there is no way to specify custom column names for CSV files. It's possible to specify the full schema of the file, but not just column names. See the related discussion here: ARROW-3722 The goal of this is to re-use the CSV type-inference but still allow people to specify custom names for the columns. As far as I know, there is currently no way to set column names post-hoc, so we should provide a way to specify them before reading the file. Related to this, ParseOptions(header_rows=0) is not currently implemented. Is there any current way to do this or does this need to be implmented? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4905) [C++][Plasma] Remove dlmalloc from client library
Philipp Moritz created ARROW-4905: - Summary: [C++][Plasma] Remove dlmalloc from client library Key: ARROW-4905 URL: https://issues.apache.org/jira/browse/ARROW-4905 Project: Apache Arrow Issue Type: Improvement Components: C++ - Plasma Affects Versions: 0.12.1 Reporter: Philipp Moritz Assignee: Philipp Moritz While working on the Ray build system, I noticed that dlmalloc symbols are leaking into the plasma client library. They should be separated out and only linked into the store. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4797) [Plasma] Avoid store crash if not enough memory is available
Philipp Moritz created ARROW-4797: - Summary: [Plasma] Avoid store crash if not enough memory is available Key: ARROW-4797 URL: https://issues.apache.org/jira/browse/ARROW-4797 Project: Apache Arrow Issue Type: Improvement Reporter: Philipp Moritz Currently, the plasma server exists with a fatal check if not enough memory is available. This can lead to errors that are hard to diagnose, see [https://github.com/ray-project/ray/issues/3670] Instead, we should keep the store alive in these circumstances, taking up some of the remaining memory and allow the client to check if enough memory has been allocating. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4757) Nested chunked array support
Philipp Moritz created ARROW-4757: - Summary: Nested chunked array support Key: ARROW-4757 URL: https://issues.apache.org/jira/browse/ARROW-4757 Project: Apache Arrow Issue Type: Improvement Reporter: Philipp Moritz Dear all, I'm currently trying to lift the 2GB limit on the python serialization. For this, I implemented a chunked union builder to split the array into smaller arrays. However, some of the children of the union array can be ListArrays, which can themselves contain UnionArrays which can contain ListArrays etc. I'm at a bit of a loss how to handle this. In principle I'd like to chunk the children too. However, currently UnionArrays can only have children of type Array, and there is no way to treat a chunked array (which is a vector of Arrays) as an Array to store it as a child of a UnionArray. Any ideas how to best support this use case? -- Philipp. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4690) Building TensorFlow compatible wheels for Arrow
Philipp Moritz created ARROW-4690: - Summary: Building TensorFlow compatible wheels for Arrow Key: ARROW-4690 URL: https://issues.apache.org/jira/browse/ARROW-4690 Project: Apache Arrow Issue Type: Improvement Reporter: Philipp Moritz Since the inclusion of LLVM, arrow wheels stopped working with TensorFlow again (on some configurations at least). While we are continuing to discuss a more permanent solution in [https://groups.google.com/a/tensorflow.org/d/topic/developers/TMqRaT-H2bI/discussion|https://groups.google.com/a/tensorflow.org/d/topic/developers/TMqRaT-H2bI/discussion,], I made some progress in creating tensorflow compatible wheels for an unmodified pyarrow. They won't adhere to the manylinux1 standard, but they should be as compatible as the TensorFlow wheels because they use the same build environment (ubuntu 14.04). I'll create a PR with the necessary changes. I don't propose to ship these wheels but it might be a good idea to include the docker image and instructions how to build them in the tree for organizations that want to use tensorflow with pyarrow on top of pip. The official recommendation should probably be to use conda if the average user wants to do this for now. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4491) [Python] Remove usage of std::to_string and std::stoi
Philipp Moritz created ARROW-4491: - Summary: [Python] Remove usage of std::to_string and std::stoi Key: ARROW-4491 URL: https://issues.apache.org/jira/browse/ARROW-4491 Project: Apache Arrow Issue Type: Improvement Reporter: Philipp Moritz Not sure why this is happening, but for some older compilers I'm seeing {code:java} terminate called after throwing an instance of 'std::invalid_argument' what(): stoi{code} since [https://github.com/apache/arrow/pull/3423|https://github.com/apache/arrow/pull/3423.] Possible cause is that there is no int8_t version of [https://en.cppreference.com/w/cpp/string/basic_string/to_string|https://en.cppreference.com/w/cpp/string/basic_string/to_string,] so it might not convert it to a proper string representation of the number. Any insight on why this could be happening is appreciated. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4453) [Python] Create Cython wrappers for sparse array
Philipp Moritz created ARROW-4453: - Summary: [Python] Create Cython wrappers for sparse array Key: ARROW-4453 URL: https://issues.apache.org/jira/browse/ARROW-4453 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Philipp Moritz We should have cython wrappers for [https://github.com/apache/arrow/pull/2546] This is related to support for https://issues.apache.org/jira/browse/ARROW-4223 and https://issues.apache.org/jira/browse/ARROW-4224 I imagine the code would be similar to https://github.com/apache/arrow/blob/5a502d281545402240e818d5fd97a9aaf36363f2/python/pyarrow/array.pxi#L748 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4452) [Python] Serializing sparse torch tensors
Philipp Moritz created ARROW-4452: - Summary: [Python] Serializing sparse torch tensors Key: ARROW-4452 URL: https://issues.apache.org/jira/browse/ARROW-4452 Project: Apache Arrow Issue Type: Improvement Reporter: Philipp Moritz Using the pytorch serialization handler on sparse Tensors: {code:java} import torch i = torch.LongTensor([[0, 2], [1, 0], [1, 2]]) v = torch.FloatTensor([3, 4, 5 ]) tensor = torch.sparse.FloatTensor(i.t(), v, torch.Size([2,3])) register_torch_serialization_handlers(pyarrow.serialization._default_serialization_context) s = pyarrow.serialize(tensor, context=pyarrow.serialization._default_serialization_context) {code} Produces this result: {code:java} TypeError: can't convert sparse tensor to numpy. Use Tensor.to_dense() to convert to a dense tensor first.{code} We should provide a way to serialize sparse torch tensors, especially now that we are getting support for sparse Tensors. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4378) [Plasma] Release objects upon Create
Philipp Moritz created ARROW-4378: - Summary: [Plasma] Release objects upon Create Key: ARROW-4378 URL: https://issues.apache.org/jira/browse/ARROW-4378 Project: Apache Arrow Issue Type: Improvement Components: Plasma (C++) Affects Versions: 0.13.0 Reporter: Philipp Moritz Similar to the way that {code:java} Get(const std::vector& object_ids, int64_t timeout_ms, std::vector* out){code} releases the object when the shared_ptr inside of ObjectBuffer goes out of scope, the same should happen for {code} Status Create(const ObjectID& object_id, int64_t data_size, const uint8_t* metadata, int64_t metadata_size, std::shared_ptr* data); {code} At the moment people have to remember calling Release() after they created and sealed the object and that can make the use of the C++ API cumbersome. Thanks to [~anuragkh] for reporting this. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4285) [Python] Use proper builder interface for serialization
Philipp Moritz created ARROW-4285: - Summary: [Python] Use proper builder interface for serialization Key: ARROW-4285 URL: https://issues.apache.org/jira/browse/ARROW-4285 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 0.12.0 Reporter: Philipp Moritz As a preparation for ARROW-3919, refactor the python serialization code such that the default builder interface is used. In the next step we can then plug in ChunkedBuilders to make sure that the generated arrays are properly chunked. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4269) [Python] AttributeError: module 'pandas.core' has no attribute 'arrays'
Philipp Moritz created ARROW-4269: - Summary: [Python] AttributeError: module 'pandas.core' has no attribute 'arrays' Key: ARROW-4269 URL: https://issues.apache.org/jira/browse/ARROW-4269 Project: Apache Arrow Issue Type: Improvement Reporter: Philipp Moritz This happens with pandas 0.22: ``` In [1]: import pyarrow --- AttributeError Traceback (most recent call last) in () > 1 import pyarrow ~/arrow/python/pyarrow/__init__.py in () 174 localfs = LocalFileSystem.get_instance() 175 --> 176 from pyarrow.serialization import (default_serialization_context, 177 register_default_serialization_handlers, 178 register_torch_serialization_handlers) ~/arrow/python/pyarrow/serialization.py in () 303 304 --> 305 register_default_serialization_handlers(_default_serialization_context) ~/arrow/python/pyarrow/serialization.py in register_default_serialization_handlers(serialization_context) 294 custom_deserializer=_deserialize_pyarrow_table) 295 --> 296 _register_custom_pandas_handlers(serialization_context) 297 298 ~/arrow/python/pyarrow/serialization.py in _register_custom_pandas_handlers(context) 175 custom_deserializer=_load_pickle_from_buffer) 176 --> 177 if hasattr(pd.core.arrays, 'interval'): 178 context.register_type( 179 pd.core.arrays.interval.IntervalArray, AttributeError: module 'pandas.core' has no attribute 'arrays' ``` -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4249) [Plasma] Remove reference to logging.h from plasma/common.h
Philipp Moritz created ARROW-4249: - Summary: [Plasma] Remove reference to logging.h from plasma/common.h Key: ARROW-4249 URL: https://issues.apache.org/jira/browse/ARROW-4249 Project: Apache Arrow Issue Type: Improvement Components: Plasma (C++) Affects Versions: 0.11.1 Reporter: Philipp Moritz Assignee: Philipp Moritz Fix For: 0.13.0 It is not needed there and pollutes the namespace for applications that use the plasma client it with arrow's DCHECK macros (DCHECK is a name widely used in other projects). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4217) [Plasma] Remove custom object metadata
Philipp Moritz created ARROW-4217: - Summary: [Plasma] Remove custom object metadata Key: ARROW-4217 URL: https://issues.apache.org/jira/browse/ARROW-4217 Project: Apache Arrow Issue Type: Improvement Components: Plasma (C++) Affects Versions: 0.11.1 Reporter: Philipp Moritz Assignee: Philipp Moritz Fix For: 0.13.0 Currently, Plasma supports custom metadata for objects. This doesn't seem to be used at the moment, and it will simplify the interface and implementation to remove it. Removing the custom metadata will also make eviction to other blob stores easier (most other stores don't support custom metadata). My personal use case was to store arrow schemata in there, but they are now stored as part of the object itself. If nobody else is using this, I'd suggest removing it. If people really want metadata, they could always store it as a separate object if desired. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4025) [Python] TensorFlow/PyTorch arrow ThreadPool workarounds not working in some settings
Philipp Moritz created ARROW-4025: - Summary: [Python] TensorFlow/PyTorch arrow ThreadPool workarounds not working in some settings Key: ARROW-4025 URL: https://issues.apache.org/jira/browse/ARROW-4025 Project: Apache Arrow Issue Type: Improvement Affects Versions: 0.11.1 Reporter: Philipp Moritz See the bug report in [https://github.com/ray-project/ray/issues/3520] I wonder if we can revisit this issue and try to get rid of the workarounds we tried to deploy in the past. See also the discussion in [https://github.com/apache/arrow/pull/2096] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4024) [Python] Cython compilation error on cython==0.27.3
Philipp Moritz created ARROW-4024: - Summary: [Python] Cython compilation error on cython==0.27.3 Key: ARROW-4024 URL: https://issues.apache.org/jira/browse/ARROW-4024 Project: Apache Arrow Issue Type: Improvement Reporter: Philipp Moritz On the latest master, I'm getting the following error: {code:java} [ 11%] Compiling Cython CXX source for lib... Error compiling Cython file: ... out.init(type) return out cdef object pyarrow_wrap_metadata( ^ pyarrow/public-api.pxi:95:5: Function signature does not match previous declaration CMakeFiles/lib_pyx.dir/build.make:57: recipe for target 'CMakeFiles/lib_pyx' failed{code} With 0.29.0 it is working. This might have been introduced in [https://github.com/apache/arrow/commit/12201841212967c78e31b2d2840b55b1707c4e7b] but I'm not sure. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3958) [Plasma] Reduce number of IPCs
Philipp Moritz created ARROW-3958: - Summary: [Plasma] Reduce number of IPCs Key: ARROW-3958 URL: https://issues.apache.org/jira/browse/ARROW-3958 Project: Apache Arrow Issue Type: Improvement Components: Plasma (C++) Affects Versions: 0.11.1 Reporter: Philipp Moritz Assignee: Philipp Moritz Fix For: 0.12.0 Currently we ship file descriptors of objects from the store to the client every time an object is created or gotten. There is relatively few distinct file descriptors, so caching them can get rid of one IPC in the majority of cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3950) [Plasma] Don't force loading the TensorFlow op on import
Philipp Moritz created ARROW-3950: - Summary: [Plasma] Don't force loading the TensorFlow op on import Key: ARROW-3950 URL: https://issues.apache.org/jira/browse/ARROW-3950 Project: Apache Arrow Issue Type: Improvement Reporter: Philipp Moritz Assignee: Philipp Moritz In certain situation, users want more control over when the TensorFlow op is loaded, so we should make it optional (even if it exists). This happens in Ray for example, where we need to make sure that if multiple python workers try to compile and import the TensorFlow op in parallel, there is no race condition (e.g. one worker could try to import a half-built version of the op). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3934) [Gandiva] Don't compile precompiled tests if ARROW_GANDIVA_BUILD_TESTS=off
Philipp Moritz created ARROW-3934: - Summary: [Gandiva] Don't compile precompiled tests if ARROW_GANDIVA_BUILD_TESTS=off Key: ARROW-3934 URL: https://issues.apache.org/jira/browse/ARROW-3934 Project: Apache Arrow Issue Type: Improvement Reporter: Philipp Moritz Assignee: Philipp Moritz Fix For: 0.12.0 Currently the precompiled tests are compiled in any case, even if ARROW_GANDIVA_BUILD_TESTS=off. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3919) [Python] Support 64 bit indices for pyarrow.serialize and pyarrow.deserialize
Philipp Moritz created ARROW-3919: - Summary: [Python] Support 64 bit indices for pyarrow.serialize and pyarrow.deserialize Key: ARROW-3919 URL: https://issues.apache.org/jira/browse/ARROW-3919 Project: Apache Arrow Issue Type: Improvement Reporter: Philipp Moritz see https://github.com/modin-project/modin/issues/266 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3746) [Gandiva] [Python] Make it possible to list all functions registered with Gandiva
Philipp Moritz created ARROW-3746: - Summary: [Gandiva] [Python] Make it possible to list all functions registered with Gandiva Key: ARROW-3746 URL: https://issues.apache.org/jira/browse/ARROW-3746 Project: Apache Arrow Issue Type: Improvement Reporter: Philipp Moritz This will also be useful for documentation purposes (right now it is not very easy to get a list of all the functions that are registered). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3718) [Gandiva] Remove spurious gtest include
Philipp Moritz created ARROW-3718: - Summary: [Gandiva] Remove spurious gtest include Key: ARROW-3718 URL: https://issues.apache.org/jira/browse/ARROW-3718 Project: Apache Arrow Issue Type: Improvement Components: Gandiva Affects Versions: 0.11.1 Reporter: Philipp Moritz Fix For: 0.12.0 At the moment, cpp/src/gandiva/expr_decomposer.h includes a gtest header which can prevent gandiva to be built without the gtest dependency. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3721) [Gandiva] [Python] Support all Gandiva literals
Philipp Moritz created ARROW-3721: - Summary: [Gandiva] [Python] Support all Gandiva literals Key: ARROW-3721 URL: https://issues.apache.org/jira/browse/ARROW-3721 Project: Apache Arrow Issue Type: Improvement Reporter: Philipp Moritz Support all the literals from [https://github.com/apache/arrow/blob/5b116ab175292fe70ed3c8727bcc6868b9695f4a/cpp/src/gandiva/tree_expr_builder.h#L35] in the Cython bindings. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3659) Clang Travis build (matrix entry 2) might not actually be using clang
Philipp Moritz created ARROW-3659: - Summary: Clang Travis build (matrix entry 2) might not actually be using clang Key: ARROW-3659 URL: https://issues.apache.org/jira/browse/ARROW-3659 Project: Apache Arrow Issue Type: Improvement Reporter: Philipp Moritz See for example [https://travis-ci.org/apache/arrow/jobs/448267169:] {code:java} Setting environment variables from .travis.yml $ export ANACONDA_TOKEN=[secure] $ export ARROW_TRAVIS_USE_TOOLCHAIN=1 $ export ARROW_TRAVIS_VALGRIND=1 $ export ARROW_TRAVIS_PLASMA=1 $ export ARROW_TRAVIS_ORC=1 $ export ARROW_TRAVIS_COVERAGE=1 $ export ARROW_TRAVIS_PARQUET=1 $ export ARROW_TRAVIS_PYTHON_DOCS=1 $ export ARROW_BUILD_WARNING_LEVEL=CHECKIN $ export ARROW_TRAVIS_PYTHON_JVM=1 $ export ARROW_TRAVIS_JAVA_BUILD_ONLY=1 $ export CC="clang-6.0" $ export CXX="clang++-6.0" $ export TRAVIS_COMPILER=gcc $ export CXX=g++ $ export CC=gcc $ export PATH=/usr/lib/ccache:$PATH cache.1 Setting up build cache{code} The CC and CXX command line variables are overwritten by travis (because the travis toolchain is set to gcc). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3602) [Gandiva] [Python] Add preliminary Cython bindings for Gandiva
Philipp Moritz created ARROW-3602: - Summary: [Gandiva] [Python] Add preliminary Cython bindings for Gandiva Key: ARROW-3602 URL: https://issues.apache.org/jira/browse/ARROW-3602 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 0.11.1 Reporter: Philipp Moritz Fix For: 0.12.0 Adding a first version of Cython bindings to Gandiva so it can be called from Python. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3589) [Gandiva] Make it possible to compile gandiva without JNI
Philipp Moritz created ARROW-3589: - Summary: [Gandiva] Make it possible to compile gandiva without JNI Key: ARROW-3589 URL: https://issues.apache.org/jira/browse/ARROW-3589 Project: Apache Arrow Issue Type: Improvement Reporter: Philipp Moritz When trying to compile arrow with {code:java} cmake -DARROW_PYTHON=on -DARROW_GANDIVA=on -DARROW_PLASMA=on ..{code} I'm seeing the following error right now: {code:java} CMake Error at /home/ubuntu/anaconda3/share/cmake-3.12/Modules/FindPackageHandleStandardArgs.cmake:137 (message): Could NOT find JNI (missing: JAVA_AWT_LIBRARY JAVA_JVM_LIBRARY JAVA_INCLUDE_PATH JAVA_INCLUDE_PATH2 JAVA_AWT_INCLUDE_PATH) Call Stack (most recent call first): /home/ubuntu/anaconda3/share/cmake-3.12/Modules/FindPackageHandleStandardArgs.cmake:378 (_FPHSA_FAILURE_MESSAGE) /home/ubuntu/anaconda3/share/cmake-3.12/Modules/FindJNI.cmake:356 (FIND_PACKAGE_HANDLE_STANDARD_ARGS) src/gandiva/jni/CMakeLists.txt:21 (find_package) -- Configuring incomplete, errors occurred{code} It should be possible to compile the C++ gandiva code without JNI bindings, how about we introduce a new flag "-DARROW_GANDIVA_JAVA=off" (which could be on by default if desired). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3243) [C++] Upgrade jemalloc to version 5
Philipp Moritz created ARROW-3243: - Summary: [C++] Upgrade jemalloc to version 5 Key: ARROW-3243 URL: https://issues.apache.org/jira/browse/ARROW-3243 Project: Apache Arrow Issue Type: Improvement Reporter: Philipp Moritz Is it possible/feasible to upgrade jemalloc to version 5 and assume that version? I'm asking because I've been working towards replacing dlmalloc in plasma with jemalloc, which makes some of the code much nicer and removes some of the issues we had with dlmalloc, but it requires jemalloc APIs that are only available starting from jemalloc version 5, in particular, I'm using the extent_hooks_t capability. For now I can submit a patch that uses a different version of jemalloc in plasma and then we can figure out how to deal with it (maybe there is a way to make it work with older versions). What are your thoughts? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3199) [Plasma] Check for EAGAIN in recvmsg and sendmsg
Philipp Moritz created ARROW-3199: - Summary: [Plasma] Check for EAGAIN in recvmsg and sendmsg Key: ARROW-3199 URL: https://issues.apache.org/jira/browse/ARROW-3199 Project: Apache Arrow Issue Type: Improvement Reporter: Philipp Moritz Fix For: 0.10.0 It turns out that [https://github.com/apache/arrow/blob/673125fd416cbd2e5c2cb9cb6a4c925adecdaf2c/cpp/src/plasma/fling.cc#L63] and probably also [https://github.com/apache/arrow/blob/673125fd416cbd2e5c2cb9cb6a4c925adecdaf2c/cpp/src/plasma/fling.cc#L49] can block and give an EAGAIN error. This was discovered during stress tests by https://github.com/stephanie-wang/ -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3159) [Plasma] Plasma C++ and Python integration test for tensors
Philipp Moritz created ARROW-3159: - Summary: [Plasma] Plasma C++ and Python integration test for tensors Key: ARROW-3159 URL: https://issues.apache.org/jira/browse/ARROW-3159 Project: Apache Arrow Issue Type: Improvement Reporter: Philipp Moritz This is motivated by ARROW-3127, we should have an integration test for this to make sure it won't break in the future. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3157) [C++] Improve buffer creation for typed data
Philipp Moritz created ARROW-3157: - Summary: [C++] Improve buffer creation for typed data Key: ARROW-3157 URL: https://issues.apache.org/jira/browse/ARROW-3157 Project: Apache Arrow Issue Type: Improvement Reporter: Philipp Moritz While looking into [https://github.com/apache/arrow/pull/2481,] I noticed this pattern: {code:java} const uint8_t* bytes_array = reinterpret_cast(input); auto buffer = std::make_shared(bytes_array, sizeof(float)*input_length);{code} It's not the end of the world but seems a little verbose to me. It would be great to have something like this: {code:java} auto buffer = MakeBuffer(input, input_length);{code} I couldn't find it, does it already exist somewhere? Any thoughts on the API? Potentially specializations to make a buffer out of a std::vector would also be helpful. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3116) [Plasma] Add "ls" to object store
Philipp Moritz created ARROW-3116: - Summary: [Plasma] Add "ls" to object store Key: ARROW-3116 URL: https://issues.apache.org/jira/browse/ARROW-3116 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Philipp Moritz Assignee: Philipp Moritz Add a facility to list all the objects in the store and information about them (object ids, sizes, number of clients using them etc.). This is very useful for debugging applications. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3105) [Plasma] Improve flushing error message
Philipp Moritz created ARROW-3105: - Summary: [Plasma] Improve flushing error message Key: ARROW-3105 URL: https://issues.apache.org/jira/browse/ARROW-3105 Project: Apache Arrow Issue Type: Improvement Components: Plasma (C++) Affects Versions: 0.10.0 Reporter: Philipp Moritz Assignee: Philipp Moritz Fix For: 0.11.0 This helps us diagnose the flushing policy better. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3062) [Python] Extend fast libtensorflow_framework.so compatibility workaround to Python 2.7
Philipp Moritz created ARROW-3062: - Summary: [Python] Extend fast libtensorflow_framework.so compatibility workaround to Python 2.7 Key: ARROW-3062 URL: https://issues.apache.org/jira/browse/ARROW-3062 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 0.10.0 Reporter: Philipp Moritz Assignee: Philipp Moritz The workaround ARROW-2657 should be optimized a little bit and use the loading of libtensorflow_framework.so (instead of doing a full "import tensorflow") also for Python 2.7. We are running into this, since doing "import tensorflow" spawns a number of threads, so without this optimization, using many python processes with pyarrow will hit OS limits for threads. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3018) [Plasma] Improve random ObjectID generation
Philipp Moritz created ARROW-3018: - Summary: [Plasma] Improve random ObjectID generation Key: ARROW-3018 URL: https://issues.apache.org/jira/browse/ARROW-3018 Project: Apache Arrow Issue Type: Improvement Components: Plasma (C++) Affects Versions: 0.10.0 Reporter: Philipp Moritz As pointed out by [~pitrou], the mersenne twister in Plasma is currently not seeded appropriately (I just saw the comment recently): https://github.com/apache/arrow/pull/2039 I can submit a patch for Plasma but I'm also wondering if we should have a properly seeded random number in Arrow. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2976) [Python] Directory in pyarrow.get_library_dirs() on Travis doesn't contain libarrow.so
Philipp Moritz created ARROW-2976: - Summary: [Python] Directory in pyarrow.get_library_dirs() on Travis doesn't contain libarrow.so Key: ARROW-2976 URL: https://issues.apache.org/jira/browse/ARROW-2976 Project: Apache Arrow Issue Type: Improvement Reporter: Philipp Moritz Concerning the way pyarrow is built in `travis_script_python.sh`: The directory in pyarrow._get_library_dirs_() doesn't seem to contain libarrow.so. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2975) [Plasma] TensorFlow op: Compilation only working if arrow found by pkg-config
Philipp Moritz created ARROW-2975: - Summary: [Plasma] TensorFlow op: Compilation only working if arrow found by pkg-config Key: ARROW-2975 URL: https://issues.apache.org/jira/browse/ARROW-2975 Project: Apache Arrow Issue Type: Improvement Components: Plasma (C++) Reporter: Philipp Moritz Assignee: Philipp Moritz Currently the pyarrow/tensorflow/build.sh script uses pyarrow to discover the arrow libraries to link against. However, this is not working on the pip package of pyarrow (since the .pc files are not shipped with it). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2954) [Plasma] Store object_id only once in object table
Philipp Moritz created ARROW-2954: - Summary: [Plasma] Store object_id only once in object table Key: ARROW-2954 URL: https://issues.apache.org/jira/browse/ARROW-2954 Project: Apache Arrow Issue Type: Improvement Components: Plasma (C++) Reporter: Philipp Moritz Assignee: Philipp Moritz Fix For: 0.10.0 This is the first part of ARROW-2953, i.e. the duplicated storage of the object id both in the key and the value of the object hash table. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2953) [Plasma] Store memory usage
Philipp Moritz created ARROW-2953: - Summary: [Plasma] Store memory usage Key: ARROW-2953 URL: https://issues.apache.org/jira/browse/ARROW-2953 Project: Apache Arrow Issue Type: Improvement Reporter: Philipp Moritz While doing some memory profiling on the store, it became clear that at the moment the metadata of the objects takes up much more space than it should. In particular, for each object: * The object id (20 bytes) is stored three times * The object checksum (8 bytes) is stored twice * data_size and metadata_size (each 8 bytes) are stored twice We can therefore significantly reduce the metadata overhead with some refactoring. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2940) [Python] Import error with pytorch 0.3
Philipp Moritz created ARROW-2940: - Summary: [Python] Import error with pytorch 0.3 Key: ARROW-2940 URL: https://issues.apache.org/jira/browse/ARROW-2940 Project: Apache Arrow Issue Type: Bug Reporter: Philipp Moritz The fix in ARROW-2920 doesn't work in versions strictly before pytorch 0.4: {code:java} >>> import pyarrow Traceback (most recent call last): File "", line 1, in File "/home/ubuntu/arrow/python/pyarrow/__init__.py", line 57, in compat.import_pytorch_extension() File "/home/ubuntu/arrow/python/pyarrow/compat.py", line 249, in import_pytorch_extension ctypes.CDLL(os.path.join(path, "lib/libcaffe2.so")) File "/home/ubuntu/anaconda3/envs/breaking-env2/lib/python3.5/ctypes/__init__.py", line 351, in __init__ self._handle = _dlopen(self._name, mode) OSError: /home/ubuntu/anaconda3/envs/breaking-env2/lib/python3.5/site-packages/torch/lib/libcaffe2.so: cannot open shared object file: No such file or directory{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2920) [Python] Segfault with pytorch 0.4
Philipp Moritz created ARROW-2920: - Summary: [Python] Segfault with pytorch 0.4 Key: ARROW-2920 URL: https://issues.apache.org/jira/browse/ARROW-2920 Project: Apache Arrow Issue Type: Bug Reporter: Philipp Moritz See also [https://github.com/ray-project/ray/issues/2447] How to reproduce: * Start the Ubuntu Deep Learning AMI (version 12.0) on EC2 * Create a new env with {{conda create -y -n breaking-env python=3.5}} * Install pytorch with {{source activate breaking-env && conda install pytorch torchvision cuda91 -c pytorch}} * Compile and install manylinux1 pyarrow wheels from latest arrow master as described here: https://github.com/apache/arrow/blob/2876a3fdd1fb9ef6918b7214d6e8d1e3017b42ad/python/manylinux1/README.md * In the breaking-env just created, run the following: {code:java} Python 3.5.5 |Anaconda, Inc.| (default, May 13 2018, 21:12:35) [GCC 7.2.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import pyarrow >>> import torch >>> torch.nn.Conv2d(64, 2, kernel_size=3, stride=1, padding=1, >>> bias=False).cuda() Segmentation fault (core dumped){code} Backtrace: {code:java} >>> torch.nn.Conv2d(64, 2, kernel_size=3, stride=1, padding=1, >>> bias=False).cuda() Program received signal SIGSEGV, Segmentation fault. 0x in ?? () (gdb) bt #0 0x in ?? () #1 0x77bc8a99 in __pthread_once_slow (once_control=0x7fffdb791e50 , init_routine=0x7fffe46aafe1 ) at pthread_once.c:116 #2 0x7fffda95c302 in at::Type::toBackend(at::Backend) const () from /home/ubuntu/anaconda3/envs/breaking-env2/lib/python3.5/site-packages/torch/lib/libcaffe2.so #3 0x7fffdc59b231 in torch::autograd::VariableType::toBackend (this=, b=) at torch/csrc/autograd/generated/VariableType.cpp:145 #4 0x7fffdc8dbe8a in torch::autograd::THPVariable_cuda (self=0x76dbff78, args=0x76daf828, kwargs=0x0) at torch/csrc/autograd/generated/python_variable_methods.cpp:333 #5 0x5569f4e8 in PyCFunction_Call () #6 0x556f67cc in PyEval_EvalFrameEx () #7 0x556fbe08 in PyEval_EvalFrameEx () #8 0x556f6e90 in PyEval_EvalFrameEx () #9 0x556fbe08 in PyEval_EvalFrameEx () #10 0x5570103d in PyEval_EvalCodeEx () #11 0x55701f5c in PyEval_EvalCode () #12 0x5575e454 in run_mod () #13 0x5562ab5e in PyRun_InteractiveOneObject () #14 0x5562ad01 in PyRun_InteractiveLoopFlags () #15 0x5562ad62 in PyRun_AnyFileExFlags.cold.2784 () #16 0x5562b080 in Py_Main.cold.2785 () #17 0x5562b871 in main (){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2892) [Plasma] Implement interface to get Java arrow objects from Plasma
Philipp Moritz created ARROW-2892: - Summary: [Plasma] Implement interface to get Java arrow objects from Plasma Key: ARROW-2892 URL: https://issues.apache.org/jira/browse/ARROW-2892 Project: Apache Arrow Issue Type: Improvement Reporter: Philipp Moritz Currently we have a low level interface to access bytes stored in plasma from Java, using the JNI: [https://github.com/apache/arrow/pull/2065/] As a followup, we should implement reading (and writing) Java arrow objects from plasma, if possible using zero-copy. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2890) [Plasma] Make Python PlasmaClient.release private
Philipp Moritz created ARROW-2890: - Summary: [Plasma] Make Python PlasmaClient.release private Key: ARROW-2890 URL: https://issues.apache.org/jira/browse/ARROW-2890 Project: Apache Arrow Issue Type: Improvement Reporter: Philipp Moritz It should normally not be called by the user, since it is automatically called upon buffer destruction, see also https://github.com/apache/arrow/blob/7d2fbeba31763c978d260a9771184a13a63aaaf7/python/pyarrow/_plasma.pyx#L222. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2866) [Plasma] TensorFlow op: Investiate outputting multiple output Tensors for the reading op
Philipp Moritz created ARROW-2866: - Summary: [Plasma] TensorFlow op: Investiate outputting multiple output Tensors for the reading op Key: ARROW-2866 URL: https://issues.apache.org/jira/browse/ARROW-2866 Project: Apache Arrow Issue Type: Improvement Reporter: Philipp Moritz see discussion in https://github.com/apache/arrow/pull/2104#discussion_r197308266 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2811) [Python] Test serialization for determinism
Philipp Moritz created ARROW-2811: - Summary: [Python] Test serialization for determinism Key: ARROW-2811 URL: https://issues.apache.org/jira/browse/ARROW-2811 Project: Apache Arrow Issue Type: Improvement Reporter: Philipp Moritz see discussion in https://github.com/apache/arrow/pull/2216 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2805) [Python] TensorFlow import workaround not working with tensorflow-gpu if CUDA is not installed
Philipp Moritz created ARROW-2805: - Summary: [Python] TensorFlow import workaround not working with tensorflow-gpu if CUDA is not installed Key: ARROW-2805 URL: https://issues.apache.org/jira/browse/ARROW-2805 Project: Apache Arrow Issue Type: Improvement Reporter: Philipp Moritz TensorFlow version: 1.7 (GPU enabled but CUDA is not installed) tensorflow-gpu was installed via pip install ``` import ray File "/home/eric/Desktop/ray-private/python/ray/__init__.py", line 28, in import pyarrow # noqa: F401 File "/home/eric/Desktop/ray-private/python/ray/pyarrow_files/pyarrow/__init__.py", line 55, in compat.import_tensorflow_extension() File "/home/eric/Desktop/ray-private/python/ray/pyarrow_files/pyarrow/compat.py", line 193, in import_tensorflow_extension ctypes.CDLL(ext) File "/usr/lib/python3.5/ctypes/__init__.py", line 347, in __init__ self._handle = _dlopen(self._name, mode) OSError: libcublas.so.9.0: cannot open shared object file: No such file or directory ``` -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2803) [C++] Put hashing function into src/arrow/util
Philipp Moritz created ARROW-2803: - Summary: [C++] Put hashing function into src/arrow/util Key: ARROW-2803 URL: https://issues.apache.org/jira/browse/ARROW-2803 Project: Apache Arrow Issue Type: Improvement Reporter: Philipp Moritz See [https://github.com/apache/arrow/pull/2220] We should decide what our default go-to hash function should be (maybe murmur3?) and put it into src/arrow/util -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2794) [Plasma] Add Delete method for multiple objects
Philipp Moritz created ARROW-2794: - Summary: [Plasma] Add Delete method for multiple objects Key: ARROW-2794 URL: https://issues.apache.org/jira/browse/ARROW-2794 Project: Apache Arrow Issue Type: Improvement Reporter: Philipp Moritz This improves efficiency since multiple objects can be deleted with a single RPC. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2788) [Plasma] Defining Delete semantics
Philipp Moritz created ARROW-2788: - Summary: [Plasma] Defining Delete semantics Key: ARROW-2788 URL: https://issues.apache.org/jira/browse/ARROW-2788 Project: Apache Arrow Issue Type: Improvement Reporter: Philipp Moritz We should define what the semantics of Plasma's Delete operation is, especially in the presence of errors (object in use is deleted, non-existing object is deleted). My current take on this is the following: Delete should be a hint to the store to delete, so if the object is not present, it should be a no-op. If an object that is in use is deleted, the store should delete it as soon as the reference count goes to zero (it would also be ok, but less desirable in my opinion, to not delete it). I think this is a good application of the "Defining errors away" from John Ousterhouts book (A Philosophy of Software Design). Please comment in this thread if you have different opinions so we can discuss! -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2758) [Plasma] Use Scope enum in Plasma
Philipp Moritz created ARROW-2758: - Summary: [Plasma] Use Scope enum in Plasma Key: ARROW-2758 URL: https://issues.apache.org/jira/browse/ARROW-2758 Project: Apache Arrow Issue Type: Improvement Components: Plasma (C++) Reporter: Philipp Moritz Fix For: 0.10.0 Modernize our usage of enums in plasma: # add option "--scoped-enum" to Flat Buffer Compiler. # change the old-styled c++ enum to c++11 style. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2757) [Plasma] Huge pages test failing
Philipp Moritz created ARROW-2757: - Summary: [Plasma] Huge pages test failing Key: ARROW-2757 URL: https://issues.apache.org/jira/browse/ARROW-2757 Project: Apache Arrow Issue Type: Improvement Reporter: Philipp Moritz See ``` === FAILURES === _ test_use_huge_pages __ @pytest.mark.skipif(not os.path.exists("/mnt/hugepages"), reason="requires hugepage support") def test_use_huge_pages(): import pyarrow.plasma as plasma with plasma.start_plasma_store( plasma_store_memory=DEFAULT_PLASMA_STORE_MEMORY, plasma_directory="/mnt/hugepages", use_hugepages=True) as (plasma_store_name, p): plasma_client = plasma.connect(plasma_store_name, "", 64) > create_object(plasma_client, 1) pyarrow/tests/test_plasma.py:773: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ pyarrow/tests/test_plasma.py:79: in create_object seal=seal) pyarrow/tests/test_plasma.py:68: in create_object_with_id memory_buffer = client.create(object_id, data_size, metadata) pyarrow/_plasma.pyx:300: in pyarrow._plasma.PlasmaClient.create check_status(self.client.get().Create(object_id.data, data_size, _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > raise PlasmaStoreFull(message) E PlasmaStoreFull: /home/travis/build/apache/arrow/cpp/src/plasma/client.cc:375 code: ReadCreateReply(buffer.data(), buffer.size(), , , _fd, _size) E object does not fit in the plasma store ``` seems to be failing consistently since [https://github.com/apache/arrow/pull/2062] (which is unrelated) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2737) [Plasma] Integrate TensorFlow Op with arrow packaging scripts
Philipp Moritz created ARROW-2737: - Summary: [Plasma] Integrate TensorFlow Op with arrow packaging scripts Key: ARROW-2737 URL: https://issues.apache.org/jira/browse/ARROW-2737 Project: Apache Arrow Issue Type: Improvement Reporter: Philipp Moritz Not sure what is involved here and what the best steps forward are. We should first collect experience from deploying the current op with Ray and then see what the right deployment strategy is. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2629) [Plasma] Iterator invalidation for pending_notifications_
Philipp Moritz created ARROW-2629: - Summary: [Plasma] Iterator invalidation for pending_notifications_ Key: ARROW-2629 URL: https://issues.apache.org/jira/browse/ARROW-2629 Project: Apache Arrow Issue Type: Bug Components: Plasma (C++) Reporter: Philipp Moritz Fix For: 0.10.0 This was discovered when running the Ray integration tests. In send_notifications we are modifying pending_notifications_, which invalidates the iterator in the for each loop in push_notification. It's not easy to reproduce, so I don't have a regression test unfortunately, but I'll post a patch that fixes it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2612) [Plasma] Fix deprecated PLASMA_DEFAULT_RELEASE_DELAY
Philipp Moritz created ARROW-2612: - Summary: [Plasma] Fix deprecated PLASMA_DEFAULT_RELEASE_DELAY Key: ARROW-2612 URL: https://issues.apache.org/jira/browse/ARROW-2612 Project: Apache Arrow Issue Type: Improvement Reporter: Philipp Moritz The deprecated PLASMA_DEFAULT_RELEASE_DELAY is currently broken, since it refers to kDeprecatedPlasmaDefaultReleaseDelay without the plasma:: namespace qualifier. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2611) [Python] Python 2 integer serialization
Philipp Moritz created ARROW-2611: - Summary: [Python] Python 2 integer serialization Key: ARROW-2611 URL: https://issues.apache.org/jira/browse/ARROW-2611 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 0.9.0 Reporter: Philipp Moritz In Python 2, serializing a Python int with pyarrow.serialize and then deserializing it returns a {{long }}instead of an integer. Note that this is not an issue in python 3 where the long type does not exist. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2595) [Plasma] operator[] creates entries in map
Philipp Moritz created ARROW-2595: - Summary: [Plasma] operator[] creates entries in map Key: ARROW-2595 URL: https://issues.apache.org/jira/browse/ARROW-2595 Project: Apache Arrow Issue Type: Improvement Components: Plasma (C++) Reporter: Philipp Moritz * Problem ** Using object_get_requests_[object_id] will produce a lot of garbage data in PlasmaStore::return_from_get. During the measurement process, we found that there was a lot of memory growth in this point. * Solution ** Use iterator instead of operator [] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2577) [Plasma] Add ASV benchmarks
Philipp Moritz created ARROW-2577: - Summary: [Plasma] Add ASV benchmarks Key: ARROW-2577 URL: https://issues.apache.org/jira/browse/ARROW-2577 Project: Apache Arrow Issue Type: Improvement Reporter: Philipp Moritz We are about to merge some PRs that potentially impact plasma performance, so we should set up ASV to track the changes. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2542) [Plasma] Refactor object notification code
Philipp Moritz created ARROW-2542: - Summary: [Plasma] Refactor object notification code Key: ARROW-2542 URL: https://issues.apache.org/jira/browse/ARROW-2542 Project: Apache Arrow Issue Type: Improvement Reporter: Philipp Moritz Replace unique_ptrwith vector -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2541) [Plasma] Clean up macro usage
Philipp Moritz created ARROW-2541: - Summary: [Plasma] Clean up macro usage Key: ARROW-2541 URL: https://issues.apache.org/jira/browse/ARROW-2541 Project: Apache Arrow Issue Type: Improvement Reporter: Philipp Moritz There are still a lot of macros being used as constants in the plasma codebase. This should be cleaned up and replaced with constexpr (deprecating them where appropriate). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2508) [Python] pytest API changes make tests fail
Philipp Moritz created ARROW-2508: - Summary: [Python] pytest API changes make tests fail Key: ARROW-2508 URL: https://issues.apache.org/jira/browse/ARROW-2508 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Philipp Moritz Seems like there is a new pytest on pypy, it produces the following failures: ``` === FAILURES === __ TestConvertDateTimeLikeTypes.test_pandas_datetime_to_date64_failures[None] __ self = mask = None @pytest.mark.parametrize('mask', [ None, np.ones(3), np.array([True, False, False]) ]) def test_pandas_datetime_to_date64_failures(self, mask): s = pd.to_datetime([ '2018-05-10T10:24:01', '2018-05-11T10:24:01', '2018-05-12T10:24:01', ]) expected_msg = 'Timestamp value had non-zero intraday milliseconds' > with pytest.raises(pa.ArrowInvalid, msg=expected_msg): E TypeError: Unexpected keyword arguments passed to pytest.raises: msg pyarrow-test-3.6/lib/python3.6/site-packages/pyarrow/tests/test_convert_pandas.py:862: TypeError _ TestConvertDateTimeLikeTypes.test_pandas_datetime_to_date64_failures[mask1] __ self = mask = array([ 1., 1., 1.]) @pytest.mark.parametrize('mask', [ None, np.ones(3), np.array([True, False, False]) ]) def test_pandas_datetime_to_date64_failures(self, mask): s = pd.to_datetime([ '2018-05-10T10:24:01', '2018-05-11T10:24:01', '2018-05-12T10:24:01', ]) expected_msg = 'Timestamp value had non-zero intraday milliseconds' > with pytest.raises(pa.ArrowInvalid, msg=expected_msg): E TypeError: Unexpected keyword arguments passed to pytest.raises: msg pyarrow-test-3.6/lib/python3.6/site-packages/pyarrow/tests/test_convert_pandas.py:862: TypeError _ TestConvertDateTimeLikeTypes.test_pandas_datetime_to_date64_failures[mask2] __ self = mask = array([ True, False, False], dtype=bool) @pytest.mark.parametrize('mask', [ None, np.ones(3), np.array([True, False, False]) ]) def test_pandas_datetime_to_date64_failures(self, mask): s = pd.to_datetime([ '2018-05-10T10:24:01', '2018-05-11T10:24:01', '2018-05-12T10:24:01', ]) expected_msg = 'Timestamp value had non-zero intraday milliseconds' > with pytest.raises(pa.ArrowInvalid, msg=expected_msg): E TypeError: Unexpected keyword arguments passed to pytest.raises: msg pyarrow-test-3.6/lib/python3.6/site-packages/pyarrow/tests/test_convert_pandas.py:862: TypeError === short test summary info ``` I think we can just change msg to message and it should work again. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2506) [Plasma] Build error on macOS
Philipp Moritz created ARROW-2506: - Summary: [Plasma] Build error on macOS Key: ARROW-2506 URL: https://issues.apache.org/jira/browse/ARROW-2506 Project: Apache Arrow Issue Type: Improvement Reporter: Philipp Moritz Since the upgrade to flatbuffers 1.9.0, I'm seeing this error on the Ray CI: arrow/cpp/src/plasma/format/plasma.fbs:234:0: error: default value of 0 for field status is not part of enum ObjectStatus I'm planning to just remove the '= 1' from 'Local = 1'. This will break the protocol however, so if we prefer to just put in a 'Dummy = 0' object at the beginning of the enum, that would also be fine with me. However, the ObjectStatus API is not stable yet and not even exposed to Python, so I think breaking it is fine. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2458) [Plasma] PlasmaClient uses global variable
Philipp Moritz created ARROW-2458: - Summary: [Plasma] PlasmaClient uses global variable Key: ARROW-2458 URL: https://issues.apache.org/jira/browse/ARROW-2458 Project: Apache Arrow Issue Type: Improvement Components: Plasma (C++) Affects Versions: 0.9.0 Reporter: Philipp Moritz The threadpool threadpool_ that PlasmaClient is using is global at the moment. This prevents us from using multiple PlasmaClients in the same process (one per thread). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2386) [Plasma] Change PlasmaClient::Create API
Philipp Moritz created ARROW-2386: - Summary: [Plasma] Change PlasmaClient::Create API Key: ARROW-2386 URL: https://issues.apache.org/jira/browse/ARROW-2386 Project: Apache Arrow Issue Type: Improvement Components: Plasma (C++) Reporter: Philipp Moritz Now that the Get API is refactored in [https://github.com/apache/arrow/pull/1807,] we should do the same for the Create API. Proposal: Have a MutablePlasmaBuffer class, which is returned by Create {code:java} Status Create(int64_t data_size, int64_t metadata_size, std::shared_ptr* buffer) {code} This allocates the data in shared memory, but does not associate it with the object id yet. This way we get get rid of the Abort() call. Move the Seal() method into the MutablePlasmaBuffer and let it return the object ID. This is very similar to what [~pitrou] suggested here: https://github.com/apache/arrow/pull/1807 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2215) [Plasma] Error when using huge pages
Philipp Moritz created ARROW-2215: - Summary: [Plasma] Error when using huge pages Key: ARROW-2215 URL: https://issues.apache.org/jira/browse/ARROW-2215 Project: Apache Arrow Issue Type: Improvement Components: Plasma (C++) Reporter: Philipp Moritz see https://github.com/ray-project/ray/issues/1592 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2195) [Plasma] Segfault when retrieving RecordBatch from plasma store
Philipp Moritz created ARROW-2195: - Summary: [Plasma] Segfault when retrieving RecordBatch from plasma store Key: ARROW-2195 URL: https://issues.apache.org/jira/browse/ARROW-2195 Project: Apache Arrow Issue Type: Improvement Reporter: Philipp Moritz It can be reproduced with the following script: ``` import pyarrow as pa import pyarrow.plasma as plasma def retrieve1(): client = plasma.connect('test', "", 0) key = "keynumber1keynumber1" pid = plasma.ObjectID(bytearray(key,'UTF-8')) [buff] = client .get_buffers([pid]) batch = pa.RecordBatchStreamReader(buff).read_next_batch() print(batch) print(batch.schema) print(batch[0]) return batch client = plasma.connect('test', "", 0) test1 = [1, 12, 23, 3, 21, 34] test1 = pa.array(test1, pa.int32()) batch = pa.RecordBatch.from_arrays([test1], ['FIELD1']) key = "keynumber1keynumber1" pid = plasma.ObjectID(bytearray(key,'UTF-8')) sink = pa.MockOutputStream() stream_writer = pa.RecordBatchStreamWriter(sink, batch.schema) stream_writer.write_batch(batch) stream_writer.close() bff = client.create(pid, sink.size()) stream = pa.FixedSizeBufferWriter(bff) writer = pa.RecordBatchStreamWriter(stream, batch.schema) writer.write_batch(batch) client.seal(pid) batch = retrieve1() print(batch) print(batch.schema) print(batch[0]) ``` Preliminary backtrace: ``` CESS (code=1, address=0x38158) frame #0: 0x00010e6457fc lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28 lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py: -> 0x10e6457fc <+28>: movslq (%rdx,%rcx,4), %rdi 0x10e645800 <+32>: callq 0x10e698170 ; symbol stub for: PyInt_FromLong 0x10e645805 <+37>: testq %rax, %rax 0x10e645808 <+40>: je 0x10e64580c ; <+44> (lldb) bt * thread #1: tid = 0xf1378e, 0x00010e6457fc lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x38158) * frame #0: 0x00010e6457fc lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28 frame #1: 0x00010e5ccd35 lib.so`__Pyx_PyObject_CallNoArg(_object*) + 133 frame #2: 0x00010e613b25 lib.so`__pyx_pw_7pyarrow_3lib_10ArrayValue_3__repr__(_object*) + 933 frame #3: 0x00010c2f83bc libpython2.7.dylib`PyObject_Repr + 60 frame #4: 0x00010c35f651 libpython2.7.dylib`PyEval_EvalFrameEx + 22305 ``` -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2127) [Plasma] Transfer of objects between CPUs and GPUs
Philipp Moritz created ARROW-2127: - Summary: [Plasma] Transfer of objects between CPUs and GPUs Key: ARROW-2127 URL: https://issues.apache.org/jira/browse/ARROW-2127 Project: Apache Arrow Issue Type: Improvement Reporter: Philipp Moritz It should be possible to transfer an object that was created on the CPU to the GPU and vice versa. One natural implementation is to introduce a flag to plasma::Get that specifies where the object should end up and then transfer the object under the hood and return the appropriate buffer. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2126) [Plasma] Hashing for GPU objects
Philipp Moritz created ARROW-2126: - Summary: [Plasma] Hashing for GPU objects Key: ARROW-2126 URL: https://issues.apache.org/jira/browse/ARROW-2126 Project: Apache Arrow Issue Type: Improvement Reporter: Philipp Moritz We should have a CUDA function that computes a hash for objects, similar to the way it is done for CPU objects at the moment. Is there a fast hash/checksum function available for CUDA, similar to xxhash? Maybe this can be implemented as a arrow::compute kernel? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2125) [Plasma] Implement eviction policy for GPU objects
Philipp Moritz created ARROW-2125: - Summary: [Plasma] Implement eviction policy for GPU objects Key: ARROW-2125 URL: https://issues.apache.org/jira/browse/ARROW-2125 Project: Apache Arrow Issue Type: Improvement Reporter: Philipp Moritz This is a followup to https://github.com/apache/arrow/pull/1445 Right now, objects allocated on GPUs are never evicted. There should be a flag with the maximum amount of memory that plasma can take on the GPU. If this memory is exceeded, objects should be evicted according to the policy (which is pluggable). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2042) [Plasma] Revert API change of plasma::Create to output a MutableBuffer
Philipp Moritz created ARROW-2042: - Summary: [Plasma] Revert API change of plasma::Create to output a MutableBuffer Key: ARROW-2042 URL: https://issues.apache.org/jira/browse/ARROW-2042 Project: Apache Arrow Issue Type: Improvement Components: Plasma (C++) Reporter: Philipp Moritz Assignee: Philipp Moritz Reverts a part of the changes from [https://github.com/apache/arrow/pull/1479] concerning the plasma::Create API. It should output a shared pointer to a Buffer instead of a shared pointer to a MutableBuffer. This is needed for [https://github.com/apache/arrow/pull/1445] so we can return a CudaBuffer from plasma::Create. It also seems to be more in line with how Buffers are intended to be used and avoids API breakage from 0.8.0 to 0.9.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-1944) FindArrow has wrong ARROW_STATIC_LIB
Philipp Moritz created ARROW-1944: - Summary: FindArrow has wrong ARROW_STATIC_LIB Key: ARROW-1944 URL: https://issues.apache.org/jira/browse/ARROW-1944 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 0.8.0 Reporter: Philipp Moritz It seems that in https://github.com/apache/arrow/blob/a0555c04dd5c43230a1c50d0d0a94e06d8ad9ff0/cpp/cmake_modules/FindArrow.cmake#L100 ARROW_PYTHON_LIB_PATH should be replaced with ARROW_LIBS -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (ARROW-1927) [Plasma] Implement delete function
Philipp Moritz created ARROW-1927: - Summary: [Plasma] Implement delete function Key: ARROW-1927 URL: https://issues.apache.org/jira/browse/ARROW-1927 Project: Apache Arrow Issue Type: Improvement Components: Plasma (C++), Python Reporter: Philipp Moritz The function should check if the reference count of the object is zero and if yes, delete it from the store. If no, it should raise an exception or return a status value. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (ARROW-1924) [Python] Bring back pickle=True option for serialization
Philipp Moritz created ARROW-1924: - Summary: [Python] Bring back pickle=True option for serialization Key: ARROW-1924 URL: https://issues.apache.org/jira/browse/ARROW-1924 Project: Apache Arrow Issue Type: Bug Reporter: Philipp Moritz We need to revert https://issues.apache.org/jira/browse/ARROW-1758 The reason is that the semantics for pickle=True is slightly different from just using (cloud-)pickle as the custom serializer: If pickle=True is used, the object can be deserialized in any process, even if a deserializer for that type_id has not been registered in that process. On the other hand, if (cloud-)pickle is used as a custom serializer, the object can only be deserialized if pyarrow has the type_id registered and can call the deserializer. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (ARROW-1919) Plasma hanging if object id is not 20 bytes
Philipp Moritz created ARROW-1919: - Summary: Plasma hanging if object id is not 20 bytes Key: ARROW-1919 URL: https://issues.apache.org/jira/browse/ARROW-1919 Project: Apache Arrow Issue Type: Bug Reporter: Philipp Moritz Assignee: Philipp Moritz Priority: Minor This happens if plasma's capability to put an object with a user defined object id is used if the object id is not 20 bytes long. Plasma will hang upon get in that case, we should give an error instead. See https://github.com/ray-project/ray/issues/1315 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (ARROW-1853) [Plasma] Fix off-by-one error in retry processing
Philipp Moritz created ARROW-1853: - Summary: [Plasma] Fix off-by-one error in retry processing Key: ARROW-1853 URL: https://issues.apache.org/jira/browse/ARROW-1853 Project: Apache Arrow Issue Type: Bug Reporter: Philipp Moritz Priority: Minor Fix For: 0.8.0 When a user construct a plasma client that should not perform a single retry, by passing num_retries = 0, nothing happens due to an off-by-one error in the retry processing. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (ARROW-1758) [Python] Remove pickle=True option for object serialization
Philipp Moritz created ARROW-1758: - Summary: [Python] Remove pickle=True option for object serialization Key: ARROW-1758 URL: https://issues.apache.org/jira/browse/ARROW-1758 Project: Apache Arrow Issue Type: Improvement Reporter: Philipp Moritz As pointed out in https://github.com/apache/arrow/pull/1272#issuecomment-340738439, we don't really need this option, it can already be done with pickle.dumps as the custom serializer and pickle.loads as the deserializer. This has the additional benefit that it will be very clear to the user which pickler will be used and the user can use a custom pickler easily. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (ARROW-1744) [Plasma] Provide TensorFlow operator to read tensors from plasma
Philipp Moritz created ARROW-1744: - Summary: [Plasma] Provide TensorFlow operator to read tensors from plasma Key: ARROW-1744 URL: https://issues.apache.org/jira/browse/ARROW-1744 Project: Apache Arrow Issue Type: Improvement Components: Plasma (C++) Reporter: Philipp Moritz see https://www.tensorflow.org/extend/adding_an_op -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (ARROW-1701) [Serialization] Support zero copy PyTorch Tensor serialization
Philipp Moritz created ARROW-1701: - Summary: [Serialization] Support zero copy PyTorch Tensor serialization Key: ARROW-1701 URL: https://issues.apache.org/jira/browse/ARROW-1701 Project: Apache Arrow Issue Type: Improvement Reporter: Philipp Moritz see http://pytorch.org/docs/master/tensors.html -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (ARROW-1695) [Serialization] Fix reference counting of numpy arrays created in custom serialializer
Philipp Moritz created ARROW-1695: - Summary: [Serialization] Fix reference counting of numpy arrays created in custom serialializer Key: ARROW-1695 URL: https://issues.apache.org/jira/browse/ARROW-1695 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.7.1 Reporter: Philipp Moritz Fix For: 0.8.0 The problem happens with the following code: {code} import numpy as np import pyarrow import sys class Bar(object): pass def bar_custom_serializer(obj): x = np.zeros(4) return x def bar_custom_deserializer(serialized_obj): return serialized_obj pyarrow._default_serialization_context.register_type(Bar, "Bar", pickle=False, custom_serializer=bar_custom_serializer, custom_deserializer=bar_custom_deserializer) pyarrow.serialize(Bar()) {code} After execution of pyarrow.serialize, the interpreter crashes in the garbage collection routine. This happens if a numpy array is returned in the custom serializer but there is no other reference to the numpy array. The reason this is not a problem in the current code is that so far we haven't created new numpy arrays in the custom serializer. I think the problem here is that the numpy array hits reference count zero between the end of SerializeSequences in python_to_arrow.cc and the call to NdarrayToTensor. I'll push a fix later today, which just increases and decreases the reference counts at the appropriate places. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (ARROW-1692) [Python, Java] UnionArray round trip not working
Philipp Moritz created ARROW-1692: - Summary: [Python, Java] UnionArray round trip not working Key: ARROW-1692 URL: https://issues.apache.org/jira/browse/ARROW-1692 Project: Apache Arrow Issue Type: Bug Reporter: Philipp Moritz Attachments: union_array.arrow I'm currently working on making pyarrow.serialization data available from the Java side, one problem I was running into is that it seems the Java implementation cannot read UnionArrays generated from C++. To make this easily reproducible I created a clean Python implementation for creating UnionArrays: https://github.com/apache/arrow/pull/1216 The data is generated with the following script: ``` import pyarrow as pa binary = pa.array([b'a', b'b', b'c', b'd'], type='binary') int64 = pa.array([1, 2, 3], type='int64') types = pa.array([0, 1, 0, 0, 1, 1, 0], type='int8') value_offsets = pa.array([0, 0, 2, 1, 1, 2, 3], type='int32') result = pa.UnionArray.from_arrays([binary, int64], types, value_offsets) batch = pa.RecordBatch.from_arrays([result], ["test"]) sink = pa.BufferOutputStream() writer = pa.RecordBatchStreamWriter(sink, batch.schema) writer.write_batch(batch) sink.close() b = sink.get_result() with open("union_array.arrow", "wb") as f: f.write(b) # Sanity check: Read the batch in again with open("union_array.arrow", "rb") as f: b = f.read() reader = pa.RecordBatchStreamReader(pa.BufferReader(b)) batch = reader.read_next_batch() print("union array is", batch.column(0)) ``` I attached the file generated by that script. Then when I run the following code in Java: ``` RootAllocator allocator = new RootAllocator(10); ByteArrayInputStream in = new ByteArrayInputStream(Files.readAllBytes(Paths.get("union_array.arrow"))); ArrowStreamReader reader = new ArrowStreamReader(in, allocator); reader.loadNextBatch() ``` I get the following error: ``` | java.lang.IllegalArgumentException thrown: Could not load buffers for field test: Union(Sparse, [22, 5])<0: Binary, 1: Int(64, true)>. error message: can not truncate buffer to a larger size 7: 0 |at VectorLoader.loadBuffers (VectorLoader.java:83) |at VectorLoader.load (VectorLoader.java:62) |at ArrowReader$1.visit (ArrowReader.java:125) |at ArrowReader$1.visit (ArrowReader.java:111) |at ArrowRecordBatch.accepts (ArrowRecordBatch.java:128) |at ArrowReader.loadNextBatch (ArrowReader.java:137) |at (#7:1) ``` It seems like Java is not picking up that the UnionArray is Dense instead of Sparse. After changing the default in java/vector/src/main/codegen/templates/UnionVector.java from Sparse to Dense, I get this: ``` jshell> reader.getVectorSchemaRoot().getSchema() $9 ==> Schema ``` but then reading doesn't work: ``` jshell> reader.loadNextBatch() | java.lang.IllegalArgumentException thrown: Could not load buffers for field list: Union(Dense, [1])<: Struct>>>. error message: can not truncate buffer to a larger size 1: 0 |at VectorLoader.loadBuffers (VectorLoader.java:83) |at VectorLoader.load (VectorLoader.java:62) |at ArrowReader$1.visit (ArrowReader.java:125) |at ArrowReader$1.visit (ArrowReader.java:111) |at ArrowRecordBatch.accepts (ArrowRecordBatch.java:128) |at ArrowReader.loadNextBatch (ArrowReader.java:137) |at (#8:1) ``` Any help with this is appreciated! -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (ARROW-1687) [Python] Expose UnionArray to pyarrow
Philipp Moritz created ARROW-1687: - Summary: [Python] Expose UnionArray to pyarrow Key: ARROW-1687 URL: https://issues.apache.org/jira/browse/ARROW-1687 Project: Apache Arrow Issue Type: Improvement Reporter: Philipp Moritz We should expose UnionArray to Python via pyarrow. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (ARROW-1677) [Blog] Add blog post on Ray and Arrow Python serialization
Philipp Moritz created ARROW-1677: - Summary: [Blog] Add blog post on Ray and Arrow Python serialization Key: ARROW-1677 URL: https://issues.apache.org/jira/browse/ARROW-1677 Project: Apache Arrow Issue Type: Improvement Reporter: Philipp Moritz To give pyarrow.serialization some more exposure and get others involved. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (ARROW-1673) [Python] NumPy boolean arrays get converted to uint8 arrays on NdarrayToTensor roundtrip
Philipp Moritz created ARROW-1673: - Summary: [Python] NumPy boolean arrays get converted to uint8 arrays on NdarrayToTensor roundtrip Key: ARROW-1673 URL: https://issues.apache.org/jira/browse/ARROW-1673 Project: Apache Arrow Issue Type: Bug Reporter: Philipp Moritz see https://github.com/ray-project/ray/issues/1121 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (ARROW-1670) [Serialization] Speed up deserialization code path
Philipp Moritz created ARROW-1670: - Summary: [Serialization] Speed up deserialization code path Key: ARROW-1670 URL: https://issues.apache.org/jira/browse/ARROW-1670 Project: Apache Arrow Issue Type: Bug Reporter: Philipp Moritz Priority: Minor At the moment we are using smart pointers for keeping track of UnionArray types and values. We can get rid of this overhead. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (ARROW-1665) [Serialization] Support more custom datatypes in the default serialization context
Philipp Moritz created ARROW-1665: - Summary: [Serialization] Support more custom datatypes in the default serialization context Key: ARROW-1665 URL: https://issues.apache.org/jira/browse/ARROW-1665 Project: Apache Arrow Issue Type: Improvement Reporter: Philipp Moritz At the moment, custom types are registered in the tests in an ad-hoc way. Instead, they should use the default serialization context introduced in ARROW-1503 to make it possible to reuse the same code in other projects. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (ARROW-1630) [Serialization] Support Python datetime objects
Philipp Moritz created ARROW-1630: - Summary: [Serialization] Support Python datetime objects Key: ARROW-1630 URL: https://issues.apache.org/jira/browse/ARROW-1630 Project: Apache Arrow Issue Type: Improvement Reporter: Philipp Moritz This was brought up in https://github.com/ray-project/ray/issues/1041 It is related but not the same as https://issues.apache.org/jira/projects/ARROW/issues/ARROW-1628 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (ARROW-1625) [Serialization] Support OrderedDict properly
Philipp Moritz created ARROW-1625: - Summary: [Serialization] Support OrderedDict properly Key: ARROW-1625 URL: https://issues.apache.org/jira/browse/ARROW-1625 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Philipp Moritz At the moment what happens when we serialize an OrderedDict and then deserialize it, it will become a normal dict! This can be reproduced with {code} import pyarrow import collections d = collections.OrderedDict([("hello", 1), ("world", 2)]) type(pyarrow.serialize(d).deserialize)) {code} which will return "dict". See also https://github.com/ray-project/ray/issues/1034. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (ARROW-1622) [Plasma] Plasma doesn't compile with XCode 9
Philipp Moritz created ARROW-1622: - Summary: [Plasma] Plasma doesn't compile with XCode 9 Key: ARROW-1622 URL: https://issues.apache.org/jira/browse/ARROW-1622 Project: Apache Arrow Issue Type: Bug Components: Plasma (C++) Reporter: Philipp Moritz Compiling the latest arrow with the following flags: ``` cmake -DARROW_PLASMA=on .. make ``` we get this error: ``` [ 61%] Building CXX object src/plasma/CMakeFiles/plasma_objlib.dir/client.cc.o In file included from /Users/rliaw/Research/riselab/ray/src/thirdparty/arrow/cpp/src/plasma/client.cc:20: In file included from /Users/rliaw/Research/riselab/ray/src/thirdparty/arrow/cpp/src/plasma/client.h:31: In file included from /Users/rliaw/Research/riselab/ray/src/thirdparty/arrow/cpp/src/plasma/common.h:30: In file included from /Users/rliaw/Research/riselab/ray/src/thirdparty/arrow/cpp/src/arrow/util/logging.h:22: In file included from /Library/Developer/CommandLineTools/usr/include/c++/v1/iostream:38: In file included from /Library/Developer/CommandLineTools/usr/include/c++/v1/ios:216: In file included from /Library/Developer/CommandLineTools/usr/include/c++/v1/__locale:18: In file included from /Library/Developer/CommandLineTools/usr/include/c++/v1/mutex:189: In file included from /Library/Developer/CommandLineTools/usr/include/c++/v1/__mutex_base:17: /Library/Developer/CommandLineTools/usr/include/c++/v1/__threading_support:156:1: error: unknown type name 'mach_port_t' mach_port_t __libcpp_thread_get_port(); ^ /Library/Developer/CommandLineTools/usr/include/c++/v1/__threading_support:300:1: error: unknown type name 'mach_port_t' mach_port_t __libcpp_thread_get_port() { ^ /Library/Developer/CommandLineTools/usr/include/c++/v1/__threading_support:301:12: error: use of undeclared identifier 'pthread_mach_thread_np' return pthread_mach_thread_np(pthread_self()); ^ 3 errors generated. make[2]: *** [src/plasma/CMakeFiles/plasma_objlib.dir/client.cc.o] Error 1 make[1]: *** [src/plasma/CMakeFiles/plasma_objlib.dir/all] Error 2 make: *** [all] Error 2 ``` The problem was discovered and diagnosed in https://github.com/apache/arrow/pull/1139 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (ARROW-1480) [Python] Improve performance of serializing sets
Philipp Moritz created ARROW-1480: - Summary: [Python] Improve performance of serializing sets Key: ARROW-1480 URL: https://issues.apache.org/jira/browse/ARROW-1480 Project: Apache Arrow Issue Type: Improvement Reporter: Philipp Moritz See this: https://github.com/ray-project/ray/issues/938 There is a PR here which I'll submit: https://github.com/apache/arrow/compare/master...pcmoritz:serialize-sets Let me know what you think! Supporting sets natively is good I think, we may also want to good way to support efficient serialization of more general iterables without converting them to a list. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (ARROW-1457) [C++] Optimize strided WriteTensor
Philipp Moritz created ARROW-1457: - Summary: [C++] Optimize strided WriteTensor Key: ARROW-1457 URL: https://issues.apache.org/jira/browse/ARROW-1457 Project: Apache Arrow Issue Type: Bug Reporter: Philipp Moritz At the moment, if we call WriteTensor on a strided Tensor, it will write the tensor element by element; this can be optimized by combining multiple consecutive writes together. If there are long stretches of contiguous data, this might even be able to take advantage of the multithreaded memory copy we have int the FixedSizeBufferWriter. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (ARROW-1453) [Python] Implement WriteTensor for non-contiguous tensors
Philipp Moritz created ARROW-1453: - Summary: [Python] Implement WriteTensor for non-contiguous tensors Key: ARROW-1453 URL: https://issues.apache.org/jira/browse/ARROW-1453 Project: Apache Arrow Issue Type: Bug Affects Versions: 0.6.0 Reporter: Philipp Moritz Priority: Minor This should be implemented: https://github.com/apache/arrow/blob/5cda6934999f9f79368f3fc3f68895fc0f4e0b24/cpp/src/arrow/ipc/writer.cc#L569 It is needed to support non-contiguous arrays in the Python serialization module. -- This message was sent by Atlassian JIRA (v6.4.14#64029)