[jira] [Created] (ARROW-7004) [Plasma] Make it possible to bump up object in LRU cache

2019-10-28 Thread Philipp Moritz (Jira)
Philipp Moritz created ARROW-7004:
-

 Summary: [Plasma] Make it possible to bump up object in LRU cache
 Key: ARROW-7004
 URL: https://issues.apache.org/jira/browse/ARROW-7004
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Plasma
Reporter: Philipp Moritz
Assignee: Philipp Moritz


To avoid evicting objects too early, we sometimes want to bump up a number of 
objects up in the LRU cache. While it would be possible to call Get() on these 
objects, this can be undesirable, since it is blocking on the objects if they 
are not available and will make it necessary to call Release() on them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-5904) [Java] [Plasma] Fix compilation of Plasma Java client

2019-07-10 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-5904:
-

 Summary: [Java] [Plasma] Fix compilation of Plasma Java client
 Key: ARROW-5904
 URL: https://issues.apache.org/jira/browse/ARROW-5904
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


This is broken since the introduction of user-defined Status messages:
{code:java}
external/plasma/cpp/src/plasma/lib/java/org_apache_arrow_plasma_PlasmaClientJNI.cc:
 In function '_jobject* 
Java_org_apache_arrow_plasma_PlasmaClientJNI_create(JNIEnv*, jclass, jlong, 
jbyteArray, jint, jbyteArray)':
external/plasma/cpp/src/plasma/lib/java/org_apache_arrow_plasma_PlasmaClientJNI.cc:114:9:
 error: 'class arrow::Status' has no member named 'IsPlasmaObjectExists'
   if (s.IsPlasmaObjectExists()) {
 ^
external/plasma/cpp/src/plasma/lib/java/org_apache_arrow_plasma_PlasmaClientJNI.cc:120:9:
 error: 'class arrow::Status' has no member named 'IsPlasmaStoreFull'
   if (s.IsPlasmaStoreFull()) {
 ^{code}
[~guoyuhong85] Can you add this codepath to the test so we can catch this kind 
of breakage more quickly in the future?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5751) [Packaging][Python] Python 2.7 wheels broken on macOS: libcares.2.dylib not found

2019-06-26 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-5751:
-

 Summary: [Packaging][Python] Python 2.7 wheels broken on macOS: 
libcares.2.dylib not found
 Key: ARROW-5751
 URL: https://issues.apache.org/jira/browse/ARROW-5751
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


I'm afraid while [https://github.com/apache/arrow/pull/4685] fixed the macOS 
wheels for python 3, but the python 2.7 wheel is still broken (with a different 
error):
{code:java}
ImportError: 
dlopen(/Users/pcmoritz/anaconda3/lib/python3.6/site-packages/pyarrow/lib.cpython-36m-darwin.so,
 2): Library not loaded: /usr/local/opt/c-ares/lib/libcares.2.dylib

  Referenced from: 
/Users/pcmoritz/anaconda3/lib/python3.6/site-packages/pyarrow/libarrow_python.14.dylib

  Reason: image not found{code}
I tried the same hack as in [https://github.com/apache/arrow/pull/4685] for 
libcares but it doesn't work (removing the .dylib fails one of the earlier 
build steps). I think the only way to go forward on this is to compile grpc 
ourselves. My attempt to do this in 
[https://github.com/apache/arrow/compare/master...pcmoritz:mac-wheels-py2] 
fails because OpenSSL is not found even though I'm specifying the 
OPENSSL_ROOT_DIR (see 
[https://travis-ci.org/pcmoritz/crossbow/builds/550603543]). Let me know if you 
have any ideas how to fix this!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5690) [Packaging] macOS wheels broken: libprotobuf.18.dylib missing

2019-06-22 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-5690:
-

 Summary: [Packaging] macOS wheels broken: libprotobuf.18.dylib 
missing
 Key: ARROW-5690
 URL: https://issues.apache.org/jira/browse/ARROW-5690
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


If I build macOS arrow wheels with crossbow from the latest master 
(a77257f4790c562dcb74724fc4a22c157ab36018) and install them, importing pyarrow 
gives the following error message:
{code:java}
In [1]: import pyarrow                                                          
                                                                                
                     

---

ImportError                               Traceback (most recent call last)

 in 

> 1 import pyarrow




~/anaconda3/lib/python3.6/site-packages/pyarrow/__init__.py in 

     47 import pyarrow.compat as compat

     48

---> 49 from pyarrow.lib import cpu_count, set_cpu_count

     50 from pyarrow.lib import (null, bool_,

     51                          int8, int16, int32, int64,




ImportError: 
dlopen(/Users/pcmoritz/anaconda3/lib/python3.6/site-packages/pyarrow/lib.cpython-36m-darwin.so,
 2): Library not loaded: /usr/local/opt/protobuf/lib/libprotobuf.18.dylib

  Referenced from: 
/Users/pcmoritz/anaconda3/lib/python3.6/site-packages/pyarrow/libarrow.14.dylib

  Reason: image not found{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5671) [crossbow] mac os python wheels failing

2019-06-20 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-5671:
-

 Summary: [crossbow] mac os python wheels failing
 Key: ARROW-5671
 URL: https://issues.apache.org/jira/browse/ARROW-5671
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


The building of (all?) macOS python wheels is currently failing with
{code:java}
Traceback (most recent call last):

  File "", line 3, in 

  File 
"/Users/travis/build/pcmoritz/crossbow/venv/lib/python3.7/site-packages/pyarrow/__init__.py",
 line 49, in 

from pyarrow.lib import cpu_count, set_cpu_count

ImportError: 
dlopen(/Users/travis/build/pcmoritz/crossbow/venv/lib/python3.7/site-packages/pyarrow/lib.cpython-37m-darwin.so,
 2): Library not loaded: @rpath/libarrow_boost_system.dylib

  Referenced from: 
/Users/travis/build/pcmoritz/crossbow/venv/lib/python3.7/site-packages/pyarrow/libarrow.14.dylib

  Reason: image not found{code}
Not sure where this was introduced :(



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5670) [crossbow] mac os python 3.5 wheel failing

2019-06-20 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-5670:
-

 Summary: [crossbow] mac os python 3.5 wheel failing
 Key: ARROW-5670
 URL: https://issues.apache.org/jira/browse/ARROW-5670
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


Currently the macOS python 3.5 is failing with
{code:java}
Downloading Apache Thrift from Traceback (most recent call last):
  File 
"/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py",
 line 1254, in do_open
h.request(req.get_method(), req.selector, req.data, headers)
  File 
"/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/http/client.py",
 line 1107, in request
self._send_request(method, url, body, headers)
  File 
"/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/http/client.py",
 line 1152, in _send_request
self.endheaders(body)
  File 
"/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/http/client.py",
 line 1103, in endheaders
self._send_output(message_body)
  File 
"/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/http/client.py",
 line 934, in _send_output
self.send(msg)
  File 
"/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/http/client.py",
 line 877, in send
self.connect()
  File 
"/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/http/client.py",
 line 1261, in connect
server_hostname=server_hostname)
  File 
"/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/ssl.py", line 
385, in wrap_socket
_context=self)
  File 
"/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/ssl.py", line 
760, in __init__
self.do_handshake()
  File 
"/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/ssl.py", line 
996, in do_handshake
self._sslobj.do_handshake()
  File 
"/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/ssl.py", line 
641, in do_handshake
self._sslobj.do_handshake()
ssl.SSLError: [SSL: TLSV1_ALERT_PROTOCOL_VERSION] tlsv1 alert protocol version 
(_ssl.c:719){code}
I've been looking into this error and will try to push a fix (the openssl 
version that is used with python 3.5 on macos is too old I think).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5669) [crossbow] manylinux1 wheel building failing

2019-06-20 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-5669:
-

 Summary: [crossbow] manylinux1 wheel building failing
 Key: ARROW-5669
 URL: https://issues.apache.org/jira/browse/ARROW-5669
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


I tried to set up a crossbow queue (on 
a0e1fbb9ef51d05a3f28e221cf8c5d4031a50c93), and right now building the 
manylinux1 wheels seems to be failing because of the arrow flight tests:

 
{code:java}
___ test_tls_do_get 
def test_tls_do_get():
"""Try a simple do_get call over TLS."""
table = simple_ints_table()
>   certs = example_tls_certs()
usr/local/lib/python3.6/site-packages/pyarrow/tests/test_flight.py:563: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
usr/local/lib/python3.6/site-packages/pyarrow/tests/test_flight.py:64: in 
example_tls_certs
"root_cert": read_flight_resource("root-ca.pem"),
usr/local/lib/python3.6/site-packages/pyarrow/tests/test_flight.py:48: in 
read_flight_resource
root = resource_root()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
def resource_root():
"""Get the path to the test resources directory."""
if not os.environ.get("ARROW_TEST_DATA"):
>   raise RuntimeError("Test resources not found; set "
   "ARROW_TEST_DATA to /testing")
E   RuntimeError: Test resources not found; set ARROW_TEST_DATA to 
/testing
usr/local/lib/python3.6/site-packages/pyarrow/tests/test_flight.py:41: 
RuntimeError{code}
This may have been introduced in 
[https://github.com/apache/arrow/pull/4594|https://github.com/apache/arrow/pull/4594.]

Any thoughts how we should proceed with this?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5027) [Python] Add JSON Reader

2019-03-27 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-5027:
-

 Summary: [Python] Add JSON Reader
 Key: ARROW-5027
 URL: https://issues.apache.org/jira/browse/ARROW-5027
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Philipp Moritz


Add bindings for the JSON reader.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5022) [C++] Implement more "Datum" types for AggregateKernel

2019-03-26 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-5022:
-

 Summary: [C++] Implement more "Datum" types for AggregateKernel
 Key: ARROW-5022
 URL: https://issues.apache.org/jira/browse/ARROW-5022
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


Currently it gives the following error if the datum isn't an array:
{code:java}
AggregateKernel expects Array datum{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5002) [C++] Implement GroupBy

2019-03-24 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-5002:
-

 Summary: [C++] Implement GroupBy
 Key: ARROW-5002
 URL: https://issues.apache.org/jira/browse/ARROW-5002
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


Dear all,

I wonder what the best way forward is for implementing GroupBy kernels. 
Initially this was part of

https://issues.apache.org/jira/browse/ARROW-4124

but is not contained in the current implementation as far as I can tell.

It seems that the part of group by that just returns indices could be 
conveniently implemented with the HashKernel. That seems useful in any case. Is 
that indeed the best way forward/should this be done?

GroupBy + Aggregate could then either be implemented with that + the Take 
kernel + aggregation involving more memory copies than necessary though or as 
part of the aggregate kernel. Probably the latter is preferred, any thoughts on 
that?

Am I missing any other JIRAs related to this?

Best, Philipp.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4983) [Plasma] Unmap memory when the client is destroyed

2019-03-21 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-4983:
-

 Summary: [Plasma] Unmap memory when the client is destroyed
 Key: ARROW-4983
 URL: https://issues.apache.org/jira/browse/ARROW-4983
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Plasma
Affects Versions: 0.12.1
Reporter: Philipp Moritz
Assignee: Philipp Moritz


Currently the plasma memory mapped into the client is not unmapped upon 
destruction of the client, which can cause memory mapped files to be kept 
around longer than necessary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4958) [C++] Purely static linking broken

2019-03-18 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-4958:
-

 Summary: [C++] Purely static linking broken
 Key: ARROW-4958
 URL: https://issues.apache.org/jira/browse/ARROW-4958
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


On the current master, 816c10d030842a1a0da4d00f95a5e3749c86a74f (#3965), running

 
{code:java}
docker-compose build cpp
docker-compose run cpp-static-only{code}
yields
{code:java}
[357/382] Linking CXX executable debug/parquet-encoding-benchmark

FAILED: debug/parquet-encoding-benchmark

: && /opt/conda/bin/ccache /usr/bin/g++  -Wno-noexcept-type  
-fdiagnostics-color=always -ggdb -O0  -Wall -Wno-conversion 
-Wno-sign-conversion -Werror -msse4.2  -g  -rdynamic 
src/parquet/CMakeFiles/parquet-encoding-benchmark.dir/encoding-benchmark.cc.o  
-o debug/parquet-encoding-benchmark  -Wl,-rpath,/opt/conda/lib 
/opt/conda/lib/libbenchmark_main.a debug/libparquet.a 
/opt/conda/lib/libbenchmark.a debug/libarrow.a 
/opt/conda/lib/libdouble-conversion.a /opt/conda/lib/libbrotlienc.so 
/opt/conda/lib/libbrotlidec.so /opt/conda/lib/libbrotlicommon.so 
/opt/conda/lib/libbz2.so /opt/conda/lib/liblz4.so 
/opt/conda/lib/libsnappy.so.1.1.7 /opt/conda/lib/libz.so 
/opt/conda/lib/libzstd.so orc_ep-install/lib/liborc.a 
/opt/conda/lib/libprotobuf.so /opt/conda/lib/libglog.so 
/opt/conda/lib/libboost_system.so /opt/conda/lib/libboost_filesystem.so 
jemalloc_ep-prefix/src/jemalloc_ep/dist//lib/libjemalloc_pic.a -pthread -lrt 
/opt/conda/lib/libboost_regex.so /opt/conda/lib/libthrift.so && :

src/parquet/CMakeFiles/parquet-encoding-benchmark.dir/encoding-benchmark.cc.o: 
In function `testing::AssertionResult::AppendMessage(testing::Message const&)':

/opt/conda/include/gtest/gtest.h:352: undefined reference to 
`testing::Message::GetString[abi:cxx11]() const'

src/parquet/CMakeFiles/parquet-encoding-benchmark.dir/encoding-benchmark.cc.o: 
In function `parquet::BenchmarkDecodeArrow::InitDataInputs()':

/arrow/cpp/src/parquet/encoding-benchmark.cc:201: undefined reference to 
`arrow::random::RandomArrayGenerator::StringWithRepeats(long, long, int, int, 
double)'

src/parquet/CMakeFiles/parquet-encoding-benchmark.dir/encoding-benchmark.cc.o: 
In function `parquet::BM_DictDecodingByteArray::DoEncodeData()':

/arrow/cpp/src/parquet/encoding-benchmark.cc:317: undefined reference to 
`testing::internal::AlwaysTrue()'

/arrow/cpp/src/parquet/encoding-benchmark.cc:317: undefined reference to 
`testing::internal::AlwaysTrue()'

/arrow/cpp/src/parquet/encoding-benchmark.cc:317: undefined reference to 
`testing::Message::Message()'

/arrow/cpp/src/parquet/encoding-benchmark.cc:317: undefined reference to 
`testing::internal::AssertHelper::AssertHelper(testing::TestPartResult::Type, 
char const*, int, char const*)'

/arrow/cpp/src/parquet/encoding-benchmark.cc:317: undefined reference to 
`testing::internal::AssertHelper::operator=(testing::Message const&) const'

/arrow/cpp/src/parquet/encoding-benchmark.cc:317: undefined reference to 
`testing::internal::AssertHelper::~AssertHelper()'

/arrow/cpp/src/parquet/encoding-benchmark.cc:321: undefined reference to 
`testing::Message::Message()'

/arrow/cpp/src/parquet/encoding-benchmark.cc:321: undefined reference to 
`testing::internal::AssertHelper::AssertHelper(testing::TestPartResult::Type, 
char const*, int, char const*)'

/arrow/cpp/src/parquet/encoding-benchmark.cc:321: undefined reference to 
`testing::internal::AssertHelper::operator=(testing::Message const&) const'

/arrow/cpp/src/parquet/encoding-benchmark.cc:321: undefined reference to 
`testing::internal::AssertHelper::~AssertHelper()'

/arrow/cpp/src/parquet/encoding-benchmark.cc:317: undefined reference to 
`testing::internal::AssertHelper::~AssertHelper()'

/arrow/cpp/src/parquet/encoding-benchmark.cc:321: undefined reference to 
`testing::internal::AssertHelper::~AssertHelper()'

src/parquet/CMakeFiles/parquet-encoding-benchmark.dir/encoding-benchmark.cc.o: 
In function `testing::internal::scoped_ptr, std::allocator > 
>::reset(std::__cxx11::basic_string, 
std::allocator >*)':

/opt/conda/include/gtest/internal/gtest-port.h:1215: undefined reference to 
`testing::internal::IsTrue(bool)'

src/parquet/CMakeFiles/parquet-encoding-benchmark.dir/encoding-benchmark.cc.o: 
In function `testing::AssertionResult 
testing::internal::CmpHelperNE
 >*, decltype(nullptr)>(char const*, char const*, 
parquet::DictEncoder >* const&, 
decltype(nullptr) const&)':

/opt/conda/include/gtest/gtest.h:1573: undefined reference to 
`testing::AssertionSuccess()'

src/parquet/CMakeFiles/parquet-encoding-benchmark.dir/encoding-benchmark.cc.o: 
In function 
`testing::internal::scoped_ptr, std::allocator > 
>::reset(std::__cxx11::basic_stringstream, 
std::allocator >*)':

/opt/conda/include/gtest/internal/gtest-port.h:1215: undefined reference to 
`testing::internal::IsTrue(bool)'


[jira] [Created] (ARROW-4912) [C++, Python] Allow specifying column names to CSV reader

2019-03-15 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-4912:
-

 Summary: [C++, Python] Allow specifying column names to CSV reader
 Key: ARROW-4912
 URL: https://issues.apache.org/jira/browse/ARROW-4912
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


Currently I think there is no way to specify custom column names for CSV files. 
It's possible to specify the full schema of the file, but not just column names.

See the related discussion here: ARROW-3722

The goal of this is to re-use the CSV type-inference but still allow people to 
specify custom names for the columns. As far as I know, there is currently no 
way to set column names post-hoc, so we should provide a way to specify them 
before reading the file.

Related to this, ParseOptions(header_rows=0) is not currently implemented.

Is there any current way to do this or does this need to be implmented?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4905) [C++][Plasma] Remove dlmalloc from client library

2019-03-15 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-4905:
-

 Summary: [C++][Plasma] Remove dlmalloc from client library
 Key: ARROW-4905
 URL: https://issues.apache.org/jira/browse/ARROW-4905
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Plasma
Affects Versions: 0.12.1
Reporter: Philipp Moritz
Assignee: Philipp Moritz


While working on the Ray build system, I noticed that dlmalloc symbols are 
leaking into the plasma client library. They should be separated out and only 
linked into the store.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4797) [Plasma] Avoid store crash if not enough memory is available

2019-03-07 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-4797:
-

 Summary: [Plasma] Avoid store crash if not enough memory is 
available
 Key: ARROW-4797
 URL: https://issues.apache.org/jira/browse/ARROW-4797
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


Currently, the plasma server exists with a fatal check if not enough memory is 
available. This can lead to errors that are hard to diagnose, see

[https://github.com/ray-project/ray/issues/3670]

Instead, we should keep the store alive in these circumstances, taking up some 
of the remaining memory and allow the client to check if enough memory has been 
allocating.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4757) Nested chunked array support

2019-03-04 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-4757:
-

 Summary: Nested chunked array support
 Key: ARROW-4757
 URL: https://issues.apache.org/jira/browse/ARROW-4757
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


Dear all,

I'm currently trying to lift the 2GB limit on the python serialization. For 
this, I implemented a chunked union builder to split the array into smaller 
arrays.

However, some of the children of the union array can be ListArrays, which can 
themselves contain UnionArrays which can contain ListArrays etc. I'm at a bit 
of a loss how to handle this. In principle I'd like to chunk the children too. 
However, currently UnionArrays can only have children of type Array, and there 
is no way to treat a chunked array (which is a vector of Arrays) as an Array to 
store it as a child of a UnionArray. Any ideas how to best support this use 
case?

-- Philipp.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4690) Building TensorFlow compatible wheels for Arrow

2019-02-26 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-4690:
-

 Summary: Building TensorFlow compatible wheels for Arrow
 Key: ARROW-4690
 URL: https://issues.apache.org/jira/browse/ARROW-4690
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


Since the inclusion of LLVM, arrow wheels stopped working with TensorFlow again 
(on some configurations at least).

While we are continuing to discuss a more permanent solution in 
[https://groups.google.com/a/tensorflow.org/d/topic/developers/TMqRaT-H2bI/discussion|https://groups.google.com/a/tensorflow.org/d/topic/developers/TMqRaT-H2bI/discussion,],
 I made some progress in creating tensorflow compatible wheels for an 
unmodified pyarrow.

They won't adhere to the manylinux1 standard, but they should be as compatible 
as the TensorFlow wheels because they use the same build environment (ubuntu 
14.04).

I'll create a PR with the necessary changes. I don't propose to ship these 
wheels but it might be a good idea to include the docker image and instructions 
how to build them in the tree for organizations that want to use tensorflow 
with pyarrow on top of pip. The official recommendation should probably be to 
use conda if the average user wants to do this for now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4491) [Python] Remove usage of std::to_string and std::stoi

2019-02-05 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-4491:
-

 Summary: [Python] Remove usage of std::to_string and std::stoi
 Key: ARROW-4491
 URL: https://issues.apache.org/jira/browse/ARROW-4491
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


Not sure why this is happening, but for some older compilers I'm seeing
{code:java}
terminate called after throwing an instance of 'std::invalid_argument'
  what():  stoi{code}
since 
[https://github.com/apache/arrow/pull/3423|https://github.com/apache/arrow/pull/3423.]

Possible cause is that there is no int8_t version of 
[https://en.cppreference.com/w/cpp/string/basic_string/to_string|https://en.cppreference.com/w/cpp/string/basic_string/to_string,]
 so it might not convert it to a proper string representation of the number.

Any insight on why this could be happening is appreciated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4453) [Python] Create Cython wrappers for sparse array

2019-02-01 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-4453:
-

 Summary: [Python] Create Cython wrappers for sparse array
 Key: ARROW-4453
 URL: https://issues.apache.org/jira/browse/ARROW-4453
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Philipp Moritz


We should have cython wrappers for [https://github.com/apache/arrow/pull/2546]

This is related to support for https://issues.apache.org/jira/browse/ARROW-4223 
and https://issues.apache.org/jira/browse/ARROW-4224

I imagine the code would be similar to 
https://github.com/apache/arrow/blob/5a502d281545402240e818d5fd97a9aaf36363f2/python/pyarrow/array.pxi#L748



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4452) [Python] Serializing sparse torch tensors

2019-02-01 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-4452:
-

 Summary: [Python] Serializing sparse torch tensors
 Key: ARROW-4452
 URL: https://issues.apache.org/jira/browse/ARROW-4452
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


Using the pytorch serialization handler on sparse Tensors:
{code:java}
import torch
i = torch.LongTensor([[0, 2], [1, 0], [1, 2]])
v = torch.FloatTensor([3,      4,      5    ])
tensor = torch.sparse.FloatTensor(i.t(), v, torch.Size([2,3]))

register_torch_serialization_handlers(pyarrow.serialization._default_serialization_context)

s = pyarrow.serialize(tensor, 
context=pyarrow.serialization._default_serialization_context) {code}
Produces this result:
{code:java}
TypeError: can't convert sparse tensor to numpy. Use Tensor.to_dense() to 
convert to a dense tensor first.{code}
We should provide a way to serialize sparse torch tensors, especially now that 
we are getting support for sparse Tensors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4378) [Plasma] Release objects upon Create

2019-01-25 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-4378:
-

 Summary: [Plasma] Release objects upon Create
 Key: ARROW-4378
 URL: https://issues.apache.org/jira/browse/ARROW-4378
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Plasma (C++)
Affects Versions: 0.13.0
Reporter: Philipp Moritz


Similar to the way that
{code:java}
Get(const std::vector& object_ids, int64_t timeout_ms, 
std::vector* out){code}
 releases the object when the shared_ptr inside of ObjectBuffer 
goes out of scope, the same should happen for
{code}
  Status Create(const ObjectID& object_id, int64_t data_size, const uint8_t* 
metadata,
int64_t metadata_size, std::shared_ptr* data);
{code}
At the moment people have to remember calling Release() after they created and 
sealed the object and that can make the use of the C++ API cumbersome.

Thanks to [~anuragkh] for reporting this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4285) [Python] Use proper builder interface for serialization

2019-01-17 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-4285:
-

 Summary: [Python] Use proper builder interface for serialization
 Key: ARROW-4285
 URL: https://issues.apache.org/jira/browse/ARROW-4285
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Affects Versions: 0.12.0
Reporter: Philipp Moritz


As a preparation for ARROW-3919, refactor the python serialization code such 
that the default builder interface is used. In the next step we can then plug 
in ChunkedBuilders to make sure that the generated arrays are properly chunked.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4269) [Python] AttributeError: module 'pandas.core' has no attribute 'arrays'

2019-01-15 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-4269:
-

 Summary: [Python] AttributeError: module 'pandas.core' has no 
attribute 'arrays'
 Key: ARROW-4269
 URL: https://issues.apache.org/jira/browse/ARROW-4269
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


This happens with pandas 0.22:

```

In [1]: import pyarrow
---
AttributeError Traceback (most recent call last)
 in ()
> 1 import pyarrow

~/arrow/python/pyarrow/__init__.py in ()
 174 localfs = LocalFileSystem.get_instance()
 175 
--> 176 from pyarrow.serialization import (default_serialization_context,
 177 register_default_serialization_handlers,
 178 register_torch_serialization_handlers)

~/arrow/python/pyarrow/serialization.py in ()
 303 
 304 
--> 305 register_default_serialization_handlers(_default_serialization_context)

~/arrow/python/pyarrow/serialization.py in 
register_default_serialization_handlers(serialization_context)
 294 custom_deserializer=_deserialize_pyarrow_table)
 295 
--> 296 _register_custom_pandas_handlers(serialization_context)
 297 
 298

~/arrow/python/pyarrow/serialization.py in 
_register_custom_pandas_handlers(context)
 175 custom_deserializer=_load_pickle_from_buffer)
 176 
--> 177 if hasattr(pd.core.arrays, 'interval'):
 178 context.register_type(
 179 pd.core.arrays.interval.IntervalArray,

AttributeError: module 'pandas.core' has no attribute 'arrays'

```



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4249) [Plasma] Remove reference to logging.h from plasma/common.h

2019-01-13 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-4249:
-

 Summary: [Plasma] Remove reference to logging.h from 
plasma/common.h
 Key: ARROW-4249
 URL: https://issues.apache.org/jira/browse/ARROW-4249
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Plasma (C++)
Affects Versions: 0.11.1
Reporter: Philipp Moritz
Assignee: Philipp Moritz
 Fix For: 0.13.0


It is not needed there and pollutes the namespace for applications that use the 
plasma client it with arrow's DCHECK macros (DCHECK is a name widely used in 
other projects).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4217) [Plasma] Remove custom object metadata

2019-01-09 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-4217:
-

 Summary: [Plasma] Remove custom object metadata
 Key: ARROW-4217
 URL: https://issues.apache.org/jira/browse/ARROW-4217
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Plasma (C++)
Affects Versions: 0.11.1
Reporter: Philipp Moritz
Assignee: Philipp Moritz
 Fix For: 0.13.0


Currently, Plasma supports custom metadata for objects. This doesn't seem to be 
used at the moment, and it will simplify the interface and implementation to 
remove it. Removing the custom metadata will also make eviction to other blob 
stores easier (most other stores don't support custom metadata).

My personal use case was to store arrow schemata in there, but they are now 
stored as part of the object itself.

If nobody else is using this, I'd suggest removing it. If people really want 
metadata, they could always store it as a separate object if desired.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4025) [Python] TensorFlow/PyTorch arrow ThreadPool workarounds not working in some settings

2018-12-13 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-4025:
-

 Summary: [Python] TensorFlow/PyTorch arrow ThreadPool workarounds 
not working in some settings
 Key: ARROW-4025
 URL: https://issues.apache.org/jira/browse/ARROW-4025
 Project: Apache Arrow
  Issue Type: Improvement
Affects Versions: 0.11.1
Reporter: Philipp Moritz


See the bug report in [https://github.com/ray-project/ray/issues/3520]

I wonder if we can revisit this issue and try to get rid of the workarounds we 
tried to deploy in the past.

See also the discussion in [https://github.com/apache/arrow/pull/2096]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4024) [Python] Cython compilation error on cython==0.27.3

2018-12-13 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-4024:
-

 Summary: [Python] Cython compilation error on cython==0.27.3
 Key: ARROW-4024
 URL: https://issues.apache.org/jira/browse/ARROW-4024
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


On the latest master, I'm getting the following error:
{code:java}
[ 11%] Compiling Cython CXX source for lib...



Error compiling Cython file:



...



    out.init(type)

    return out





cdef object pyarrow_wrap_metadata(

    ^





pyarrow/public-api.pxi:95:5: Function signature does not match previous 
declaration

CMakeFiles/lib_pyx.dir/build.make:57: recipe for target 'CMakeFiles/lib_pyx' 
failed{code}
With 0.29.0 it is working. This might have been introduced in 
[https://github.com/apache/arrow/commit/12201841212967c78e31b2d2840b55b1707c4e7b]
 but I'm not sure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3958) [Plasma] Reduce number of IPCs

2018-12-07 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-3958:
-

 Summary: [Plasma] Reduce number of IPCs
 Key: ARROW-3958
 URL: https://issues.apache.org/jira/browse/ARROW-3958
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Plasma (C++)
Affects Versions: 0.11.1
Reporter: Philipp Moritz
Assignee: Philipp Moritz
 Fix For: 0.12.0


Currently we ship file descriptors of objects from the store to the client 
every time an object is created or gotten. There is relatively few distinct 
file descriptors, so caching them can get rid of one IPC in the majority of 
cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3950) [Plasma] Don't force loading the TensorFlow op on import

2018-12-06 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-3950:
-

 Summary: [Plasma] Don't force loading the TensorFlow op on import
 Key: ARROW-3950
 URL: https://issues.apache.org/jira/browse/ARROW-3950
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz
Assignee: Philipp Moritz


In certain situation, users want more control over when the TensorFlow op is 
loaded, so we should make it optional (even if it exists). This happens in Ray 
for example, where we need to make sure that if multiple python workers try to 
compile and import the TensorFlow op in parallel, there is no race condition 
(e.g. one worker could try to import a half-built version of the op).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3934) [Gandiva] Don't compile precompiled tests if ARROW_GANDIVA_BUILD_TESTS=off

2018-12-03 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-3934:
-

 Summary: [Gandiva] Don't compile precompiled tests if 
ARROW_GANDIVA_BUILD_TESTS=off
 Key: ARROW-3934
 URL: https://issues.apache.org/jira/browse/ARROW-3934
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz
Assignee: Philipp Moritz
 Fix For: 0.12.0


Currently the precompiled tests are compiled in any case, even if 
ARROW_GANDIVA_BUILD_TESTS=off.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3919) [Python] Support 64 bit indices for pyarrow.serialize and pyarrow.deserialize

2018-11-30 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-3919:
-

 Summary: [Python] Support 64 bit indices for pyarrow.serialize and 
pyarrow.deserialize
 Key: ARROW-3919
 URL: https://issues.apache.org/jira/browse/ARROW-3919
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


see https://github.com/modin-project/modin/issues/266



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3746) [Gandiva] [Python] Make it possible to list all functions registered with Gandiva

2018-11-09 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-3746:
-

 Summary: [Gandiva] [Python] Make it possible to list all functions 
registered with Gandiva
 Key: ARROW-3746
 URL: https://issues.apache.org/jira/browse/ARROW-3746
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


This will also be useful for documentation purposes (right now it is not very 
easy to get a list of all the functions that are registered).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3718) [Gandiva] Remove spurious gtest include

2018-11-08 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-3718:
-

 Summary: [Gandiva] Remove spurious gtest include
 Key: ARROW-3718
 URL: https://issues.apache.org/jira/browse/ARROW-3718
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Gandiva
Affects Versions: 0.11.1
Reporter: Philipp Moritz
 Fix For: 0.12.0


At the moment, cpp/src/gandiva/expr_decomposer.h includes a gtest header which 
can prevent gandiva to be built without the gtest dependency.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3721) [Gandiva] [Python] Support all Gandiva literals

2018-11-08 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-3721:
-

 Summary: [Gandiva] [Python] Support all Gandiva literals
 Key: ARROW-3721
 URL: https://issues.apache.org/jira/browse/ARROW-3721
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


Support all the literals from 
[https://github.com/apache/arrow/blob/5b116ab175292fe70ed3c8727bcc6868b9695f4a/cpp/src/gandiva/tree_expr_builder.h#L35]
 in the Cython bindings.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3659) Clang Travis build (matrix entry 2) might not actually be using clang

2018-10-30 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-3659:
-

 Summary: Clang Travis build (matrix entry 2) might not actually be 
using clang
 Key: ARROW-3659
 URL: https://issues.apache.org/jira/browse/ARROW-3659
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


See for example [https://travis-ci.org/apache/arrow/jobs/448267169:]
{code:java}
Setting environment variables from .travis.yml
$ export ANACONDA_TOKEN=[secure]
$ export ARROW_TRAVIS_USE_TOOLCHAIN=1
$ export ARROW_TRAVIS_VALGRIND=1
$ export ARROW_TRAVIS_PLASMA=1
$ export ARROW_TRAVIS_ORC=1
$ export ARROW_TRAVIS_COVERAGE=1
$ export ARROW_TRAVIS_PARQUET=1
$ export ARROW_TRAVIS_PYTHON_DOCS=1
$ export ARROW_BUILD_WARNING_LEVEL=CHECKIN
$ export ARROW_TRAVIS_PYTHON_JVM=1
$ export ARROW_TRAVIS_JAVA_BUILD_ONLY=1
$ export CC="clang-6.0"
$ export CXX="clang++-6.0"
$ export TRAVIS_COMPILER=gcc
$ export CXX=g++
$ export CC=gcc
$ export PATH=/usr/lib/ccache:$PATH
cache.1
Setting up build cache{code}
The CC and CXX command line variables are overwritten by travis (because the 
travis toolchain is set to gcc).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3602) [Gandiva] [Python] Add preliminary Cython bindings for Gandiva

2018-10-23 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-3602:
-

 Summary: [Gandiva] [Python] Add preliminary Cython bindings for 
Gandiva
 Key: ARROW-3602
 URL: https://issues.apache.org/jira/browse/ARROW-3602
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Affects Versions: 0.11.1
Reporter: Philipp Moritz
 Fix For: 0.12.0


Adding a first version of Cython bindings to Gandiva so it can be called from 
Python.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3589) [Gandiva] Make it possible to compile gandiva without JNI

2018-10-22 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-3589:
-

 Summary: [Gandiva] Make it possible to compile gandiva without JNI
 Key: ARROW-3589
 URL: https://issues.apache.org/jira/browse/ARROW-3589
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


When trying to compile arrow with
{code:java}
cmake -DARROW_PYTHON=on -DARROW_GANDIVA=on -DARROW_PLASMA=on ..{code}
I'm seeing the following error right now:
{code:java}
CMake Error at 
/home/ubuntu/anaconda3/share/cmake-3.12/Modules/FindPackageHandleStandardArgs.cmake:137
 (message):

  Could NOT find JNI (missing: JAVA_AWT_LIBRARY JAVA_JVM_LIBRARY

  JAVA_INCLUDE_PATH JAVA_INCLUDE_PATH2 JAVA_AWT_INCLUDE_PATH)

Call Stack (most recent call first):

  
/home/ubuntu/anaconda3/share/cmake-3.12/Modules/FindPackageHandleStandardArgs.cmake:378
 (_FPHSA_FAILURE_MESSAGE)

  /home/ubuntu/anaconda3/share/cmake-3.12/Modules/FindJNI.cmake:356 
(FIND_PACKAGE_HANDLE_STANDARD_ARGS)

  src/gandiva/jni/CMakeLists.txt:21 (find_package)





-- Configuring incomplete, errors occurred{code}
It should be possible to compile the C++ gandiva code without JNI bindings, how 
about we introduce a new flag "-DARROW_GANDIVA_JAVA=off" (which could be on by 
default if desired).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3243) [C++] Upgrade jemalloc to version 5

2018-09-16 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-3243:
-

 Summary: [C++] Upgrade jemalloc to version 5
 Key: ARROW-3243
 URL: https://issues.apache.org/jira/browse/ARROW-3243
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


Is it possible/feasible to upgrade jemalloc to version 5 and assume that 
version? I'm asking because I've been working towards replacing dlmalloc in 
plasma with jemalloc, which makes some of the code much nicer and removes some 
of the issues we had with dlmalloc, but it requires jemalloc APIs that are only 
available starting from jemalloc version 5, in particular, I'm using the 
extent_hooks_t capability.

For now I can submit a patch that uses a different version of jemalloc in 
plasma and then we can figure out how to deal with it (maybe there is a way to 
make it work with older versions). What are your thoughts?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3199) [Plasma] Check for EAGAIN in recvmsg and sendmsg

2018-09-08 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-3199:
-

 Summary: [Plasma] Check for EAGAIN in recvmsg and sendmsg
 Key: ARROW-3199
 URL: https://issues.apache.org/jira/browse/ARROW-3199
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz
 Fix For: 0.10.0


It turns out that 
[https://github.com/apache/arrow/blob/673125fd416cbd2e5c2cb9cb6a4c925adecdaf2c/cpp/src/plasma/fling.cc#L63]
 and probably also 
[https://github.com/apache/arrow/blob/673125fd416cbd2e5c2cb9cb6a4c925adecdaf2c/cpp/src/plasma/fling.cc#L49]
 can block and give an EAGAIN error.

This was discovered during stress tests by https://github.com/stephanie-wang/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3159) [Plasma] Plasma C++ and Python integration test for tensors

2018-09-01 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-3159:
-

 Summary: [Plasma] Plasma C++ and Python integration test for 
tensors
 Key: ARROW-3159
 URL: https://issues.apache.org/jira/browse/ARROW-3159
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


This is motivated by ARROW-3127, we should have an integration test for this to 
make sure it won't break in the future.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3157) [C++] Improve buffer creation for typed data

2018-09-01 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-3157:
-

 Summary: [C++] Improve buffer creation for typed data
 Key: ARROW-3157
 URL: https://issues.apache.org/jira/browse/ARROW-3157
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


While looking into [https://github.com/apache/arrow/pull/2481,] I noticed this 
pattern:
{code:java}
const uint8_t* bytes_array = reinterpret_cast(input);
auto buffer = std::make_shared(bytes_array, 
sizeof(float)*input_length);{code}
It's not the end of the world but seems a little verbose to me. It would be 
great to have something like this:
{code:java}
auto buffer = MakeBuffer(input, input_length);{code}
I couldn't find it, does it already exist somewhere? Any thoughts on the API? 
Potentially specializations to make a buffer out of a std::vector would also 
be helpful.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3116) [Plasma] Add "ls" to object store

2018-08-24 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-3116:
-

 Summary: [Plasma] Add "ls" to object store
 Key: ARROW-3116
 URL: https://issues.apache.org/jira/browse/ARROW-3116
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Philipp Moritz
Assignee: Philipp Moritz


Add a facility to list all the objects in the store and information about them 
(object ids, sizes, number of clients using them etc.). This is very useful for 
debugging applications.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3105) [Plasma] Improve flushing error message

2018-08-21 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-3105:
-

 Summary: [Plasma] Improve flushing error message
 Key: ARROW-3105
 URL: https://issues.apache.org/jira/browse/ARROW-3105
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Plasma (C++)
Affects Versions: 0.10.0
Reporter: Philipp Moritz
Assignee: Philipp Moritz
 Fix For: 0.11.0


This helps us diagnose the flushing policy better.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3062) [Python] Extend fast libtensorflow_framework.so compatibility workaround to Python 2.7

2018-08-15 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-3062:
-

 Summary: [Python] Extend fast libtensorflow_framework.so 
compatibility workaround to Python 2.7
 Key: ARROW-3062
 URL: https://issues.apache.org/jira/browse/ARROW-3062
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Affects Versions: 0.10.0
Reporter: Philipp Moritz
Assignee: Philipp Moritz


The workaround ARROW-2657 should be optimized a little bit and use the loading 
of libtensorflow_framework.so (instead of doing a full "import tensorflow") 
also for Python 2.7.

We are running into this, since doing "import tensorflow" spawns a number of 
threads, so without this optimization, using many python processes with pyarrow 
will hit OS limits for threads.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3018) [Plasma] Improve random ObjectID generation

2018-08-07 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-3018:
-

 Summary: [Plasma] Improve random ObjectID generation
 Key: ARROW-3018
 URL: https://issues.apache.org/jira/browse/ARROW-3018
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Plasma (C++)
Affects Versions: 0.10.0
Reporter: Philipp Moritz


As pointed out by [~pitrou], the mersenne twister in Plasma is currently not 
seeded appropriately (I just saw the comment recently): 
https://github.com/apache/arrow/pull/2039

I can submit a patch for Plasma but I'm also wondering if we should have a 
properly seeded random number in Arrow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2976) [Python] Directory in pyarrow.get_library_dirs() on Travis doesn't contain libarrow.so

2018-08-03 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-2976:
-

 Summary: [Python] Directory in pyarrow.get_library_dirs() on 
Travis doesn't contain libarrow.so
 Key: ARROW-2976
 URL: https://issues.apache.org/jira/browse/ARROW-2976
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


Concerning the way pyarrow is built in `travis_script_python.sh`:

The directory in pyarrow._get_library_dirs_() doesn't seem to contain 
libarrow.so.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2975) [Plasma] TensorFlow op: Compilation only working if arrow found by pkg-config

2018-08-03 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-2975:
-

 Summary: [Plasma] TensorFlow op: Compilation only working if arrow 
found by pkg-config
 Key: ARROW-2975
 URL: https://issues.apache.org/jira/browse/ARROW-2975
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Plasma (C++)
Reporter: Philipp Moritz
Assignee: Philipp Moritz


Currently the pyarrow/tensorflow/build.sh script uses pyarrow to discover the 
arrow libraries to link against. However, this is not working on the pip 
package of pyarrow (since the .pc files are not shipped with it).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2954) [Plasma] Store object_id only once in object table

2018-07-31 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-2954:
-

 Summary: [Plasma] Store object_id only once in object table
 Key: ARROW-2954
 URL: https://issues.apache.org/jira/browse/ARROW-2954
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Plasma (C++)
Reporter: Philipp Moritz
Assignee: Philipp Moritz
 Fix For: 0.10.0


This is the first part of ARROW-2953, i.e. the duplicated storage of the object 
id both in the key and the value of the object hash table.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2953) [Plasma] Store memory usage

2018-07-31 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-2953:
-

 Summary: [Plasma] Store memory usage
 Key: ARROW-2953
 URL: https://issues.apache.org/jira/browse/ARROW-2953
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


While doing some memory profiling on the store, it became clear that at the 
moment the metadata of the objects takes up much more space than it should. In 
particular, for each object:
 * The object id (20 bytes) is stored three times
 * The object checksum (8 bytes) is stored twice
 * data_size and metadata_size (each 8 bytes) are stored twice

We can therefore significantly reduce the metadata overhead with some 
refactoring.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2940) [Python] Import error with pytorch 0.3

2018-07-30 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-2940:
-

 Summary: [Python] Import error with pytorch 0.3
 Key: ARROW-2940
 URL: https://issues.apache.org/jira/browse/ARROW-2940
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Philipp Moritz


The fix in ARROW-2920 doesn't work in versions strictly before pytorch 0.4:
{code:java}
>>> import pyarrow

Traceback (most recent call last):

  File "", line 1, in 

  File "/home/ubuntu/arrow/python/pyarrow/__init__.py", line 57, in 

    compat.import_pytorch_extension()

  File "/home/ubuntu/arrow/python/pyarrow/compat.py", line 249, in 
import_pytorch_extension

    ctypes.CDLL(os.path.join(path, "lib/libcaffe2.so"))

  File 
"/home/ubuntu/anaconda3/envs/breaking-env2/lib/python3.5/ctypes/__init__.py", 
line 351, in __init__

    self._handle = _dlopen(self._name, mode)

OSError: 
/home/ubuntu/anaconda3/envs/breaking-env2/lib/python3.5/site-packages/torch/lib/libcaffe2.so:
 cannot open shared object file: No such file or directory{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2920) [Python] Segfault with pytorch 0.4

2018-07-26 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-2920:
-

 Summary: [Python] Segfault with pytorch 0.4
 Key: ARROW-2920
 URL: https://issues.apache.org/jira/browse/ARROW-2920
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Philipp Moritz


See also [https://github.com/ray-project/ray/issues/2447]

How to reproduce:
 * Start the Ubuntu Deep Learning AMI (version 12.0) on EC2
 * Create a new env with {{conda create -y -n breaking-env python=3.5}}
 * Install pytorch with {{source activate breaking-env && conda install pytorch 
torchvision cuda91 -c pytorch}}

 * Compile and install manylinux1 pyarrow wheels from latest arrow master as 
described here: 
https://github.com/apache/arrow/blob/2876a3fdd1fb9ef6918b7214d6e8d1e3017b42ad/python/manylinux1/README.md
 * In the breaking-env just created, run the following:

 
{code:java}
Python 3.5.5 |Anaconda, Inc.| (default, May 13 2018, 21:12:35)

[GCC 7.2.0] on linux

Type "help", "copyright", "credits" or "license" for more information.

>>> import pyarrow

>>> import torch

>>> torch.nn.Conv2d(64, 2, kernel_size=3, stride=1, padding=1, 
>>> bias=False).cuda()

Segmentation fault (core dumped){code}
 

Backtrace:
{code:java}
>>> torch.nn.Conv2d(64, 2, kernel_size=3, stride=1, padding=1, 
>>> bias=False).cuda()



Program received signal SIGSEGV, Segmentation fault.

0x in ?? ()

(gdb) bt

#0  0x in ?? ()

#1  0x77bc8a99 in __pthread_once_slow (once_control=0x7fffdb791e50 
, init_routine=0x7fffe46aafe1 
)

    at pthread_once.c:116

#2  0x7fffda95c302 in at::Type::toBackend(at::Backend) const () from 
/home/ubuntu/anaconda3/envs/breaking-env2/lib/python3.5/site-packages/torch/lib/libcaffe2.so

#3  0x7fffdc59b231 in torch::autograd::VariableType::toBackend 
(this=, b=) at 
torch/csrc/autograd/generated/VariableType.cpp:145

#4  0x7fffdc8dbe8a in torch::autograd::THPVariable_cuda 
(self=0x76dbff78, args=0x76daf828, kwargs=0x0) at 
torch/csrc/autograd/generated/python_variable_methods.cpp:333

#5  0x5569f4e8 in PyCFunction_Call ()

#6  0x556f67cc in PyEval_EvalFrameEx ()

#7  0x556fbe08 in PyEval_EvalFrameEx ()

#8  0x556f6e90 in PyEval_EvalFrameEx ()

#9  0x556fbe08 in PyEval_EvalFrameEx ()

#10 0x5570103d in PyEval_EvalCodeEx ()

#11 0x55701f5c in PyEval_EvalCode ()

#12 0x5575e454 in run_mod ()

#13 0x5562ab5e in PyRun_InteractiveOneObject ()

#14 0x5562ad01 in PyRun_InteractiveLoopFlags ()

#15 0x5562ad62 in PyRun_AnyFileExFlags.cold.2784 ()

#16 0x5562b080 in Py_Main.cold.2785 ()

#17 0x5562b871 in main (){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2892) [Plasma] Implement interface to get Java arrow objects from Plasma

2018-07-20 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-2892:
-

 Summary: [Plasma] Implement interface to get Java arrow objects 
from Plasma
 Key: ARROW-2892
 URL: https://issues.apache.org/jira/browse/ARROW-2892
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


Currently we have a low level interface to access bytes stored in plasma from 
Java, using the JNI: [https://github.com/apache/arrow/pull/2065/]

 

As a followup, we should implement reading (and writing) Java arrow objects 
from plasma, if possible using zero-copy.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2890) [Plasma] Make Python PlasmaClient.release private

2018-07-20 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-2890:
-

 Summary: [Plasma] Make Python PlasmaClient.release private
 Key: ARROW-2890
 URL: https://issues.apache.org/jira/browse/ARROW-2890
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


It should normally not be called by the user, since it is automatically called 
upon buffer destruction, see also 
https://github.com/apache/arrow/blob/7d2fbeba31763c978d260a9771184a13a63aaaf7/python/pyarrow/_plasma.pyx#L222.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2866) [Plasma] TensorFlow op: Investiate outputting multiple output Tensors for the reading op

2018-07-17 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-2866:
-

 Summary: [Plasma] TensorFlow op: Investiate outputting multiple 
output Tensors for the reading op
 Key: ARROW-2866
 URL: https://issues.apache.org/jira/browse/ARROW-2866
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


see discussion in 
https://github.com/apache/arrow/pull/2104#discussion_r197308266



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2811) [Python] Test serialization for determinism

2018-07-07 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-2811:
-

 Summary: [Python] Test serialization for determinism
 Key: ARROW-2811
 URL: https://issues.apache.org/jira/browse/ARROW-2811
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


see discussion in https://github.com/apache/arrow/pull/2216



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2805) [Python] TensorFlow import workaround not working with tensorflow-gpu if CUDA is not installed

2018-07-06 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-2805:
-

 Summary: [Python] TensorFlow import workaround not working with 
tensorflow-gpu if CUDA is not installed
 Key: ARROW-2805
 URL: https://issues.apache.org/jira/browse/ARROW-2805
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


TensorFlow version: 1.7 (GPU enabled but CUDA is not installed)

tensorflow-gpu was installed via pip install

```

import ray
 File "/home/eric/Desktop/ray-private/python/ray/__init__.py", line 28, in 

 import pyarrow # noqa: F401
 File 
"/home/eric/Desktop/ray-private/python/ray/pyarrow_files/pyarrow/__init__.py", 
line 55, in 
 compat.import_tensorflow_extension()
 File 
"/home/eric/Desktop/ray-private/python/ray/pyarrow_files/pyarrow/compat.py", 
line 193, in import_tensorflow_extension
 ctypes.CDLL(ext)
 File "/usr/lib/python3.5/ctypes/__init__.py", line 347, in __init__
 self._handle = _dlopen(self._name, mode)
OSError: libcublas.so.9.0: cannot open shared object file: No such file or 
directory

```



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2803) [C++] Put hashing function into src/arrow/util

2018-07-06 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-2803:
-

 Summary: [C++] Put hashing function into src/arrow/util
 Key: ARROW-2803
 URL: https://issues.apache.org/jira/browse/ARROW-2803
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


See [https://github.com/apache/arrow/pull/2220]

We should decide what our default go-to hash function should be (maybe 
murmur3?) and put it into src/arrow/util



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2794) [Plasma] Add Delete method for multiple objects

2018-07-04 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-2794:
-

 Summary: [Plasma] Add Delete method for multiple objects
 Key: ARROW-2794
 URL: https://issues.apache.org/jira/browse/ARROW-2794
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


This improves efficiency since multiple objects can be deleted with a single 
RPC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2788) [Plasma] Defining Delete semantics

2018-07-03 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-2788:
-

 Summary: [Plasma] Defining Delete semantics
 Key: ARROW-2788
 URL: https://issues.apache.org/jira/browse/ARROW-2788
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


We should define what the semantics of Plasma's Delete operation is, especially 
in the presence of errors (object in use is deleted, non-existing object is 
deleted).

My current take on this is the following:

Delete should be a hint to the store to delete, so if the object is not 
present, it should be a no-op. If an object that is in use is deleted, the 
store should delete it as soon as the reference count goes to zero (it would 
also be ok, but less desirable in my opinion, to not delete it).

I think this is a good application of the "Defining errors away" from John 
Ousterhouts book (A Philosophy of Software Design).

Please comment in this thread if you have different opinions so we can discuss!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2758) [Plasma] Use Scope enum in Plasma

2018-06-27 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-2758:
-

 Summary: [Plasma] Use Scope enum in Plasma
 Key: ARROW-2758
 URL: https://issues.apache.org/jira/browse/ARROW-2758
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Plasma (C++)
Reporter: Philipp Moritz
 Fix For: 0.10.0


Modernize our usage of enums in plasma:
 # add option "--scoped-enum" to Flat Buffer Compiler.
 # change the old-styled c++ enum to c++11 style.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2757) [Plasma] Huge pages test failing

2018-06-27 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-2757:
-

 Summary: [Plasma] Huge pages test failing
 Key: ARROW-2757
 URL: https://issues.apache.org/jira/browse/ARROW-2757
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


See

```

=== FAILURES 
=== _ 
test_use_huge_pages __ @pytest.mark.skipif(not 
os.path.exists("/mnt/hugepages"), reason="requires hugepage support") def 
test_use_huge_pages(): import pyarrow.plasma as plasma with 
plasma.start_plasma_store( plasma_store_memory=DEFAULT_PLASMA_STORE_MEMORY, 
plasma_directory="/mnt/hugepages", use_hugepages=True) as (plasma_store_name, 
p): plasma_client = plasma.connect(plasma_store_name, "", 64) > 
create_object(plasma_client, 1) pyarrow/tests/test_plasma.py:773: _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pyarrow/tests/test_plasma.py:79: in create_object seal=seal) 
pyarrow/tests/test_plasma.py:68: in create_object_with_id memory_buffer = 
client.create(object_id, data_size, metadata) pyarrow/_plasma.pyx:300: in 
pyarrow._plasma.PlasmaClient.create 
check_status(self.client.get().Create(object_id.data, data_size, _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > raise 
PlasmaStoreFull(message) E PlasmaStoreFull: 
/home/travis/build/apache/arrow/cpp/src/plasma/client.cc:375 code: 
ReadCreateReply(buffer.data(), buffer.size(), , , _fd, 
_size) E object does not fit in the plasma store

```

seems to be failing consistently since 
[https://github.com/apache/arrow/pull/2062] (which is unrelated)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2737) [Plasma] Integrate TensorFlow Op with arrow packaging scripts

2018-06-24 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-2737:
-

 Summary: [Plasma] Integrate TensorFlow Op with arrow packaging 
scripts
 Key: ARROW-2737
 URL: https://issues.apache.org/jira/browse/ARROW-2737
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


Not sure what is involved here and what the best steps forward are. We should 
first collect experience from deploying the current op with Ray and then see 
what the right deployment strategy is.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2629) [Plasma] Iterator invalidation for pending_notifications_

2018-05-22 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-2629:
-

 Summary: [Plasma] Iterator invalidation for pending_notifications_
 Key: ARROW-2629
 URL: https://issues.apache.org/jira/browse/ARROW-2629
 Project: Apache Arrow
  Issue Type: Bug
  Components: Plasma (C++)
Reporter: Philipp Moritz
 Fix For: 0.10.0


This was discovered when running the Ray integration tests. In 
send_notifications we are modifying pending_notifications_, which invalidates 
the iterator in the for each loop in push_notification.

It's not easy to reproduce, so I don't have a regression test unfortunately, 
but I'll post a patch that fixes it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2612) [Plasma] Fix deprecated PLASMA_DEFAULT_RELEASE_DELAY

2018-05-17 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-2612:
-

 Summary: [Plasma] Fix deprecated PLASMA_DEFAULT_RELEASE_DELAY
 Key: ARROW-2612
 URL: https://issues.apache.org/jira/browse/ARROW-2612
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


The deprecated PLASMA_DEFAULT_RELEASE_DELAY is currently broken, since it 
refers to kDeprecatedPlasmaDefaultReleaseDelay without the plasma:: namespace 
qualifier.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2611) [Python] Python 2 integer serialization

2018-05-17 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-2611:
-

 Summary: [Python] Python 2 integer serialization
 Key: ARROW-2611
 URL: https://issues.apache.org/jira/browse/ARROW-2611
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Affects Versions: 0.9.0
Reporter: Philipp Moritz


In Python 2, serializing a Python int with pyarrow.serialize and then 
deserializing it returns a {{long }}instead of an integer. Note that this is 
not an issue in python 3 where the long type does not exist.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2595) [Plasma] operator[] creates entries in map

2018-05-16 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-2595:
-

 Summary: [Plasma] operator[] creates entries in map
 Key: ARROW-2595
 URL: https://issues.apache.org/jira/browse/ARROW-2595
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Plasma (C++)
Reporter: Philipp Moritz


* Problem

 ** Using object_get_requests_[object_id] will produce a lot of garbage data in 
PlasmaStore::return_from_get. During the measurement process, we found that 
there was a lot of memory growth in this point.
 * Solution

 ** Use iterator instead of operator []



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2577) [Plasma] Add ASV benchmarks

2018-05-13 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-2577:
-

 Summary: [Plasma] Add ASV benchmarks
 Key: ARROW-2577
 URL: https://issues.apache.org/jira/browse/ARROW-2577
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


We are about to merge some PRs that potentially impact plasma performance, so 
we should set up ASV to track the changes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2542) [Plasma] Refactor object notification code

2018-05-04 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-2542:
-

 Summary: [Plasma] Refactor object notification code
 Key: ARROW-2542
 URL: https://issues.apache.org/jira/browse/ARROW-2542
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


Replace unique_ptr with vector



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2541) [Plasma] Clean up macro usage

2018-05-04 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-2541:
-

 Summary: [Plasma] Clean up macro usage
 Key: ARROW-2541
 URL: https://issues.apache.org/jira/browse/ARROW-2541
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


There are still a lot of macros being used as constants in the plasma codebase. 
This should be cleaned up and replaced with constexpr (deprecating them where 
appropriate).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2508) [Python] pytest API changes make tests fail

2018-04-25 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-2508:
-

 Summary: [Python] pytest API changes make tests fail
 Key: ARROW-2508
 URL: https://issues.apache.org/jira/browse/ARROW-2508
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Philipp Moritz


Seems like there is a new pytest on pypy, it produces the following failures:

```
=== FAILURES ===
__ TestConvertDateTimeLikeTypes.test_pandas_datetime_to_date64_failures[None] __
 
self = 
mask = None
 
 @pytest.mark.parametrize('mask', [
 None,
 np.ones(3),
 np.array([True, False, False])
 ])
 def test_pandas_datetime_to_date64_failures(self, mask):
 s = pd.to_datetime([
 '2018-05-10T10:24:01',
 '2018-05-11T10:24:01',
 '2018-05-12T10:24:01',
 ])
 
 expected_msg = 'Timestamp value had non-zero intraday milliseconds'
> with pytest.raises(pa.ArrowInvalid, msg=expected_msg):
E TypeError: Unexpected keyword arguments passed to pytest.raises: msg
 
pyarrow-test-3.6/lib/python3.6/site-packages/pyarrow/tests/test_convert_pandas.py:862:
 TypeError
_ TestConvertDateTimeLikeTypes.test_pandas_datetime_to_date64_failures[mask1] __
 
self = 
mask = array([ 1., 1., 1.])
 
 @pytest.mark.parametrize('mask', [
 None,
 np.ones(3),
 np.array([True, False, False])
 ])
 def test_pandas_datetime_to_date64_failures(self, mask):
 s = pd.to_datetime([
 '2018-05-10T10:24:01',
 '2018-05-11T10:24:01',
 '2018-05-12T10:24:01',
 ])
 
 expected_msg = 'Timestamp value had non-zero intraday milliseconds'
> with pytest.raises(pa.ArrowInvalid, msg=expected_msg):
E TypeError: Unexpected keyword arguments passed to pytest.raises: msg
 
pyarrow-test-3.6/lib/python3.6/site-packages/pyarrow/tests/test_convert_pandas.py:862:
 TypeError
_ TestConvertDateTimeLikeTypes.test_pandas_datetime_to_date64_failures[mask2] __
 
self = 
mask = array([ True, False, False], dtype=bool)
 
 @pytest.mark.parametrize('mask', [
 None,
 np.ones(3),
 np.array([True, False, False])
 ])
 def test_pandas_datetime_to_date64_failures(self, mask):
 s = pd.to_datetime([
 '2018-05-10T10:24:01',
 '2018-05-11T10:24:01',
 '2018-05-12T10:24:01',
 ])
 
 expected_msg = 'Timestamp value had non-zero intraday milliseconds'
> with pytest.raises(pa.ArrowInvalid, msg=expected_msg):
E TypeError: Unexpected keyword arguments passed to pytest.raises: msg
 
pyarrow-test-3.6/lib/python3.6/site-packages/pyarrow/tests/test_convert_pandas.py:862:
 TypeError
=== short test summary info 
```

I think we can just change msg to message and it should work again.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2506) [Plasma] Build error on macOS

2018-04-24 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-2506:
-

 Summary: [Plasma] Build error on macOS
 Key: ARROW-2506
 URL: https://issues.apache.org/jira/browse/ARROW-2506
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


Since the upgrade to flatbuffers 1.9.0, I'm seeing this error on the Ray CI:

arrow/cpp/src/plasma/format/plasma.fbs:234:0: error: default value of 0 for 
field status is not part of enum ObjectStatus

I'm planning to just remove the '= 1' from 'Local = 1'. This will break the 
protocol however, so if we prefer to just put in a 'Dummy = 0' object at the 
beginning of the enum, that would also be fine with me. However, the 
ObjectStatus API is not stable yet and not even exposed to Python, so I think 
breaking it is fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2458) [Plasma] PlasmaClient uses global variable

2018-04-13 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-2458:
-

 Summary: [Plasma] PlasmaClient uses global variable
 Key: ARROW-2458
 URL: https://issues.apache.org/jira/browse/ARROW-2458
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Plasma (C++)
Affects Versions: 0.9.0
Reporter: Philipp Moritz


The threadpool threadpool_ that PlasmaClient is using is global at the moment. 
This prevents us from using multiple PlasmaClients in the same process (one per 
thread).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2386) [Plasma] Change PlasmaClient::Create API

2018-04-03 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-2386:
-

 Summary: [Plasma] Change PlasmaClient::Create API
 Key: ARROW-2386
 URL: https://issues.apache.org/jira/browse/ARROW-2386
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Plasma (C++)
Reporter: Philipp Moritz


Now that the Get API is refactored in 
[https://github.com/apache/arrow/pull/1807,] we should do the same for the 
Create API.

Proposal:

Have a MutablePlasmaBuffer class, which is returned by Create
{code:java}
Status Create(int64_t data_size, int64_t metadata_size, 
std::shared_ptr* buffer)
{code}
This allocates the data in shared memory, but does not associate it with the 
object id yet. This way we get get rid of the Abort() call.

Move the Seal() method into the MutablePlasmaBuffer and let it return the 
object ID.

 

This is very similar to what [~pitrou] suggested here: 
https://github.com/apache/arrow/pull/1807



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2215) [Plasma] Error when using huge pages

2018-02-26 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-2215:
-

 Summary: [Plasma] Error when using huge pages
 Key: ARROW-2215
 URL: https://issues.apache.org/jira/browse/ARROW-2215
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Plasma (C++)
Reporter: Philipp Moritz


see https://github.com/ray-project/ray/issues/1592



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2195) [Plasma] Segfault when retrieving RecordBatch from plasma store

2018-02-21 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-2195:
-

 Summary: [Plasma] Segfault when retrieving RecordBatch from plasma 
store
 Key: ARROW-2195
 URL: https://issues.apache.org/jira/browse/ARROW-2195
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


It can be reproduced with the following script:

```
import pyarrow as pa
import pyarrow.plasma as plasma

def retrieve1():
             client = plasma.connect('test', "", 0)

             key = "keynumber1keynumber1"
             pid = plasma.ObjectID(bytearray(key,'UTF-8'))

             [buff] = client .get_buffers([pid])
             batch = pa.RecordBatchStreamReader(buff).read_next_batch()

             print(batch)
             print(batch.schema)
             print(batch[0])

             return batch

client = plasma.connect('test', "", 0)

test1 = [1, 12, 23, 3, 21, 34]
test1 = pa.array(test1, pa.int32())

batch = pa.RecordBatch.from_arrays([test1], ['FIELD1'])

key = "keynumber1keynumber1"
pid = plasma.ObjectID(bytearray(key,'UTF-8'))
sink = pa.MockOutputStream()
stream_writer = pa.RecordBatchStreamWriter(sink, batch.schema)
stream_writer.write_batch(batch)
stream_writer.close()

bff = client.create(pid, sink.size())

stream = pa.FixedSizeBufferWriter(bff)
writer = pa.RecordBatchStreamWriter(stream, batch.schema)
writer.write_batch(batch)
client.seal(pid)

batch = retrieve1()
print(batch)
print(batch.schema)
print(batch[0])

```

 

Preliminary backtrace:

 

```

CESS (code=1, address=0x38158)

    frame #0: 0x00010e6457fc 
lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28

lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py:

->  0x10e6457fc <+28>: movslq (%rdx,%rcx,4), %rdi

    0x10e645800 <+32>: callq  0x10e698170               ; symbol stub for: 
PyInt_FromLong

    0x10e645805 <+37>: testq  %rax, %rax

    0x10e645808 <+40>: je     0x10e64580c               ; <+44>

(lldb) bt

* thread #1: tid = 0xf1378e, 0x00010e6457fc 
lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28, 
queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, 
address=0x38158)

  * frame #0: 0x00010e6457fc 
lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28

    frame #1: 0x00010e5ccd35 lib.so`__Pyx_PyObject_CallNoArg(_object*) + 133

    frame #2: 0x00010e613b25 
lib.so`__pyx_pw_7pyarrow_3lib_10ArrayValue_3__repr__(_object*) + 933

    frame #3: 0x00010c2f83bc libpython2.7.dylib`PyObject_Repr + 60

    frame #4: 0x00010c35f651 libpython2.7.dylib`PyEval_EvalFrameEx + 22305

```



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2127) [Plasma] Transfer of objects between CPUs and GPUs

2018-02-10 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-2127:
-

 Summary: [Plasma] Transfer of objects between CPUs and GPUs
 Key: ARROW-2127
 URL: https://issues.apache.org/jira/browse/ARROW-2127
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


It should be possible to transfer an object that was created on the CPU to the 
GPU and vice versa. One natural implementation is to introduce a flag to 
plasma::Get that specifies where the object should end up and then transfer the 
object under the hood and return the appropriate buffer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2126) [Plasma] Hashing for GPU objects

2018-02-10 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-2126:
-

 Summary: [Plasma] Hashing for GPU objects
 Key: ARROW-2126
 URL: https://issues.apache.org/jira/browse/ARROW-2126
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


We should have a CUDA function that computes a hash for objects, similar to the 
way it is done for CPU objects at the moment. Is there a fast hash/checksum 
function available for CUDA, similar to xxhash? Maybe this can be implemented 
as a arrow::compute kernel?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2125) [Plasma] Implement eviction policy for GPU objects

2018-02-10 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-2125:
-

 Summary: [Plasma] Implement eviction policy for GPU objects
 Key: ARROW-2125
 URL: https://issues.apache.org/jira/browse/ARROW-2125
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


This is a followup to https://github.com/apache/arrow/pull/1445

Right now, objects allocated on GPUs are never evicted. There should be a flag 
with the maximum amount of memory that plasma can take on the GPU. If this 
memory is exceeded, objects should be evicted according to the policy (which is 
pluggable).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2042) [Plasma] Revert API change of plasma::Create to output a MutableBuffer

2018-01-26 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-2042:
-

 Summary: [Plasma] Revert API change of plasma::Create to output a 
MutableBuffer
 Key: ARROW-2042
 URL: https://issues.apache.org/jira/browse/ARROW-2042
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Plasma (C++)
Reporter: Philipp Moritz
Assignee: Philipp Moritz


Reverts a part of the changes from [https://github.com/apache/arrow/pull/1479] 
concerning the plasma::Create API. It should output a shared pointer to a 
Buffer instead of a shared pointer to a MutableBuffer. This is needed for 
[https://github.com/apache/arrow/pull/1445] so we can return a CudaBuffer from 
plasma::Create. It also seems to be more in line with how Buffers are intended 
to be used and avoids API breakage from 0.8.0 to 0.9.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-1944) FindArrow has wrong ARROW_STATIC_LIB

2017-12-20 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-1944:
-

 Summary: FindArrow has wrong ARROW_STATIC_LIB
 Key: ARROW-1944
 URL: https://issues.apache.org/jira/browse/ARROW-1944
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 0.8.0
Reporter: Philipp Moritz


It seems that in

https://github.com/apache/arrow/blob/a0555c04dd5c43230a1c50d0d0a94e06d8ad9ff0/cpp/cmake_modules/FindArrow.cmake#L100

ARROW_PYTHON_LIB_PATH should be replaced with  ARROW_LIBS



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1927) [Plasma] Implement delete function

2017-12-14 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-1927:
-

 Summary: [Plasma] Implement delete function
 Key: ARROW-1927
 URL: https://issues.apache.org/jira/browse/ARROW-1927
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Plasma (C++), Python
Reporter: Philipp Moritz


The function should check if the reference count of the object is zero and if 
yes, delete it from the store. If no, it should raise an exception or return a 
status value.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1924) [Python] Bring back pickle=True option for serialization

2017-12-13 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-1924:
-

 Summary: [Python] Bring back pickle=True option for serialization
 Key: ARROW-1924
 URL: https://issues.apache.org/jira/browse/ARROW-1924
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Philipp Moritz


We need to revert https://issues.apache.org/jira/browse/ARROW-1758

The reason is that the semantics for pickle=True is slightly different from 
just using (cloud-)pickle as the custom serializer:

If pickle=True is used, the object can be deserialized in any process, even if 
a deserializer for that type_id has not been registered in that process. On the 
other hand, if (cloud-)pickle is used as a custom serializer, the object can 
only be deserialized if pyarrow has the type_id registered and can call the 
deserializer.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1919) Plasma hanging if object id is not 20 bytes

2017-12-12 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-1919:
-

 Summary: Plasma hanging if object id is not 20 bytes
 Key: ARROW-1919
 URL: https://issues.apache.org/jira/browse/ARROW-1919
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Philipp Moritz
Assignee: Philipp Moritz
Priority: Minor


This happens if plasma's capability to put an object with a user defined object 
id is used if the object id is not 20 bytes long. Plasma will hang upon get in 
that case, we should give an error instead.

See https://github.com/ray-project/ray/issues/1315



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1853) [Plasma] Fix off-by-one error in retry processing

2017-11-24 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-1853:
-

 Summary: [Plasma] Fix off-by-one error in retry processing
 Key: ARROW-1853
 URL: https://issues.apache.org/jira/browse/ARROW-1853
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Philipp Moritz
Priority: Minor
 Fix For: 0.8.0


When a user construct a plasma client that should not perform a single retry, 
by passing num_retries = 0, nothing happens due to an off-by-one error in the 
retry processing. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1758) [Python] Remove pickle=True option for object serialization

2017-10-31 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-1758:
-

 Summary: [Python] Remove pickle=True option for object 
serialization
 Key: ARROW-1758
 URL: https://issues.apache.org/jira/browse/ARROW-1758
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


As pointed out in 
https://github.com/apache/arrow/pull/1272#issuecomment-340738439, we don't 
really need this option, it can already be done with pickle.dumps as the custom 
serializer and pickle.loads as the deserializer.

This has the additional benefit that it will be very clear to the user which 
pickler will be used and the user can use a custom pickler easily.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1744) [Plasma] Provide TensorFlow operator to read tensors from plasma

2017-10-28 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-1744:
-

 Summary: [Plasma] Provide TensorFlow operator to read tensors from 
plasma
 Key: ARROW-1744
 URL: https://issues.apache.org/jira/browse/ARROW-1744
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Plasma (C++)
Reporter: Philipp Moritz


see https://www.tensorflow.org/extend/adding_an_op



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1701) [Serialization] Support zero copy PyTorch Tensor serialization

2017-10-20 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-1701:
-

 Summary: [Serialization] Support zero copy PyTorch Tensor 
serialization
 Key: ARROW-1701
 URL: https://issues.apache.org/jira/browse/ARROW-1701
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


see http://pytorch.org/docs/master/tensors.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1695) [Serialization] Fix reference counting of numpy arrays created in custom serialializer

2017-10-20 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-1695:
-

 Summary: [Serialization] Fix reference counting of numpy arrays 
created in custom serialializer
 Key: ARROW-1695
 URL: https://issues.apache.org/jira/browse/ARROW-1695
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.7.1
Reporter: Philipp Moritz
 Fix For: 0.8.0


The problem happens with the following code:

{code}
import numpy as np
import pyarrow
import sys

class Bar(object):
pass

def bar_custom_serializer(obj):
x = np.zeros(4)
return x

def bar_custom_deserializer(serialized_obj):
return serialized_obj

pyarrow._default_serialization_context.register_type(Bar, "Bar", pickle=False, 
custom_serializer=bar_custom_serializer, 
custom_deserializer=bar_custom_deserializer)

pyarrow.serialize(Bar())
{code}

After execution of pyarrow.serialize, the interpreter crashes in the garbage 
collection routine.

This happens if a numpy array is returned in the custom serializer but there is 
no other reference to the numpy array. The reason this is not a problem in the 
current code is that so far we haven't created new numpy arrays in the custom 
serializer.

I think the problem here is that the numpy array hits reference count zero 
between the end of SerializeSequences in python_to_arrow.cc and the call to 
NdarrayToTensor. I'll push a fix later today, which just increases and 
decreases the reference counts at the appropriate places.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1692) [Python, Java] UnionArray round trip not working

2017-10-19 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-1692:
-

 Summary: [Python, Java] UnionArray round trip not working
 Key: ARROW-1692
 URL: https://issues.apache.org/jira/browse/ARROW-1692
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Philipp Moritz
 Attachments: union_array.arrow

I'm currently working on making pyarrow.serialization data available from the 
Java side, one problem I was running into is that it seems the Java 
implementation cannot read UnionArrays generated from C++. To make this easily 
reproducible I created a clean Python implementation for creating UnionArrays: 
https://github.com/apache/arrow/pull/1216

The data is generated with the following script:

```
import pyarrow as pa

binary = pa.array([b'a', b'b', b'c', b'd'], type='binary')
int64 = pa.array([1, 2, 3], type='int64')
types = pa.array([0, 1, 0, 0, 1, 1, 0], type='int8')
value_offsets = pa.array([0, 0, 2, 1, 1, 2, 3], type='int32')

result = pa.UnionArray.from_arrays([binary, int64], types, value_offsets)

batch = pa.RecordBatch.from_arrays([result], ["test"])

sink = pa.BufferOutputStream()
writer = pa.RecordBatchStreamWriter(sink, batch.schema)

writer.write_batch(batch)

sink.close()

b = sink.get_result()

with open("union_array.arrow", "wb") as f:
f.write(b)

# Sanity check: Read the batch in again

with open("union_array.arrow", "rb") as f:
b = f.read()
reader = pa.RecordBatchStreamReader(pa.BufferReader(b))

batch = reader.read_next_batch()

print("union array is", batch.column(0))
```

I attached the file generated by that script. Then when I run the following 
code in Java:

```
RootAllocator allocator = new RootAllocator(10);

ByteArrayInputStream in = new 
ByteArrayInputStream(Files.readAllBytes(Paths.get("union_array.arrow")));

ArrowStreamReader reader = new ArrowStreamReader(in, allocator);

reader.loadNextBatch()
```

I get the following error:

```
|  java.lang.IllegalArgumentException thrown: Could not load buffers for field 
test: Union(Sparse, [22, 5])<0: Binary, 1: Int(64, true)>. error message: can 
not truncate buffer to a larger size 7: 0
|at VectorLoader.loadBuffers (VectorLoader.java:83)
|at VectorLoader.load (VectorLoader.java:62)
|at ArrowReader$1.visit (ArrowReader.java:125)
|at ArrowReader$1.visit (ArrowReader.java:111)
|at ArrowRecordBatch.accepts (ArrowRecordBatch.java:128)
|at ArrowReader.loadNextBatch (ArrowReader.java:137)
|at (#7:1)
```

It seems like Java is not picking up that the UnionArray is Dense instead of 
Sparse. After changing the default in 
java/vector/src/main/codegen/templates/UnionVector.java from Sparse to Dense, I 
get this:

```
jshell> reader.getVectorSchemaRoot().getSchema()
$9 ==> Schema
```

but then reading doesn't work:

```
jshell> reader.loadNextBatch()
|  java.lang.IllegalArgumentException thrown: Could not load buffers for field 
list: Union(Dense, [1])<: Struct>>>. error message: can not truncate buffer to a larger size 1: 0
|at VectorLoader.loadBuffers (VectorLoader.java:83)
|at VectorLoader.load (VectorLoader.java:62)
|at ArrowReader$1.visit (ArrowReader.java:125)
|at ArrowReader$1.visit (ArrowReader.java:111)
|at ArrowRecordBatch.accepts (ArrowRecordBatch.java:128)
|at ArrowReader.loadNextBatch (ArrowReader.java:137)
|at (#8:1)
```

Any help with this is appreciated!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1687) [Python] Expose UnionArray to pyarrow

2017-10-18 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-1687:
-

 Summary: [Python] Expose UnionArray to pyarrow
 Key: ARROW-1687
 URL: https://issues.apache.org/jira/browse/ARROW-1687
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


We should expose UnionArray to Python via pyarrow.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1677) [Blog] Add blog post on Ray and Arrow Python serialization

2017-10-16 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-1677:
-

 Summary: [Blog] Add blog post on Ray and Arrow Python serialization
 Key: ARROW-1677
 URL: https://issues.apache.org/jira/browse/ARROW-1677
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


To give pyarrow.serialization some more exposure and get others involved.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1673) [Python] NumPy boolean arrays get converted to uint8 arrays on NdarrayToTensor roundtrip

2017-10-13 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-1673:
-

 Summary: [Python] NumPy boolean arrays get converted to uint8 
arrays on NdarrayToTensor roundtrip
 Key: ARROW-1673
 URL: https://issues.apache.org/jira/browse/ARROW-1673
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Philipp Moritz


see https://github.com/ray-project/ray/issues/1121



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1670) [Serialization] Speed up deserialization code path

2017-10-12 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-1670:
-

 Summary: [Serialization] Speed up deserialization code path
 Key: ARROW-1670
 URL: https://issues.apache.org/jira/browse/ARROW-1670
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Philipp Moritz
Priority: Minor


At the moment we are using smart pointers for keeping track of UnionArray types 
and values. We can get rid of this overhead.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1665) [Serialization] Support more custom datatypes in the default serialization context

2017-10-11 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-1665:
-

 Summary: [Serialization] Support more custom datatypes in the 
default serialization context
 Key: ARROW-1665
 URL: https://issues.apache.org/jira/browse/ARROW-1665
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


At the moment, custom types are registered in the tests in an ad-hoc way. 
Instead, they should use the default serialization context introduced in 
ARROW-1503 to make it possible to reuse the same code in other projects.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1630) [Serialization] Support Python datetime objects

2017-10-01 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-1630:
-

 Summary: [Serialization] Support Python datetime objects
 Key: ARROW-1630
 URL: https://issues.apache.org/jira/browse/ARROW-1630
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


This was brought up in https://github.com/ray-project/ray/issues/1041

It is related but not the same as 
https://issues.apache.org/jira/projects/ARROW/issues/ARROW-1628



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1625) [Serialization] Support OrderedDict properly

2017-09-29 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-1625:
-

 Summary: [Serialization] Support OrderedDict properly
 Key: ARROW-1625
 URL: https://issues.apache.org/jira/browse/ARROW-1625
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Philipp Moritz


At the moment what happens when we serialize an OrderedDict and then 
deserialize it, it will become a normal dict! This can be reproduced with

{code}
import pyarrow
import collections
d = collections.OrderedDict([("hello", 1), ("world", 2)])
type(pyarrow.serialize(d).deserialize))
{code}

which will return "dict". See also 
https://github.com/ray-project/ray/issues/1034.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1622) [Plasma] Plasma doesn't compile with XCode 9

2017-09-27 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-1622:
-

 Summary: [Plasma] Plasma doesn't compile with XCode 9
 Key: ARROW-1622
 URL: https://issues.apache.org/jira/browse/ARROW-1622
 Project: Apache Arrow
  Issue Type: Bug
  Components: Plasma (C++)
Reporter: Philipp Moritz


Compiling the latest arrow with the following flags:

```
cmake -DARROW_PLASMA=on ..
make
```
we get this error:

```
[ 61%] Building CXX object src/plasma/CMakeFiles/plasma_objlib.dir/client.cc.o
In file included from 
/Users/rliaw/Research/riselab/ray/src/thirdparty/arrow/cpp/src/plasma/client.cc:20:
In file included from 
/Users/rliaw/Research/riselab/ray/src/thirdparty/arrow/cpp/src/plasma/client.h:31:
In file included from 
/Users/rliaw/Research/riselab/ray/src/thirdparty/arrow/cpp/src/plasma/common.h:30:
In file included from 
/Users/rliaw/Research/riselab/ray/src/thirdparty/arrow/cpp/src/arrow/util/logging.h:22:
In file included from 
/Library/Developer/CommandLineTools/usr/include/c++/v1/iostream:38:
In file included from 
/Library/Developer/CommandLineTools/usr/include/c++/v1/ios:216:
In file included from 
/Library/Developer/CommandLineTools/usr/include/c++/v1/__locale:18:
In file included from 
/Library/Developer/CommandLineTools/usr/include/c++/v1/mutex:189:
In file included from 
/Library/Developer/CommandLineTools/usr/include/c++/v1/__mutex_base:17:
/Library/Developer/CommandLineTools/usr/include/c++/v1/__threading_support:156:1:
 error: unknown type
 name 'mach_port_t'
mach_port_t __libcpp_thread_get_port();
^
/Library/Developer/CommandLineTools/usr/include/c++/v1/__threading_support:300:1:
 error: unknown type
 name 'mach_port_t'
mach_port_t __libcpp_thread_get_port() {
^
/Library/Developer/CommandLineTools/usr/include/c++/v1/__threading_support:301:12:
 error: use of
 undeclared identifier 'pthread_mach_thread_np'
   return pthread_mach_thread_np(pthread_self());
  ^
3 errors generated.
make[2]: *** [src/plasma/CMakeFiles/plasma_objlib.dir/client.cc.o] Error 1
make[1]: *** [src/plasma/CMakeFiles/plasma_objlib.dir/all] Error 2
make: *** [all] Error 2
```

The problem was discovered and diagnosed in 
https://github.com/apache/arrow/pull/1139



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1480) [Python] Improve performance of serializing sets

2017-09-06 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-1480:
-

 Summary: [Python] Improve performance of serializing sets
 Key: ARROW-1480
 URL: https://issues.apache.org/jira/browse/ARROW-1480
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


See this:

https://github.com/ray-project/ray/issues/938

There is a PR here which I'll submit:

https://github.com/apache/arrow/compare/master...pcmoritz:serialize-sets

Let me know what you think! Supporting sets natively is good I think, we may 
also want to good way to support efficient serialization of more general 
iterables without converting them to a list.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1457) [C++] Optimize strided WriteTensor

2017-09-03 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-1457:
-

 Summary: [C++] Optimize strided WriteTensor
 Key: ARROW-1457
 URL: https://issues.apache.org/jira/browse/ARROW-1457
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Philipp Moritz


At the moment, if we call WriteTensor on a strided Tensor, it will write the 
tensor element by element; this can be optimized by combining multiple 
consecutive writes together.

If there are long stretches of contiguous data, this might even be able to take 
advantage of the multithreaded memory copy we have int the 
FixedSizeBufferWriter.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1453) [Python] Implement WriteTensor for non-contiguous tensors

2017-09-02 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-1453:
-

 Summary: [Python] Implement WriteTensor for non-contiguous tensors
 Key: ARROW-1453
 URL: https://issues.apache.org/jira/browse/ARROW-1453
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Philipp Moritz
Priority: Minor


This should be implemented:

https://github.com/apache/arrow/blob/5cda6934999f9f79368f3fc3f68895fc0f4e0b24/cpp/src/arrow/ipc/writer.cc#L569

It is needed to support non-contiguous arrays in the Python serialization 
module.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


  1   2   >