[jira] [Created] (ARROW-2459) pyarrow: Segfault with pyarrow.deserialize_pandas
Travis Brady created ARROW-2459: --- Summary: pyarrow: Segfault with pyarrow.deserialize_pandas Key: ARROW-2459 URL: https://issues.apache.org/jira/browse/ARROW-2459 Project: Apache Arrow Issue Type: Bug Components: Python Environment: OS X, Linux Reporter: Travis Brady Following up from [https://github.com/apache/arrow/issues/1884] wherein I found that calling deserialize_pandas in the linked app.py script in the repo linked below causes the app.py process to segfault. I initially observed this on OS X, but have since confirmed that the behavior exists on Linux as well. Repo containing example: [https://github.com/travisbrady/sanic-arrow] And more generally: what is the right way to get a Java-based HTTP microservice to talk to a Python-based HTTP microservice using Arrow as the serialization format? I'm exchanging DataFrame type objects (they are pandas.DataFrame's on the Python side) between the two services for real-time scoring in a few xgboost models implemented in Python. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2458) [Plasma] PlasmaClient uses global variable
Philipp Moritz created ARROW-2458: - Summary: [Plasma] PlasmaClient uses global variable Key: ARROW-2458 URL: https://issues.apache.org/jira/browse/ARROW-2458 Project: Apache Arrow Issue Type: Improvement Components: Plasma (C++) Affects Versions: 0.9.0 Reporter: Philipp Moritz The threadpool threadpool_ that PlasmaClient is using is global at the moment. This prevents us from using multiple PlasmaClients in the same process (one per thread). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2457) garrow_array_builder_append_values() won't work for large arrays
Haralampos Gavriilidis created ARROW-2457: - Summary: garrow_array_builder_append_values() won't work for large arrays Key: ARROW-2457 URL: https://issues.apache.org/jira/browse/ARROW-2457 Project: Apache Arrow Issue Type: Bug Components: C, C++, GLib Affects Versions: 0.9.0, 0.8.0 Reporter: Haralampos Gavriilidis I am using garrow_array_builder_append_values() to transform a native C array to an Arrow array, without calling arrow_array_builder_append multiple times. When calling garrow_array_builder_append_values() in array-builder.cpp with following signature: {code:java} garrow_array_builder_append_values(GArrowArrayBuilder *builder, const VALUE *values, gint64 values_length, const gboolean *is_valids, gint64 is_valids_length, GError **error, const gchar *context) {code} it will fail for large arrays. This is probably happening because the is_valids array is copied to the valid_bytes array (of different type), for which the memory is allocated on the stack, and not on the heap, like shown on the snippet below: {code:java} uint8_t valid_bytes[is_valids_length]; for (gint64 i = 0; i < is_valids_length; ++i){ valid_bytes[i] = is_valids[i]; } {code} A way to avoid this problem would be to allocate memory for the valid_bytes array using malloc() or something similar. Is this behavior intended, maybe because no large arrays should be handed over to that function, or it is rather a bug? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2456) garrow_array_builder_append_values does not work for large arrays
Haralampos Gavriilidis created ARROW-2456: - Summary: garrow_array_builder_append_values does not work for large arrays Key: ARROW-2456 URL: https://issues.apache.org/jira/browse/ARROW-2456 Project: Apache Arrow Issue Type: Bug Components: C++, GLib Reporter: Haralampos Gavriilidis When calling {code:java} garrow_array_builder_append_values(GArrowArrayBuilder *builder, const VALUE *values, gint64 values_length, const gboolean *is_valids, gint64 is_valids_length, GError **error, const gchar *context){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2455) The bytes_allocated_ in CudaContextImpl isn't initialized
Tao He created ARROW-2455: - Summary: The bytes_allocated_ in CudaContextImpl isn't initialized Key: ARROW-2455 URL: https://issues.apache.org/jira/browse/ARROW-2455 Project: Apache Arrow Issue Type: Bug Components: GPU Reporter: Tao He The atomic counter `bytes_allocated_` in `CudaContextImpl` isn't initialized, leading to failure of cuda-test on windows. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2454) [Python] Empty chunked array slice crashes
Antoine Pitrou created ARROW-2454: - Summary: [Python] Empty chunked array slice crashes Key: ARROW-2454 URL: https://issues.apache.org/jira/browse/ARROW-2454 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.9.0 Reporter: Antoine Pitrou {code:python} >>> col = pa.Column.from_array('ints', pa.array([1,2,3])) >>> col chunk 0: [ 1, 2, 3 ] >>> col.data >>> col.data[:1] >>> col.data[:0] Erreur de segmentation (core dumped) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2453) [Python] Improve Table column access
Antoine Pitrou created ARROW-2453: - Summary: [Python] Improve Table column access Key: ARROW-2453 URL: https://issues.apache.org/jira/browse/ARROW-2453 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 0.9.0 Reporter: Antoine Pitrou Suppose you have a table column named "nulls". Right now, to access it on a table, you need to do something like this: {code:python} >>> table.column(table.schema.get_field_index('nulls')) chunk 0: [ NA, NA, NA ] {code} Also, if you mistype the column name, instead of getting an error you get an arbitrary column: {code} >>> table.column(table.schema.get_field_index('z')) chunk 0: [ 0, 1, 2 ] {code} {{Table.column()}} should accept a string object and return the column with the corresponding name. KeyError should be raised if there is no column with a such name. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: Continuous benchmarking setup
Nice! Are the benchmark results published somewhere? Le 13/04/2018 à 02:50, Tom Augspurger a écrit : > https://github.com/TomAugspurger/asv-runner/ is the setup for the projects > currently running. Adding arrow to > https://github.com/TomAugspurger/asv-runner/blob/master/tests/full.yml might > work. I'll have to redeploy with the update. > > > From: Wes McKinney> Sent: Thursday, April 12, 2018 7:24:20 PM > To: dev@arrow.apache.org > Subject: Re: Continuous benchmarking setup > > hi Antoine, > > I have a bare metal machine at home (affectionately known as the > "pandabox") that's available via SSH that we've been using for > continuous benchmarking for other projects. Arrow is welcome to use > it. I can give you access to the machine if you would like. Hopefully, > we can suitably the process of setting up a continuous benchmarking > machine so that if we need to migrate to a new machine, it is not too > much of a hardship to do so. > > Thanks > Wes > > On Wed, Apr 11, 2018 at 9:40 AM, Antoine Pitrou wrote: >> >> Hello >> >> With the following changes, it seems we might reach the point where >> we're able to run the Python-based benchmark suite accross multiple >> commits (at least the ones not anterior to those changes): >> https://github.com/apache/arrow/pull/1775 >> >> To make this truly useful, we would need a dedicated host. Ideally a >> (Linux) OS running on bare metal, with SMT/HyperThreading disabled. >> If running virtualized, the VM should have dedicated physical CPU cores. >> >> That machine would run the benchmarks on a regular basis (perhaps once >> per night) and publish the results in static HTML form somewhere. >> >> (note: nice to have in the future might be access to NVidia hardware, >> but right now there are no CUDA benchmarks in the Python benchmarks) >> >> What should be the procedure here? >> >> Regards >> >> Antoine. >
[jira] [Created] (ARROW-2452) [TEST] Spark integration test fails with permission eror
Krisztian Szucs created ARROW-2452: -- Summary: [TEST] Spark integration test fails with permission eror Key: ARROW-2452 URL: https://issues.apache.org/jira/browse/ARROW-2452 Project: Apache Arrow Issue Type: Bug Reporter: Krisztian Szucs {{ arrow/dev/run_docker_compose.sh spark_integration }} {{ Scanning dependencies of target lib [ 66%] Building CXX object CMakeFiles/lib.dir/lib.cxx.o [100%] Linking CXX shared module release/lib.so [100%] Built target lib -- Finished cmake --build for pyarrow Bundling includes: release/include ('Moving built C-extension', 'release/lib.so', 'to build path', '/apache-arrow/arrow/python/build/lib.linux-x86_64-2.7/pyarrow/lib.so') release/_parquet.so Cython module _parquet failure permitted release/_orc.so Cython module _orc failure permitted release/plasma.so Cython module plasma failure permitted running install error: can't create or remove files in install directory The following error occurred while trying to add or remove files in the installation directory: [Errno 13] Permission denied: '/home/ubuntu/miniconda/envs/pyarrow-dev/lib/python2.7/site-packages/test-easy-install-1855.write-test' The installation directory you specified (via --install-dir, --prefix, or the distutils default setting) was: /home/ubuntu/miniconda/envs/pyarrow-dev/lib/python2.7/site-packages/ Perhaps your account does not have write access to this directory? If the installation directory is a system-owned directory, you may need to sign in as the administrator or "root" account. If you do not have administrative access to this machine, you may wish to choose a different installation directory, preferably one that is listed in your PYTHONPATH environment variable. }} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: Pickle data from python
There's already https://issues.apache.org/jira/browse/ARROW-1715 As for pickling Buffers, it's a bit more contentious. Perhaps we should stick to pickling higher-level types (arrays, batches, etc.). Regards Antoine. Le 13/04/2018 à 03:22, Wes McKinney a écrit : > hi Alberto, > > If you cannot find a JIRA about pickling RecordBatch objects, could > you please create one? A patch would be welcome for this; it is > certainly in scope for the project. > > If you encounter any new problems, please open a bug report. > > Thanks! > Wes > > On Thu, Apr 12, 2018 at 3:13 PM, ALBERTO Bocchinfuso >wrote: >> Hello, >> >> I cannot pickle RecordBatches, Buffers etc. >> >> I found Issue 1654 in the issue tracker, that has been solved with pull >> request 1238. But this looks to apply only to the types listed (schemas, >> DataTypes, etc.). >> When I try to Pickle Buffers etc. I get exactly the same error reported in >> the issue report. >> Is the implementation of the possibility of pickling all the data types of >> pyarrow (with particular attention to RecordBatches etc.) on the agenda? >> >> Thank you, >> Alberto