[jira] [Created] (ARROW-2459) pyarrow: Segfault with pyarrow.deserialize_pandas

2018-04-13 Thread Travis Brady (JIRA)
Travis Brady created ARROW-2459:
---

 Summary: pyarrow: Segfault with pyarrow.deserialize_pandas
 Key: ARROW-2459
 URL: https://issues.apache.org/jira/browse/ARROW-2459
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
 Environment: OS X, Linux
Reporter: Travis Brady


Following up from [https://github.com/apache/arrow/issues/1884] wherein I found 
that calling deserialize_pandas in the linked app.py script in the repo linked 
below causes the app.py process to segfault.

I initially observed this on OS X, but have since confirmed that the behavior 
exists on Linux as well.

Repo containing example: [https://github.com/travisbrady/sanic-arrow] 

And more generally: what is the right way to get a Java-based HTTP microservice 
to talk to a Python-based HTTP microservice using Arrow as the serialization 
format? I'm exchanging DataFrame type objects (they are pandas.DataFrame's on 
the Python side) between the two services for real-time scoring in a few 
xgboost models implemented in Python.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2458) [Plasma] PlasmaClient uses global variable

2018-04-13 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-2458:
-

 Summary: [Plasma] PlasmaClient uses global variable
 Key: ARROW-2458
 URL: https://issues.apache.org/jira/browse/ARROW-2458
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Plasma (C++)
Affects Versions: 0.9.0
Reporter: Philipp Moritz


The threadpool threadpool_ that PlasmaClient is using is global at the moment. 
This prevents us from using multiple PlasmaClients in the same process (one per 
thread).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2457) garrow_array_builder_append_values() won't work for large arrays

2018-04-13 Thread Haralampos Gavriilidis (JIRA)
Haralampos Gavriilidis created ARROW-2457:
-

 Summary: garrow_array_builder_append_values() won't work for large 
arrays
 Key: ARROW-2457
 URL: https://issues.apache.org/jira/browse/ARROW-2457
 Project: Apache Arrow
  Issue Type: Bug
  Components: C, C++, GLib
Affects Versions: 0.9.0, 0.8.0
Reporter: Haralampos Gavriilidis


I am using garrow_array_builder_append_values() to transform a native C array 
to an Arrow array, without calling arrow_array_builder_append multiple times. 
When calling garrow_array_builder_append_values() in array-builder.cpp with 
following signature:
{code:java}
garrow_array_builder_append_values(GArrowArrayBuilder *builder,
const VALUE *values,
gint64 values_length,
const gboolean *is_valids,
gint64 is_valids_length,
GError **error,
const gchar *context)
{code}
it will fail for large arrays. This is probably happening because the is_valids 
array is copied to the valid_bytes array (of different type), for which the 
memory is allocated on the stack, and not on the heap, like shown on the 
snippet below:
{code:java}
uint8_t valid_bytes[is_valids_length];
for (gint64 i = 0; i < is_valids_length; ++i){ 
  valid_bytes[i] = is_valids[i]; 
}
{code}
 A way to avoid this problem would be to allocate memory for the valid_bytes 
array using malloc() or something similar. Is this behavior intended, maybe 
because no large arrays should be handed over to that function, or it is rather 
a bug?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2456) garrow_array_builder_append_values does not work for large arrays

2018-04-13 Thread Haralampos Gavriilidis (JIRA)
Haralampos Gavriilidis created ARROW-2456:
-

 Summary: garrow_array_builder_append_values does not work for 
large arrays
 Key: ARROW-2456
 URL: https://issues.apache.org/jira/browse/ARROW-2456
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, GLib
Reporter: Haralampos Gavriilidis


When calling 
{code:java}
garrow_array_builder_append_values(GArrowArrayBuilder *builder,
 const VALUE *values,
 gint64 values_length,
 const gboolean *is_valids,
 gint64 is_valids_length,
 GError **error,
 const gchar *context){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2455) The bytes_allocated_ in CudaContextImpl isn't initialized

2018-04-13 Thread Tao He (JIRA)
Tao He created ARROW-2455:
-

 Summary: The bytes_allocated_ in CudaContextImpl isn't initialized
 Key: ARROW-2455
 URL: https://issues.apache.org/jira/browse/ARROW-2455
 Project: Apache Arrow
  Issue Type: Bug
  Components: GPU
Reporter: Tao He


The atomic counter `bytes_allocated_` in `CudaContextImpl` isn't initialized, 
leading to failure of cuda-test on windows.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2454) [Python] Empty chunked array slice crashes

2018-04-13 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-2454:
-

 Summary: [Python] Empty chunked array slice crashes
 Key: ARROW-2454
 URL: https://issues.apache.org/jira/browse/ARROW-2454
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.9.0
Reporter: Antoine Pitrou


{code:python}
>>> col = pa.Column.from_array('ints', pa.array([1,2,3]))
>>> col

chunk 0: 
[
  1,
  2,
  3
]
>>> col.data

>>> col.data[:1]

>>> col.data[:0]
Erreur de segmentation (core dumped)
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2453) [Python] Improve Table column access

2018-04-13 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-2453:
-

 Summary: [Python] Improve Table column access
 Key: ARROW-2453
 URL: https://issues.apache.org/jira/browse/ARROW-2453
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Affects Versions: 0.9.0
Reporter: Antoine Pitrou


Suppose you have a table column named "nulls". Right now, to access it on a 
table, you need to do something like this:
{code:python}
>>> table.column(table.schema.get_field_index('nulls'))

chunk 0: 
[
  NA,
  NA,
  NA
]
{code}

Also, if you mistype the column name, instead of getting an error you get an 
arbitrary column:
{code}
>>> table.column(table.schema.get_field_index('z'))

chunk 0: 
[
  0,
  1,
  2
]
{code}

{{Table.column()}} should accept a string object and return the column with the 
corresponding name. KeyError should be raised if there is no column with a such 
name.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Continuous benchmarking setup

2018-04-13 Thread Antoine Pitrou

Nice! Are the benchmark results published somewhere?



Le 13/04/2018 à 02:50, Tom Augspurger a écrit :
> https://github.com/TomAugspurger/asv-runner/ is the setup for the projects 
> currently running. Adding arrow to  
> https://github.com/TomAugspurger/asv-runner/blob/master/tests/full.yml might 
> work. I'll have to redeploy with the update.
> 
> 
> From: Wes McKinney 
> Sent: Thursday, April 12, 2018 7:24:20 PM
> To: dev@arrow.apache.org
> Subject: Re: Continuous benchmarking setup
> 
> hi Antoine,
> 
> I have a bare metal machine at home (affectionately known as the
> "pandabox") that's available via SSH that we've been using for
> continuous benchmarking for other projects. Arrow is welcome to use
> it. I can give you access to the machine if you would like. Hopefully,
> we can suitably the process of setting up a continuous benchmarking
> machine so that if we need to migrate to a new machine, it is not too
> much of a hardship to do so.
> 
> Thanks
> Wes
> 
> On Wed, Apr 11, 2018 at 9:40 AM, Antoine Pitrou  wrote:
>>
>> Hello
>>
>> With the following changes, it seems we might reach the point where
>> we're able to run the Python-based benchmark suite accross multiple
>> commits (at least the ones not anterior to those changes):
>> https://github.com/apache/arrow/pull/1775
>>
>> To make this truly useful, we would need a dedicated host.  Ideally a
>> (Linux) OS running on bare metal, with SMT/HyperThreading disabled.
>> If running virtualized, the VM should have dedicated physical CPU cores.
>>
>> That machine would run the benchmarks on a regular basis (perhaps once
>> per night) and publish the results in static HTML form somewhere.
>>
>> (note: nice to have in the future might be access to NVidia hardware,
>> but right now there are no CUDA benchmarks in the Python benchmarks)
>>
>> What should be the procedure here?
>>
>> Regards
>>
>> Antoine.
> 


[jira] [Created] (ARROW-2452) [TEST] Spark integration test fails with permission eror

2018-04-13 Thread Krisztian Szucs (JIRA)
Krisztian Szucs created ARROW-2452:
--

 Summary: [TEST] Spark integration test fails with permission eror
 Key: ARROW-2452
 URL: https://issues.apache.org/jira/browse/ARROW-2452
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Krisztian Szucs


{{ arrow/dev/run_docker_compose.sh spark_integration }}

{{ 
Scanning dependencies of target lib
[ 66%] Building CXX object CMakeFiles/lib.dir/lib.cxx.o
[100%] Linking CXX shared module release/lib.so
[100%] Built target lib
-- Finished cmake --build for pyarrow
Bundling includes: release/include
('Moving built C-extension', 'release/lib.so', 'to build path', 
'/apache-arrow/arrow/python/build/lib.linux-x86_64-2.7/pyarrow/lib.so')
release/_parquet.so
Cython module _parquet failure permitted
release/_orc.so
Cython module _orc failure permitted
release/plasma.so
Cython module plasma failure permitted
running install
error: can't create or remove files in install directory

The following error occurred while trying to add or remove files in the
installation directory:

[Errno 13] Permission denied: 
'/home/ubuntu/miniconda/envs/pyarrow-dev/lib/python2.7/site-packages/test-easy-install-1855.write-test'

The installation directory you specified (via --install-dir, --prefix, or
the distutils default setting) was:

/home/ubuntu/miniconda/envs/pyarrow-dev/lib/python2.7/site-packages/

Perhaps your account does not have write access to this directory?  If the
installation directory is a system-owned directory, you may need to sign in
as the administrator or "root" account.  If you do not have administrative
access to this machine, you may wish to choose a different installation
directory, preferably one that is listed in your PYTHONPATH environment
variable.
}} 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Pickle data from python

2018-04-13 Thread Antoine Pitrou

There's already https://issues.apache.org/jira/browse/ARROW-1715

As for pickling Buffers, it's a bit more contentious.  Perhaps we should
stick to pickling higher-level types (arrays, batches, etc.).

Regards

Antoine.


Le 13/04/2018 à 03:22, Wes McKinney a écrit :
> hi Alberto,
> 
> If you cannot find a JIRA about pickling RecordBatch objects, could
> you please create one? A patch would be welcome for this; it is
> certainly in scope for the project.
> 
> If you encounter any new problems, please open a bug report.
> 
> Thanks!
> Wes
> 
> On Thu, Apr 12, 2018 at 3:13 PM, ALBERTO Bocchinfuso
>  wrote:
>> Hello,
>>
>> I cannot pickle RecordBatches, Buffers etc.
>>
>> I found Issue 1654 in the issue tracker, that has been solved with pull 
>> request 1238. But this looks to apply only to the types listed (schemas, 
>> DataTypes, etc.).
>> When I try to Pickle Buffers etc. I get exactly the same error reported in 
>> the issue report.
>> Is the implementation of the possibility of pickling all the data types of 
>> pyarrow (with particular attention to RecordBatches etc.) on the agenda?
>>
>> Thank you,
>> Alberto