[jira] [Created] (ARROW-2097) [Python] Suppress valgrind stdout/stderr in Travis CI builds when there are no errors
Wes McKinney created ARROW-2097: --- Summary: [Python] Suppress valgrind stdout/stderr in Travis CI builds when there are no errors Key: ARROW-2097 URL: https://issues.apache.org/jira/browse/ARROW-2097 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Wes McKinney See https://travis-ci.org/apache/arrow/jobs/33265#L7858. It might be nice to have an environment variable so that this can be toggled on or off, for debugging purposes. See also ARROW-1380 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2096) [C++] Turn off Boost_DEBUG to trim build output
Wes McKinney created ARROW-2096: --- Summary: [C++] Turn off Boost_DEBUG to trim build output Key: ARROW-2096 URL: https://issues.apache.org/jira/browse/ARROW-2096 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 0.9.0 We are setting {{Boost_DEBUG}} in {{ThirdpartyToolchain.cmake}}. This makes our build logs more verbose than necessary. We should explicitly set it to FALSE and leave a comment so that people who are debugging Boost issues can re-enable it to see the logs -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2095) [C++] Suppress ORC EP build logging by default
Wes McKinney created ARROW-2095: --- Summary: [C++] Suppress ORC EP build logging by default Key: ARROW-2095 URL: https://issues.apache.org/jira/browse/ARROW-2095 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 0.9.0 See build logs: https://travis-ci.org/apache/arrow/jobs/33265#L9569. This logging should be made equivalent to other EP builds (see e.g. the protobuf build preceding ORC) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2094) [Python] Use toolchain libraries and PROTOBUF_HOME for protocol buffers
Wes McKinney created ARROW-2094: --- Summary: [Python] Use toolchain libraries and PROTOBUF_HOME for protocol buffers Key: ARROW-2094 URL: https://issues.apache.org/jira/browse/ARROW-2094 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Wes McKinney This is being built from source in Travis CI at the moment; using a toolchain build could help with build times Speaking of which, libprotobuf could use some TLC in conda-forge -- I ran out of bandwidth to do this myself: https://github.com/conda-forge/staged-recipes/pull/3087. [~Max Risuhin] do you have time to look into adding a C++-only conda-forge package? cc [~jim.crist] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: [Python] Retrieving a RecordBatch from plasma inside a function
Hey Alberto, Thanks for your message! I'm trying to reproduce it. Can you attach the code you use to write the batch into the store? Also can you say which version of Python and Arrow you are using? On my installation, I get ``` In [*5*]: plasma.ObjectID(bytearray("keynumber1keynumber1", "UTF-8")) --- ValueErrorTraceback (most recent call last) in () > 1 plasma.ObjectID(bytearray("keynumber1keynumber1", "UTF-8")) plasma.pyx in pyarrow.plasma.ObjectID.__cinit__() ValueError: Object ID must by 20 bytes, is keynumber1keynumber1 ``` (the canonical way to do this would be plasma.ObjectID(b "keynumber1keynumber1")) Best, Philipp. On Mon, Feb 5, 2018 at 10:09 AM, ALBERTO Bocchinfuso < alberto_boc...@hotmail.it> wrote: > Good morning, > > I am experiencing problems with the RecordBatches stored in plasma in a > particular situation. > > If I return a RecordBatch as result of a python function, I am able to > read just the metadata, while I get an error when reading the columns. > > For example, the following code > def retrieve1(): > client = plasma.connect('test', "", 0) > > key = "keynumber1keynumber1" > pid = plasma.ObjectID(bytearray(key,'UTF-8')) > > [buff] = client .get_buffers([pid]) > batch = pa.RecordBatchStreamReader(buff).read_next_batch() > return batch > > batch = retrieve1() > print(batch) > print(batch.schema) > print(batch[0]) > > Represents a simple python code in which a function is in charge of > retrieving the RecordBatch from the plasma store, and then returns it to > the caller. Running the previous example I get: > > FIELD1: int32 > metadata > > {} > > [ > 1, > 12, > 23, > 3, > 21, > 34 > ] > > FIELD1: int32 > metadata > > {} > Errore di segmentazione (core dump creato) > > > If I retrieve and use the data in the same part of the code (as I do in > the function retrieve1(), but it also works when I put everything in the > main program.) everything runs without problems. > > Also the problem seems to be related to the particular case in which I > retrieve the RecordBatch from the plasma store, since the following > (simpler) code: > def create(): > test1 = [1, 12, 23, 3, 21, 34] > test1 = pa.array(test1, pa.int32()) > > batch = pa.RecordBatch.from_arrays([test1], ['FIELD1']) > print(batch) > print(batch.schema) > print(batch[0]) > return batch > > batch1 = create() > print(batch1) > print(batch1.schema) > print(batch1[0]) > > Prints: > > > FIELD1: int32 > > [ > 1, > 12, > 23, > 3, > 21, > 34 > ] > > FIELD1: int32 > > [ > 1, > 12, > 23, > 3, > 21, > 34 > ] > > Which is what I expect. > > Is this issue known or am I doing something wrong when retrieving the > RecordBatch from plasma? > > Also I would like to pinpoint the fact that this problem was as easy to > find as hard to re-create. For this reason, there can be other situations > in which the same problem arises that I did not experienced, since I mostly > deal with plasma and I’ve been using only python so long: the description I > gave is not intended to be complete. > > Thank you, > Alberto >
[jira] [Created] (ARROW-2093) [Python] Possibly do not test pytorch serialization in Travis CI
Wes McKinney created ARROW-2093: --- Summary: [Python] Possibly do not test pytorch serialization in Travis CI Key: ARROW-2093 URL: https://issues.apache.org/jira/browse/ARROW-2093 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Wes McKinney Fix For: 0.9.0 I am not sure it is worth downloading ~400MB in binaries {code} The following packages will be downloaded: package|build ---|- libgcc-5.2.0 |0 1.1 MB defaults pillow-5.0.0 | py27_0 958 KB conda-forge libtiff-4.0.9 |0 511 KB conda-forge libtorch-0.1.12| nomkl_0 1.7 MB defaults olefile-0.44 | py27_0 50 KB conda-forge torchvision-0.1.9 | py27hdb88a65_1 86 KB soumith openblas-0.2.19|214.1 MB conda-forge numpy-1.13.1 |py27_blas_openblas_200 8.4 MB conda-forge pytorch-0.2.0 |py27ha262b23_4cu75 312.2 MB soumith mkl-2017.0.3 |0 129.5 MB defaults Total: 468.6 MB {code} Follow up from ARROW-2071 https://github.com/apache/arrow/pull/1561 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: Delta dictionaries: implementation
hi Dimitri, No one is working on it yet in C++, nor have we worked on any API design sketches. I think there may be some work in JavaScript. Please feel free to open some JIRAs and propose APIs / behavior or work on an implementation. Thanks, Wes On Mon, Feb 5, 2018 at 11:37 AM, Dimitri Vorona wrote: > Hi, > > ARROW-1727 added format support for delta dictionaries. It makes possible > to interleave record batches which contain dictionary encoded field with > delta dictionary batches which add new dictionary entries. > > As far as I can see there is not implementation of this feature in cpp, > yet. Is anyone working on it right now? Are there any ideas what the API > should look like? > > Cheers, > Dimitri.
[jira] [Created] (ARROW-2092) [Python] Enhance benchmark suite
Antoine Pitrou created ARROW-2092: - Summary: [Python] Enhance benchmark suite Key: ARROW-2092 URL: https://issues.apache.org/jira/browse/ARROW-2092 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 0.8.0 Reporter: Antoine Pitrou Assignee: Antoine Pitrou We need to test more operations in the ASV-based benchmarks suite. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[Python] Retrieving a RecordBatch from plasma inside a function
Good morning, I am experiencing problems with the RecordBatches stored in plasma in a particular situation. If I return a RecordBatch as result of a python function, I am able to read just the metadata, while I get an error when reading the columns. For example, the following code def retrieve1(): client = plasma.connect('test', "", 0) key = "keynumber1keynumber1" pid = plasma.ObjectID(bytearray(key,'UTF-8')) [buff] = client .get_buffers([pid]) batch = pa.RecordBatchStreamReader(buff).read_next_batch() return batch batch = retrieve1() print(batch) print(batch.schema) print(batch[0]) Represents a simple python code in which a function is in charge of retrieving the RecordBatch from the plasma store, and then returns it to the caller. Running the previous example I get: FIELD1: int32 metadata {} [ 1, 12, 23, 3, 21, 34 ] FIELD1: int32 metadata {} Errore di segmentazione (core dump creato) If I retrieve and use the data in the same part of the code (as I do in the function retrieve1(), but it also works when I put everything in the main program.) everything runs without problems. Also the problem seems to be related to the particular case in which I retrieve the RecordBatch from the plasma store, since the following (simpler) code: def create(): test1 = [1, 12, 23, 3, 21, 34] test1 = pa.array(test1, pa.int32()) batch = pa.RecordBatch.from_arrays([test1], ['FIELD1']) print(batch) print(batch.schema) print(batch[0]) return batch batch1 = create() print(batch1) print(batch1.schema) print(batch1[0]) Prints: FIELD1: int32 [ 1, 12, 23, 3, 21, 34 ] FIELD1: int32 [ 1, 12, 23, 3, 21, 34 ] Which is what I expect. Is this issue known or am I doing something wrong when retrieving the RecordBatch from plasma? Also I would like to pinpoint the fact that this problem was as easy to find as hard to re-create. For this reason, there can be other situations in which the same problem arises that I did not experienced, since I mostly deal with plasma and I’ve been using only python so long: the description I gave is not intended to be complete. Thank you, Alberto
Delta dictionaries: implementation
Hi, ARROW-1727 added format support for delta dictionaries. It makes possible to interleave record batches which contain dictionary encoded field with delta dictionary batches which add new dictionary entries. As far as I can see there is not implementation of this feature in cpp, yet. Is anyone working on it right now? Are there any ideas what the API should look like? Cheers, Dimitri.
Spark DataFrame <--> Arrow Roundtrip
Hi all, I would like to make some changes (updates) to the data stored in Spark data frames, which I get as a result of different queries. Afterwards, I would like to operate with these changed data frames as with normal data frames in Spark, e.g. use them for further transformations. I would like to use Apache Arrow as an intermediate representation of the data, I am going to update. My idea was to call ds.toArrowPayload() and afterwards operate with RDD, so get the batch for each payload and perform the update operation on the batch. Question: Can I update individual values for some column vector? Or is it better to rewrite the whole column? And the final question is how to get all the batches back to Spark, I mean create data frame? Can I use method ArrowConverters.toDataFrame(arrowRDD,ds.schema(), ...) for that ? Is it going to work? Does anybody have any better ideas? Any assistance would be greatly appreciated! Best, Michael
[jira] [Created] (ARROW-2091) Interacting with arrow/pyarrow in C++
Jun created ARROW-2091: -- Summary: Interacting with arrow/pyarrow in C++ Key: ARROW-2091 URL: https://issues.apache.org/jira/browse/ARROW-2091 Project: Apache Arrow Issue Type: Improvement Reporter: Jun I've been searching online for a while but cannot figure out how to do this. Please help if this is already a resolved issue. I have a c++/python application that interacts with arrow/pyarrow. I want to write a C++ api that takes python objects directly and operate on them in c++. {code:java} PyObject* process_table(PyObject* table) { // process the arrow table std::shared_ptr tablePtr = table; // How? }{code} The problem here is: how do I extract the internal std::shared_ptr from the PyObject? Unfortunately we are not using cython in our stack, we operate on PyObject * directly in c++. I can easily do this on numpy arrays: {code:java} PyObject * process_array(PyObject* arr) { PyArray_Check(arr); // process the PyArrayObject directly ... }{code} I wonder is there any way to achieve this level of c++ integration without using cython? Thanks! -- This message was sent by Atlassian JIRA (v7.6.3#76005)