[jira] [Created] (ARROW-2097) [Python] Suppress valgrind stdout/stderr in Travis CI builds when there are no errors

2018-02-05 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2097:
---

 Summary: [Python] Suppress valgrind stdout/stderr in Travis CI 
builds when there are no errors
 Key: ARROW-2097
 URL: https://issues.apache.org/jira/browse/ARROW-2097
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney


See https://travis-ci.org/apache/arrow/jobs/33265#L7858. It might be nice 
to have an environment variable so that this can be toggled on or off, for 
debugging purposes. See also ARROW-1380



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2096) [C++] Turn off Boost_DEBUG to trim build output

2018-02-05 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2096:
---

 Summary: [C++] Turn off Boost_DEBUG to trim build output
 Key: ARROW-2096
 URL: https://issues.apache.org/jira/browse/ARROW-2096
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.9.0


We are setting {{Boost_DEBUG}} in {{ThirdpartyToolchain.cmake}}. This makes our 
build logs more verbose than necessary. We should explicitly set it to FALSE 
and leave a comment so that people who are debugging Boost issues can re-enable 
it to see the logs



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2095) [C++] Suppress ORC EP build logging by default

2018-02-05 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2095:
---

 Summary: [C++] Suppress ORC EP build logging by default
 Key: ARROW-2095
 URL: https://issues.apache.org/jira/browse/ARROW-2095
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.9.0


See build logs: https://travis-ci.org/apache/arrow/jobs/33265#L9569. This 
logging should be made equivalent to other EP builds (see e.g. the protobuf 
build preceding ORC)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2094) [Python] Use toolchain libraries and PROTOBUF_HOME for protocol buffers

2018-02-05 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2094:
---

 Summary: [Python] Use toolchain libraries and PROTOBUF_HOME for 
protocol buffers
 Key: ARROW-2094
 URL: https://issues.apache.org/jira/browse/ARROW-2094
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney


This is being built from source in Travis CI at the moment; using a toolchain 
build could help with build times

Speaking of which, libprotobuf could use some TLC in conda-forge -- I ran out 
of bandwidth to do this myself: 
https://github.com/conda-forge/staged-recipes/pull/3087. [~Max Risuhin] do you 
have time to look into adding a C++-only conda-forge package?

cc [~jim.crist]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [Python] Retrieving a RecordBatch from plasma inside a function

2018-02-05 Thread Philipp Moritz
Hey Alberto,

Thanks for your message! I'm trying to reproduce it.

Can you attach the code you use to write the batch into the store?

Also can you say which version of Python and Arrow you are using? On my
installation, I get

```

In [*5*]: plasma.ObjectID(bytearray("keynumber1keynumber1", "UTF-8"))

---

ValueErrorTraceback (most recent call last)

 in ()

> 1 plasma.ObjectID(bytearray("keynumber1keynumber1", "UTF-8"))


plasma.pyx in pyarrow.plasma.ObjectID.__cinit__()


ValueError: Object ID must by 20 bytes, is keynumber1keynumber1
```

(the canonical way to do this would be plasma.ObjectID(b
"keynumber1keynumber1"))

Best,
Philipp.

On Mon, Feb 5, 2018 at 10:09 AM, ALBERTO Bocchinfuso <
alberto_boc...@hotmail.it> wrote:

> Good morning,
>
> I am experiencing problems with the RecordBatches stored in plasma in a
> particular situation.
>
> If I return a RecordBatch as result of a python function, I am able to
> read just the metadata, while I get an error when reading the columns.
>
> For example, the following code
> def retrieve1():
> client = plasma.connect('test', "", 0)
>
> key = "keynumber1keynumber1"
> pid = plasma.ObjectID(bytearray(key,'UTF-8'))
>
> [buff] = client .get_buffers([pid])
> batch = pa.RecordBatchStreamReader(buff).read_next_batch()
> return batch
>
> batch = retrieve1()
> print(batch)
> print(batch.schema)
> print(batch[0])
>
> Represents a simple python code in which a function is in charge of
> retrieving the RecordBatch from the plasma store, and then returns it to
> the caller. Running the previous example I get:
> 
> FIELD1: int32
> metadata
> 
> {}
> 
> [
>   1,
>   12,
>   23,
>   3,
>   21,
>   34
> ]
> 
> FIELD1: int32
> metadata
> 
> {}
> Errore di segmentazione (core dump creato)
>
>
> If I retrieve and use the data in the same part of the code (as I do in
> the function retrieve1(), but it also works when I put everything in the
> main program.) everything runs without problems.
>
> Also the problem seems to be related to the particular case in which I
> retrieve the RecordBatch from the plasma store, since the following
> (simpler) code:
> def create():
> test1 = [1, 12, 23, 3, 21, 34]
> test1 = pa.array(test1, pa.int32())
>
> batch = pa.RecordBatch.from_arrays([test1], ['FIELD1'])
> print(batch)
> print(batch.schema)
> print(batch[0])
> return batch
>
> batch1 = create()
> print(batch1)
> print(batch1.schema)
> print(batch1[0])
>
> Prints:
>
> 
> FIELD1: int32
> 
> [
>   1,
>   12,
>   23,
>   3,
>   21,
>   34
> ]
> 
> FIELD1: int32
> 
> [
>   1,
>   12,
>   23,
>   3,
>   21,
>   34
> ]
>
> Which is what I expect.
>
> Is this issue known or am I doing something wrong when retrieving the
> RecordBatch from plasma?
>
> Also I would like to pinpoint the fact that this problem was as easy to
> find as hard to re-create. For this reason, there can be other situations
> in which the same problem arises that I did not experienced, since I mostly
> deal with plasma and I’ve been using only python so long: the description I
> gave is not intended to be complete.
>
> Thank you,
> Alberto
>


[jira] [Created] (ARROW-2093) [Python] Possibly do not test pytorch serialization in Travis CI

2018-02-05 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2093:
---

 Summary: [Python] Possibly do not test pytorch serialization in 
Travis CI
 Key: ARROW-2093
 URL: https://issues.apache.org/jira/browse/ARROW-2093
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney
 Fix For: 0.9.0


I am not sure it is worth downloading ~400MB in binaries

{code}
The following packages will be downloaded:
package|build
---|-
libgcc-5.2.0   |0 1.1 MB  defaults
pillow-5.0.0   |   py27_0 958 KB  conda-forge
libtiff-4.0.9  |0 511 KB  conda-forge
libtorch-0.1.12|  nomkl_0 1.7 MB  defaults
olefile-0.44   |   py27_0  50 KB  conda-forge
torchvision-0.1.9  |   py27hdb88a65_1  86 KB  soumith
openblas-0.2.19|214.1 MB  conda-forge
numpy-1.13.1   |py27_blas_openblas_200 8.4 MB  
conda-forge
pytorch-0.2.0  |py27ha262b23_4cu75   312.2 MB  soumith
mkl-2017.0.3   |0   129.5 MB  defaults

   Total:   468.6 MB
{code}

Follow up from ARROW-2071 https://github.com/apache/arrow/pull/1561



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Delta dictionaries: implementation

2018-02-05 Thread Wes McKinney
hi Dimitri,

No one is working on it yet in C++, nor have we worked on any API
design sketches. I think there may be some work in JavaScript.

Please feel free to open some JIRAs and propose APIs / behavior or
work on an implementation.

Thanks,
Wes

On Mon, Feb 5, 2018 at 11:37 AM, Dimitri Vorona  wrote:
> Hi,
>
> ARROW-1727 added format support for delta dictionaries. It makes possible
> to interleave record batches which contain dictionary encoded field with
> delta dictionary batches which add new dictionary entries.
>
> As far as I can see there is not implementation of this feature in cpp,
> yet. Is anyone working on it right now? Are there any ideas what the API
> should look like?
>
> Cheers,
> Dimitri.


[jira] [Created] (ARROW-2092) [Python] Enhance benchmark suite

2018-02-05 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-2092:
-

 Summary: [Python] Enhance benchmark suite
 Key: ARROW-2092
 URL: https://issues.apache.org/jira/browse/ARROW-2092
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Affects Versions: 0.8.0
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou


We need to test more operations in the ASV-based benchmarks suite.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[Python] Retrieving a RecordBatch from plasma inside a function

2018-02-05 Thread ALBERTO Bocchinfuso
Good morning,

I am experiencing problems with the RecordBatches stored in plasma in a 
particular situation.

If I return a RecordBatch as result of a python function, I am able to read 
just the metadata, while I get an error when reading the columns.

For example, the following code
def retrieve1():
client = plasma.connect('test', "", 0)

key = "keynumber1keynumber1"
pid = plasma.ObjectID(bytearray(key,'UTF-8'))

[buff] = client .get_buffers([pid])
batch = pa.RecordBatchStreamReader(buff).read_next_batch()
return batch

batch = retrieve1()
print(batch)
print(batch.schema)
print(batch[0])

Represents a simple python code in which a function is in charge of retrieving 
the RecordBatch from the plasma store, and then returns it to the caller. 
Running the previous example I get:

FIELD1: int32
metadata

{}

[
  1,
  12,
  23,
  3,
  21,
  34
]

FIELD1: int32
metadata

{}
Errore di segmentazione (core dump creato)


If I retrieve and use the data in the same part of the code (as I do in the 
function retrieve1(), but it also works when I put everything in the main 
program.) everything runs without problems.

Also the problem seems to be related to the particular case in which I retrieve 
the RecordBatch from the plasma store, since the following (simpler) code:
def create():
test1 = [1, 12, 23, 3, 21, 34]
test1 = pa.array(test1, pa.int32())

batch = pa.RecordBatch.from_arrays([test1], ['FIELD1'])
print(batch)
print(batch.schema)
print(batch[0])
return batch

batch1 = create()
print(batch1)
print(batch1.schema)
print(batch1[0])

Prints:


FIELD1: int32

[
  1,
  12,
  23,
  3,
  21,
  34
]

FIELD1: int32

[
  1,
  12,
  23,
  3,
  21,
  34
]

Which is what I expect.

Is this issue known or am I doing something wrong when retrieving the 
RecordBatch from plasma?

Also I would like to pinpoint the fact that this problem was as easy to find as 
hard to re-create. For this reason, there can be other situations in which the 
same problem arises that I did not experienced, since I mostly deal with plasma 
and I’ve been using only python so long: the description I gave is not intended 
to be complete.

Thank you,
Alberto


Delta dictionaries: implementation

2018-02-05 Thread Dimitri Vorona
Hi,

ARROW-1727 added format support for delta dictionaries. It makes possible
to interleave record batches which contain dictionary encoded field with
delta dictionary batches which add new dictionary entries.

As far as I can see there is not implementation of this feature in cpp,
yet. Is anyone working on it right now? Are there any ideas what the API
should look like?

Cheers,
Dimitri.


Spark DataFrame <--> Arrow Roundtrip

2018-02-05 Thread Michael Shtelma
Hi all,

I would like to make some changes (updates) to the data stored in
Spark data frames, which I get as a result of different queries.
Afterwards, I would like to operate with these changed data frames as
with normal data frames in Spark, e.g. use them for further
transformations.

I would like to use Apache Arrow as an intermediate representation of
the data, I am going to update. My idea was to call
ds.toArrowPayload() and afterwards operate with RDD, so
get the batch for each payload and perform the update operation on the
batch. Question: Can I update individual values for some column
vector? Or is it better to rewrite the whole column?

And the final question is how to get all the batches back to Spark, I
mean create data frame?
Can I use method ArrowConverters.toDataFrame(arrowRDD,ds.schema(),
...) for that ?

Is it going to work? Does anybody have any better ideas?
Any assistance would be greatly appreciated!

Best,
Michael


[jira] [Created] (ARROW-2091) Interacting with arrow/pyarrow in C++

2018-02-05 Thread Jun (JIRA)
Jun created ARROW-2091:
--

 Summary: Interacting with arrow/pyarrow in C++
 Key: ARROW-2091
 URL: https://issues.apache.org/jira/browse/ARROW-2091
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Jun


I've been searching online for a while but cannot figure out how to do this. 
Please help if this is already a resolved issue.

I have a c++/python application that interacts with arrow/pyarrow. I want to 
write a C++ api that takes python objects directly and operate on them in c++.
{code:java}
PyObject* process_table(PyObject* table)
{
// process the arrow table
std::shared_ptr tablePtr = table; // How?
}{code}
The problem here is: how do I extract the internal std::shared_ptr from 
the PyObject?

Unfortunately we are not using cython in our stack, we operate on PyObject * 
directly in c++.

I can easily do this on numpy arrays:
{code:java}
PyObject * process_array(PyObject* arr)
{
PyArray_Check(arr);
// process the PyArrayObject directly
...
}{code}

I wonder is there any way to achieve this level of c++ integration without 
using cython? Thanks!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)