[jira] [Created] (ARROW-2866) [Plasma] TensorFlow op: Investiate outputting multiple output Tensors for the reading op

2018-07-16 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-2866:
-

 Summary: [Plasma] TensorFlow op: Investiate outputting multiple 
output Tensors for the reading op
 Key: ARROW-2866
 URL: https://issues.apache.org/jira/browse/ARROW-2866
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


see discussion in 
https://github.com/apache/arrow/pull/2104#discussion_r197308266



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Re: Passing Arrow object across language

2018-07-16 Thread 周宇睿(闻拙)
Hi Wes:

Thank you for the response. Yes the examples you provided are very helpful. 

But I still have a question regarding memory management. Let’s say passed 
memory addresses from c++ to JVM and constructed the data structure in Java. 
Since this is an off heap memory, how could I make sure the memory will be 
released when necessary?

thanks
Yurui

from Alimail macOS
 --Original Mail --
Sender:Wes McKinney 
Send Date:Tue Jul 17 02:09:51 2018
Recipients: 
Subject:Re: Passing Arrow object across language
I discussed some of these things at a high level in my talk at SciPy
2018 last week

https://www.slideshare.net/wesm/apache-arrow-crosslanguage-development-platform-for-inmemory-data-105427919

On Mon, Jul 16, 2018 at 2:08 PM, Wes McKinney  wrote:
> hi Yurui,
>
> You can also share data structures through JNI without using the IPC
> tools at all, which could require memory copying to produce the IPC
> messages.
>
> What you can do is obtain the memory addresses for the component
> buffers of an array (or vector, as called in Java) and construct the
> data structure from the memory addresses on the other side. We are
> doing exactly this already in Python using JPype (which is JNI-based):
>
> https://github.com/apache/arrow/blob/master/python/pyarrow/jvm.py
>
> The Gandiva project uses JNI to pass Java Netty buffer memory
> addresses to C++, you can see the code for creating the arrays from
> the memory addresses and then constructing a RecordBatch:
>
> https://github.com/dremio/gandiva/blob/master/cpp/src/jni/native_builder.cc#L602
>
> I believe as time goes on we will have better and more standardized
> APIs to deal with JNI<->C++ zero-copy passing, these implementations
> have only been done relatively recently. Your contributions to the
> Arrow project around this would be most welcomed!
>
> Thanks,
> Wes
>
> On Mon, Jul 16, 2018 at 2:00 PM, Philipp Moritz  wrote:
>> Hey Yuri,
>>
>> you can use the Arrow IPC mechanism to do this:
>>
>> - https://github.com/apache/arrow/blob/master/format/IPC.md
>> - Python: https://arrow.apache.org/docs/python/ipc.html
>> - C++: https://arrow.apache.org/docs/cpp/namespacearrow_1_1ipc.html
>> - For Java, see the org.apache.arrow.vector.ipc namespace
>>
>> On the C++ side, you can for example use a RecordBatchStreamWriter to write
>> the IPC message, and then on the Java side you could use the
>> ArrowStreamReader to read it.
>>
>> There are some tests here:
>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/ipc-read-write-test.cc
>> https://github.com/apache/arrow/tree/master/java/vector/src/test/java/org/apache/arrow/vector/ipc
>>
>> There is also integration tests here, although I'm not really familiar with
>> them:
>>
>> https://github.com/apache/arrow/tree/master/integration
>>
>> If you could write a little tutorial/into on how to do this (maybe using
>> Plasma for exchanging the data) and contribute it to the documentation,
>> that would be amazing!
>>
>> Best,
>> Philipp.
>>
>> On Mon, Jul 16, 2018 at 4:14 AM, 周宇睿(闻拙)  wrote:
>>
>>> Hi guys:
>>>
>>> I might miss something quite obviously. But how does Arrow passing objects
>>> across language? Let’s say I have a java program that invoke a c++ function
>>> via JNI, how does the c++ function pass an Arrow RecordBack object back to
>>> Java runtime without memory copy?
>>>
>>> Any advise would be appreciated.
>>> Thanks
>>> Yurui
>>>
>>> from Alimail macOS

[jira] [Created] (ARROW-2865) [C++/Python] Reduce some duplicated code in python/builtin_convert.cc

2018-07-16 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2865:
---

 Summary: [C++/Python] Reduce some duplicated code in 
python/builtin_convert.cc
 Key: ARROW-2865
 URL: https://issues.apache.org/jira/browse/ARROW-2865
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Wes McKinney


See discussion in https://github.com/apache/arrow/pull/2270



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2863) [Python] Add context manager APIs to RecordBatch*Writer/Reader classes

2018-07-16 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2863:
---

 Summary: [Python] Add context manager APIs to 
RecordBatch*Writer/Reader classes
 Key: ARROW-2863
 URL: https://issues.apache.org/jira/browse/ARROW-2863
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney


This would cause the {{close}} method to be called when the scope exits



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2864) Add deletion cache to delete objects later

2018-07-16 Thread Yuhong Guo (JIRA)
Yuhong Guo created ARROW-2864:
-

 Summary: Add deletion cache to delete objects later
 Key: ARROW-2864
 URL: https://issues.apache.org/jira/browse/ARROW-2864
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Plasma (C++)
Reporter: Yuhong Guo
Assignee: Yuhong Guo


Currently, the Delete function will skip the objects that are in use. If you 
want to delete the objects later, you need to call the Delete function again. 
We need to guarantee that the objects will be deleted when they are not in use 
later.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2862) [C++] `wget -c` doesn't work when using thirdparty/download_thirdparty.sh for the first time

2018-07-16 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2862:
---

 Summary: [C++] `wget -c` doesn't work when using 
thirdparty/download_thirdparty.sh for the first time 
 Key: ARROW-2862
 URL: https://issues.apache.org/jira/browse/ARROW-2862
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
Assignee: Wes McKinney
 Fix For: 0.10.0


MIssed this when I was working on the offline build. Is there a way to use 
{{wget -c}} that works when there is no file there yet? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2861) [Python] Add extra tips about using Parquet to store index-less pandas data

2018-07-16 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2861:
---

 Summary: [Python] Add extra tips about using Parquet to store 
index-less pandas data
 Key: ARROW-2861
 URL: https://issues.apache.org/jira/browse/ARROW-2861
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney
 Fix For: 0.10.0


See https://github.com/apache/arrow/pull/2248



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Pyarrow Plasma client.release() fault

2018-07-16 Thread Wes McKinney
Seems like we might want to write down some best practices for this
level of large scale usage, essentially a supercomputer-like rig. I
wouldn't even know where to come by a machine with a machine with >
2TB memory for scalability / concurrency load testing

On Mon, Jul 16, 2018 at 2:59 PM, Robert Nishihara
 wrote:
> Are you using the same plasma client from all of the different threads? If
> so, that could cause race conditions as the client is not thread safe.
>
> Alternatively, if you have a separate plasma client for each thread, then
> you may be running out of file descriptors somewhere (either the client
> process or the store).
>
> Can you check if the object store evicting objects (it prints something to
> stdout/stderr when this happens)? Could you be running out of memory but
> failing to release the objects?
>
> On Tue, Jul 10, 2018 at 9:48 AM Corey Nolet  wrote:
>
>> Update:
>>
>> I'm investigating the possibility that I've reached the overcommit limit in
>> the kernel as a result of all the parallel processes.
>>
>> This still doesn't fix the client.release() problem but it might explain
>> why the processing appears to halt, after some time, until I restart the
>> Jupyter kernel.
>>
>> On Tue, Jul 10, 2018 at 12:27 PM Corey Nolet  wrote:
>>
>> > Wes,
>> >
>> > Unfortunately, my code is on a separate network. I'll try to explain what
>> > I'm doing and if you need further detail, I can certainly pseudocode
>> > specifics.
>> >
>> > I am using multiprocessing.Pool() to fire up a bunch of threads for
>> > different filenames. In each thread, I'm performing a pd.read_csv(),
>> > sorting by the timestamp field (rounded to the day) and chunking the
>> > Dataframe into separate Dataframes. I create a new Plasma ObjectID for
>> each
>> > of the chunked Dataframes, convert them to RecordBuffer objects, stream
>> the
>> > bytes to Plasma and seal the objects. Only the objectIDs are returned to
>> > the orchestration thread.
>> >
>> > In follow-on processing, I'm combining the ObjectIDs for each of the
>> > unique day timestamps into lists and I'm passing those into a function in
>> > parallel using multiprocessing.Pool(). In this function, I'm iterating
>> > through the lists of objectIds, loading them back into Dataframes,
>> > appending them together until their size
>> > is > some predefined threshold, and performing a df.to_parquet().
>> >
>> > The steps in the 2 paragraphs above are performing in a loop, batching up
>> > 500-1k files at a time for each iteration.
>> >
>> > When I run this iteration a few times, it eventually locks up the Plasma
>> > client. With regards to the release() fault, it doesn't seem to matter
>> when
>> > or where I run it (in the orchestration thread or in other threads), it
>> > always seems to crash the Jupyter kernel. I'm thinking I might be using
>> it
>> > wrong, I'm just trying to figure out where and what I'm doing.
>> >
>> > Thanks again!
>> >
>> > On Tue, Jul 10, 2018 at 12:05 PM Wes McKinney 
>> wrote:
>> >
>> >> hi Corey,
>> >>
>> >> Can you provide the code (or a simplified version thereof) that shows
>> >> how you're using Plasma?
>> >>
>> >> - Wes
>> >>
>> >> On Tue, Jul 10, 2018 at 11:45 AM, Corey Nolet 
>> wrote:
>> >> > I'm on a system with 12TB of memory and attempting to use Pyarrow's
>> >> Plasma
>> >> > client to convert a series of CSV files (via Pandas) into a Parquet
>> >> store.
>> >> >
>> >> > I've got a little over 20k CSV files to process which are about 1-2gb
>> >> each.
>> >> > I'm loading 500 to 1000 files at a time.
>> >> >
>> >> > In each iteration, I'm loading a series of files, partitioning them
>> by a
>> >> > time field into separate dataframes, then writing parquet files in
>> >> > directories for each day.
>> >> >
>> >> > The problem I'm having is that the Plasma client & server appear to
>> >> lock up
>> >> > after about 2-3 iterations. It locks up to the point where I can't
>> even
>> >> > CTRL+C the server. I am able to stop the notebook and re-trying the
>> code
>> >> > just continues to lock up when interacting with Jupyter. There are no
>> >> > errors in my logs to tell me something's wrong.
>> >> >
>> >> > Just to make sure I'm not just being impatient and possibly need to
>> wait
>> >> > for some background services to finish, I allowed the code to run
>> >> overnight
>> >> > and it was still in the same state when I came in to work this
>> morning.
>> >> I'm
>> >> > running the Plasma server with 4TB max.
>> >> >
>> >> > In an attempt to pro-actively free up some of the object ids that I no
>> >> > longer need, I also attempted to use the client.release() function
>> but I
>> >> > cannot seem to figure out how to make this work properly. It crashes
>> my
>> >> > Jupyter kernel each time I try.
>> >> >
>> >> > I'm using Pyarrow 0.9.0
>> >> >
>> >> > Thanks in advance.
>> >>
>> >
>>


[jira] [Created] (ARROW-2860) Null values in a single partition of dataset, results in invalid schema on read

2018-07-16 Thread Sam Oluwalana (JIRA)
Sam Oluwalana created ARROW-2860:


 Summary: Null values in a single partition of dataset, results in 
invalid schema on read
 Key: ARROW-2860
 URL: https://issues.apache.org/jira/browse/ARROW-2860
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Sam Oluwalana


{code:python}
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd

from datetime import datetime, timedelta


def generate_data(event_type, event_id, offset=0):
"""Generate data."""
now = datetime.utcnow() + timedelta(seconds=offset)
obj = {
'event_type': event_type,
'event_id': event_id,
'event_date': now.date(),
'foo': None,
'bar': u'hello',
}
if event_type == 2:
obj['foo'] = 1
obj['bar'] = u'world'
if event_type == 3:
obj['different'] = u'data'
obj['bar'] = u'event type 3'
else:
obj['different'] = None
return obj


data = [
generate_data(1, 1, 1),
generate_data(1, 1, 3600 * 72),
generate_data(2, 1, 1),
generate_data(2, 1, 3600 * 72),
generate_data(3, 1, 1),
generate_data(3, 1, 3600 * 72),
]

df = pd.DataFrame.from_records(data, index='event_id')
table = pa.Table.from_pandas(df)

pq.write_to_dataset(table, root_path='/tmp/events', 
partition_cols=['event_type', 'event_date'])

dataset = pq.ParquetDataset('/tmp/events')
table = dataset.read()
print(table.num_rows)
{code}

Expected output:
{code:python}
6
{code}

Actual:
{code:python}
python example_failure.py
Traceback (most recent call last):
  File "example_failure.py", line 43, in 
dataset = pq.ParquetDataset('/tmp/events')
  File 
"/Users/sam/.virtualenvs/test-parquet/lib/python2.7/site-packages/pyarrow/parquet.py",
 line 745, in __init__
self.validate_schemas()
  File 
"/Users/sam/.virtualenvs/test-parquet/lib/python2.7/site-packages/pyarrow/parquet.py",
 line 775, in validate_schemas
dataset_schema))
ValueError: Schema in partition[event_type=2, event_date=0] 
/tmp/events/event_type=3/event_date=2018-07-16 
00:00:00/be001bf576674d09825539f20e99ebe5.parquet was different.
bar: string
different: string
foo: double
event_id: int64
metadata

{'pandas': '{"pandas_version": "0.23.3", "index_columns": ["event_id"], 
"columns": [{"metadata": null, "field_name": "bar", "name": "bar", 
"numpy_type": "object", "pandas_type": "unicode"}, {"metadata": null, 
"field_name": "different", "name": "different", "numpy_type": "object", 
"pandas_type": "unicode"}, {"metadata": null, "field_name": "foo", "name": 
"foo", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, 
"field_name": "event_id", "name": "event_id", "numpy_type": "int64", 
"pandas_type": "int64"}], "column_indexes": [{"metadata": null, "field_name": 
null, "name": null, "numpy_type": "object", "pandas_type": "bytes"}]}'}

vs

bar: string
different: null
foo: double
event_id: int64
metadata

{'pandas': '{"pandas_version": "0.23.3", "index_columns": ["event_id"], 
"columns": [{"metadata": null, "field_name": "bar", "name": "bar", 
"numpy_type": "object", "pandas_type": "unicode"}, {"metadata": null, 
"field_name": "different", "name": "different", "numpy_type": "object", 
"pandas_type": "empty"}, {"metadata": null, "field_name": "foo", "name": "foo", 
"numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, 
"field_name": "event_id", "name": "event_id", "numpy_type": "int64", 
"pandas_type": "int64"}], "column_indexes": [{"metadata": null, "field_name": 
null, "name": null, "numpy_type": "object", "pandas_type": "bytes"}]}'}
{code}

Apparently what is happening is that pyarrow is interpreting the schema from 
each of the partitions individually and the partitions for `event_type=3 / 
event_date=*`  both have values for the column `different` whereas the other 
columns do not. The discrepancy causes the `None` values of the other 
partitions to be labeled as `pandas_type` `empty` instead of `unicode`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2859) [Python] Handle objects exporting the buffer protocol in open_stream, open_file, and RecordBatch*Reader APIs

2018-07-16 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2859:
---

 Summary: [Python] Handle objects exporting the buffer protocol in 
open_stream, open_file, and RecordBatch*Reader APIs
 Key: ARROW-2859
 URL: https://issues.apache.org/jira/browse/ARROW-2859
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney
 Fix For: 0.10.0


I have hit this rough edge several times when doing interactive demos. If we 
can do so safely then this would improve usability



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2858) Add unit tests for crossbow

2018-07-16 Thread Phillip Cloud (JIRA)
Phillip Cloud created ARROW-2858:


 Summary: Add unit tests for crossbow
 Key: ARROW-2858
 URL: https://issues.apache.org/jira/browse/ARROW-2858
 Project: Apache Arrow
  Issue Type: Task
Reporter: Phillip Cloud


As this code grows we should start adding unit tests to make sure we can make 
changes safely.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Arrow meetup in Hyderabad July 24

2018-07-16 Thread Kelly Stirman
We're organizing a meetup in Hyderabad next week. Would anyone like to give
a talk? Apologies, I know it's a long shot due to location and short notice
(some of our Mountain View team will be visiting our team there who is
working on Gandiva).

https://www.meetup.com/Apache-Arrow-Meetup/events/252744998/

This will be in HITEC City, so close to lots of engineering teams in case
you have friends in the area you think would be interested.

Thanks,
Kelly


Re: Pyarrow Plasma client.release() fault

2018-07-16 Thread Robert Nishihara
Are you using the same plasma client from all of the different threads? If
so, that could cause race conditions as the client is not thread safe.

Alternatively, if you have a separate plasma client for each thread, then
you may be running out of file descriptors somewhere (either the client
process or the store).

Can you check if the object store evicting objects (it prints something to
stdout/stderr when this happens)? Could you be running out of memory but
failing to release the objects?

On Tue, Jul 10, 2018 at 9:48 AM Corey Nolet  wrote:

> Update:
>
> I'm investigating the possibility that I've reached the overcommit limit in
> the kernel as a result of all the parallel processes.
>
> This still doesn't fix the client.release() problem but it might explain
> why the processing appears to halt, after some time, until I restart the
> Jupyter kernel.
>
> On Tue, Jul 10, 2018 at 12:27 PM Corey Nolet  wrote:
>
> > Wes,
> >
> > Unfortunately, my code is on a separate network. I'll try to explain what
> > I'm doing and if you need further detail, I can certainly pseudocode
> > specifics.
> >
> > I am using multiprocessing.Pool() to fire up a bunch of threads for
> > different filenames. In each thread, I'm performing a pd.read_csv(),
> > sorting by the timestamp field (rounded to the day) and chunking the
> > Dataframe into separate Dataframes. I create a new Plasma ObjectID for
> each
> > of the chunked Dataframes, convert them to RecordBuffer objects, stream
> the
> > bytes to Plasma and seal the objects. Only the objectIDs are returned to
> > the orchestration thread.
> >
> > In follow-on processing, I'm combining the ObjectIDs for each of the
> > unique day timestamps into lists and I'm passing those into a function in
> > parallel using multiprocessing.Pool(). In this function, I'm iterating
> > through the lists of objectIds, loading them back into Dataframes,
> > appending them together until their size
> > is > some predefined threshold, and performing a df.to_parquet().
> >
> > The steps in the 2 paragraphs above are performing in a loop, batching up
> > 500-1k files at a time for each iteration.
> >
> > When I run this iteration a few times, it eventually locks up the Plasma
> > client. With regards to the release() fault, it doesn't seem to matter
> when
> > or where I run it (in the orchestration thread or in other threads), it
> > always seems to crash the Jupyter kernel. I'm thinking I might be using
> it
> > wrong, I'm just trying to figure out where and what I'm doing.
> >
> > Thanks again!
> >
> > On Tue, Jul 10, 2018 at 12:05 PM Wes McKinney 
> wrote:
> >
> >> hi Corey,
> >>
> >> Can you provide the code (or a simplified version thereof) that shows
> >> how you're using Plasma?
> >>
> >> - Wes
> >>
> >> On Tue, Jul 10, 2018 at 11:45 AM, Corey Nolet 
> wrote:
> >> > I'm on a system with 12TB of memory and attempting to use Pyarrow's
> >> Plasma
> >> > client to convert a series of CSV files (via Pandas) into a Parquet
> >> store.
> >> >
> >> > I've got a little over 20k CSV files to process which are about 1-2gb
> >> each.
> >> > I'm loading 500 to 1000 files at a time.
> >> >
> >> > In each iteration, I'm loading a series of files, partitioning them
> by a
> >> > time field into separate dataframes, then writing parquet files in
> >> > directories for each day.
> >> >
> >> > The problem I'm having is that the Plasma client & server appear to
> >> lock up
> >> > after about 2-3 iterations. It locks up to the point where I can't
> even
> >> > CTRL+C the server. I am able to stop the notebook and re-trying the
> code
> >> > just continues to lock up when interacting with Jupyter. There are no
> >> > errors in my logs to tell me something's wrong.
> >> >
> >> > Just to make sure I'm not just being impatient and possibly need to
> wait
> >> > for some background services to finish, I allowed the code to run
> >> overnight
> >> > and it was still in the same state when I came in to work this
> morning.
> >> I'm
> >> > running the Plasma server with 4TB max.
> >> >
> >> > In an attempt to pro-actively free up some of the object ids that I no
> >> > longer need, I also attempted to use the client.release() function
> but I
> >> > cannot seem to figure out how to make this work properly. It crashes
> my
> >> > Jupyter kernel each time I try.
> >> >
> >> > I'm using Pyarrow 0.9.0
> >> >
> >> > Thanks in advance.
> >>
> >
>


Re: pyarrow read/write schema as json?

2018-07-16 Thread Wes McKinney
hi Patrick,

The JSON representation of schemas weren't intended as public APIs.
Can you use the pyarrow Schema directly? I'm not sure I would advise
using the JSON for building any kind of production software.

Although, I'm not opposed to exposing this functionality in Python
with the clear caveat that the JSON representation is not to be used
for persistence. We have only designed it to be used for integration
testing.

see

https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/json.h
https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/json-internal.h

I just created https://issues.apache.org/jira/browse/ARROW-2857. If
someone wants to submit a patch I will be happy to take a look.

Thanks
Wes

On Fri, Jul 13, 2018 at 10:17 AM, Patrick Surry  wrote:
> Feels like I’m missing something obvious, but is there an easy way to
> read/write arrow schema objects as json in pyarrow?  It looks like java api
> has toJSON methods but can’t see if/how they’re exposed in python api.
>
> wesmckinn (via slack) said: we haven't exposed JSON functionality in Python
> yet afaik.
>
> In the github, it looked like there might be some route via the pyarrow
> jvm, e.g.
> https://github.com/apache/arrow/blob/4481b070c9eca4140aaa3a2470ede920411598a0/python/pyarrow/tests/test_jvm.py#L139
> but import pyarrow.jvm as pa_jvm doesn't work for me either, so now stuck :(
>
> I'm on
>
 pa.__version__
>
> '0.9.0.post1'
> Hoping to use pyarrow schema as a way to explicitly declare layout of some
> pandas dataframes for validation and maybe type coercion for edge cases
> like a numeric column which is entirely null and gets inferred by pandas as
> a different type.
>
> Thanks,
> Patrick
> --
> [image: hopper.com]  Patrick Surry
> Chief Data Scientist
> (857) 919 1700 | @patricksurry


[jira] [Created] (ARROW-2857) [Python] Expose integration test JSON read/write in Python API

2018-07-16 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2857:
---

 Summary: [Python] Expose integration test JSON read/write in 
Python API
 Key: ARROW-2857
 URL: https://issues.apache.org/jira/browse/ARROW-2857
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Python
Reporter: Wes McKinney


This should be clearly marked to not be used for persistence



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Passing Arrow object across language

2018-07-16 Thread Wes McKinney
I discussed some of these things at a high level in my talk at SciPy
2018 last week

https://www.slideshare.net/wesm/apache-arrow-crosslanguage-development-platform-for-inmemory-data-105427919

On Mon, Jul 16, 2018 at 2:08 PM, Wes McKinney  wrote:
> hi Yurui,
>
> You can also share data structures through JNI without using the IPC
> tools at all, which could require memory copying to produce the IPC
> messages.
>
> What you can do is obtain the memory addresses for the component
> buffers of an array (or vector, as called in Java) and construct the
> data structure from the memory addresses on the other side. We are
> doing exactly this already in Python using JPype (which is JNI-based):
>
> https://github.com/apache/arrow/blob/master/python/pyarrow/jvm.py
>
> The Gandiva project uses JNI to pass Java Netty buffer memory
> addresses to C++, you can see the code for creating the arrays from
> the memory addresses and then constructing a RecordBatch:
>
> https://github.com/dremio/gandiva/blob/master/cpp/src/jni/native_builder.cc#L602
>
> I believe as time goes on we will have better and more standardized
> APIs to deal with JNI<->C++ zero-copy passing, these implementations
> have only been done relatively recently. Your contributions to the
> Arrow project around this would be most welcomed!
>
> Thanks,
> Wes
>
> On Mon, Jul 16, 2018 at 2:00 PM, Philipp Moritz  wrote:
>> Hey Yuri,
>>
>> you can use the Arrow IPC mechanism to do this:
>>
>> - https://github.com/apache/arrow/blob/master/format/IPC.md
>> - Python: https://arrow.apache.org/docs/python/ipc.html
>> - C++: https://arrow.apache.org/docs/cpp/namespacearrow_1_1ipc.html
>> - For Java, see the org.apache.arrow.vector.ipc namespace
>>
>> On the C++ side, you can for example use a RecordBatchStreamWriter to write
>> the IPC message, and then on the Java side you could use the
>> ArrowStreamReader to read it.
>>
>> There are some tests here:
>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/ipc-read-write-test.cc
>> https://github.com/apache/arrow/tree/master/java/vector/src/test/java/org/apache/arrow/vector/ipc
>>
>> There is also integration tests here, although I'm not really familiar with
>> them:
>>
>> https://github.com/apache/arrow/tree/master/integration
>>
>> If you could write a little tutorial/into on how to do this (maybe using
>> Plasma for exchanging the data) and contribute it to the documentation,
>> that would be amazing!
>>
>> Best,
>> Philipp.
>>
>> On Mon, Jul 16, 2018 at 4:14 AM, 周宇睿(闻拙)  wrote:
>>
>>> Hi guys:
>>>
>>> I might miss something quite obviously. But how does Arrow passing objects
>>> across language? Let’s say I have a java program that invoke a c++ function
>>> via JNI, how does the c++ function pass an Arrow RecordBack object back to
>>> Java runtime without memory copy?
>>>
>>> Any advise would be appreciated.
>>> Thanks
>>> Yurui
>>>
>>> from Alimail macOS


Re: Passing Arrow object across language

2018-07-16 Thread Wes McKinney
hi Yurui,

You can also share data structures through JNI without using the IPC
tools at all, which could require memory copying to produce the IPC
messages.

What you can do is obtain the memory addresses for the component
buffers of an array (or vector, as called in Java) and construct the
data structure from the memory addresses on the other side. We are
doing exactly this already in Python using JPype (which is JNI-based):

https://github.com/apache/arrow/blob/master/python/pyarrow/jvm.py

The Gandiva project uses JNI to pass Java Netty buffer memory
addresses to C++, you can see the code for creating the arrays from
the memory addresses and then constructing a RecordBatch:

https://github.com/dremio/gandiva/blob/master/cpp/src/jni/native_builder.cc#L602

I believe as time goes on we will have better and more standardized
APIs to deal with JNI<->C++ zero-copy passing, these implementations
have only been done relatively recently. Your contributions to the
Arrow project around this would be most welcomed!

Thanks,
Wes

On Mon, Jul 16, 2018 at 2:00 PM, Philipp Moritz  wrote:
> Hey Yuri,
>
> you can use the Arrow IPC mechanism to do this:
>
> - https://github.com/apache/arrow/blob/master/format/IPC.md
> - Python: https://arrow.apache.org/docs/python/ipc.html
> - C++: https://arrow.apache.org/docs/cpp/namespacearrow_1_1ipc.html
> - For Java, see the org.apache.arrow.vector.ipc namespace
>
> On the C++ side, you can for example use a RecordBatchStreamWriter to write
> the IPC message, and then on the Java side you could use the
> ArrowStreamReader to read it.
>
> There are some tests here:
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/ipc-read-write-test.cc
> https://github.com/apache/arrow/tree/master/java/vector/src/test/java/org/apache/arrow/vector/ipc
>
> There is also integration tests here, although I'm not really familiar with
> them:
>
> https://github.com/apache/arrow/tree/master/integration
>
> If you could write a little tutorial/into on how to do this (maybe using
> Plasma for exchanging the data) and contribute it to the documentation,
> that would be amazing!
>
> Best,
> Philipp.
>
> On Mon, Jul 16, 2018 at 4:14 AM, 周宇睿(闻拙)  wrote:
>
>> Hi guys:
>>
>> I might miss something quite obviously. But how does Arrow passing objects
>> across language? Let’s say I have a java program that invoke a c++ function
>> via JNI, how does the c++ function pass an Arrow RecordBack object back to
>> Java runtime without memory copy?
>>
>> Any advise would be appreciated.
>> Thanks
>> Yurui
>>
>> from Alimail macOS


Re: Passing Arrow object across language

2018-07-16 Thread Philipp Moritz
Hey Yuri,

you can use the Arrow IPC mechanism to do this:

- https://github.com/apache/arrow/blob/master/format/IPC.md
- Python: https://arrow.apache.org/docs/python/ipc.html
- C++: https://arrow.apache.org/docs/cpp/namespacearrow_1_1ipc.html
- For Java, see the org.apache.arrow.vector.ipc namespace

On the C++ side, you can for example use a RecordBatchStreamWriter to write
the IPC message, and then on the Java side you could use the
ArrowStreamReader to read it.

There are some tests here:
https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/ipc-read-write-test.cc
https://github.com/apache/arrow/tree/master/java/vector/src/test/java/org/apache/arrow/vector/ipc

There is also integration tests here, although I'm not really familiar with
them:

https://github.com/apache/arrow/tree/master/integration

If you could write a little tutorial/into on how to do this (maybe using
Plasma for exchanging the data) and contribute it to the documentation,
that would be amazing!

Best,
Philipp.

On Mon, Jul 16, 2018 at 4:14 AM, 周宇睿(闻拙)  wrote:

> Hi guys:
>
> I might miss something quite obviously. But how does Arrow passing objects
> across language? Let’s say I have a java program that invoke a c++ function
> via JNI, how does the c++ function pass an Arrow RecordBack object back to
> Java runtime without memory copy?
>
> Any advise would be appreciated.
> Thanks
> Yurui
>
> from Alimail macOS


Re: Proposed Java ArrowStreamReader/MessageReader API Changes

2018-07-16 Thread Bryan Cutler
Thanks for the comments Li.  For your concerns about memory ownership, I
don't think anything is really changed here, but we can discuss further in
the PR.  I'm not sure I quite understand your concern when you say
"complexity of maintaining both style APIs"?  The proposed changes are for
1 coherent API, not 2, and I think it simplifies things for the user and
our codebase.  The holder object is just a simple data struct to allow
returning all of the message info, including the message, to the user. I
would really like to get this API right this time to avoid any more
changes, so suggestions for improvements would be much appreciated.

Thanks,
Bryan

On Sun, Jul 15, 2018 at 12:56 PM, Li Jin  wrote:

> Bryan,
>
> Sorry for the delay. I did a round of review of the API change (high
> level).
>
> I understand the proposed API changes allows users of the Arrow Java
> library to implement Arrow reader that reads Arrow data to on-heap memory
> directly. However, my main feedback is
> that the proposed API changes introduced a new style of read API - instead
> of return the Arrow message and data directly from the read method, there
> are new APIs introduced to read the message into a "holder" object. I am a
> little concerned about the complexity of maintaining both style APIs. There
> is another concern about memory ownership, which I have replied in more
> details in the PR - the buffer allocator is moved into the message reader
> class itself, instead of being provided by the caller - this could cause
> some confusion about memory ownership.
>
> That being said, I feel I am not 100% confident to say +1, +0 or -1 to the
> proposed change alone and I request some other Java committer review this
> change with me together.
>
> Thank you,
> Li
>
> On Mon, Jul 9, 2018 at 11:50 PM, Li Jin  wrote:
>
> > Bryan,
> >
> > Sorry I am traveling now but I will try to take to look in the next few
> > days.
> >
> > Li
> >
> > On Mon, Jul 9, 2018 at 11:16 PM, Bryan Cutler  wrote:
> >
> >> Hi All,
> >>
> >> I'm proposing some Java API changes to MessageReader, with minor changes
> >> to
> >> ArrowStreamReader and MessageSerializer, as part of ARROW-2704 [1] and
> can
> >> be seen in the PR [2]. These changes are to improve processing an Arrow
> >> stream on a per Message basis.
> >>
> >> A while ago I introduced the MessageReader interface in anticipation of
> >> some future work, but it fell a little short of what I needed. So after
> >> tweaking the APIs a bit, it can now allow the user to implement message
> >> stream processing without knowing specifics about the stream format and
> in
> >> an optimal way that avoids unnecessary buffer copying. If you have some
> >> concerns about these APIs, please discuss in the PR at [2]. If no one
> >> objects, it would be great to get these changes in version 0.10.0.
> >> Thanks!
> >>
> >> Bryan
> >>
> >>
> >> [1]: https://issues.apache.org/jira/browse/ARROW-2704
> >> [2]: https://github.com/apache/arrow/pull/2139
> >>
> >
> >
>


[jira] [Created] (ARROW-2856) [Python/C++] Array constructor should not truncate floats when casting to int

2018-07-16 Thread Florian Jetter (JIRA)
Florian Jetter created ARROW-2856:
-

 Summary: [Python/C++] Array constructor should not truncate floats 
when casting to int
 Key: ARROW-2856
 URL: https://issues.apache.org/jira/browse/ARROW-2856
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.9.0
Reporter: Florian Jetter


I would expect the following code to raise instead of truncating the float
{code}
In [4]: pa.array([1.9], type=pa.int8())
Out[4]:

[
  1
]
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Passing Arrow object across language

2018-07-16 Thread 周宇睿(闻拙)
Hi guys:

I might miss something quite obviously. But how does Arrow passing objects 
across language? Let’s say I have a java program that invoke a c++ function via 
JNI, how does the c++ function pass an Arrow RecordBack object back to Java 
runtime without memory copy?

Any advise would be appreciated.
Thanks
Yurui 

from Alimail macOS