[
https://issues.apache.org/jira/browse/ARROW-263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15429948#comment-15429948
]
Philipp Moritz commented on ARROW-263:
--------------------------------------
Hey Micah,
thanks for your answer!
I got the trick of unlinking the domain socket from here:
https://troydhanson.github.io/network/Unix_domain_sockets.html ("Unlink before
bind"). On Linux and Mac OS it seems to work and prevents leaking of the file.
Note that at some point we need to introduce a named object that can be seen by
all processes to bootstrap the communication between processes and this has
been the least problematic way of doing that I have seen.
At the moment I'm also working on a distributed version of the object store
(with a separate process that can be used to ship objects between object stores
on different nodes in a network) and investigating libuv to do it in a platform
independent way. Libuv is a small dependency and my experience so far is pretty
enjoyable. It also includes limited functionality to exchange file descriptors,
but this might not work on windows (see also
https://groups.google.com/forum/#!msg/libuv/0xxXBIGlzLc/H1HbL-igb84J, I haven't
tried it yet).
Concerning your last comment: The plasma store is a long running process that
keeps its file descriptor and the data alive. Are page faults still a problem
if data does not need to be reloaded from hard disk?
If somebody else has a platform independent way of achieving some of these
goals, I'd be happy to learn about their ideas.
> Design an initial IPC mechanism for Arrow Vectors
> -------------------------------------------------
>
> Key: ARROW-263
> URL: https://issues.apache.org/jira/browse/ARROW-263
> Project: Apache Arrow
> Issue Type: New Feature
> Reporter: Micah Kornfield
> Assignee: Micah Kornfield
>
> Prior discussion on this topic [1].
> Use-cases:
> 1. User defined function (UDF) execution: One process wants to execute a
> user defined function written in another language (e.g. Java executing a
> function defined in python, this involves creating Arrow Arrays in java,
> sending them to python and receiving a new set of Arrow Arrays produced in
> python back in the java process).
> 2. If a storage system and a query engine are running on the same host we
> might want use IPC instead of RPC (e.g. Apache Drill querying Apache Kudu)
> Assumptions:
> 1. IPC mechanism should be useable from the core set of supported languages
> (Java, Python, C) on POSIX and ideally windows systems. Ideally, we would
> not need to add dependencies on additional libraries outside of each
> languages outside of this document.
> We want leverage shared memory for Arrays to avoid doubling RAM requirements
> by duplicating the same Array in different memory locations.
> 2. Under some circumstances shared memory might be more efficient than FIFOs
> or sockets (in other scenarios they won’t see thread below).
> 3. Security is not a concern for V1, we assume all processes running are
> “trusted”.
> Requirements:
> 1.Resource management:
> a. Both processes need a way of allocating memory for Arrow Arrays so
> that data can be passed from one process to another.
> b. There must be a mechanism to cleanup unused Arrow Arrays to limit
> resource usage but avoid race conditions when processing arrays
> 2. Schema negotiation - before sending data, both processes need to agree on
> schema each one will produce.
> Out of scope requirements:
> 1. IPC channel metadata discovery is out of scope of this document.
> Discovery can be provided by passing appropriate command line arguments,
> configuration files or other mechanisms like RPC (in which case RPC channel
> discovery is still an issue).
> [1]
> http://mail-archives.apache.org/mod_mbox/arrow-dev/201603.mbox/%3c8d5f7e3237b3ed47b84cf187bb17b666148e7...@shsmsx103.ccr.corp.intel.com%3E
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)