I've been under the impression that exposing memory to be shared directly and not copied WAS, in fact, the responsibility of Arrow. In fact, I read this in [1] and this is turned me on to Arrow in the first place.
[1] http://www.datanami.com/2016/02/17/arrow-aims-to-defrag-big-in-memory-analytics/ On Wed, Mar 16, 2016 at 1:00 PM, Julian Hyde <jh...@apache.org> wrote: > This is all very interesting stuff, but just so we’re clear: it is not > Arrow’s responsibility to provide an RPC/IPC/LPC mechanism, nor facilities > for resource management. If we DID decide to make this Arrow’s > responsibility it would overlap with other components which specialize in > such stuff. > > > > > On Mar 16, 2016, at 9:49 AM, Jacques Nadeau <jacq...@apache.org> wrote: > > > > @Todd: agree entirely on prototyping design. My goal is throw out some > > ideas and some POC code and then we can explore from there. > > > > My main thoughts have initially been around lifecycle management. I've > done > > some work previously where a consistently sized shared buffer using mmap > > has improved performance. This is more complicated given the requirements > > for providing collaborative allocation and cross process reference > counts. > > > > With regards to whether this is more generally applicable: I think it > could > > ultimately be more general but I suggest we focus on the particular > > application of moving long-lived arrow record batches between a producer > > and a consumer initially. Constraining the problems seems like we will > get > > to something workable sooner. We can abstract to a more general solution > as > > there are other clear requirements. > > > > With regards to capnproto, I believe they are simply saying when they > talk > > about zero-copy shared memory that the structure supports that (same as > any > > memory-layout based design). I don't believe they actually implemented a > > protocol and multi-language implementation for zero-copy cross process > > communication. > > > > One other note to make here is that my goal here is not just about > > performance but also about memory footprint. Being able to have a shared > > memory protocol that allows multiple tools to interact with the same hot > > dataset. > > > > RE: ACL, for the initial focus, I suggest that we consider the two > sharing > > processes are "trusted" and expect the initial Arrow API reference > > implementations to manage memory access. > > > > Regarding other questions that Todd threw out: > > > > - if you are using an mmapped file in /dev/shm/, how do you make sure it > > gets cleaned up if the process crashes? > > > >>> Agreed that it needs to get resolve. If I recall, destruction can be > > applied once associated process are attached to memory and this allows > the > > kernel to recover once all attaching processes are destroyed. If this > isn't > > enough, then we may very well need a simple external coordinator. > > > > - how do you allocate memory to it? there's nothing ensuring that > /dev/shm > > doesn't swap out if you try to put too much in there, and then your > > in-memory super-fast access will basically collapse under swap thrashing > > > >>> Simplest model initially is probably one where we assume a master and a > > slave. (Ideally negotiated on initial connection.) The master is > > responsible for allocating memory and giving that to the slave. The > master > > then is responsible for managing reasonable memory allocation limits just > > like any other. Slaves that need to allocated memory must ask the master > > (at whatever chunk makes sense) and will get rejected if they are too > > aggressive. (this probably means that at any point an IPC can fall back > to > > RPC??) > > > > - how do you do lifecycle management across the two processes? If, say, > > Kudu wants to pass a block of data to some Python program, how does it > know > > when the Python program is done reading it and it should be deleted? What > > if the python program crashed in the middle - when can Kudu release it? > > > >>> My thinking, as mentioned earlier, is a shared reference count model > for > > complex situations. Possibly a "request/response" ownership model for > > simpler cases. > > > > - how do you do security? If both sides of the connection don't trust > each > > other, and use length prefixes and offsets, you have to be constantly > > validating and re-validating everything you read. > > > > I'm suggesting that we start with trusting so we don't get too wrapped up > > in all the extra complexities of security. My experience with these > things > > is that a lot of users will frequently pick performance or footprint over > > security for quite some time. For example, if I recall correctly, on the > > shared file descriptor model that was initially implemented in the HDFS > > client, that people used short-circuit reads for years before security > was > > correctly implemented. (Am I remembering this right?) > > > > Lastly, as I mentioned above, I don't think there should be any > requirement > > that Arrow communication be limited to only 'IPC'. As Todd points out, in > > many cases unix domain sockets will be just fine. > > > > We need to implement both models because we all know that locality will > > never be guaranteed. The IPC design/implementation needs to be good for > > anything to make into arrow. > > > > thanks > > Jacques > > > > > > > > On Wed, Mar 16, 2016 at 8:54 AM, Zhe Zhang <z...@apache.org> wrote: > > > >> I have similar concerns as Todd stated below. With an mmap-based > approach, > >> we are treating shared memory objects like files. This brings in all > >> filesystem related considerations like ACL and lifecycle mgmt. > >> > >> Stepping back a little, the shared-memory work isn't really specific to > >> Arrow. A few questions related to this: > >> 1) Has the topic been discussed in the context of protobuf (or other IPC > >> protocols) before? Seems Cap'n Proto (https://capnproto.org/) has > >> zero-copy > >> shared memory. I haven't read implementation detail though. > >> 2) If the shared-memory work benefits a wide range of protocols, should > it > >> be a generalized and standalone library? > >> > >> Thanks, > >> Zhe > >> > >> On Tue, Mar 15, 2016 at 8:30 PM Todd Lipcon <t...@cloudera.com> wrote: > >> > >>> Having thought about this quite a bit in the past, I think the > mechanics > >> of > >>> how to share memory are by far the easiest part. The much harder part > is > >>> the resource management and ownership. Questions like: > >>> > >>> - if you are using an mmapped file in /dev/shm/, how do you make sure > it > >>> gets cleaned up if the process crashes? > >>> - how do you allocate memory to it? there's nothing ensuring that > >> /dev/shm > >>> doesn't swap out if you try to put too much in there, and then your > >>> in-memory super-fast access will basically collapse under swap > thrashing > >>> - how do you do lifecycle management across the two processes? If, say, > >>> Kudu wants to pass a block of data to some Python program, how does it > >> know > >>> when the Python program is done reading it and it should be deleted? > What > >>> if the python program crashed in the middle - when can Kudu release it? > >>> - how do you do security? If both sides of the connection don't trust > >> each > >>> other, and use length prefixes and offsets, you have to be constantly > >>> validating and re-validating everything you read. > >>> > >>> Another big factor is that shared memory is not, in my experience, > >>> immediately faster than just copying data over a unix domain socket. In > >>> particular, the first time you read an mmapped file, you'll end up > paying > >>> minor page fault overhead on every page. This can be improved with > >>> HugePages, but huge page mmaps are not supported yet in current Linux > >> (work > >>> going on currently to address this). So you're left with hugetlbfs, > which > >>> involves static allocations and much more pain. > >>> > >>> All the above is a long way to say: let's make sure we do the write > >>> prototyping and up-front design before jumping into code. > >>> > >>> -Todd > >>> > >>> > >>> > >>> On Tue, Mar 15, 2016 at 5:54 PM, Jacques Nadeau <jacq...@apache.org> > >>> wrote: > >>> > >>>> @Corey > >>>> The POC Steven and Wes are working on is based on MappedBuffer but I'm > >>>> looking at using netty's fork of tcnative to use shared memory > >> directly. > >>>> > >>>> @Yiannis > >>>> We need to have both RPC and a shared memory mechanisms (what I'm > >>> inclined > >>>> to call IPC but is a specific kind of IPC). The idea is we negotiate > >> via > >>>> RPC and then if we determine shared locality, we work over shared > >> memory > >>>> (preferably for both data and control). So the system interacting with > >>>> HBase in your example would be the one responsible for placing > >> collocated > >>>> execution to take advantage of IPC. > >>>> > >>>> How do others feel of my redefinition of IPC to mean the same memory > >>> space > >>>> communication (either via shared memory or rdma) versus RPC as socket > >>> based > >>>> communication? > >>>> > >>>> > >>>> On Tue, Mar 15, 2016 at 5:38 PM, Corey Nolet <cjno...@gmail.com> > >> wrote: > >>>> > >>>>> I was seeing Netty's unsafe classes being used here, not mapped byte > >>>>> buffer not sure if that statement is completely correct but I'll > >> have > >>> to > >>>>> dog through the code again to figure that out. > >>>>> > >>>>> The more I was looking at unsafe, it makes sense why that would be > >>>>> used.apparently it's also supposed to be included on Java 9 as a > >> first > >>>>> class API > >>>>> On Mar 15, 2016 7:03 PM, "Wes McKinney" <w...@cloudera.com> wrote: > >>>>> > >>>>>> My understanding is that you can use java.nio.MappedByteBuffer to > >>> work > >>>>>> with memory-mapped files as one way to share memory pages between > >>> Java > >>>>>> (and non-Java) processes without copying. > >>>>>> > >>>>>> I am hoping that we can reach a POC of zero-copy Arrow memory > >> sharing > >>>>>> Java-to-Java and Java-to-C++ in the near future. Indeed this will > >>> have > >>>>>> huge implications once we get it working end to end (for example, > >>>>>> receiving memory from a Java process in Python without a heavy > >> ser-de > >>>>>> step -- it's what we've always dreamed of) and with the metadata > >> and > >>>>>> shared memory control flow standardized. > >>>>>> > >>>>>> - Wes > >>>>>> > >>>>>> On Wed, Mar 9, 2016 at 9:25 PM, Corey J Nolet <cjno...@gmail.com> > >>>> wrote: > >>>>>>> If I understand correctly, Arrow is using Netty underneath which > >> is > >>>>>> using Sun's Unsafe API in order to allocate direct byte buffers off > >>>> heap. > >>>>>> It is using Netty to communicate between "client" and "server", > >>>>> information > >>>>>> about memory addresses for data that is being requested. > >>>>>>> > >>>>>>> I've never attempted to use the Unsafe API to access off heap > >>> memory > >>>>>> that has been allocated in one JVM from another JVM but I'm > >> assuming > >>>> this > >>>>>> must be the case in order to claim that the memory is being > >> accessed > >>>>>> directly without being copied, correct? > >>>>>>> > >>>>>>> The implication here is huge. If the memory is being directly > >>> shared > >>>>>> across processes by them being allowed to directly reach into the > >>>> direct > >>>>>> byte buffers, that's true shared memory. Otherwise, if there's > >> copies > >>>>> going > >>>>>> on, it's less appealing. > >>>>>>> > >>>>>>> > >>>>>>> Thanks. > >>>>>>> > >>>>>>> Sent from my iPad > >>>>>> > >>>>> > >>>> > >>> > >>> > >>> > >>> -- > >>> Todd Lipcon > >>> Software Engineer, Cloudera > >>> > >> > >