[ https://issues.apache.org/jira/browse/ARROW-1163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16242298#comment-16242298 ]
Philipp Moritz commented on ARROW-1163: --------------------------------------- Hey Lu Qi, I have very limited experience with Java, here are some thoughts that are I hope are helpful: You can do zero copy reads in Java using an off-heap method like http://xcorpion.tech/2016/09/10/It-s-all-about-buffers-zero-copy-mmap-and-Java-NIO/. Given the data already lives in (in-memory) memory-mapped files, this might be the best way to go forward here. We would essentially define our own Tensor class and then use code like https://github.com/apache/spark/tree/50ada2a4d31609b6c828158cad8e128c2f605b8d/common/unsafe/src/main/java/org/apache/spark/unsafe (see for example https://github.com/apache/spark/blob/50ada2a4d31609b6c828158cad8e128c2f605b8d/common/unsafe/src/main/java/org/apache/spark/unsafe/array/LongArray.java) to access the data without copies. Arrow already has a Tensor class in C++ that does similar things and the the current Python serialization code uses that to read Tensors in a zero copy way from the object store and expose them as numpy arrays to the user. On the Java side I think not much is available yet for reading tensors; as a point to get started, the code for parsing Tensor metadata is generated here: https://github.com/apache/arrow/blob/82eea49b3eea6047f53478113ab3ff9a38f0d344/java/format/pom.xml#L108 If you look at the code for reading C++ Tensors, you should be able to get a prototype of this working. I'm also cc'ing some of the people who have done most work on the Java implementation for more input. [~bryanc] [~siddteotia] [~jnadeau] -- Philipp. > [Plasma] Java client for Plasma > ------------------------------- > > Key: ARROW-1163 > URL: https://issues.apache.org/jira/browse/ARROW-1163 > Project: Apache Arrow > Issue Type: New Feature > Reporter: Philipp Moritz > > We should start thinking about how a Java client for plasma would look like. > Given the focus of arrow to support Python, C++ and Java really well, it is > the next important target after Python and C++. > My preliminary thoughts on it are the following ones: We can either go with > JNI and wrap the C++ client or (in my opinion preferable) write a pure Java > client. It would communicate with the Plasma store via Java flatbuffers over > sockets. > It seems that the only thing blocking a pure Java client at the moment is the > way we ship file descriptors for the memory mapped files between store and > client (see the file fling.cc in the Plasma repo). We would need to get rid > of that because there is no pure Java API that allows transferring file > descriptors over a process boundary. So the way to transfer memory mapped > files over process boundaries then is probably to use the file system and > keep the memory mapped files in the file system instead of unlinking them > immediately (as we do at the moment), so they can be opened by the client > process via their path. > The challenge in this case is how to clean the files up and make sure they > are not lying around if the plasma store crashes. One option is to store the > plasma store PID with the file (i.e. as part of the file name) and let the > plasma store clean them up the next time it is started); maybe there is OS > level support for temporary files we can reuse. > I probably won't get to this for a while, so if anybody needs this or has > free cycles, they should feel free to chime in. Also opinions on the design > are appreciated! > -- Philipp. -- This message was sent by Atlassian JIRA (v6.4.14#64029)