liusitan commented on issue #13875: URL: https://github.com/apache/arrow/issues/13875#issuecomment-1214694078
the reason why I am hacking the Arrow ipc format is that recently I am implementing a fuse filesystem for vineyard, which is an immutable storage manager that utilize the columnar format as the arrow. We decided to enable our clients to access vineyard objects by reading from the Arrow ipc format. which means when the client wants to access the objects stored in the vineyard, the fuse file system will searlize the corresponding vienayrd objects on the fly, store it in the fuse process, and provide the serialized Arrow-formatted vineyard objects to clients. However, this approach may lead to heavy memory usage. We are thinking, is it possible to create a mapping between the Arrow ipc format to the vineyard objects, in terms of the information stored in the vineyard objects' metadata, it's totally possible, especially in terms of the dataframe, we store that in units of column as well. Theoretically, if a user wants to access the 100 byte to 200 btyes of the Arrow-formatted vineyard objects, conceptually, that's a range of data in the first column, my implementation can realize its conceptual representation from the byte range, and grab the data from vineyard, serialized data, provide what client wants. Practically, I haven't found a way to precompute the sizes of each part of the serialized Arrow ipc format for now, given the documentation so far. After doing some question compression, I raised the question above. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org