Folks who are looking for some context about this work might like to read a recently-published paper from this research group:
https://arxiv.org/pdf/2105.09894.pdf On Tue, Jun 1, 2021 at 3:12 PM Jayjeet Chakraborty < jayjeetchakrabort...@gmail.com> wrote: > Dear Arrow Community, > > In our previous discussion, we planned on implementing a new Dataset API > like InMemoryDataset to interact with objects containing IPC data stored in > Ceph/RADOS <https://ceph.io/>. We had implemented this design and raised a > PR <https://github.com/apache/arrow/pull/8647>. But when we started adding > the dataset discovery functionality, we found ourselves reimplementing > filesystem abstractions and its metadata management. We closed the original > PR and raised a new PR <https://github.com/apache/arrow/pull/10431> where > we redesigned our implementation to use the Ceph filesystem as our file I/O > interface since it provides fast metadata support via the Ceph metadata > servers (MDS). We also decided to store data using one of the file formats > supported by Arrow. One of our driving use cases favored Parquet. > > Since we perform the scan operation inside the storage layer using Ceph > Object class > < > https://docs.ceph.com/en/latest/rados/api/objclass-sdk/#:~:text=Ceph%20can%20be%20extended%20by,object%20classes%20within%20the%20tree > .> > methods which need to be invoked directly on objects, we utilize the > striping strategy information provided by CephFS to translate filename in > CephFS to object id in RADOS. To be able to have this one-to-one mapping, > we split Parquet files in a manner similar to how Spark splits Parquet > files for HDFS and ensure that each fragment is backed by a single RADOS > object. > > We are planning a new PR, we extend the FileFormat interface to create a > RadosParquetFileFormat > < > https://github.com/uccross/skyhookdm-arrow/blob/arrow-master/cpp/src/arrow/dataset/file_rados_parquet.h#L129 > > > interface that offloads Parquet file scan operations to the RADOS layer in > Ceph. Since we now utilize a filesystem interface, we can just use the > FileSystemDataset API and plug in our new format to offload scan > operations. We have also added Python bindings for the new APIs that we > implemented. In all, our patch only consists of around 3,000 LoC and > introduces new dependencies to Ceph’s librados and object class SDK only > (that can be disabled via cmake flags). > > We have added an architecture > < > https://github.com/uccross/skyhookdm-arrow/blob/rados-parquet-pr/cpp/src/arrow/adapters/arrow-rados-cls/docs/architecture.md > > > document with our PR which describes the overall architecture along with > the life of a dataset scan on using RadosParquet. Additionally, we recently > wrote up a paper <https://arxiv.org/abs/2105.09894> describing our design > and implementation along with some initial benchmarks given there. We plan > to raise a PR <https://github.com/apache/arrow/pull/10431> to upstream our > format to apache/arrow soon and hence look forward to your comments and > thoughts on this new feature. Please let us know if you have any questions. > Thank you. > > Best regards, > > Jayjeet Chakraborty > > On 2020/09/15 18:06:56, Micah Kornfield <emkornfi...@gmail.com> wrote: > > gmock is already a dependency. We haven't upgraded gmock/gtest in a > while, > > we might want to consider doing that (but this is orthogonal). > > > > On Tue, Sep 15, 2020 at 10:16 AM Antoine Pitrou <anto...@python.org> > wrote: > > > > > > > > Hi Ivo, > > > > > > You can open a JIRA once you've got a PR ready. No need to do it > before > > > you think you're ready for submission. > > > > > > AFAIK, gmock is already a dependency. > > > > > > Regards > > > > > > Antoine. > > > > > > > > > > > > Le 15/09/2020 à 18:49, Ivo Jimenez a écrit : > > > > Hi again, > > > > > > > > We noticed in the contribution guidelines that there needs to be an > > > issue for every PR in JIRA. Should we open one for the eventual PR for > the > > > work we're doing on implementing the dataset on Ceph's RADOS? > > > > > > > > Also, on a related note, we would like to mock the RADOS client so > that > > > we can integrate it in CI tests. Would it be OK to include gmock as a > > > dependency? > > > > > > > > thanks! > > > > > > > > On 2020/09/02 22:05:51, Ivo Jimenez <ivo.jime...@gmail.com> wrote: > > > >> Hi Ben, > > > >> > > > >> > > > >>>> Our main concern is that this new arrow::dataset::RadosFormat > class > > > will > > > >>> be > > > >>>> deriving from the arrow::dataset::FileFormat class, which seems to > > > raise > > > >>> a > > > >>>> conceptual mismatch as there isn’t really a RADOS format > > > >>> > > > >>> IIUC RADOS doesn't interact with a filesystem directly, so > > > RadosFileFormat > > > >>> would > > > >>> indeed be a conceptually problematic point of extension. If a RADOS > > > file > > > >>> system > > > >>> is not viable then I think the ideal approach would be to directly > > > >>> implement the > > > >>> Fragment [1] and Dataset [2] interfaces, forgoing a FileFormat > > > >>> implementation altogether. > > > >>> Unfortunately the only example we have of this approach is > > > >>> InMemoryFragment, > > > >>> which simply wraps a vector of record batches. > > > >>> > > > >> > > > >> This is what we will go with, as this seems to be the quickest way > for > > > us > > > >> to have a PoC and start experimenting with this. > > > >> > > > >> Thanks a lot for the invaluable feedback! 🙏 > > > >> > > > > > >