Re: Arrow Dataset API on Ceph

Wes McKinney Tue, 01 Jun 2021 14:07:26 -0700

Folks who are looking for some context about this work might like to read a
recently-published paper from this research group:


https://arxiv.org/pdf/2105.09894.pdf

On Tue, Jun 1, 2021 at 3:12 PM Jayjeet Chakraborty <
jayjeetchakrabort...@gmail.com> wrote:

> Dear Arrow Community,
>
> In our previous discussion, we planned on implementing a new Dataset API
> like InMemoryDataset to interact with objects containing IPC data stored in
> Ceph/RADOS <https://ceph.io/>. We had implemented this design and raised a
> PR <https://github.com/apache/arrow/pull/8647>. But when we started adding
> the dataset discovery functionality, we found ourselves reimplementing
> filesystem abstractions and its metadata management. We closed the original
> PR and raised a new PR <https://github.com/apache/arrow/pull/10431> where
> we redesigned our implementation to use the Ceph filesystem as our file I/O
> interface since it provides fast metadata support via the Ceph metadata
> servers (MDS). We also decided to store data using one of the file formats
> supported by Arrow. One of our driving use cases favored Parquet.
>
> Since we perform the scan operation inside the storage layer using Ceph
> Object class
> <
> https://docs.ceph.com/en/latest/rados/api/objclass-sdk/#:~:text=Ceph%20can%20be%20extended%20by,object%20classes%20within%20the%20tree
> .>
> methods which need to be invoked directly on objects, we utilize the
> striping strategy information provided by CephFS to translate filename in
> CephFS to  object id in RADOS. To be able to have this one-to-one mapping,
> we split Parquet files in a manner similar to how Spark splits Parquet
> files for HDFS and ensure that each fragment is backed by a single RADOS
> object.
>
> We are planning a new PR, we extend the FileFormat interface to create a
> RadosParquetFileFormat
> <
> https://github.com/uccross/skyhookdm-arrow/blob/arrow-master/cpp/src/arrow/dataset/file_rados_parquet.h#L129
> >
> interface that offloads Parquet file scan operations to the RADOS layer in
> Ceph. Since we now utilize a filesystem interface, we can just use the
> FileSystemDataset API and plug in our new format to offload scan
> operations. We have also added Python bindings for the new APIs that we
> implemented. In all, our patch only consists of around 3,000 LoC and
> introduces new dependencies to Ceph’s librados and object class SDK only
> (that can be disabled via cmake flags).
>
> We have added an architecture
> <
> https://github.com/uccross/skyhookdm-arrow/blob/rados-parquet-pr/cpp/src/arrow/adapters/arrow-rados-cls/docs/architecture.md
> >
> document with our PR which describes the overall architecture along with
> the life of a dataset scan on using RadosParquet. Additionally, we recently
> wrote up a paper <https://arxiv.org/abs/2105.09894> describing our design
> and implementation along with some initial benchmarks given there. We plan
> to raise a PR <https://github.com/apache/arrow/pull/10431> to upstream our
> format to apache/arrow soon and hence look forward to your comments and
> thoughts on this new feature. Please let us know if you have any questions.
> Thank you.
>
> Best regards,
>
> Jayjeet Chakraborty
>
> On 2020/09/15 18:06:56, Micah Kornfield <emkornfi...@gmail.com> wrote:
> > gmock is already a dependency.  We haven't upgraded gmock/gtest in a
> while,
> > we might want to consider doing that (but this is orthogonal).
> >
> > On Tue, Sep 15, 2020 at 10:16 AM Antoine Pitrou <anto...@python.org>
> wrote:
> >
> > >
> > > Hi Ivo,
> > >
> > > You can open a JIRA once you've got a PR ready.  No need to do it
> before
> > > you think you're ready for submission.
> > >
> > > AFAIK, gmock is already a dependency.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > >
> > > Le 15/09/2020 à 18:49, Ivo Jimenez a écrit :
> > > > Hi again,
> > > >
> > > > We noticed in the contribution guidelines that there needs to be an
> > > issue for every PR in JIRA. Should we open one for the eventual PR for
> the
> > > work we're doing on implementing the dataset on Ceph's RADOS?
> > > >
> > > > Also, on a related note, we would like to mock the RADOS client so
> that
> > > we can integrate it in CI tests. Would it be OK to include gmock as a
> > > dependency?
> > > >
> > > > thanks!
> > > >
> > > > On 2020/09/02 22:05:51, Ivo Jimenez <ivo.jime...@gmail.com> wrote:
> > > >> Hi Ben,
> > > >>
> > > >>
> > > >>>> Our main concern is that this new arrow::dataset::RadosFormat
> class
> > > will
> > > >>> be
> > > >>>> deriving from the arrow::dataset::FileFormat class, which seems to
> > > raise
> > > >>> a
> > > >>>> conceptual mismatch as there isn’t really a RADOS format
> > > >>>
> > > >>> IIUC RADOS doesn't interact with a filesystem directly, so
> > > RadosFileFormat
> > > >>> would
> > > >>> indeed be a conceptually problematic point of extension. If a RADOS
> > > file
> > > >>> system
> > > >>> is not viable then I think the ideal approach would be to directly
> > > >>> implement the
> > > >>> Fragment [1] and Dataset [2] interfaces, forgoing a FileFormat
> > > >>> implementation altogether.
> > > >>> Unfortunately the only example we have of this approach is
> > > >>> InMemoryFragment,
> > > >>> which simply wraps a vector of record batches.
> > > >>>
> > > >>
> > > >> This is what we will go with, as this seems to be the quickest way
> for
> > > us
> > > >> to have a PoC and start experimenting with this.
> > > >>
> > > >> Thanks a lot for the invaluable feedback! 🙏
> > > >>
> > >
> >
>

Re: Arrow Dataset API on Ceph

Reply via email to