I am not involved with the design or development of the V2 API - so these could be naïve comments/thoughts. Just as dataset is to abstract away from RDD, which otherwise required a little more intimate knowledge about Spark internals, I am guessing the absence of partition operations is either due to no current need for it or the need to abstract that away from the API user/programmer. Ofcourse, the other thing about dataset is that it comes with schema - closely binding the two together.
However for situations where a deeper understanding of the datasource is necessary to read/write data, then that logic can potentially be embedded into the classes that implement DataSourceReader/DataSourceWriter or DataReader/DataWriter. E.g. if you are writing data and you need to dynamically create partitions on the fly as you write data, then the DataSourceReader can gather the current list of partitions and pass it on to DataWriter via DataWriterFactory. The DataWriter can consult that list and if a row is encountered that does not exist then create it (again note that the partition creation operation needs to be idempotent OR the DataWriter needs to check for the partition before trying to create as it may have been already created by another DataWriter). As for partition add/list/drop/alter, I don't think that concept/notion applies to all datasources (e.g. filesystem). Also, the concept of a Spark partition may not translate into the underlying datasource partition. At the same time I did see a discussion thread on catalog operations for V2 API - although, probably Spark partitions do not map one-to-one to the underlying partitions. Probably a good place to introduce partition info is to add a method/object called "meta" to a dataset and allow the datasource to describe itself (e.g. table permissions, table partitions and specs, datasource info (e.g. cluster), etc.). E.g. something like this With just meta method dataset.meta = {optional datasource specific info} Or with meta as an intermediate object with several operations dataset.meta.describe dataset.meta.update .... However, if you are look On 9/16/18, 1:24 AM, "tigerquoll" <tigerqu...@outlook.com> wrote: I've been following the development of the new data source abstraction with keen interest. One of the issues that has occurred to me as I sat down and planned how I would implement a data source is how I would support manipulating partitions. My reading of the current prototype is that Data source v2 APIs expose enough of a concept of a partition to support communicating record distribution particulars to catalyst, but does not represent partitions as a concept that the end user of the data sources can manipulate. The end users of data sources need to be able to add/drop/modify and list partitions. For example, many systems require partitions to be created before records are added to them. For batch use-cases, it may be possible for users to manipulate partitions from within the environment that the data source interfaces to, but for streaming use-cases, this is not at all practical. Two ways I can think of doing this are: 1. Allow "pass-through" commands to the underlying data source 2. Have a generic concept of partitions exposed to the end user via the data source API and Spark SQL DML. I'm keen for option 2 but recognise that its possible there are better alternatives out there. -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/