I am not involved with the design or development of the V2 API - so these could
be naïve comments/thoughts.
Just as dataset is to abstract away from RDD, which otherwise required a little
more intimate knowledge about Spark internals, I am guessing the absence of
partition operations is either due to no current need for it or the need to
abstract that away from the API user/programmer.
Ofcourse, the other thing about dataset is that it comes with schema - closely
binding the two together.
However for situations where a deeper understanding of the datasource is
necessary to read/write data, then that logic can potentially be embedded into
the classes that implement DataSourceReader/DataSourceWriter or
DataReader/DataWriter.
E.g. if you are writing data and you need to dynamically create partitions on
the fly as you write data, then the DataSourceReader can gather the current
list of partitions and pass it on to DataWriter via DataWriterFactory. The
DataWriter can consult that list and if a row is encountered that does not
exist then create it (again note that the partition creation operation needs to
be idempotent OR the DataWriter needs to check for the partition before trying
to create as it may have been already created by another DataWriter).
As for partition add/list/drop/alter, I don't think that concept/notion applies
to all datasources (e.g. filesystem).
Also, the concept of a Spark partition may not translate into the underlying
datasource partition.
At the same time I did see a discussion thread on catalog operations for V2 API
- although, probably Spark partitions do not map one-to-one to the underlying
partitions.
Probably a good place to introduce partition info is to add a method/object
called "meta" to a dataset and allow the datasource to describe itself (e.g.
table permissions, table partitions and specs, datasource info (e.g. cluster),
etc.).
E.g. something like this
With just meta method
dataset.meta = {optional datasource specific info}
Or with meta as an intermediate object with several operations
dataset.meta.describe
dataset.meta.update
....
However, if you are look
On 9/16/18, 1:24 AM, "tigerquoll" <[email protected]> wrote:
I've been following the development of the new data source abstraction with
keen interest. One of the issues that has occurred to me as I sat down and
planned how I would implement a data source is how I would support
manipulating partitions.
My reading of the current prototype is that Data source v2 APIs expose
enough of a concept of a partition to support communicating record
distribution particulars to catalyst, but does not represent partitions as a
concept that the end user of the data sources can manipulate.
The end users of data sources need to be able to add/drop/modify and list
partitions. For example, many systems require partitions to be created
before records are added to them.
For batch use-cases, it may be possible for users to manipulate partitions
from within the environment that the data source interfaces to, but for
streaming use-cases, this is not at all practical.
Two ways I can think of doing this are:
1. Allow "pass-through" commands to the underlying data source
2. Have a generic concept of partitions exposed to the end user via the data
source API and Spark SQL DML.
I'm keen for option 2 but recognise that its possible there are better
alternatives out there.
--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/