Re: [Discuss] Datasource v2 support for manipulating partitions

Thakrar, Jayesh Sun, 16 Sep 2018 18:39:02 -0700

I am not involved with the design or development of the V2 API - so these could 
be naïve comments/thoughts.
Just as dataset is to abstract away from RDD, which otherwise required a little 
more intimate knowledge about Spark internals, I am guessing the absence of 
partition operations is either due to no current need for it or the need to 
abstract that away from the API user/programmer.
Ofcourse, the other thing about dataset is that it comes with schema - closely 
binding the two together.


However for situations where a deeper understanding of the datasource is 
necessary to read/write data, then that logic can potentially be embedded into 
the classes that implement DataSourceReader/DataSourceWriter or 
DataReader/DataWriter.

E.g. if you are writing data and you need to dynamically create partitions on 
the fly as you write data, then the DataSourceReader can gather the current 
list of partitions and pass it on to DataWriter via DataWriterFactory. The 
DataWriter can consult that list and if a row is encountered that does not 
exist then create it (again note that the partition creation operation needs to 
be idempotent OR the DataWriter needs to check for the partition before trying 
to create as it may have been already created by another DataWriter).

As for partition add/list/drop/alter, I don't think that concept/notion applies 
to all datasources (e.g. filesystem).
Also, the concept of a Spark partition may not translate into the underlying 
datasource partition.

At the same time I did see a discussion thread on catalog operations for V2 API 
- although, probably Spark partitions do not map one-to-one to the underlying 
partitions.

Probably a good place to introduce partition info is to add a method/object 
called "meta" to a dataset and allow the datasource to describe itself (e.g. 
table permissions, table partitions and specs, datasource info (e.g. cluster), 
etc.).

E.g. something like this

With just meta method
dataset.meta = {optional datasource specific info}

Or with meta as an intermediate object with several operations
dataset.meta.describe
dataset.meta.update
....

However, if you are look 

On 9/16/18, 1:24 AM, "tigerquoll" <tigerqu...@outlook.com> wrote:

    I've been following the development of the new data source abstraction with
    keen interest.  One of the issues that has occurred to me as I sat down and
    planned how I would implement a data source is how I would support
    manipulating partitions.
    
    My reading of the current prototype is that Data source v2 APIs expose
    enough of a concept of a partition to support communicating record
    distribution particulars to catalyst, but does not represent partitions as a
    concept that the end user of the data sources can manipulate.
    
    The end users of data sources need to be able to add/drop/modify and list
    partitions. For example, many systems require partitions to be created
    before records are added to them.  
    
    For batch use-cases, it may be possible for users to manipulate partitions
    from within the environment that the data source interfaces to, but for
    streaming use-cases, this is not at all practical.
    
    Two ways I can think of doing this are:
    1. Allow "pass-through" commands to the underlying data source
    2. Have a generic concept of partitions exposed to the end user via the data
    source API and Spark SQL DML.
    
    I'm keen for option 2 but recognise that its possible there are better
    alternatives out there.
    
    
    
    --
    Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

Re: [Discuss] Datasource v2 support for manipulating partitions

Reply via email to