We still need to support low-level data sources like pure parquet files, which do not have a metastore.
BTW I think we should leave the metadata management to the catalog API after catalog federation. Data source API should only care about data. On Mon, Sep 25, 2017 at 11:14 AM, Reynold Xin <r...@databricks.com> wrote: > Can there be an explicit create function? > > > On Sun, Sep 24, 2017 at 7:17 PM, Wenchen Fan <cloud0...@gmail.com> wrote: > >> I agree it would be a clean approach if data source is only responsible >> to write into an already-configured table. However, without catalog >> federation, Spark doesn't have an API to ask an external system(like >> Cassandra) to create a table. Currently it's all done by data source write >> API. Data source implementations are responsible to create or insert a >> table according to the save mode. >> >> As a workaround, I think it's acceptable to pass partitioning/bucketing >> information via data source options, and data sources should decide to take >> these informations and create the table, or throw exception if these >> informations don't match the already-configured table. >> >> >> On Fri, Sep 22, 2017 at 9:35 AM, Ryan Blue <rb...@netflix.com> wrote: >> >>> > input data requirement >>> >>> Clustering and sorting within partitions are a good start. We can always >>> add more later when they are needed. >>> >>> The primary use case I'm thinking of for this is partitioning and >>> bucketing. If I'm implementing a partitioned table format, I need to tell >>> Spark to cluster by my partition columns. Should there also be a way to >>> pass those columns separately, since they may not be stored in the same way >>> like partitions are in the current format? >>> >>> On Wed, Sep 20, 2017 at 3:10 AM, Wenchen Fan <cloud0...@gmail.com> >>> wrote: >>> >>>> Hi all, >>>> >>>> I want to have some discussion about Data Source V2 write path before >>>> starting a voting. >>>> >>>> The Data Source V1 write path asks implementations to write a DataFrame >>>> directly, which is painful: >>>> 1. Exposing upper-level API like DataFrame to Data Source API is not >>>> good for maintenance. >>>> 2. Data sources may need to preprocess the input data before writing, >>>> like cluster/sort the input by some columns. It's better to do the >>>> preprocessing in Spark instead of in the data source. >>>> 3. Data sources need to take care of transaction themselves, which is >>>> hard. And different data sources may come up with a very similar approach >>>> for the transaction, which leads to many duplicated codes. >>>> >>>> >>>> To solve these pain points, I'm proposing a data source writing >>>> framework which is very similar to the reading framework, i.e., >>>> WriteSupport -> DataSourceV2Writer -> WriteTask -> DataWriter. You can take >>>> a look at my prototype to see what it looks like: >>>> https://github.com/apache/spark/pull/19269 >>>> >>>> There are some other details need further discussion: >>>> 1. *partitioning/bucketing* >>>> Currently only the built-in file-based data sources support them, but >>>> there is nothing stopping us from exposing them to all data sources. One >>>> question is, shall we make them as mix-in interfaces for data source v2 >>>> reader/writer, or just encode them into data source options(a >>>> string-to-string map)? Ideally it's more like options, Spark just transfers >>>> these user-given informations to data sources, and doesn't do anything for >>>> it. >>>> >>>> 2. *input data requirement* >>>> Data sources should be able to ask Spark to preprocess the input data, >>>> and this can be a mix-in interface for DataSourceV2Writer. I think we need >>>> to add clustering request and sorting within partitions request, any more? >>>> >>>> 3. *transaction* >>>> I think we can just follow `FileCommitProtocol`, which is the internal >>>> framework Spark uses to guarantee transaction for built-in file-based data >>>> sources. Generally speaking, we need task level and job level commit/abort. >>>> Again you can see more details in my prototype about it: >>>> https://github.com/apache/spark/pull/19269 >>>> >>>> 4. *data source table* >>>> This is the trickiest one. In Spark you can create a table which points >>>> to a data source, so you can read/write this data source easily by >>>> referencing the table name. Ideally data source table is just a pointer >>>> which points to a data source with a list of predefined options, to save >>>> users from typing these options again and again for each query. >>>> If that's all, then everything is good, we don't need to add more >>>> interfaces to Data Source V2. However, data source tables provide special >>>> operators like ALTER TABLE SCHEMA, ADD PARTITION, etc., which requires data >>>> sources to have some extra ability. >>>> Currently these special operators only work for built-in file-based >>>> data sources, and I don't think we will extend it in the near future, I >>>> propose to mark them as out of the scope. >>>> >>>> >>>> Any comments are welcome! >>>> Thanks, >>>> Wenchen >>>> >>> >>> >>> >>> -- >>> Ryan Blue >>> Software Engineer >>> Netflix >>> >> >> >