Re: [discuss] Data Source V2 write path

Ryan Blue Mon, 25 Sep 2017 13:38:21 -0700

I think it is a bad idea to let this problem leak into the new storage API.
By not setting the expectation that metadata for a table will exist, this
will needlessly complicate writers just to support the existing problematic
design. Why can't we use an in-memory catalog to store the configuration of
HadoopFS tables? I see no compelling reason why this needs to be passed
into the V2 write API.


If this is limited to an implementation hack for the Hadoop FS writers,
then I guess that's not terrible. I just don't understand why it is
necessary.

On Mon, Sep 25, 2017 at 11:26 AM, Wenchen Fan <cloud0...@gmail.com> wrote:

> Catalog federation is to publish the Spark catalog API(kind of a data
> source API for metadata), so that Spark is able to read/write metadata from
> external systems. (SPARK-15777)
>
> Currently Spark can only read/write Hive metastore, which means for other
> systems like Cassandra, we can only implicitly create tables with data
> source API.
>
> Again this is not ideal but just a workaround before we finish catalog
> federation. That's why the save mode description mostly refer to how data
> will be handled instead of metadata.
>
> Because of this, I think we still need to pass metadata like
> partitioning/bucketing to the data source write API. And I propose to use
> data source options so that it's not at API level and we can easily ignore
> these options in the future if catalog federation is done.
>
> The same thing applies to Hadoop FS data sources, we need to pass metadata
> to the writer anyway.
>
>
>
> On Tue, Sep 26, 2017 at 1:08 AM, Ryan Blue <rb...@netflix.com> wrote:
>
>> However, without catalog federation, Spark doesn’t have an API to ask an
>> external system(like Cassandra) to create a table. Currently it’s all done
>> by data source write API. Data source implementations are responsible to
>> create or insert a table according to the save mode.
>>
>> What’s catalog federation? Is there a SPIP for it? It sounds
>> straight-forward based on your comments, but I’d rather make sure we’re
>> talking about the same thing.
>>
>> What I’m proposing doesn’t require a change to either the public API, nor
>> does it depend on being able to create tables. Why do writers necessarily
>> need to create tables? I think other components (e.g. a federated catalog)
>> should manage table creation outside of this abstraction. Just because data
>> sources currently create tables doesn’t mean that we are tied to that
>> implementation.
>>
>> I would also disagree that data source implementations are responsible
>> for creating for inserting according to save mode. The modes are “append”,
>> “overwrite”, “failIfExists” and “ignore”, and the descriptions indicate to
>> me that the mode refers to how *data* will be handled, not table
>> metadata. Overwrite’s docs
>> <https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/SaveMode.java#L37>
>> state that “existing *data* is expected to be overwritten.”
>>
>> Save mode currently introduces confusion because it isn’t clear whether
>> the mode applies to tables or to writes. In Hive, overwrite removes
>> conflicting partitions, but I think the Hadoop FS relations will delete
>> tables. We get around this some by using external tables and preserving
>> data, but this is an area where we should have clear semantics for external
>> systems like Cassandra. I’d like to see a cleaner public API that separates
>> these concerns, but that’s a different discussion. For now, I don’t think
>> requiring that a table exists is unreasonable. If a table has no metastore
>> (Hadoop FS tables) then we can just pass the table metadata in when
>> creating the writer since there is no existence in this case.
>>
>> rb
>> 
>>
>> On Sun, Sep 24, 2017 at 7:17 PM, Wenchen Fan <cloud0...@gmail.com> wrote:
>>
>>> I agree it would be a clean approach if data source is only responsible
>>> to write into an already-configured table. However, without catalog
>>> federation, Spark doesn't have an API to ask an external system(like
>>> Cassandra) to create a table. Currently it's all done by data source write
>>> API. Data source implementations are responsible to create or insert a
>>> table according to the save mode.
>>>
>>> As a workaround, I think it's acceptable to pass partitioning/bucketing
>>> information via data source options, and data sources should decide to take
>>> these informations and create the table, or throw exception if these
>>> informations don't match the already-configured table.
>>>
>>>
>>> On Fri, Sep 22, 2017 at 9:35 AM, Ryan Blue <rb...@netflix.com> wrote:
>>>
>>>> > input data requirement
>>>>
>>>> Clustering and sorting within partitions are a good start. We can
>>>> always add more later when they are needed.
>>>>
>>>> The primary use case I'm thinking of for this is partitioning and
>>>> bucketing. If I'm implementing a partitioned table format, I need to tell
>>>> Spark to cluster by my partition columns. Should there also be a way to
>>>> pass those columns separately, since they may not be stored in the same way
>>>> like partitions are in the current format?
>>>>
>>>> On Wed, Sep 20, 2017 at 3:10 AM, Wenchen Fan <cloud0...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I want to have some discussion about Data Source V2 write path before
>>>>> starting a voting.
>>>>>
>>>>> The Data Source V1 write path asks implementations to write a
>>>>> DataFrame directly, which is painful:
>>>>> 1. Exposing upper-level API like DataFrame to Data Source API is not
>>>>> good for maintenance.
>>>>> 2. Data sources may need to preprocess the input data before writing,
>>>>> like cluster/sort the input by some columns. It's better to do the
>>>>> preprocessing in Spark instead of in the data source.
>>>>> 3. Data sources need to take care of transaction themselves, which is
>>>>> hard. And different data sources may come up with a very similar approach
>>>>> for the transaction, which leads to many duplicated codes.
>>>>>
>>>>>
>>>>> To solve these pain points, I'm proposing a data source writing
>>>>> framework which is very similar to the reading framework, i.e.,
>>>>> WriteSupport -> DataSourceV2Writer -> WriteTask -> DataWriter. You can 
>>>>> take
>>>>> a look at my prototype to see what it looks like:
>>>>> https://github.com/apache/spark/pull/19269
>>>>>
>>>>> There are some other details need further discussion:
>>>>> 1. *partitioning/bucketing*
>>>>> Currently only the built-in file-based data sources support them, but
>>>>> there is nothing stopping us from exposing them to all data sources. One
>>>>> question is, shall we make them as mix-in interfaces for data source v2
>>>>> reader/writer, or just encode them into data source options(a
>>>>> string-to-string map)? Ideally it's more like options, Spark just 
>>>>> transfers
>>>>> these user-given informations to data sources, and doesn't do anything for
>>>>> it.
>>>>>
>>>>> 2. *input data requirement*
>>>>> Data sources should be able to ask Spark to preprocess the input data,
>>>>> and this can be a mix-in interface for DataSourceV2Writer. I think we need
>>>>> to add clustering request and sorting within partitions request, any more?
>>>>>
>>>>> 3. *transaction*
>>>>> I think we can just follow `FileCommitProtocol`, which is the internal
>>>>> framework Spark uses to guarantee transaction for built-in file-based data
>>>>> sources. Generally speaking, we need task level and job level 
>>>>> commit/abort.
>>>>> Again you can see more details in my prototype about it:
>>>>> https://github.com/apache/spark/pull/19269
>>>>>
>>>>> 4. *data source table*
>>>>> This is the trickiest one. In Spark you can create a table which
>>>>> points to a data source, so you can read/write this data source easily by
>>>>> referencing the table name. Ideally data source table is just a pointer
>>>>> which points to a data source with a list of predefined options, to save
>>>>> users from typing these options again and again for each query.
>>>>> If that's all, then everything is good, we don't need to add more
>>>>> interfaces to Data Source V2. However, data source tables provide special
>>>>> operators like ALTER TABLE SCHEMA, ADD PARTITION, etc., which requires 
>>>>> data
>>>>> sources to have some extra ability.
>>>>> Currently these special operators only work for built-in file-based
>>>>> data sources, and I don't think we will extend it in the near future, I
>>>>> propose to mark them as out of the scope.
>>>>>
>>>>>
>>>>> Any comments are welcome!
>>>>> Thanks,
>>>>> Wenchen
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: [discuss] Data Source V2 write path

Reply via email to