As far as changes to the public API go, I’d prefer deprecating the API that
mixes data and metadata operations. But I don’t think that requires that we
go with your proposal #1, where the current write API can’t use data source
v2 writers. I think we can separate the metadata operations for Hadoop
The main entries of data source inside Spark is the SQL API and
`DataFrameReader/Writer`.
For SQL API, I think the semantic is well defined, the data and metadata
operations are separated. E.g., INSERT INTO means write data into an
existing table, CREATE TABLE means only create the metadata. But t
> Spark doesn't know how to create a table in external systems like
Cassandra, and that's why it's currently done inside the data source writer.
This isn't a valid argument for doing this task in the writer for v2. If we
want to fix the problems with v1, we shouldn't continue to mix write
operatio
> When this CTAS logical node is turned into a physical plan, the relation
gets turned into a `DataSourceV2` instance and then Spark gets a writer and
configures it with the proposed API. The main point of this is to pass the
logical relation (with all of the user's options) through to the data
sou
On an unrelated note, is there any appetite for making the write path also
include an option to return elements that were not
able to be processed for some reason.
Usage might be like
saveAndIgnoreFailures() : Dataset
So that if some records cannot be parsed by the datasource for writing, or
vio
Comments inline. I've written up what I'm proposing with a bit more detail.
On Tue, Sep 26, 2017 at 11:17 AM, Wenchen Fan wrote:
> I'm trying to give a summary:
>
> Ideally data source API should only deal with data, not metadata. But one
> key problem is, Spark still need to support data source
I'm trying to give a summary:
Ideally data source API should only deal with data, not metadata. But one
key problem is, Spark still need to support data sources without metastore,
e.g. file format data sources.
For this kind of data sources, users have to pass the metadata information
like partit
> I think it is a bad idea to let this problem leak into the new storage
API.
Well, I think using data source options is a good compromise for this. We
can't avoid this problem until catalog federation is done, and this may not
happen within Spark 2.3, but we definitely need data source write API
I think it is a bad idea to let this problem leak into the new storage API.
By not setting the expectation that metadata for a table will exist, this
will needlessly complicate writers just to support the existing problematic
design. Why can't we use an in-memory catalog to store the configuration
Catalog federation is to publish the Spark catalog API(kind of a data
source API for metadata), so that Spark is able to read/write metadata from
external systems. (SPARK-15777)
Currently Spark can only read/write Hive metastore, which means for other
systems like Cassandra, we can only implicitly
However, without catalog federation, Spark doesn’t have an API to ask an
external system(like Cassandra) to create a table. Currently it’s all done
by data source write API. Data source implementations are responsible to
create or insert a table according to the save mode.
What’s catalog federatio
We still need to support low-level data sources like pure parquet files,
which do not have a metastore.
BTW I think we should leave the metadata management to the catalog API
after catalog federation. Data source API should only care about data.
On Mon, Sep 25, 2017 at 11:14 AM, Reynold Xin wrot
Can there be an explicit create function?
On Sun, Sep 24, 2017 at 7:17 PM, Wenchen Fan wrote:
> I agree it would be a clean approach if data source is only responsible to
> write into an already-configured table. However, without catalog
> federation, Spark doesn't have an API to ask an externa
I agree it would be a clean approach if data source is only responsible to
write into an already-configured table. However, without catalog
federation, Spark doesn't have an API to ask an external system(like
Cassandra) to create a table. Currently it's all done by data source write
API. Data sourc
> input data requirement
Clustering and sorting within partitions are a good start. We can always
add more later when they are needed.
The primary use case I'm thinking of for this is partitioning and
bucketing. If I'm implementing a partitioned table format, I need to tell
Spark to cluster by my
Ah yes I agree. I was just saying it should be options (rather than
specific constructs). Having them at creation time makes a lot of sense.
Although one tricky thing is what if they need to change, but we can
probably just special case that.
On Thu, Sep 21, 2017 at 6:28 PM Ryan Blue wrote:
> I’
I’d just pass them [partitioning/bucketing] as options, until there are
clear (and strong) use cases to do them otherwise.
I don’t think it makes sense to pass partitioning and bucketing information
*into* this API. The writer should already know the table structure and
should pass relevant inform
On Wed, Sep 20, 2017 at 3:10 AM, Wenchen Fan wrote:
> Hi all,
>
> I want to have some discussion about Data Source V2 write path before
> starting a voting.
>
> The Data Source V1 write path asks implementations to write a DataFrame
> directly, which is painful:
> 1. Exposing upper-level API like
Hi all,
I want to have some discussion about Data Source V2 write path before
starting a voting.
The Data Source V1 write path asks implementations to write a DataFrame
directly, which is painful:
1. Exposing upper-level API like DataFrame to Data Source API is not good
for maintenance.
2. Data s
19 matches
Mail list logo