Re: [Discuss] Datasource v2 support for manipulating partitions

Ryan Blue Wed, 19 Sep 2018 12:58:36 -0700

I'm open to exploring the idea of adding partition management as a catalog
API. The approach we're taking is to have an interface for each concern a
catalog might implement, like TableCatalog (proposed in SPARK-24252), but
also FunctionCatalog for stored functions and possibly
PartitionedTableCatalog for explicitly partitioned tables.


That could definitely be used to implement ALTER TABLE ADD/DROP PARTITION
for Hive tables, although I'm not sure that we would want to continue
exposing partitions for simple tables. I know that this is important for
storage systems like Kudu, but I think it is needlessly difficult and
annoying for simple tables that are partitioned by a regular transformation
like Hive tables. That's why Iceberg hides partitioning outside of table
configuration. That also avoids problems where SELECT DISTINCT queries are
wrong because a partition exists but has no data.

How useful is this outside of Kudu? Is it something that we should provide
an API for, or is it specific enough to Kudu that Spark shouldn't include
it in the API for all sources?

rb


On Tue, Sep 18, 2018 at 7:38 AM Thakrar, Jayesh <
[email protected]> wrote:

> Totally agree with you Dale, that there are situations for efficiency,
> performance and better control/visibility/manageability that we need to
> expose partition management.
>
> So as described, I suggested two things - the ability to do it in the
> current V2 API form via options and appropriate implementation in
> datasource reader/writer.
>
> And for long term, suggested that partition management can be made part of
> metadata/catalog management - SPARK-24252 (DataSourceV2: Add catalog
> support)?
>
>
> On 9/17/18, 8:26 PM, "tigerquoll" <[email protected]> wrote:
>
>     Hi Jayesh,
>     I get where you are coming from - partitions are just an implementation
>     optimisation that we really shouldn’t be bothering the end user with.
>     Unfortunately that view is like saying RPC is like a procedure call,
> and
>     details of the network transport should be hidden from the end user.
> CORBA
>     tried this approach for RPC and failed for the same reason that no
> major
>     vendor of DBMS systems that support partitions try to hide them from
> the end
>     user.  They have a substantial real world effect that is impossible to
> hide
>     from the user (in particular when writing/modifying the data source).
> Any
>     attempt to “take care” of partitions automatically invariably guesses
> wrong
>     and ends up frustrating the end user (as “substantial real world
> effect”
>     turns to “show stopping performance penalty” if the user attempts to
> fight
>     against a partitioning scheme she has no idea exists)
>
>     So if we are not hiding them from the user, we need to allow users to
>     manipulate them. Either by representing them generically in the API,
>     allowing pass-through commands to manipulate them, or by some other
> means.
>
>     Regards,
>     Dale.
>
>
>
>
>     --
>     Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
>
>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: [Discuss] Datasource v2 support for manipulating partitions

Reply via email to