Re: [Discuss] Datasource v2 support for manipulating partitions

2018-09-20 Thread Thakrar, Jayesh
Here’s what can be done in PostgreSQL

You can create a partitioned table with a partition clause, e.g.
CREATE TABLE measurement (.) PARTITION BY RANGE (logdate)

You can create a partitioned table by creating tables as partitions of a 
partitioned table, e.g.
CREATE TABLE measurement_y2006m02 PARTITION OF measurement FOR VALUES FROM 
('2006-02-01') TO ('2006-03-01')

Each “partition” is like a table and can be managed just like a table.

And ofcourse you can have nested partitioning.

As for partition management, you can attach/detach partitions by converting a 
regular table into a table partition and a table partition into a regular table 
using the ALTER TABLE statement

ALTER TABLE measurement ATTACH/DETACH PARTITION

There are similar options in Oracle.
In Oracle, converting a table into a partition and vice-versa is referred to as 
 “partition exchange”.
However unlike Postgres, table partitions are not treated as regular tables.


As for partition management relevance in Spark API, here are some thoughts:

Reading data from a table supporting predicate pushdown
W/o explicit partition specification, we would need to rely on partition 
pruning to select the appropriate partitions
However if we can provide a mechanism to specify the partition(s), that would 
be great – and it would need to be translated into appropriate SQL clauses 
under the covers

Writing data to a table supporting partitions
I think there is no current way to support the above Postgres/Oracle ways of 
creating partition tables or doing table exchanges intelligently.
So probably options or some appropriate interfaces would be required
And the above ALTER TABLE equivalent work can be done as part of the commit 
(provided an appropriate interface is supported).

Here are Dale’s comments earlier from the thread
“So if we are not hiding them from the user, we need to allow users to
manipulate them. Either by representing them generically in the API,
allowing pass-through commands to manipulate them, or by some other means.”

I think we need to mull over this and also look beyond RDBMSes – say, S3 for 
applicability.

In essence, I think partitions matter because they allow partition pruning (= 
less resource intensive) during read and allow partition setup and 
appropriately targeting during write.


From: Ryan Blue 
Reply-To: "rb...@netflix.com" 
Date: Wednesday, September 19, 2018 at 4:35 PM
To: "Thakrar, Jayesh" 
Cc: "tigerqu...@outlook.com" , Spark Dev List 

Subject: Re: [Discuss] Datasource v2 support for manipulating partitions

What does partition management look like in those systems and what are the 
options we would standardize in an API?

On Wed, Sep 19, 2018 at 2:16 PM Thakrar, Jayesh 
mailto:jthak...@conversantmedia.com>> wrote:
I think partition management feature would be very useful in RDBMSes that 
support it – e.g. Oracle, PostgreSQL, and DB2.
In some cases add partitions can be explicit and can/may be done outside of 
data loads.
But in some other cases, it may/can need to be done implicitly when supported  
by the platform.
Similar to the static/dynamic partition loading in Hive and Oracle.

So in short, I agree that partition management should be an optional interface.

From: Ryan Blue mailto:rb...@netflix.com>>
Reply-To: "rb...@netflix.com<mailto:rb...@netflix.com>" 
mailto:rb...@netflix.com>>
Date: Wednesday, September 19, 2018 at 2:58 PM
To: "Thakrar, Jayesh" 
mailto:jthak...@conversantmedia.com>>
Cc: "tigerqu...@outlook.com<mailto:tigerqu...@outlook.com>" 
mailto:tigerqu...@outlook.com>>, Spark Dev List 
mailto:dev@spark.apache.org>>
Subject: Re: [Discuss] Datasource v2 support for manipulating partitions

I'm open to exploring the idea of adding partition management as a catalog API. 
The approach we're taking is to have an interface for each concern a catalog 
might implement, like TableCatalog (proposed in SPARK-24252), but also 
FunctionCatalog for stored functions and possibly PartitionedTableCatalog for 
explicitly partitioned tables.

That could definitely be used to implement ALTER TABLE ADD/DROP PARTITION for 
Hive tables, although I'm not sure that we would want to continue exposing 
partitions for simple tables. I know that this is important for storage systems 
like Kudu, but I think it is needlessly difficult and annoying for simple 
tables that are partitioned by a regular transformation like Hive tables. 
That's why Iceberg hides partitioning outside of table configuration. That also 
avoids problems where SELECT DISTINCT queries are wrong because a partition 
exists but has no data.

How useful is this outside of Kudu? Is it something that we should provide an 
API for, or is it specific enough to Kudu that Spark shouldn't include it in 
the API for all sources?

rb


On Tue, Sep 18, 2018 at 7:38 AM Thakrar, Jayesh 
mailto:jthak...@conversantmedia.com>> wrote:

Re: [Discuss] Datasource v2 support for manipulating partitions

2018-09-19 Thread Ryan Blue
What does partition management look like in those systems and what are the
options we would standardize in an API?

On Wed, Sep 19, 2018 at 2:16 PM Thakrar, Jayesh <
jthak...@conversantmedia.com> wrote:

> I think partition management feature would be very useful in RDBMSes that
> support it – e.g. Oracle, PostgreSQL, and DB2.
>
> In some cases add partitions can be explicit and can/may be done outside
> of data loads.
>
> But in some other cases, it may/can need to be done implicitly when
> supported  by the platform.
>
> Similar to the static/dynamic partition loading in Hive and Oracle.
>
>
>
> So in short, I agree that partition management should be an optional
> interface.
>
>
>
> *From: *Ryan Blue 
> *Reply-To: *"rb...@netflix.com" 
> *Date: *Wednesday, September 19, 2018 at 2:58 PM
> *To: *"Thakrar, Jayesh" 
> *Cc: *"tigerqu...@outlook.com" , Spark Dev List <
> dev@spark.apache.org>
> *Subject: *Re: [Discuss] Datasource v2 support for manipulating partitions
>
>
>
> I'm open to exploring the idea of adding partition management as a catalog
> API. The approach we're taking is to have an interface for each concern a
> catalog might implement, like TableCatalog (proposed in SPARK-24252), but
> also FunctionCatalog for stored functions and possibly
> PartitionedTableCatalog for explicitly partitioned tables.
>
>
>
> That could definitely be used to implement ALTER TABLE ADD/DROP PARTITION
> for Hive tables, although I'm not sure that we would want to continue
> exposing partitions for simple tables. I know that this is important for
> storage systems like Kudu, but I think it is needlessly difficult and
> annoying for simple tables that are partitioned by a regular transformation
> like Hive tables. That's why Iceberg hides partitioning outside of table
> configuration. That also avoids problems where SELECT DISTINCT queries are
> wrong because a partition exists but has no data.
>
>
>
> How useful is this outside of Kudu? Is it something that we should provide
> an API for, or is it specific enough to Kudu that Spark shouldn't include
> it in the API for all sources?
>
>
>
> rb
>
>
>
>
>
> On Tue, Sep 18, 2018 at 7:38 AM Thakrar, Jayesh <
> jthak...@conversantmedia.com> wrote:
>
> Totally agree with you Dale, that there are situations for efficiency,
> performance and better control/visibility/manageability that we need to
> expose partition management.
>
> So as described, I suggested two things - the ability to do it in the
> current V2 API form via options and appropriate implementation in
> datasource reader/writer.
>
> And for long term, suggested that partition management can be made part of
> metadata/catalog management - SPARK-24252 (DataSourceV2: Add catalog
> support)?
>
>
> On 9/17/18, 8:26 PM, "tigerquoll"  wrote:
>
> Hi Jayesh,
> I get where you are coming from - partitions are just an implementation
> optimisation that we really shouldn’t be bothering the end user with.
> Unfortunately that view is like saying RPC is like a procedure call,
> and
> details of the network transport should be hidden from the end user.
> CORBA
> tried this approach for RPC and failed for the same reason that no
> major
> vendor of DBMS systems that support partitions try to hide them from
> the end
> user.  They have a substantial real world effect that is impossible to
> hide
> from the user (in particular when writing/modifying the data source).
> Any
> attempt to “take care” of partitions automatically invariably guesses
> wrong
> and ends up frustrating the end user (as “substantial real world
> effect”
> turns to “show stopping performance penalty” if the user attempts to
> fight
> against a partitioning scheme she has no idea exists)
>
> So if we are not hiding them from the user, we need to allow users to
> manipulate them. Either by representing them generically in the API,
> allowing pass-through commands to manipulate them, or by some other
> means.
>
> Regards,
> Dale.
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
>
>
>
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>


-- 
Ryan Blue
Software Engineer
Netflix


Re: [Discuss] Datasource v2 support for manipulating partitions

2018-09-19 Thread Thakrar, Jayesh
I think partition management feature would be very useful in RDBMSes that 
support it – e.g. Oracle, PostgreSQL, and DB2.
In some cases add partitions can be explicit and can/may be done outside of 
data loads.
But in some other cases, it may/can need to be done implicitly when supported  
by the platform.
Similar to the static/dynamic partition loading in Hive and Oracle.

So in short, I agree that partition management should be an optional interface.

From: Ryan Blue 
Reply-To: "rb...@netflix.com" 
Date: Wednesday, September 19, 2018 at 2:58 PM
To: "Thakrar, Jayesh" 
Cc: "tigerqu...@outlook.com" , Spark Dev List 

Subject: Re: [Discuss] Datasource v2 support for manipulating partitions

I'm open to exploring the idea of adding partition management as a catalog API. 
The approach we're taking is to have an interface for each concern a catalog 
might implement, like TableCatalog (proposed in SPARK-24252), but also 
FunctionCatalog for stored functions and possibly PartitionedTableCatalog for 
explicitly partitioned tables.

That could definitely be used to implement ALTER TABLE ADD/DROP PARTITION for 
Hive tables, although I'm not sure that we would want to continue exposing 
partitions for simple tables. I know that this is important for storage systems 
like Kudu, but I think it is needlessly difficult and annoying for simple 
tables that are partitioned by a regular transformation like Hive tables. 
That's why Iceberg hides partitioning outside of table configuration. That also 
avoids problems where SELECT DISTINCT queries are wrong because a partition 
exists but has no data.

How useful is this outside of Kudu? Is it something that we should provide an 
API for, or is it specific enough to Kudu that Spark shouldn't include it in 
the API for all sources?

rb


On Tue, Sep 18, 2018 at 7:38 AM Thakrar, Jayesh 
mailto:jthak...@conversantmedia.com>> wrote:
Totally agree with you Dale, that there are situations for efficiency, 
performance and better control/visibility/manageability that we need to expose 
partition management.

So as described, I suggested two things - the ability to do it in the current 
V2 API form via options and appropriate implementation in datasource 
reader/writer.

And for long term, suggested that partition management can be made part of 
metadata/catalog management - SPARK-24252 (DataSourceV2: Add catalog support)?


On 9/17/18, 8:26 PM, "tigerquoll" 
mailto:tigerqu...@outlook.com>> wrote:

Hi Jayesh,
I get where you are coming from - partitions are just an implementation
optimisation that we really shouldn’t be bothering the end user with.
Unfortunately that view is like saying RPC is like a procedure call, and
details of the network transport should be hidden from the end user. CORBA
tried this approach for RPC and failed for the same reason that no major
vendor of DBMS systems that support partitions try to hide them from the end
user.  They have a substantial real world effect that is impossible to hide
from the user (in particular when writing/modifying the data source).  Any
attempt to “take care” of partitions automatically invariably guesses wrong
and ends up frustrating the end user (as “substantial real world effect”
turns to “show stopping performance penalty” if the user attempts to fight
against a partitioning scheme she has no idea exists)

So if we are not hiding them from the user, we need to allow users to
manipulate them. Either by representing them generically in the API,
allowing pass-through commands to manipulate them, or by some other means.

Regards,
Dale.




--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/




--
Ryan Blue
Software Engineer
Netflix


Re: [Discuss] Datasource v2 support for manipulating partitions

2018-09-19 Thread Ryan Blue
I'm open to exploring the idea of adding partition management as a catalog
API. The approach we're taking is to have an interface for each concern a
catalog might implement, like TableCatalog (proposed in SPARK-24252), but
also FunctionCatalog for stored functions and possibly
PartitionedTableCatalog for explicitly partitioned tables.

That could definitely be used to implement ALTER TABLE ADD/DROP PARTITION
for Hive tables, although I'm not sure that we would want to continue
exposing partitions for simple tables. I know that this is important for
storage systems like Kudu, but I think it is needlessly difficult and
annoying for simple tables that are partitioned by a regular transformation
like Hive tables. That's why Iceberg hides partitioning outside of table
configuration. That also avoids problems where SELECT DISTINCT queries are
wrong because a partition exists but has no data.

How useful is this outside of Kudu? Is it something that we should provide
an API for, or is it specific enough to Kudu that Spark shouldn't include
it in the API for all sources?

rb


On Tue, Sep 18, 2018 at 7:38 AM Thakrar, Jayesh <
jthak...@conversantmedia.com> wrote:

> Totally agree with you Dale, that there are situations for efficiency,
> performance and better control/visibility/manageability that we need to
> expose partition management.
>
> So as described, I suggested two things - the ability to do it in the
> current V2 API form via options and appropriate implementation in
> datasource reader/writer.
>
> And for long term, suggested that partition management can be made part of
> metadata/catalog management - SPARK-24252 (DataSourceV2: Add catalog
> support)?
>
>
> On 9/17/18, 8:26 PM, "tigerquoll"  wrote:
>
> Hi Jayesh,
> I get where you are coming from - partitions are just an implementation
> optimisation that we really shouldn’t be bothering the end user with.
> Unfortunately that view is like saying RPC is like a procedure call,
> and
> details of the network transport should be hidden from the end user.
> CORBA
> tried this approach for RPC and failed for the same reason that no
> major
> vendor of DBMS systems that support partitions try to hide them from
> the end
> user.  They have a substantial real world effect that is impossible to
> hide
> from the user (in particular when writing/modifying the data source).
> Any
> attempt to “take care” of partitions automatically invariably guesses
> wrong
> and ends up frustrating the end user (as “substantial real world
> effect”
> turns to “show stopping performance penalty” if the user attempts to
> fight
> against a partitioning scheme she has no idea exists)
>
> So if we are not hiding them from the user, we need to allow users to
> manipulate them. Either by representing them generically in the API,
> allowing pass-through commands to manipulate them, or by some other
> means.
>
> Regards,
> Dale.
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
>
>
>

-- 
Ryan Blue
Software Engineer
Netflix


Re: [Discuss] Datasource v2 support for manipulating partitions

2018-09-18 Thread Thakrar, Jayesh
Totally agree with you Dale, that there are situations for efficiency, 
performance and better control/visibility/manageability that we need to expose 
partition management.

So as described, I suggested two things - the ability to do it in the current 
V2 API form via options and appropriate implementation in datasource 
reader/writer.

And for long term, suggested that partition management can be made part of 
metadata/catalog management - SPARK-24252 (DataSourceV2: Add catalog support)?


On 9/17/18, 8:26 PM, "tigerquoll"  wrote:

Hi Jayesh,
I get where you are coming from - partitions are just an implementation
optimisation that we really shouldn’t be bothering the end user with. 
Unfortunately that view is like saying RPC is like a procedure call, and
details of the network transport should be hidden from the end user. CORBA
tried this approach for RPC and failed for the same reason that no major
vendor of DBMS systems that support partitions try to hide them from the end
user.  They have a substantial real world effect that is impossible to hide
from the user (in particular when writing/modifying the data source).  Any
attempt to “take care” of partitions automatically invariably guesses wrong
and ends up frustrating the end user (as “substantial real world effect”
turns to “show stopping performance penalty” if the user attempts to fight
against a partitioning scheme she has no idea exists)

So if we are not hiding them from the user, we need to allow users to
manipulate them. Either by representing them generically in the API,
allowing pass-through commands to manipulate them, or by some other means.

Regards,
Dale.




--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/





Re: [Discuss] Datasource v2 support for manipulating partitions

2018-09-17 Thread tigerquoll
Hi Jayesh,
I get where you are coming from - partitions are just an implementation
optimisation that we really shouldn’t be bothering the end user with. 
Unfortunately that view is like saying RPC is like a procedure call, and
details of the network transport should be hidden from the end user. CORBA
tried this approach for RPC and failed for the same reason that no major
vendor of DBMS systems that support partitions try to hide them from the end
user.  They have a substantial real world effect that is impossible to hide
from the user (in particular when writing/modifying the data source).  Any
attempt to “take care” of partitions automatically invariably guesses wrong
and ends up frustrating the end user (as “substantial real world effect”
turns to “show stopping performance penalty” if the user attempts to fight
against a partitioning scheme she has no idea exists)

So if we are not hiding them from the user, we need to allow users to
manipulate them. Either by representing them generically in the API,
allowing pass-through commands to manipulate them, or by some other means.

Regards,
Dale.




--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [Discuss] Datasource v2 support for manipulating partitions

2018-09-16 Thread Thakrar, Jayesh
I am not involved with the design or development of the V2 API - so these could 
be naïve comments/thoughts.
Just as dataset is to abstract away from RDD, which otherwise required a little 
more intimate knowledge about Spark internals, I am guessing the absence of 
partition operations is either due to no current need for it or the need to 
abstract that away from the API user/programmer.
Ofcourse, the other thing about dataset is that it comes with schema - closely 
binding the two together.

However for situations where a deeper understanding of the datasource is 
necessary to read/write data, then that logic can potentially be embedded into 
the classes that implement DataSourceReader/DataSourceWriter or 
DataReader/DataWriter.

E.g. if you are writing data and you need to dynamically create partitions on 
the fly as you write data, then the DataSourceReader can gather the current 
list of partitions and pass it on to DataWriter via DataWriterFactory. The 
DataWriter can consult that list and if a row is encountered that does not 
exist then create it (again note that the partition creation operation needs to 
be idempotent OR the DataWriter needs to check for the partition before trying 
to create as it may have been already created by another DataWriter).

As for partition add/list/drop/alter, I don't think that concept/notion applies 
to all datasources (e.g. filesystem).
Also, the concept of a Spark partition may not translate into the underlying 
datasource partition.

At the same time I did see a discussion thread on catalog operations for V2 API 
- although, probably Spark partitions do not map one-to-one to the underlying 
partitions.

Probably a good place to introduce partition info is to add a method/object 
called "meta" to a dataset and allow the datasource to describe itself (e.g. 
table permissions, table partitions and specs, datasource info (e.g. cluster), 
etc.).

E.g. something like this

With just meta method
dataset.meta = {optional datasource specific info}

Or with meta as an intermediate object with several operations
dataset.meta.describe
dataset.meta.update


However, if you are look 

On 9/16/18, 1:24 AM, "tigerquoll"  wrote:

I've been following the development of the new data source abstraction with
keen interest.  One of the issues that has occurred to me as I sat down and
planned how I would implement a data source is how I would support
manipulating partitions.

My reading of the current prototype is that Data source v2 APIs expose
enough of a concept of a partition to support communicating record
distribution particulars to catalyst, but does not represent partitions as a
concept that the end user of the data sources can manipulate.

The end users of data sources need to be able to add/drop/modify and list
partitions. For example, many systems require partitions to be created
before records are added to them.  

For batch use-cases, it may be possible for users to manipulate partitions
from within the environment that the data source interfaces to, but for
streaming use-cases, this is not at all practical.

Two ways I can think of doing this are:
1. Allow "pass-through" commands to the underlying data source
2. Have a generic concept of partitions exposed to the end user via the data
source API and Spark SQL DML.

I'm keen for option 2 but recognise that its possible there are better
alternatives out there.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/





[Discuss] Datasource v2 support for manipulating partitions

2018-09-16 Thread tigerquoll
I've been following the development of the new data source abstraction with
keen interest.  One of the issues that has occurred to me as I sat down and
planned how I would implement a data source is how I would support
manipulating partitions.

My reading of the current prototype is that Data source v2 APIs expose
enough of a concept of a partition to support communicating record
distribution particulars to catalyst, but does not represent partitions as a
concept that the end user of the data sources can manipulate.

The end users of data sources need to be able to add/drop/modify and list
partitions. For example, many systems require partitions to be created
before records are added to them.  

For batch use-cases, it may be possible for users to manipulate partitions
from within the environment that the data source interfaces to, but for
streaming use-cases, this is not at all practical.

Two ways I can think of doing this are:
1. Allow "pass-through" commands to the underlying data source
2. Have a generic concept of partitions exposed to the end user via the data
source API and Spark SQL DML.

I'm keen for option 2 but recognise that its possible there are better
alternatives out there.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org