Re: [DISCUSS] Revert and revisit the public custom expression API for partition (a.k.a. Transform API)

2020-01-23 Thread Wenchen Fan
I don't think we want to add a lot of flexibility to the PARTITION BY
expressions. It's usually just columns or nested fields, or some common
functions like year, month, etc.

If you look at the parser, we create DS V2 Expression directly.
The partition-specific expressions are for
`DataFrameWriterV2.partitionedBy` only. The method takes `Column` as input
and we can't pass the DS V2 Expression directly. If there are better ways
to call this method, we can remove these partition-specific expressions.

Native function doesn't work. It's quite inconvenient if we force the
implementations to call a java function to do the partitioning. This is
different from UDF as UDF means someone gives a function and ask Spark to
run. Partitioning is the opposite.

Hope this helps.

Thanks,
Wenchen



On Thu, Jan 23, 2020 at 3:42 PM Hyukjin Kwon  wrote:

> There's another PR open to expose this more publicity in Python side (
> https://github.com/apache/spark/pull/27331).
>
> To sum up, I would like to make sure we know these below:
> - Is this expression only for partition or do we plan to expose this to
> replace other existing expressions as some kind of public DSv2 expression
> API?
> - Do we want to support other expressions here?
>   - If so, why do we need partition-specific expressions?
>   - If not, why don't we use a different syntax and class for this API?
> - What about we expose a native function to allow transform like a UDF?
>
> Ryan and Wenchen, do you mind if I ask answers for these questions?
>
> 2020년 1월 17일 (금) 오전 10:25, Hyukjin Kwon 님이 작성:
>
>> Thanks for giving me some context and clarification, Ryan.
>>
>> I think I was rather trying to propose to revert because I don't see the
>> explicit plan here and it was just left half-done for a long while.
>> From reading the PR description and codes, I could not guess in which way
>> we should fix this API (e.g., is this expression only for partition or
>> replacement of all expressions?). Also, if you take a look at the commit
>> log, it has not been fixed for 10 months except moving around or minor
>> fixes.
>>
>> Do you mind if I ask how we plan to extend this feature? For example,
>> - if we want to replace existing expressions at the end
>> - if we want to add a copy of expressions for some reasons.
>> - How will we handle ambiguity of supported expressions between other
>> datasource implementations and Spark.
>> - If we can't tell other expressions are supported here, why don't we
>> just use different syntax to clarify?
>>
>> If we have this plan or doc, and people can fix accordingly with
>> incremental improvements, I am good to keep it.
>>
>>
>> Here are some of followup questions and answers:
>>
>> > I don't think there is reason to revert this simply because of some of
>> the early choices, like deciding to start a public expression API. If you'd
>> like to extend this to "fix" areas where you find it confusing, then please
>> do.
>>
>> If it's clear that we should redesign the API, or there is no more plan
>> about that API at this moment, I think it can be an option to revert, in
>> particular, considering that code freeze is being close. For example, why
>> did we try UDF-like way by exposing a function interface only.
>>
>>
>> > The idea was that Spark needs a public expression API anyway for other
>> uses
>>
>> I was wondering why we should we a public expression API in DSv2. Is
>> there some places where UDFs can't cover?
>> As said above, it requires a duplication of existing expressions is
>> required and wonder if this is worthwhile.
>> With the stub of Transform API, it looks we want this but I don't know
>> why.
>>
>>
>> > None of this has been confusing or misleading for our users, who caught
>> on quickly.
>>
>> Maybe using simple case wouldn't bring so much confusions if they were
>> already told about it.
>> However, if we think about the difference and subtleties, I doubt if the
>> users already know the answers:
>>
>> CREATE TABLE table(col INT) USING parquet PARTITIONED BY *transform(col)*
>>
>>   - It looks expressions and allowing other expressions / combinations
>>   - Since the expressions are handled by DSv2, the behaviours are
>> dependent on DSv2 e.g., using *transform* against Datasource
>> implementation A and B are different.
>>  - Likewise, if Spark supports *transform* here, the behaviour will be
>> different.
>>
>>
>> 2020년 1월 17일 (금) 오전 2:36, Ryan Blue 님이 작성:
>>
>>> Hi everyone,
>>>
>>> Let me recap some of the discussions that got us to where we are with
>>> this today. Hopefully that will provide some clarity.
>>>
>>> The purpose of partition transforms is to allow source implementations
>>> to internally handle partitioning. Right now, users are responsible for
>>> this. For example, users will transform timestamps into date strings when
>>> writing and other people will provide a filter on those date strings when
>>> scanning. This is error-prone: users commonly forget to add partition
>>> filters in addition 

Re: [DISCUSS] Revert and revisit the public custom expression API for partition (a.k.a. Transform API)

2020-01-22 Thread Hyukjin Kwon
There's another PR open to expose this more publicity in Python side (
https://github.com/apache/spark/pull/27331).

To sum up, I would like to make sure we know these below:
- Is this expression only for partition or do we plan to expose this to
replace other existing expressions as some kind of public DSv2 expression
API?
- Do we want to support other expressions here?
  - If so, why do we need partition-specific expressions?
  - If not, why don't we use a different syntax and class for this API?
- What about we expose a native function to allow transform like a UDF?

Ryan and Wenchen, do you mind if I ask answers for these questions?

2020년 1월 17일 (금) 오전 10:25, Hyukjin Kwon 님이 작성:

> Thanks for giving me some context and clarification, Ryan.
>
> I think I was rather trying to propose to revert because I don't see the
> explicit plan here and it was just left half-done for a long while.
> From reading the PR description and codes, I could not guess in which way
> we should fix this API (e.g., is this expression only for partition or
> replacement of all expressions?). Also, if you take a look at the commit
> log, it has not been fixed for 10 months except moving around or minor
> fixes.
>
> Do you mind if I ask how we plan to extend this feature? For example,
> - if we want to replace existing expressions at the end
> - if we want to add a copy of expressions for some reasons.
> - How will we handle ambiguity of supported expressions between other
> datasource implementations and Spark.
> - If we can't tell other expressions are supported here, why don't we just
> use different syntax to clarify?
>
> If we have this plan or doc, and people can fix accordingly with
> incremental improvements, I am good to keep it.
>
>
> Here are some of followup questions and answers:
>
> > I don't think there is reason to revert this simply because of some of
> the early choices, like deciding to start a public expression API. If you'd
> like to extend this to "fix" areas where you find it confusing, then please
> do.
>
> If it's clear that we should redesign the API, or there is no more plan
> about that API at this moment, I think it can be an option to revert, in
> particular, considering that code freeze is being close. For example, why
> did we try UDF-like way by exposing a function interface only.
>
>
> > The idea was that Spark needs a public expression API anyway for other
> uses
>
> I was wondering why we should we a public expression API in DSv2. Is there
> some places where UDFs can't cover?
> As said above, it requires a duplication of existing expressions is
> required and wonder if this is worthwhile.
> With the stub of Transform API, it looks we want this but I don't know why.
>
>
> > None of this has been confusing or misleading for our users, who caught
> on quickly.
>
> Maybe using simple case wouldn't bring so much confusions if they were
> already told about it.
> However, if we think about the difference and subtleties, I doubt if the
> users already know the answers:
>
> CREATE TABLE table(col INT) USING parquet PARTITIONED BY *transform(col)*
>
>   - It looks expressions and allowing other expressions / combinations
>   - Since the expressions are handled by DSv2, the behaviours are
> dependent on DSv2 e.g., using *transform* against Datasource
> implementation A and B are different.
>  - Likewise, if Spark supports *transform* here, the behaviour will be
> different.
>
>
> 2020년 1월 17일 (금) 오전 2:36, Ryan Blue 님이 작성:
>
>> Hi everyone,
>>
>> Let me recap some of the discussions that got us to where we are with
>> this today. Hopefully that will provide some clarity.
>>
>> The purpose of partition transforms is to allow source implementations to
>> internally handle partitioning. Right now, users are responsible for this.
>> For example, users will transform timestamps into date strings when writing
>> and other people will provide a filter on those date strings when scanning.
>> This is error-prone: users commonly forget to add partition filters in
>> addition to data filters, if anyone uses the wrong format or transformation
>> queries will silently return incorrect results, etc. But sources can (and
>> should) automatically handle storing and retrieving data internally because
>> it is much easier for users.
>>
>> When we first proposed transforms, I wanted to use Expression. But
>> Reynold rightly pointed out that Expression is an internal API that should
>> not be exposed. So we decided to compromise by building a public
>> expressions API like the public Filter API for the initial purpose of
>> passing transform expressions to sources. The idea was that Spark needs a
>> public expression API anyway for other uses, like requesting a distribution
>> and ordering for a writer. To keep things simple, we chose to build a
>> minimal public expression API and expand it incrementally as we need more
>> features.
>>
>> We also considered whether to parse all expressions and convert only
>> 

Re: [DISCUSS] Revert and revisit the public custom expression API for partition (a.k.a. Transform API)

2020-01-16 Thread Hyukjin Kwon
Thanks for giving me some context and clarification, Ryan.

I think I was rather trying to propose to revert because I don't see the
explicit plan here and it was just left half-done for a long while.
>From reading the PR description and codes, I could not guess in which way
we should fix this API (e.g., is this expression only for partition or
replacement of all expressions?). Also, if you take a look at the commit
log, it has not been fixed for 10 months except moving around or minor
fixes.

Do you mind if I ask how we plan to extend this feature? For example,
- if we want to replace existing expressions at the end
- if we want to add a copy of expressions for some reasons.
- How will we handle ambiguity of supported expressions between other
datasource implementations and Spark.
- If we can't tell other expressions are supported here, why don't we just
use different syntax to clarify?

If we have this plan or doc, and people can fix accordingly with
incremental improvements, I am good to keep it.


Here are some of followup questions and answers:

> I don't think there is reason to revert this simply because of some of
the early choices, like deciding to start a public expression API. If you'd
like to extend this to "fix" areas where you find it confusing, then please
do.

If it's clear that we should redesign the API, or there is no more plan
about that API at this moment, I think it can be an option to revert, in
particular, considering that code freeze is being close. For example, why
did we try UDF-like way by exposing a function interface only.


> The idea was that Spark needs a public expression API anyway for other
uses

I was wondering why we should we a public expression API in DSv2. Is there
some places where UDFs can't cover?
As said above, it requires a duplication of existing expressions is
required and wonder if this is worthwhile.
With the stub of Transform API, it looks we want this but I don't know why.


> None of this has been confusing or misleading for our users, who caught
on quickly.

Maybe using simple case wouldn't bring so much confusions if they were
already told about it.
However, if we think about the difference and subtleties, I doubt if the
users already know the answers:

CREATE TABLE table(col INT) USING parquet PARTITIONED BY *transform(col)*

  - It looks expressions and allowing other expressions / combinations
  - Since the expressions are handled by DSv2, the behaviours are dependent
on DSv2 e.g., using *transform* against Datasource implementation A and B
are different.
 - Likewise, if Spark supports *transform* here, the behaviour will be
different.


2020년 1월 17일 (금) 오전 2:36, Ryan Blue 님이 작성:

> Hi everyone,
>
> Let me recap some of the discussions that got us to where we are with this
> today. Hopefully that will provide some clarity.
>
> The purpose of partition transforms is to allow source implementations to
> internally handle partitioning. Right now, users are responsible for this.
> For example, users will transform timestamps into date strings when writing
> and other people will provide a filter on those date strings when scanning.
> This is error-prone: users commonly forget to add partition filters in
> addition to data filters, if anyone uses the wrong format or transformation
> queries will silently return incorrect results, etc. But sources can (and
> should) automatically handle storing and retrieving data internally because
> it is much easier for users.
>
> When we first proposed transforms, I wanted to use Expression. But Reynold
> rightly pointed out that Expression is an internal API that should not be
> exposed. So we decided to compromise by building a public expressions API
> like the public Filter API for the initial purpose of passing transform
> expressions to sources. The idea was that Spark needs a public expression
> API anyway for other uses, like requesting a distribution and ordering for
> a writer. To keep things simple, we chose to build a minimal public
> expression API and expand it incrementally as we need more features.
>
> We also considered whether to parse all expressions and convert only
> transformations to the public API, or to parse just transformations. We
> went with just parsing transformations because it was easier and we can
> expand it to improve error messages later.
>
> I don't think there is reason to revert this simply because of some of the
> early choices, like deciding to start a public expression API. If you'd
> like to extend this to "fix" areas where you find it confusing, then please
> do. We know that by parsing more expressions we could improve error
> messages. But that's not to say that we need to revert it.
>
> None of this has been confusing or misleading for our users, who caught on
> quickly.
>
> On Thu, Jan 16, 2020 at 5:14 AM Hyukjin Kwon  wrote:
>
>> I think the problem here is if there is an explicit plan or not.
>> The PR was merged one year ago and not many changes have been made to
>> 

Re: [DISCUSS] Revert and revisit the public custom expression API for partition (a.k.a. Transform API)

2020-01-16 Thread Ryan Blue
Hi everyone,

Let me recap some of the discussions that got us to where we are with this
today. Hopefully that will provide some clarity.

The purpose of partition transforms is to allow source implementations to
internally handle partitioning. Right now, users are responsible for this.
For example, users will transform timestamps into date strings when writing
and other people will provide a filter on those date strings when scanning.
This is error-prone: users commonly forget to add partition filters in
addition to data filters, if anyone uses the wrong format or transformation
queries will silently return incorrect results, etc. But sources can (and
should) automatically handle storing and retrieving data internally because
it is much easier for users.

When we first proposed transforms, I wanted to use Expression. But Reynold
rightly pointed out that Expression is an internal API that should not be
exposed. So we decided to compromise by building a public expressions API
like the public Filter API for the initial purpose of passing transform
expressions to sources. The idea was that Spark needs a public expression
API anyway for other uses, like requesting a distribution and ordering for
a writer. To keep things simple, we chose to build a minimal public
expression API and expand it incrementally as we need more features.

We also considered whether to parse all expressions and convert only
transformations to the public API, or to parse just transformations. We
went with just parsing transformations because it was easier and we can
expand it to improve error messages later.

I don't think there is reason to revert this simply because of some of the
early choices, like deciding to start a public expression API. If you'd
like to extend this to "fix" areas where you find it confusing, then please
do. We know that by parsing more expressions we could improve error
messages. But that's not to say that we need to revert it.

None of this has been confusing or misleading for our users, who caught on
quickly.

On Thu, Jan 16, 2020 at 5:14 AM Hyukjin Kwon  wrote:

> I think the problem here is if there is an explicit plan or not.
> The PR was merged one year ago and not many changes have been made to this
> API to address the main concerns mentioned.
> Also, the followup JIRA requested seems still open
> https://issues.apache.org/jira/browse/SPARK-27386
> I heard this was already discussed but I cannot find the summary of the
> meeting or any documentation.
>
> I would like to make sure how we plan to extend. I had a couple of
> questions such as:
>   - Why can't we use UDF-interface-like as an example?
>   - Is this expression only for partition or do we plan to expose this to
> replace other existing expressions?
>
> > About extensibility, it's similar to DS V1 Filter again. We don't cover
> all the expressions at the beginning, but we can add more in future
> versions when needed. The data source implementations should be defensive
> and fail when seeing unrecognized Filter/Transform.
>
> I think there are differences in that:
> - DSv1 filter works whether the filters are pushed or not However, this
> case does not work.
> - There are too many expressions whereas the number of predicates are
> relatively limited. If we plan to push all expressions eventually, I doubt
> if this is a good idea.
>
>
> 2020년 1월 16일 (목) 오후 9:22, Wenchen Fan 님이 작성:
>
>> The DS v2 project is still evolving so half-backed is inevitable
>> sometimes. This feature is definitely in the right direction to allow more
>> flexible partition implementations, but there are a few problems we can
>> discuss.
>>
>> About expression duplication. This is an existing design choice. We don't
>> want to expose the Expression class directly but we do need to expose some
>> Expression-like stuff in the developer APIs. So we pick some basic
>> expressions, make a copy and create a public version of them. This is what
>> we did for DS V1 Filter, and I think we can continue to do this for DS v2
>> Transform.
>>
>> About extensibility, it's similar to DS V1 Filter again. We don't cover
>> all the expressions at the beginning, but we can add more in future
>> versions when needed. The data source implementations should be defensive
>> and fail when seeing unrecognized Filter/Transform.
>>
>> About compatibility. This is the place that I have a concern as well. For
>> DS V1 Filter, we just expose all the Filter classes, like `EqualTo`,
>> `GreaterThan`, etc. These classes have well-defined semantic. For DS V2
>> Transform, we only expose the Transform interface, and data sources need to
>> look at `Transform#name` and search the document to see the semantic.
>> What's worse, the parser/analyzer allows arbitrary string as Transform
>> name, so it's impossible to have well-defined semantic, and also different
>> sources may have different semantic for the same Transform name.
>>
>> I'd suggest we forbid arbitrary string as Transform (the ApplyTransform
>> 

Re: [DISCUSS] Revert and revisit the public custom expression API for partition (a.k.a. Transform API)

2020-01-16 Thread Hyukjin Kwon
I think the problem here is if there is an explicit plan or not.
The PR was merged one year ago and not many changes have been made to this
API to address the main concerns mentioned.
Also, the followup JIRA requested seems still open
https://issues.apache.org/jira/browse/SPARK-27386
I heard this was already discussed but I cannot find the summary of the
meeting or any documentation.

I would like to make sure how we plan to extend. I had a couple of
questions such as:
  - Why can't we use UDF-interface-like as an example?
  - Is this expression only for partition or do we plan to expose this to
replace other existing expressions?

> About extensibility, it's similar to DS V1 Filter again. We don't cover
all the expressions at the beginning, but we can add more in future
versions when needed. The data source implementations should be defensive
and fail when seeing unrecognized Filter/Transform.

I think there are differences in that:
- DSv1 filter works whether the filters are pushed or not However, this
case does not work.
- There are too many expressions whereas the number of predicates are
relatively limited. If we plan to push all expressions eventually, I doubt
if this is a good idea.


2020년 1월 16일 (목) 오후 9:22, Wenchen Fan 님이 작성:

> The DS v2 project is still evolving so half-backed is inevitable
> sometimes. This feature is definitely in the right direction to allow more
> flexible partition implementations, but there are a few problems we can
> discuss.
>
> About expression duplication. This is an existing design choice. We don't
> want to expose the Expression class directly but we do need to expose some
> Expression-like stuff in the developer APIs. So we pick some basic
> expressions, make a copy and create a public version of them. This is what
> we did for DS V1 Filter, and I think we can continue to do this for DS v2
> Transform.
>
> About extensibility, it's similar to DS V1 Filter again. We don't cover
> all the expressions at the beginning, but we can add more in future
> versions when needed. The data source implementations should be defensive
> and fail when seeing unrecognized Filter/Transform.
>
> About compatibility. This is the place that I have a concern as well. For
> DS V1 Filter, we just expose all the Filter classes, like `EqualTo`,
> `GreaterThan`, etc. These classes have well-defined semantic. For DS V2
> Transform, we only expose the Transform interface, and data sources need to
> look at `Transform#name` and search the document to see the semantic.
> What's worse, the parser/analyzer allows arbitrary string as Transform
> name, so it's impossible to have well-defined semantic, and also different
> sources may have different semantic for the same Transform name.
>
> I'd suggest we forbid arbitrary string as Transform (the ApplyTransform
> class). We can even follow DS  V1 Filter and expose the classes directly.
>
> On Thu, Jan 16, 2020 at 6:56 PM Hyukjin Kwon  wrote:
>
>> Hi all,
>>
>> I would like to suggest to take one step back at
>> https://github.com/apache/spark/pull/24117 and rethink about it.
>> I am writing this email as I raised the issue few times but could not
>> have enough responses promptly, and
>> the code freeze is being close.
>>
>> In particular, please refer the below comments for the full context:
>> - https://github.com/apache/spark/pull/24117#issuecomment-568891483
>> - https://github.com/apache/spark/pull/24117#issuecomment-568614961
>> - https://github.com/apache/spark/pull/24117#issuecomment-568891483
>>
>>
>> In short, this PR added an API in DSv2:
>>
>> CREATE TABLE table(col INT) USING parquet PARTITIONED BY *transform(col)*
>>
>>
>> So people can write some classes for *transform(col)* for partitioned
>> column specifically.
>>
>> However, there are some design concerns which looked not addressed
>> properly.
>>
>> Note that one of the main point is to avoid half-baked or
>> just-work-for-now APIs. However, this looks
>> definitely like half-completed. Therefore, I would like to propose to
>> take one step back and revert it for now.
>> Please see below the concerns listed.
>>
>> *Duplication of existing expressions*
>> Seems like existing expressions are going to be duplicated. See below new
>> APIs added:
>>
>> def years(column: String): YearsTransform = YearsTransform(reference(column))
>> def months(column: String): MonthsTransform = 
>> MonthsTransform(reference(column))
>> def days(column: String): DaysTransform = DaysTransform(reference(column))
>> def hours(column: String): HoursTransform = HoursTransform(reference(column))
>> ...
>>
>> It looks like it requires to add a copy of our existing expressions, in
>> the future.
>>
>>
>> *Limited Extensibility*
>> It has a clear limitation. It looks other expressions are going to be
>> allowed together (e.g., `concat(years(col) + days(col))`);
>> however, it looks impossible to extend with the current design. It just
>> directly maps transformName to implementation class,
>> and just pass 

Re: [DISCUSS] Revert and revisit the public custom expression API for partition (a.k.a. Transform API)

2020-01-16 Thread Wenchen Fan
The DS v2 project is still evolving so half-backed is inevitable sometimes.
This feature is definitely in the right direction to allow more flexible
partition implementations, but there are a few problems we can discuss.

About expression duplication. This is an existing design choice. We don't
want to expose the Expression class directly but we do need to expose some
Expression-like stuff in the developer APIs. So we pick some basic
expressions, make a copy and create a public version of them. This is what
we did for DS V1 Filter, and I think we can continue to do this for DS v2
Transform.

About extensibility, it's similar to DS V1 Filter again. We don't cover all
the expressions at the beginning, but we can add more in future versions
when needed. The data source implementations should be defensive and fail
when seeing unrecognized Filter/Transform.

About compatibility. This is the place that I have a concern as well. For
DS V1 Filter, we just expose all the Filter classes, like `EqualTo`,
`GreaterThan`, etc. These classes have well-defined semantic. For DS V2
Transform, we only expose the Transform interface, and data sources need to
look at `Transform#name` and search the document to see the semantic.
What's worse, the parser/analyzer allows arbitrary string as Transform
name, so it's impossible to have well-defined semantic, and also different
sources may have different semantic for the same Transform name.

I'd suggest we forbid arbitrary string as Transform (the ApplyTransform
class). We can even follow DS  V1 Filter and expose the classes directly.

On Thu, Jan 16, 2020 at 6:56 PM Hyukjin Kwon  wrote:

> Hi all,
>
> I would like to suggest to take one step back at
> https://github.com/apache/spark/pull/24117 and rethink about it.
> I am writing this email as I raised the issue few times but could not have
> enough responses promptly, and
> the code freeze is being close.
>
> In particular, please refer the below comments for the full context:
> - https://github.com/apache/spark/pull/24117#issuecomment-568891483
> - https://github.com/apache/spark/pull/24117#issuecomment-568614961
> - https://github.com/apache/spark/pull/24117#issuecomment-568891483
>
>
> In short, this PR added an API in DSv2:
>
> CREATE TABLE table(col INT) USING parquet PARTITIONED BY *transform(col)*
>
>
> So people can write some classes for *transform(col)* for partitioned
> column specifically.
>
> However, there are some design concerns which looked not addressed
> properly.
>
> Note that one of the main point is to avoid half-baked or
> just-work-for-now APIs. However, this looks
> definitely like half-completed. Therefore, I would like to propose to take
> one step back and revert it for now.
> Please see below the concerns listed.
>
> *Duplication of existing expressions*
> Seems like existing expressions are going to be duplicated. See below new
> APIs added:
>
> def years(column: String): YearsTransform = YearsTransform(reference(column))
> def months(column: String): MonthsTransform = 
> MonthsTransform(reference(column))
> def days(column: String): DaysTransform = DaysTransform(reference(column))
> def hours(column: String): HoursTransform = HoursTransform(reference(column))
> ...
>
> It looks like it requires to add a copy of our existing expressions, in
> the future.
>
>
> *Limited Extensibility*
> It has a clear limitation. It looks other expressions are going to be
> allowed together (e.g., `concat(years(col) + days(col))`);
> however, it looks impossible to extend with the current design. It just
> directly maps transformName to implementation class,
> and just pass arguments:
>
> transform
> ...
> | transformName=identifier
>   '(' argument+=transformArgument (',' argument+=transformArgument)* ')'  
> #applyTransform
> ;
>
> It looks regular expressions are supported; however, it's not.
> - If we should support, the design had to consider that.
> - if we should not support, different syntax might have to be used instead.
>
> *Limited Compatibility Management*
> The name can be arbitrary. For instance, if "transform" is supported in
> Spark side, the name is preempted by Spark.
> If every the datasource supported such name, it becomes not compatible.
>
>
>
>


[DISCUSS] Revert and revisit the public custom expression API for partition (a.k.a. Transform API)

2020-01-16 Thread Hyukjin Kwon
Hi all,

I would like to suggest to take one step back at
https://github.com/apache/spark/pull/24117 and rethink about it.
I am writing this email as I raised the issue few times but could not have
enough responses promptly, and
the code freeze is being close.

In particular, please refer the below comments for the full context:
- https://github.com/apache/spark/pull/24117#issuecomment-568891483
- https://github.com/apache/spark/pull/24117#issuecomment-568614961
- https://github.com/apache/spark/pull/24117#issuecomment-568891483


In short, this PR added an API in DSv2:

CREATE TABLE table(col INT) USING parquet PARTITIONED BY *transform(col)*


So people can write some classes for *transform(col)* for partitioned
column specifically.

However, there are some design concerns which looked not addressed properly.

Note that one of the main point is to avoid half-baked or just-work-for-now
APIs. However, this looks
definitely like half-completed. Therefore, I would like to propose to take
one step back and revert it for now.
Please see below the concerns listed.

*Duplication of existing expressions*
Seems like existing expressions are going to be duplicated. See below new
APIs added:

def years(column: String): YearsTransform = YearsTransform(reference(column))
def months(column: String): MonthsTransform = MonthsTransform(reference(column))
def days(column: String): DaysTransform = DaysTransform(reference(column))
def hours(column: String): HoursTransform = HoursTransform(reference(column))
...

It looks like it requires to add a copy of our existing expressions, in the
future.


*Limited Extensibility*
It has a clear limitation. It looks other expressions are going to be
allowed together (e.g., `concat(years(col) + days(col))`);
however, it looks impossible to extend with the current design. It just
directly maps transformName to implementation class,
and just pass arguments:

transform
...
| transformName=identifier
  '(' argument+=transformArgument (','
argument+=transformArgument)* ')'  #applyTransform
;

It looks regular expressions are supported; however, it's not.
- If we should support, the design had to consider that.
- if we should not support, different syntax might have to be used instead.

*Limited Compatibility Management*
The name can be arbitrary. For instance, if "transform" is supported in
Spark side, the name is preempted by Spark.
If every the datasource supported such name, it becomes not compatible.