Re: [DISCUSS] Revert and revisit the public custom expression API for partition (a.k.a. Transform API)

Wenchen Fan Thu, 23 Jan 2020 05:43:37 -0800

I don't think we want to add a lot of flexibility to the PARTITION BY
expressions. It's usually just columns or nested fields, or some common
functions like year, month, etc.


If you look at the parser, we create DS V2 Expression directly.
The partition-specific expressions are for
`DataFrameWriterV2.partitionedBy` only. The method takes `Column` as input
and we can't pass the DS V2 Expression directly. If there are better ways
to call this method, we can remove these partition-specific expressions.

Native function doesn't work. It's quite inconvenient if we force the
implementations to call a java function to do the partitioning. This is
different from UDF as UDF means someone gives a function and ask Spark to
run. Partitioning is the opposite.

Hope this helps.

Thanks,
Wenchen



On Thu, Jan 23, 2020 at 3:42 PM Hyukjin Kwon <gurwls...@gmail.com> wrote:

> There's another PR open to expose this more publicity in Python side (
> https://github.com/apache/spark/pull/27331).
>
> To sum up, I would like to make sure we know these below:
> - Is this expression only for partition or do we plan to expose this to
> replace other existing expressions as some kind of public DSv2 expression
> API?
> - Do we want to support other expressions here?
>   - If so, why do we need partition-specific expressions?
>   - If not, why don't we use a different syntax and class for this API?
> - What about we expose a native function to allow transform like a UDF?
>
> Ryan and Wenchen, do you mind if I ask answers for these questions?
>
> 2020년 1월 17일 (금) 오전 10:25, Hyukjin Kwon <gurwls...@gmail.com>님이 작성:
>
>> Thanks for giving me some context and clarification, Ryan.
>>
>> I think I was rather trying to propose to revert because I don't see the
>> explicit plan here and it was just left half-done for a long while.
>> From reading the PR description and codes, I could not guess in which way
>> we should fix this API (e.g., is this expression only for partition or
>> replacement of all expressions?). Also, if you take a look at the commit
>> log, it has not been fixed for 10 months except moving around or minor
>> fixes.
>>
>> Do you mind if I ask how we plan to extend this feature? For example,
>> - if we want to replace existing expressions at the end
>> - if we want to add a copy of expressions for some reasons.
>> - How will we handle ambiguity of supported expressions between other
>> datasource implementations and Spark.
>> - If we can't tell other expressions are supported here, why don't we
>> just use different syntax to clarify?
>>
>> If we have this plan or doc, and people can fix accordingly with
>> incremental improvements, I am good to keep it.
>>
>>
>> Here are some of followup questions and answers:
>>
>> > I don't think there is reason to revert this simply because of some of
>> the early choices, like deciding to start a public expression API. If you'd
>> like to extend this to "fix" areas where you find it confusing, then please
>> do.
>>
>> If it's clear that we should redesign the API, or there is no more plan
>> about that API at this moment, I think it can be an option to revert, in
>> particular, considering that code freeze is being close. For example, why
>> did we try UDF-like way by exposing a function interface only.
>>
>>
>> > The idea was that Spark needs a public expression API anyway for other
>> uses
>>
>> I was wondering why we should we a public expression API in DSv2. Is
>> there some places where UDFs can't cover?
>> As said above, it requires a duplication of existing expressions is
>> required and wonder if this is worthwhile.
>> With the stub of Transform API, it looks we want this but I don't know
>> why.
>>
>>
>> > None of this has been confusing or misleading for our users, who caught
>> on quickly.
>>
>> Maybe using simple case wouldn't bring so much confusions if they were
>> already told about it.
>> However, if we think about the difference and subtleties, I doubt if the
>> users already know the answers:
>>
>> CREATE TABLE table(col INT) USING parquet PARTITIONED BY *transform(col)*
>>
>>   - It looks expressions and allowing other expressions / combinations
>>   - Since the expressions are handled by DSv2, the behaviours are
>> dependent on DSv2 e.g., using *transform* against Datasource
>> implementation A and B are different.
>>  - Likewise, if Spark supports *transform* here, the behaviour will be
>> different.
>>
>>
>> 2020년 1월 17일 (금) 오전 2:36, Ryan Blue <rb...@netflix.com>님이 작성:
>>
>>> Hi everyone,
>>>
>>> Let me recap some of the discussions that got us to where we are with
>>> this today. Hopefully that will provide some clarity.
>>>
>>> The purpose of partition transforms is to allow source implementations
>>> to internally handle partitioning. Right now, users are responsible for
>>> this. For example, users will transform timestamps into date strings when
>>> writing and other people will provide a filter on those date strings when
>>> scanning. This is error-prone: users commonly forget to add partition
>>> filters in addition to data filters, if anyone uses the wrong format or
>>> transformation queries will silently return incorrect results, etc. But
>>> sources can (and should) automatically handle storing and retrieving data
>>> internally because it is much easier for users.
>>>
>>> When we first proposed transforms, I wanted to use Expression. But
>>> Reynold rightly pointed out that Expression is an internal API that should
>>> not be exposed. So we decided to compromise by building a public
>>> expressions API like the public Filter API for the initial purpose of
>>> passing transform expressions to sources. The idea was that Spark needs a
>>> public expression API anyway for other uses, like requesting a distribution
>>> and ordering for a writer. To keep things simple, we chose to build a
>>> minimal public expression API and expand it incrementally as we need more
>>> features.
>>>
>>> We also considered whether to parse all expressions and convert only
>>> transformations to the public API, or to parse just transformations. We
>>> went with just parsing transformations because it was easier and we can
>>> expand it to improve error messages later.
>>>
>>> I don't think there is reason to revert this simply because of some of
>>> the early choices, like deciding to start a public expression API. If you'd
>>> like to extend this to "fix" areas where you find it confusing, then please
>>> do. We know that by parsing more expressions we could improve error
>>> messages. But that's not to say that we need to revert it.
>>>
>>> None of this has been confusing or misleading for our users, who caught
>>> on quickly.
>>>
>>> On Thu, Jan 16, 2020 at 5:14 AM Hyukjin Kwon <gurwls...@gmail.com>
>>> wrote:
>>>
>>>> I think the problem here is if there is an explicit plan or not.
>>>> The PR was merged one year ago and not many changes have been made to
>>>> this API to address the main concerns mentioned.
>>>> Also, the followup JIRA requested seems still open
>>>> https://issues.apache.org/jira/browse/SPARK-27386
>>>> I heard this was already discussed but I cannot find the summary of the
>>>> meeting or any documentation.
>>>>
>>>> I would like to make sure how we plan to extend. I had a couple of
>>>> questions such as:
>>>>   - Why can't we use UDF-interface-like as an example?
>>>>   - Is this expression only for partition or do we plan to expose this
>>>> to replace other existing expressions?
>>>>
>>>> > About extensibility, it's similar to DS V1 Filter again. We don't
>>>> cover all the expressions at the beginning, but we can add more in future
>>>> versions when needed. The data source implementations should be defensive
>>>> and fail when seeing unrecognized Filter/Transform.
>>>>
>>>> I think there are differences in that:
>>>> - DSv1 filter works whether the filters are pushed or not However, this
>>>> case does not work.
>>>> - There are too many expressions whereas the number of predicates are
>>>> relatively limited. If we plan to push all expressions eventually, I doubt
>>>> if this is a good idea.
>>>>
>>>>
>>>> 2020년 1월 16일 (목) 오후 9:22, Wenchen Fan <cloud0...@gmail.com>님이 작성:
>>>>
>>>>> The DS v2 project is still evolving so half-backed is inevitable
>>>>> sometimes. This feature is definitely in the right direction to allow more
>>>>> flexible partition implementations, but there are a few problems we can
>>>>> discuss.
>>>>>
>>>>> About expression duplication. This is an existing design choice. We
>>>>> don't want to expose the Expression class directly but we do need to 
>>>>> expose
>>>>> some Expression-like stuff in the developer APIs. So we pick some basic
>>>>> expressions, make a copy and create a public version of them. This is what
>>>>> we did for DS V1 Filter, and I think we can continue to do this for DS v2
>>>>> Transform.
>>>>>
>>>>> About extensibility, it's similar to DS V1 Filter again. We don't
>>>>> cover all the expressions at the beginning, but we can add more in future
>>>>> versions when needed. The data source implementations should be defensive
>>>>> and fail when seeing unrecognized Filter/Transform.
>>>>>
>>>>> About compatibility. This is the place that I have a concern as well.
>>>>> For DS V1 Filter, we just expose all the Filter classes, like `EqualTo`,
>>>>> `GreaterThan`, etc. These classes have well-defined semantic. For DS V2
>>>>> Transform, we only expose the Transform interface, and data sources need 
>>>>> to
>>>>> look at `Transform#name` and search the document to see the semantic.
>>>>> What's worse, the parser/analyzer allows arbitrary string as Transform
>>>>> name, so it's impossible to have well-defined semantic, and also different
>>>>> sources may have different semantic for the same Transform name.
>>>>>
>>>>> I'd suggest we forbid arbitrary string as Transform (the
>>>>> ApplyTransform class). We can even follow DS  V1 Filter and expose the
>>>>> classes directly.
>>>>>
>>>>> On Thu, Jan 16, 2020 at 6:56 PM Hyukjin Kwon <gurwls...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I would like to suggest to take one step back at
>>>>>> https://github.com/apache/spark/pull/24117 and rethink about it.
>>>>>> I am writing this email as I raised the issue few times but could not
>>>>>> have enough responses promptly, and
>>>>>> the code freeze is being close.
>>>>>>
>>>>>> In particular, please refer the below comments for the full context:
>>>>>> - https://github.com/apache/spark/pull/24117#issuecomment-568891483
>>>>>> - https://github.com/apache/spark/pull/24117#issuecomment-568614961
>>>>>> - https://github.com/apache/spark/pull/24117#issuecomment-568891483
>>>>>>
>>>>>>
>>>>>> In short, this PR added an API in DSv2:
>>>>>>
>>>>>> CREATE TABLE table(col INT) USING parquet PARTITIONED BY *transform(col)*
>>>>>>
>>>>>>
>>>>>> So people can write some classes for *transform(col)* for
>>>>>> partitioned column specifically.
>>>>>>
>>>>>> However, there are some design concerns which looked not addressed
>>>>>> properly.
>>>>>>
>>>>>> Note that one of the main point is to avoid half-baked or
>>>>>> just-work-for-now APIs. However, this looks
>>>>>> definitely like half-completed. Therefore, I would like to propose to
>>>>>> take one step back and revert it for now.
>>>>>> Please see below the concerns listed.
>>>>>>
>>>>>> *Duplication of existing expressions*
>>>>>> Seems like existing expressions are going to be duplicated. See below
>>>>>> new APIs added:
>>>>>>
>>>>>> def years(column: String): YearsTransform = 
>>>>>> YearsTransform(reference(column))
>>>>>> def months(column: String): MonthsTransform = 
>>>>>> MonthsTransform(reference(column))
>>>>>> def days(column: String): DaysTransform = 
>>>>>> DaysTransform(reference(column))
>>>>>> def hours(column: String): HoursTransform = 
>>>>>> HoursTransform(reference(column))
>>>>>> ...
>>>>>>
>>>>>> It looks like it requires to add a copy of our existing expressions,
>>>>>> in the future.
>>>>>>
>>>>>>
>>>>>> *Limited Extensibility*
>>>>>> It has a clear limitation. It looks other expressions are going to be
>>>>>> allowed together (e.g., `concat(years(col) + days(col))`);
>>>>>> however, it looks impossible to extend with the current design. It
>>>>>> just directly maps transformName to implementation class,
>>>>>> and just pass arguments:
>>>>>>
>>>>>> transform
>>>>>>     ...
>>>>>>     | transformName=identifier
>>>>>>       '(' argument+=transformArgument (',' argument+=transformArgument)* 
>>>>>> ')'  #applyTransform
>>>>>>     ;
>>>>>>
>>>>>> It looks regular expressions are supported; however, it's not.
>>>>>> - If we should support, the design had to consider that.
>>>>>> - if we should not support, different syntax might have to be used
>>>>>> instead.
>>>>>>
>>>>>> *Limited Compatibility Management*
>>>>>> The name can be arbitrary. For instance, if "transform" is supported
>>>>>> in Spark side, the name is preempted by Spark.
>>>>>> If every the datasource supported such name, it becomes not
>>>>>> compatible.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>

Re: [DISCUSS] Revert and revisit the public custom expression API for partition (a.k.a. Transform API)

Reply via email to