Hi everyone,

Let me recap some of the discussions that got us to where we are with this
today. Hopefully that will provide some clarity.

The purpose of partition transforms is to allow source implementations to
internally handle partitioning. Right now, users are responsible for this.
For example, users will transform timestamps into date strings when writing
and other people will provide a filter on those date strings when scanning.
This is error-prone: users commonly forget to add partition filters in
addition to data filters, if anyone uses the wrong format or transformation
queries will silently return incorrect results, etc. But sources can (and
should) automatically handle storing and retrieving data internally because
it is much easier for users.

When we first proposed transforms, I wanted to use Expression. But Reynold
rightly pointed out that Expression is an internal API that should not be
exposed. So we decided to compromise by building a public expressions API
like the public Filter API for the initial purpose of passing transform
expressions to sources. The idea was that Spark needs a public expression
API anyway for other uses, like requesting a distribution and ordering for
a writer. To keep things simple, we chose to build a minimal public
expression API and expand it incrementally as we need more features.

We also considered whether to parse all expressions and convert only
transformations to the public API, or to parse just transformations. We
went with just parsing transformations because it was easier and we can
expand it to improve error messages later.

I don't think there is reason to revert this simply because of some of the
early choices, like deciding to start a public expression API. If you'd
like to extend this to "fix" areas where you find it confusing, then please
do. We know that by parsing more expressions we could improve error
messages. But that's not to say that we need to revert it.

None of this has been confusing or misleading for our users, who caught on
quickly.

On Thu, Jan 16, 2020 at 5:14 AM Hyukjin Kwon <gurwls...@gmail.com> wrote:

> I think the problem here is if there is an explicit plan or not.
> The PR was merged one year ago and not many changes have been made to this
> API to address the main concerns mentioned.
> Also, the followup JIRA requested seems still open
> https://issues.apache.org/jira/browse/SPARK-27386
> I heard this was already discussed but I cannot find the summary of the
> meeting or any documentation.
>
> I would like to make sure how we plan to extend. I had a couple of
> questions such as:
>   - Why can't we use UDF-interface-like as an example?
>   - Is this expression only for partition or do we plan to expose this to
> replace other existing expressions?
>
> > About extensibility, it's similar to DS V1 Filter again. We don't cover
> all the expressions at the beginning, but we can add more in future
> versions when needed. The data source implementations should be defensive
> and fail when seeing unrecognized Filter/Transform.
>
> I think there are differences in that:
> - DSv1 filter works whether the filters are pushed or not However, this
> case does not work.
> - There are too many expressions whereas the number of predicates are
> relatively limited. If we plan to push all expressions eventually, I doubt
> if this is a good idea.
>
>
> 2020년 1월 16일 (목) 오후 9:22, Wenchen Fan <cloud0...@gmail.com>님이 작성:
>
>> The DS v2 project is still evolving so half-backed is inevitable
>> sometimes. This feature is definitely in the right direction to allow more
>> flexible partition implementations, but there are a few problems we can
>> discuss.
>>
>> About expression duplication. This is an existing design choice. We don't
>> want to expose the Expression class directly but we do need to expose some
>> Expression-like stuff in the developer APIs. So we pick some basic
>> expressions, make a copy and create a public version of them. This is what
>> we did for DS V1 Filter, and I think we can continue to do this for DS v2
>> Transform.
>>
>> About extensibility, it's similar to DS V1 Filter again. We don't cover
>> all the expressions at the beginning, but we can add more in future
>> versions when needed. The data source implementations should be defensive
>> and fail when seeing unrecognized Filter/Transform.
>>
>> About compatibility. This is the place that I have a concern as well. For
>> DS V1 Filter, we just expose all the Filter classes, like `EqualTo`,
>> `GreaterThan`, etc. These classes have well-defined semantic. For DS V2
>> Transform, we only expose the Transform interface, and data sources need to
>> look at `Transform#name` and search the document to see the semantic.
>> What's worse, the parser/analyzer allows arbitrary string as Transform
>> name, so it's impossible to have well-defined semantic, and also different
>> sources may have different semantic for the same Transform name.
>>
>> I'd suggest we forbid arbitrary string as Transform (the ApplyTransform
>> class). We can even follow DS  V1 Filter and expose the classes directly.
>>
>> On Thu, Jan 16, 2020 at 6:56 PM Hyukjin Kwon <gurwls...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I would like to suggest to take one step back at
>>> https://github.com/apache/spark/pull/24117 and rethink about it.
>>> I am writing this email as I raised the issue few times but could not
>>> have enough responses promptly, and
>>> the code freeze is being close.
>>>
>>> In particular, please refer the below comments for the full context:
>>> - https://github.com/apache/spark/pull/24117#issuecomment-568891483
>>> - https://github.com/apache/spark/pull/24117#issuecomment-568614961
>>> - https://github.com/apache/spark/pull/24117#issuecomment-568891483
>>>
>>>
>>> In short, this PR added an API in DSv2:
>>>
>>> CREATE TABLE table(col INT) USING parquet PARTITIONED BY *transform(col)*
>>>
>>>
>>> So people can write some classes for *transform(col)* for partitioned
>>> column specifically.
>>>
>>> However, there are some design concerns which looked not addressed
>>> properly.
>>>
>>> Note that one of the main point is to avoid half-baked or
>>> just-work-for-now APIs. However, this looks
>>> definitely like half-completed. Therefore, I would like to propose to
>>> take one step back and revert it for now.
>>> Please see below the concerns listed.
>>>
>>> *Duplication of existing expressions*
>>> Seems like existing expressions are going to be duplicated. See below
>>> new APIs added:
>>>
>>> def years(column: String): YearsTransform = 
>>> YearsTransform(reference(column))
>>> def months(column: String): MonthsTransform = 
>>> MonthsTransform(reference(column))
>>> def days(column: String): DaysTransform = DaysTransform(reference(column))
>>> def hours(column: String): HoursTransform = 
>>> HoursTransform(reference(column))
>>> ...
>>>
>>> It looks like it requires to add a copy of our existing expressions, in
>>> the future.
>>>
>>>
>>> *Limited Extensibility*
>>> It has a clear limitation. It looks other expressions are going to be
>>> allowed together (e.g., `concat(years(col) + days(col))`);
>>> however, it looks impossible to extend with the current design. It just
>>> directly maps transformName to implementation class,
>>> and just pass arguments:
>>>
>>> transform
>>>     ...
>>>     | transformName=identifier
>>>       '(' argument+=transformArgument (',' argument+=transformArgument)* 
>>> ')'  #applyTransform
>>>     ;
>>>
>>> It looks regular expressions are supported; however, it's not.
>>> - If we should support, the design had to consider that.
>>> - if we should not support, different syntax might have to be used
>>> instead.
>>>
>>> *Limited Compatibility Management*
>>> The name can be arbitrary. For instance, if "transform" is supported in
>>> Spark side, the name is preempted by Spark.
>>> If every the datasource supported such name, it becomes not compatible.
>>>
>>>
>>>
>>>

-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to