Re: [DISCUSS] SPIP: APIs for Table Metadata Operations

Ryan Blue Wed, 15 Aug 2018 14:27:35 -0700

I think I found a good solution to the problem of using Expression in the
TableCatalog API and in the DeleteSupport API.


For DeleteSupport, there is already a stable and public subset of
Expression named Filter that can be used to pass filters. The reason why
DeleteSupport would use Expression is to support more complex expressions
like to_date(ts) = '2018-08-15' that are translated to ts >=
1534316400000000 AND ts < 1534402800000000. But, this can be done in Spark
instead of the data sources so I think DeleteSupport should use Filter
instead. I updated the DeleteSupport PR #21308
<https://github.com/apache/spark/pull/21308> with these changes.

Also, I agree that the DataSourceV2 API should also not expose Expression,
so I opened SPARK-25127 to track removing SupportsPushDownCatalystFilter
<https://issues.apache.org/jira/browse/SPARK-25127>.

For TableCatalog, I took a similar approach instead of introducing a
parallel Expression API. Instead, I created a PartitionTransform API (like
Filter) that communicates the transformation function, function parameters
like num buckets, and column references. I updated the TableCatalog PR
#21306 <https://github.com/apache/spark/pull/21306> to use
PartitionTransform instead of Expression and I updated the text of the SPIP
doc
<https://docs.google.com/document/d/1zLFiA1VuaWeVxeTDXNg8bL6GP3BVoOZBkewFtEnjEoo/edit#heading=h.m45webtwxf2d>
.

I also raised a concern about needing to wait for Spark to add support for
new expressions (now partition transforms). To get around this, I added an
apply transform that passes the name of a function and an input column.
That way, users can still pass transforms that Spark doesn’t know about by
name to data sources: apply("source_function", "colName").

Please have a look at the updated pull requests and SPIP doc and comment!

rb

Re: [DISCUSS] SPIP: APIs for Table Metadata Operations

Reply via email to