[GitHub] [spark] HyukjinKwon commented on a change in pull request #27331: [SPARK-29157][SQL][PYSPARK] Add DataFrameWriterV2 to Python API

GitBox Thu, 25 Jun 2020 05:15:25 -0700


HyukjinKwon commented on a change in pull request #27331:
URL: https://github.com/apache/spark/pull/27331#discussion_r445512162




##########
File path: python/pyspark/sql/readwriter.py
##########
@@ -1048,6 +1048,128 @@ def jdbc(self, url, table, mode=None, properties=None):
         self.mode(mode)._jwrite.jdbc(url, table, jprop)
 
 
+class DataFrameWriterV2(object):
+    """
+    Interface used to write a class:`pyspark.sql.dataframe.DataFrame`
+    to external storage using the v2 API.
+
+    .. versionadded:: 3.1.0
+    """
+
+    def __init__(self, df, table):
+        self._df = df
+        self._spark = df.sql_ctx
+        self._jwriter = df._jdf.writeTo(table)
+
+    @since(3.1)
+    def using(self, provider):
+        """
+        Specifies a provider for the underlying output data source.
+        Spark's default catalog supports "parquet", "json", etc.
+        """
+        self._jwriter.using(provider)
+        return self
+
+    @since(3.1)
+    def option(self, key, value):
+        """
+        Add a write option.
+        """
+        self._jwriter.option(key, to_str(value))
+        return self
+
+    @since(3.1)
+    def options(self, **options):
+        """
+        Add write options.
+        """
+        options = {k: to_str(v) for k, v in options.items()}
+        self._jwriter.options(options)
+        return self
+
+    @since(3.1)
+    def partitionedBy(self, col, *cols):

Review comment:
       @rdblue, @brkyvz, @cloud-fan, Should we maybe at least use a different 
class for these partition column expressions such as `PartitioningColumn` like 
we do for `TypedColumn`, and add `asPartitioningColumn` to `Column`?
   
   I remember we basically want to remove these partitioning specific 
expressions at [[DISCUSS] Revert and revisit the public custom expression API 
for partition (a.k.a. Transform API)
   
](http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Revert-and-revisit-the-public-custom-expression-API-for-partition-a-k-a-Transform-API-td28683.html)
 if we find a better way to do it.
   
   I suspect doing `PartitionedColumn` is a-okay as a temporary solution 
because we can guard it by typing, and we can move these partitioning-specific 
expressions into a separate package. I think this way make them distinguished 
at least. I can work as well on it if this sounds fine to you guys.

##########
File path: python/pyspark/sql/readwriter.py
##########
@@ -1048,6 +1048,128 @@ def jdbc(self, url, table, mode=None, properties=None):
         self.mode(mode)._jwrite.jdbc(url, table, jprop)
 
 
+class DataFrameWriterV2(object):
+    """
+    Interface used to write a class:`pyspark.sql.dataframe.DataFrame`
+    to external storage using the v2 API.
+
+    .. versionadded:: 3.1.0
+    """
+
+    def __init__(self, df, table):
+        self._df = df
+        self._spark = df.sql_ctx
+        self._jwriter = df._jdf.writeTo(table)
+
+    @since(3.1)
+    def using(self, provider):
+        """
+        Specifies a provider for the underlying output data source.
+        Spark's default catalog supports "parquet", "json", etc.
+        """
+        self._jwriter.using(provider)
+        return self
+
+    @since(3.1)
+    def option(self, key, value):
+        """
+        Add a write option.
+        """
+        self._jwriter.option(key, to_str(value))
+        return self
+
+    @since(3.1)
+    def options(self, **options):
+        """
+        Add write options.
+        """
+        options = {k: to_str(v) for k, v in options.items()}
+        self._jwriter.options(options)
+        return self
+
+    @since(3.1)
+    def partitionedBy(self, col, *cols):

Review comment:
       @rdblue, @brkyvz, @cloud-fan, Should we maybe at least use a different 
class for these partition column expressions such as `PartitionedColumn` like 
we do for `TypedColumn`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #27331: [SPARK-29157][SQL][PYSPARK] Add DataFrameWriterV2 to Python API

Reply via email to