[jira] [Commented] (SPARK-38483) Column name or alias as an attribute of the PySpark Column class
[ https://issues.apache.org/jira/browse/SPARK-38483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17514788#comment-17514788 ] Brian Schaefer commented on SPARK-38483: I've been thinking about this the past few weeks and would like to propose a minimal version of this suggested feature: * a property {{Column._name}} that simply returns {{Column._jc.toString()}} * an instance variable {{Column._alias}} that is set when {{Column.alias()}} is called. The combination of these two provides a convenient interface for Python users without promising too much. A common use case in my own work would be re-using an alias (mentioned in the ticket description): {code:python} >>> def process_values(col): ... new_values = ... ... return new_values.alias(col._alias or col._name) ... >>> values = F.col("original_values").alias("values") >>> df.select(process_values(values)) {code} > Column name or alias as an attribute of the PySpark Column class > > > Key: SPARK-38483 > URL: https://issues.apache.org/jira/browse/SPARK-38483 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.2.1 >Reporter: Brian Schaefer >Priority: Minor > Labels: starter > > Having the name of a column as an attribute of PySpark {{Column}} class > instances can enable some convenient patterns, for example: > Applying a function to a column and aliasing with the original name: > {code:java} > values = F.col("values") > # repeating the column name as an alias > distinct_values = F.array_distinct(values).alias("values") > # re-using the existing column name > distinct_values = F.array_distinct(values).alias(values._name){code} > Checking the column name inside a custom function and applying conditional > logic on the name: > {code:java} > def custom_function(col: Column) -> Column: > if col._name == "my_column": > return col.astype("int") > return col.astype("string"){code} > The proposal in this issue is to add a property {{Column.\_name}} that > obtains the name or alias of a column in a similar way as currently done in > the {{Column.\_\_repr\_\_}} method: > [https://github.com/apache/spark/blob/master/python/pyspark/sql/column.py#L1062.] > The choice of {{_name}} intentionally avoids collision with the existing > {{Column.name}} method, which is an alias for {{{}Column.alias{}}}. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38483) Column name or alias as an attribute of the PySpark Column class
[ https://issues.apache.org/jira/browse/SPARK-38483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17504529#comment-17504529 ] Brian Schaefer commented on SPARK-38483: The column name does differ between the two when selecting a struct field, but handling that case seems fairly straightforward. {code:python} >>> df = spark.createDataFrame([{"struct": {"outer_field": {"inner_field": >>> 1}}}]) >>> values = F.col("struct.outer_field.inner_field") >>> print(df.select(values).schema[0].name) inner_field >>> print(values._jc.toString()) struct.outer_field.inner_field >>> print(values._jc.toString().split(".")[-1]) inner_field{code} > Column name or alias as an attribute of the PySpark Column class > > > Key: SPARK-38483 > URL: https://issues.apache.org/jira/browse/SPARK-38483 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.2.1 >Reporter: Brian Schaefer >Priority: Minor > Labels: starter > > Having the name of a column as an attribute of PySpark {{Column}} class > instances can enable some convenient patterns, for example: > Applying a function to a column and aliasing with the original name: > {code:java} > values = F.col("values") > # repeating the column name as an alias > distinct_values = F.array_distinct(values).alias("values") > # re-using the existing column name > distinct_values = F.array_distinct(values).alias(values._name){code} > Checking the column name inside a custom function and applying conditional > logic on the name: > {code:java} > def custom_function(col: Column) -> Column: > if col._name == "my_column": > return col.astype("int") > return col.astype("string"){code} > The proposal in this issue is to add a property {{Column.\_name}} that > obtains the name or alias of a column in a similar way as currently done in > the {{Column.\_\_repr\_\_}} method: > [https://github.com/apache/spark/blob/master/python/pyspark/sql/column.py#L1062.] > The choice of {{_name}} intentionally avoids collision with the existing > {{Column.name}} method, which is an alias for {{{}Column.alias{}}}. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38483) Column name or alias as an attribute of the PySpark Column class
[ https://issues.apache.org/jira/browse/SPARK-38483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17504448#comment-17504448 ] Brian Schaefer commented on SPARK-38483: Could you provide an example of when the real column names would be different? At least for basic examples, it looks like the real column names match those found using {{{}Column._jc.toString(){}}}. With some careful regex it may also be possible to catch aliases. {code:python} >>> df = spark.createDataFrame([{"values": [1,2,3]}]) >>> values = F.col("values") >>> print(df.select(values).schema[0].name) values >>> print(values._jc.toString()) values >>> import re >>> aliased_values = F.col("values").alias("aliased") >>> print(df.select(aliased_values).schema[0].name) aliased >>> print(re.match(".*`(.*)`", aliased_values._jc.toString())[1]) aliased {code} > Column name or alias as an attribute of the PySpark Column class > > > Key: SPARK-38483 > URL: https://issues.apache.org/jira/browse/SPARK-38483 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.2.1 >Reporter: Brian Schaefer >Priority: Minor > Labels: starter > > Having the name of a column as an attribute of PySpark {{Column}} class > instances can enable some convenient patterns, for example: > Applying a function to a column and aliasing with the original name: > {code:java} > values = F.col("values") > # repeating the column name as an alias > distinct_values = F.array_distinct(values).alias("values") > # re-using the existing column name > distinct_values = F.array_distinct(values).alias(values._name){code} > Checking the column name inside a custom function and applying conditional > logic on the name: > {code:java} > def custom_function(col: Column) -> Column: > if col._name == "my_column": > return col.astype("int") > return col.astype("string"){code} > The proposal in this issue is to add a property {{Column.\_name}} that > obtains the name or alias of a column in a similar way as currently done in > the {{Column.\_\_repr\_\_}} method: > [https://github.com/apache/spark/blob/master/python/pyspark/sql/column.py#L1062.] > The choice of {{_name}} intentionally avoids collision with the existing > {{Column.name}} method, which is an alias for {{{}Column.alias{}}}. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38483) Column name or alias as an attribute of the PySpark Column class
[ https://issues.apache.org/jira/browse/SPARK-38483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17503963#comment-17503963 ] Hyukjin Kwon commented on SPARK-38483: -- The real column names can only be known after resolving it by analyzer. You would have to do it with, for example: {code} values = F.col("values") df.select(values).schema[0] {code} > Column name or alias as an attribute of the PySpark Column class > > > Key: SPARK-38483 > URL: https://issues.apache.org/jira/browse/SPARK-38483 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.2.1 >Reporter: Brian Schaefer >Priority: Minor > Labels: starter > > Having the name of a column as an attribute of PySpark {{Column}} class > instances can enable some convenient patterns, for example: > Applying a function to a column and aliasing with the original name: > {code:java} > values = F.col("values") > # repeating the column name as an alias > distinct_values = F.array_distinct(values).alias("values") > # re-using the existing column name > distinct_values = F.array_distinct(values).alias(values._name){code} > Checking the column name inside a custom function and applying conditional > logic on the name: > {code:java} > def custom_function(col: Column) -> Column: > if col._name == "my_column": > return col.astype("int") > return col.astype("string"){code} > The proposal in this issue is to add a property {{Column.\_name}} that > obtains the name or alias of a column in a similar way as currently done in > the {{Column.\_\_repr\_\_}} method: > [https://github.com/apache/spark/blob/master/python/pyspark/sql/column.py#L1062.] > The choice of {{_name}} intentionally avoids collision with the existing > {{Column.name}} method, which is an alias for {{{}Column.alias{}}}. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38483) Column name or alias as an attribute of the PySpark Column class
[ https://issues.apache.org/jira/browse/SPARK-38483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17503782#comment-17503782 ] Brian Schaefer commented on SPARK-38483: Extracting the column name from the {{Column.__repr__}} method has been discussed on StackExchange: [https://stackoverflow.com/a/43150264.] However, it would be useful to have the column name more easily accessible. > Column name or alias as an attribute of the PySpark Column class > > > Key: SPARK-38483 > URL: https://issues.apache.org/jira/browse/SPARK-38483 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.2.1 >Reporter: Brian Schaefer >Priority: Minor > Labels: starter > > Having the name of a column as an attribute of PySpark {{Column}} class > instances can enable some convenient patterns, for example: > Applying a function to a column and aliasing with the original name: > {code:java} > values = F.col("values") > # repeating the column name as an alias > distinct_values = F.array_distinct(values).alias("values") > # re-using the existing column name > distinct_values = F.array_distinct(values).alias(values._name){code} > Checking the column name inside a custom function and applying conditional > logic on the name: > {code:java} > def custom_function(col: Column) -> Column: > if col._name == "my_column": > return col.astype("int") > return col.astype("string"){code} > The proposal in this issue is to add a property {{Column.\_name}} that > obtains the name or alias of a column in a similar way as currently done in > the {{Column.\_\_repr\_\_}} method: > [https://github.com/apache/spark/blob/master/python/pyspark/sql/column.py#L1062.] > The choice of {{_name}} intentionally avoids collision with the existing > {{Column.name}} method, which is an alias for {{{}Column.alias{}}}. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org