[GitHub] spark pull request #18447: [SPARK-21232][SQL][SparkR][PYSPARK] New built-in ...
Github user mmolimar closed the pull request at: https://github.com/apache/spark/pull/18447 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18447: [SPARK-21232][SQL][SparkR][PYSPARK] New built-in ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18447#discussion_r212820817 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala --- @@ -98,6 +98,29 @@ case class AssertTrue(child: Expression) extends UnaryExpression with ImplicitCa override def sql: String = s"assert_true(${child.sql})" } +@ExpressionDescription( + usage = "_FUNC_(expr) - Returns the data type of the `expr`.", + extended = """ +Examples: + > SELECT _FUNC_("a"); + string + > SELECT _FUNC_(OL); + bigint + """) --- End diff -- `since` should be added --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18447: [SPARK-21232][SQL][SparkR][PYSPARK] New built-in ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18447#discussion_r212820670 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala --- @@ -98,6 +98,29 @@ case class AssertTrue(child: Expression) extends UnaryExpression with ImplicitCa override def sql: String = s"assert_true(${child.sql})" } +@ExpressionDescription( + usage = "_FUNC_(expr) - Returns the data type of the `expr`.", + extended = """ +Examples: + > SELECT _FUNC_("a"); + string + > SELECT _FUNC_(OL); + bigint + """) +case class GetDataType(child: Expression) extends UnaryExpression { + + override def dataType: DataType = StringType + + override def nullSafeEval(input: Any): Any = UTF8String.fromString(child.dataType.simpleString) --- End diff -- It should be `catalogString` instead of `simpleString`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18447: [SPARK-21232][SQL][SparkR][PYSPARK] New built-in ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18447#discussion_r212820641 --- Diff: R/pkg/R/generics.R --- @@ -950,6 +949,10 @@ setGeneric("grouping_bit", function(x) { standardGeneric("grouping_bit") }) #' @name NULL setGeneric("grouping_id", function(x, ...) { standardGeneric("grouping_id") }) +#' @rdname column_misc_functions +#' @name NULL +setGeneric("hash", function(x, ...) { standardGeneric("hash") }) --- End diff -- I would avoid unrelated changes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18447: [SPARK-21232][SQL][SparkR][PYSPARK] New built-in ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/18447#discussion_r186278707 --- Diff: R/pkg/R/functions.R --- @@ -679,6 +679,19 @@ setMethod("hash", column(jc) }) +#' @details +#' \code{data_type}: Returns the data type of a given column. +#' +#' @rdname column_misc_functions +#' @aliases data_type data_type,Column-method +#' @note data_type since 2.3.0 --- End diff -- 2.4.0 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18447: [SPARK-21232][SQL][SparkR][PYSPARK] New built-in ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/18447#discussion_r186278702 --- Diff: R/pkg/R/functions.R --- @@ -679,6 +679,19 @@ setMethod("hash", column(jc) }) +#' @details +#' \code{data_type}: Returns the data type of a given column. +#' +#' @rdname column_misc_functions +#' @aliases data_type data_type,Column-method +#' @examples \dontrun{data_type(df$c)} --- End diff -- see this line of the code example of hash https://github.com/mmolimar/spark/blob/ed52e2f856f78fb2dca23b6be2f682caa0a88c81/R/pkg/R/functions.R#L176 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18447: [SPARK-21232][SQL][SparkR][PYSPARK] New built-in ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/18447#discussion_r186278685 --- Diff: R/pkg/R/functions.R --- @@ -679,6 +679,19 @@ setMethod("hash", column(jc) }) +#' @details +#' \code{data_type}: Returns the data type of a given column. +#' +#' @rdname column_misc_functions +#' @aliases data_type data_type,Column-method +#' @examples \dontrun{data_type(df$c)} +setMethod("data_type", --- End diff -- 2.4.0 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18447: [SPARK-21232][SQL][SparkR][PYSPARK] New built-in ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/18447#discussion_r186252381 --- Diff: R/pkg/R/functions.R --- @@ -679,6 +679,19 @@ setMethod("hash", column(jc) }) +#' @details +#' \code{data_type}: Returns the data type of a given column. +#' +#' @rdname column_misc_functions +#' @aliases data_type data_type,Column-method +#' @examples \dontrun{data_type(df$c)} +setMethod("data_type", --- End diff -- add `@note` like example above --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18447: [SPARK-21232][SQL][SparkR][PYSPARK] New built-in ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/18447#discussion_r186252423 --- Diff: R/pkg/NAMESPACE --- @@ -236,6 +236,7 @@ exportMethods("%<=>%", "current_date", "current_timestamp", "hash", + "data_type", --- End diff -- I know this list isn't completely sorted, let's sort this? you can move "hash" and "data_type" --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18447: [SPARK-21232][SQL][SparkR][PYSPARK] New built-in ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/18447#discussion_r186252398 --- Diff: R/pkg/R/functions.R --- @@ -679,6 +679,19 @@ setMethod("hash", column(jc) }) +#' @details +#' \code{data_type}: Returns the data type of a given column. +#' +#' @rdname column_misc_functions +#' @aliases data_type data_type,Column-method +#' @examples \dontrun{data_type(df$c)} --- End diff -- example should no be added here - see the example for `hash` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18447: [SPARK-21232][SQL][SparkR][PYSPARK] New built-in ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/18447#discussion_r185703059 --- Diff: R/pkg/R/functions.R --- @@ -653,6 +653,25 @@ setMethod("hash", column(jc) }) +#' data_type +#' +#' Returns the data type of a given column. +#' +#' @param x Column to get the data type. +#' +#' @rdname data_type +#' @name data_type +#' @family misc functions --- End diff -- can you follow the pattern, like for "hash" above? we don't have individual rdname etc for each function --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18447: [SPARK-21232][SQL][SparkR][PYSPARK] New built-in ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18447#discussion_r130032578 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala --- @@ -209,6 +209,18 @@ class DataFrameFunctionsSuite extends QueryTest with SharedSQLContext { Row(2743272264L, 2180413220L)) } + test("misc data_type function") { +val df = Seq(("a", false)).toDF("a", "b") + +checkAnswer( + df.select(data_type($"a"), data_type($"b")), --- End diff -- Not sure. I think we need a reference for a function reference in an RDBMS (SQL to be more correct). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18447: [SPARK-21232][SQL][SparkR][PYSPARK] New built-in ...
Github user mmolimar commented on a diff in the pull request: https://github.com/apache/spark/pull/18447#discussion_r130025210 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala --- @@ -209,6 +209,18 @@ class DataFrameFunctionsSuite extends QueryTest with SharedSQLContext { Row(2743272264L, 2180413220L)) } + test("misc data_type function") { +val df = Seq(("a", false)).toDF("a", "b") + +checkAnswer( + df.select(data_type($"a"), data_type($"b")), --- End diff -- @HyukjinKwon so? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18447: [SPARK-21232][SQL][SparkR][PYSPARK] New built-in ...
Github user mmolimar commented on a diff in the pull request: https://github.com/apache/spark/pull/18447#discussion_r126710311 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala --- @@ -209,6 +209,18 @@ class DataFrameFunctionsSuite extends QueryTest with SharedSQLContext { Row(2743272264L, 2180413220L)) } + test("misc data_type function") { +val df = Seq(("a", false)).toDF("a", "b") + +checkAnswer( + df.select(data_type($"a"), data_type($"b")), --- End diff -- I can't think of a SQL db which queries the datatype without using the table schema. However, for example in MongoDB, you can get something like that using the $type operator or 'typeof'. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18447: [SPARK-21232][SQL][SparkR][PYSPARK] New built-in ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18447#discussion_r125195855 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala --- @@ -209,6 +209,18 @@ class DataFrameFunctionsSuite extends QueryTest with SharedSQLContext { Row(2743272264L, 2180413220L)) } + test("misc data_type function") { +val df = Seq(("a", false)).toDF("a", "b") + +checkAnswer( + df.select(data_type($"a"), data_type($"b")), --- End diff -- For me, it sounds the point is to know where it is `null`. I think it'd be more persuasive if you leave some links here with other equivalent SQL function in other databases and match the behaviour. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18447: [SPARK-21232][SQL][SparkR][PYSPARK] New built-in ...
Github user mmolimar commented on a diff in the pull request: https://github.com/apache/spark/pull/18447#discussion_r124545289 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala --- @@ -209,6 +209,18 @@ class DataFrameFunctionsSuite extends QueryTest with SharedSQLContext { Row(2743272264L, 2180413220L)) } + test("misc data_type function") { +val df = Seq(("a", false)).toDF("a", "b") + +checkAnswer( + df.select(data_type($"a"), data_type($"b")), --- End diff -- The idea would be to know the type based on the value itself, not by the schema (i.e. the value could be null): ```scala val df = spark.sparkContext.parallelize( StringData(null) :: StringData("a") :: Nil).toDF() df.select(data_type(col("s"))) //you get null and string in this case df.schema.map(_.dataType.simpleString) // you just get string ``` On the other hand, it'd be nice to have this SQL function as we do in some databases. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18447: [SPARK-21232][SQL][SparkR][PYSPARK] New built-in ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18447#discussion_r124473417 --- Diff: R/pkg/R/functions.R --- @@ -598,6 +598,25 @@ setMethod("hash", column(jc) }) +#' data_type +#' +#' Returns the data type of a given column. +#' +#' @param x Column to get the data type. +#' +#' @rdname data_type +#' @name data_type +#' @family misc functions +#' @aliases data_type,Column-method +#' @export +#' @examples \dontrun{data_type(df$c)} --- End diff -- In R, I think we could do this as below: ```r > df <- createDataFrame(iris) > lapply(schema(df)$fields(), function(s) { s$dataType.simpleString() }) [[1]] [1] "double" [[2]] [1] "double" [[3]] [1] "double" [[4]] [1] "double" [[5]] [1] "string" ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18447: [SPARK-21232][SQL][SparkR][PYSPARK] New built-in ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18447#discussion_r124473373 --- Diff: python/pyspark/sql/functions.py --- @@ -1248,6 +1248,17 @@ def hash(*cols): return Column(jc) +def data_type(col): +"""Returns the data type of the given column. + +>>> spark.createDataFrame([('ABC',)], ['a']).select(data_type('a').alias('data_type')).collect() --- End diff -- In Python, I think we could do this as below: ```python >>> df = spark.createDataFrame([('ABC',)], ['a']) >>> [s.dataType.simpleString() for s in df.schema] ['string'] ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18447: [SPARK-21232][SQL][SparkR][PYSPARK] New built-in ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18447#discussion_r124470787 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala --- @@ -209,6 +209,18 @@ class DataFrameFunctionsSuite extends QueryTest with SharedSQLContext { Row(2743272264L, 2180413220L)) } + test("misc data_type function") { +val df = Seq(("a", false)).toDF("a", "b") + +checkAnswer( + df.select(data_type($"a"), data_type($"b")), --- End diff -- I think we can easily get the types via ```scala val df = Seq(("a", false)).toDF("a", "b") df.schema.map(_.dataType.simpleString) ``` I just wonder in which case we need these type per each row in the dataframe. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18447: [SPARK-21232][SQL][SparkR][PYSPARK] New built-in ...
GitHub user mmolimar opened a pull request: https://github.com/apache/spark/pull/18447 [SPARK-21232][SQL][SparkR][PYSPARK] New built-in SQL function - Data_Type ## What changes were proposed in this pull request? New built-in function to get the data type of columns in SQL. ## How was this patch tested? Unit tests included. You can merge this pull request into a Git repository by running: $ git pull https://github.com/mmolimar/spark SPARK-21232 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18447.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18447 commit ef2b2189994f4d790560dbf8bfddf0008a520ccf Author: Mario MolinaDate: 2017-06-28T06:22:29Z New data_type SQL function commit f4bea7061d09246c8fcaa97865043a70228913e3 Author: Mario Molina Date: 2017-06-28T06:22:57Z Tests for data_type function commit 6fad8e9f518567b503345a21fcea8a4ddf1e5d9b Author: Mario Molina Date: 2017-06-28T06:25:18Z Python support for data_type function commit 959cccf4357abef2bd90957c5402c2a2d67c6262 Author: Mario Molina Date: 2017-06-28T06:25:31Z R support for data_type function --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org