[GitHub] spark pull request #18447: [SPARK-21232][SQL][SparkR][PYSPARK] New built-in ...

2018-09-10 Thread mmolimar
Github user mmolimar closed the pull request at:

https://github.com/apache/spark/pull/18447


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18447: [SPARK-21232][SQL][SparkR][PYSPARK] New built-in ...

2018-08-26 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18447#discussion_r212820817
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala
 ---
@@ -98,6 +98,29 @@ case class AssertTrue(child: Expression) extends 
UnaryExpression with ImplicitCa
   override def sql: String = s"assert_true(${child.sql})"
 }
 
+@ExpressionDescription(
+  usage = "_FUNC_(expr) - Returns the data type of the `expr`.",
+  extended = """
+Examples:
+  > SELECT _FUNC_("a");
+   string
+  > SELECT _FUNC_(OL);
+   bigint
+  """)
--- End diff --

`since` should be added


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18447: [SPARK-21232][SQL][SparkR][PYSPARK] New built-in ...

2018-08-26 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18447#discussion_r212820670
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala
 ---
@@ -98,6 +98,29 @@ case class AssertTrue(child: Expression) extends 
UnaryExpression with ImplicitCa
   override def sql: String = s"assert_true(${child.sql})"
 }
 
+@ExpressionDescription(
+  usage = "_FUNC_(expr) - Returns the data type of the `expr`.",
+  extended = """
+Examples:
+  > SELECT _FUNC_("a");
+   string
+  > SELECT _FUNC_(OL);
+   bigint
+  """)
+case class GetDataType(child: Expression) extends UnaryExpression {
+
+  override def dataType: DataType = StringType
+
+  override def nullSafeEval(input: Any): Any = 
UTF8String.fromString(child.dataType.simpleString)
--- End diff --

It should be `catalogString` instead of `simpleString`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18447: [SPARK-21232][SQL][SparkR][PYSPARK] New built-in ...

2018-08-26 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18447#discussion_r212820641
  
--- Diff: R/pkg/R/generics.R ---
@@ -950,6 +949,10 @@ setGeneric("grouping_bit", function(x) { 
standardGeneric("grouping_bit") })
 #' @name NULL
 setGeneric("grouping_id", function(x, ...) { 
standardGeneric("grouping_id") })
 
+#' @rdname column_misc_functions
+#' @name NULL
+setGeneric("hash", function(x, ...) { standardGeneric("hash") })
--- End diff --

I would avoid unrelated changes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18447: [SPARK-21232][SQL][SparkR][PYSPARK] New built-in ...

2018-05-05 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/18447#discussion_r186278707
  
--- Diff: R/pkg/R/functions.R ---
@@ -679,6 +679,19 @@ setMethod("hash",
 column(jc)
   })
 
+#' @details
+#' \code{data_type}: Returns the data type of a given column.
+#'
+#' @rdname column_misc_functions
+#' @aliases data_type data_type,Column-method
+#' @note data_type since 2.3.0
--- End diff --

2.4.0


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18447: [SPARK-21232][SQL][SparkR][PYSPARK] New built-in ...

2018-05-05 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/18447#discussion_r186278702
  
--- Diff: R/pkg/R/functions.R ---
@@ -679,6 +679,19 @@ setMethod("hash",
 column(jc)
   })
 
+#' @details
+#' \code{data_type}: Returns the data type of a given column.
+#'
+#' @rdname column_misc_functions
+#' @aliases data_type data_type,Column-method
+#' @examples \dontrun{data_type(df$c)}
--- End diff --

see this line of the code example of hash

https://github.com/mmolimar/spark/blob/ed52e2f856f78fb2dca23b6be2f682caa0a88c81/R/pkg/R/functions.R#L176



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18447: [SPARK-21232][SQL][SparkR][PYSPARK] New built-in ...

2018-05-05 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/18447#discussion_r186278685
  
--- Diff: R/pkg/R/functions.R ---
@@ -679,6 +679,19 @@ setMethod("hash",
 column(jc)
   })
 
+#' @details
+#' \code{data_type}: Returns the data type of a given column.
+#'
+#' @rdname column_misc_functions
+#' @aliases data_type data_type,Column-method
+#' @examples \dontrun{data_type(df$c)}
+setMethod("data_type",
--- End diff --

2.4.0


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18447: [SPARK-21232][SQL][SparkR][PYSPARK] New built-in ...

2018-05-04 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/18447#discussion_r186252381
  
--- Diff: R/pkg/R/functions.R ---
@@ -679,6 +679,19 @@ setMethod("hash",
 column(jc)
   })
 
+#' @details
+#' \code{data_type}: Returns the data type of a given column.
+#'
+#' @rdname column_misc_functions
+#' @aliases data_type data_type,Column-method
+#' @examples \dontrun{data_type(df$c)}
+setMethod("data_type",
--- End diff --

add `@note` like example above


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18447: [SPARK-21232][SQL][SparkR][PYSPARK] New built-in ...

2018-05-04 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/18447#discussion_r186252423
  
--- Diff: R/pkg/NAMESPACE ---
@@ -236,6 +236,7 @@ exportMethods("%<=>%",
   "current_date",
   "current_timestamp",
   "hash",
+  "data_type",
--- End diff --

I know this list isn't completely sorted, let's sort this?
you can move "hash" and "data_type"


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18447: [SPARK-21232][SQL][SparkR][PYSPARK] New built-in ...

2018-05-04 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/18447#discussion_r186252398
  
--- Diff: R/pkg/R/functions.R ---
@@ -679,6 +679,19 @@ setMethod("hash",
 column(jc)
   })
 
+#' @details
+#' \code{data_type}: Returns the data type of a given column.
+#'
+#' @rdname column_misc_functions
+#' @aliases data_type data_type,Column-method
+#' @examples \dontrun{data_type(df$c)}
--- End diff --

example should no be added here - see the example for `hash`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18447: [SPARK-21232][SQL][SparkR][PYSPARK] New built-in ...

2018-05-03 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/18447#discussion_r185703059
  
--- Diff: R/pkg/R/functions.R ---
@@ -653,6 +653,25 @@ setMethod("hash",
 column(jc)
   })
 
+#' data_type
+#'
+#' Returns the data type of a given column.
+#'
+#' @param x Column to get the data type.
+#'
+#' @rdname data_type
+#' @name data_type
+#' @family misc functions
--- End diff --

can you follow the pattern, like for "hash" above? we don't have individual 
rdname etc for each function


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18447: [SPARK-21232][SQL][SparkR][PYSPARK] New built-in ...

2017-07-28 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18447#discussion_r130032578
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala ---
@@ -209,6 +209,18 @@ class DataFrameFunctionsSuite extends QueryTest with 
SharedSQLContext {
   Row(2743272264L, 2180413220L))
   }
 
+  test("misc data_type function") {
+val df = Seq(("a", false)).toDF("a", "b")
+
+checkAnswer(
+  df.select(data_type($"a"), data_type($"b")),
--- End diff --

Not sure. I think we need a reference for a function reference in an RDBMS 
(SQL to be more correct).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18447: [SPARK-21232][SQL][SparkR][PYSPARK] New built-in ...

2017-07-28 Thread mmolimar
Github user mmolimar commented on a diff in the pull request:

https://github.com/apache/spark/pull/18447#discussion_r130025210
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala ---
@@ -209,6 +209,18 @@ class DataFrameFunctionsSuite extends QueryTest with 
SharedSQLContext {
   Row(2743272264L, 2180413220L))
   }
 
+  test("misc data_type function") {
+val df = Seq(("a", false)).toDF("a", "b")
+
+checkAnswer(
+  df.select(data_type($"a"), data_type($"b")),
--- End diff --

@HyukjinKwon so?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18447: [SPARK-21232][SQL][SparkR][PYSPARK] New built-in ...

2017-07-11 Thread mmolimar
Github user mmolimar commented on a diff in the pull request:

https://github.com/apache/spark/pull/18447#discussion_r126710311
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala ---
@@ -209,6 +209,18 @@ class DataFrameFunctionsSuite extends QueryTest with 
SharedSQLContext {
   Row(2743272264L, 2180413220L))
   }
 
+  test("misc data_type function") {
+val df = Seq(("a", false)).toDF("a", "b")
+
+checkAnswer(
+  df.select(data_type($"a"), data_type($"b")),
--- End diff --

I can't think of a SQL db which queries the datatype without using the 
table schema. However, for example in MongoDB, you can get something like that 
using the $type operator or 'typeof'.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18447: [SPARK-21232][SQL][SparkR][PYSPARK] New built-in ...

2017-07-02 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18447#discussion_r125195855
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala ---
@@ -209,6 +209,18 @@ class DataFrameFunctionsSuite extends QueryTest with 
SharedSQLContext {
   Row(2743272264L, 2180413220L))
   }
 
+  test("misc data_type function") {
+val df = Seq(("a", false)).toDF("a", "b")
+
+checkAnswer(
+  df.select(data_type($"a"), data_type($"b")),
--- End diff --

For me, it sounds the point is to know where it is `null`. I think it'd be 
more persuasive if you leave some links here with other equivalent SQL function 
in other databases and match the behaviour.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18447: [SPARK-21232][SQL][SparkR][PYSPARK] New built-in ...

2017-06-28 Thread mmolimar
Github user mmolimar commented on a diff in the pull request:

https://github.com/apache/spark/pull/18447#discussion_r124545289
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala ---
@@ -209,6 +209,18 @@ class DataFrameFunctionsSuite extends QueryTest with 
SharedSQLContext {
   Row(2743272264L, 2180413220L))
   }
 
+  test("misc data_type function") {
+val df = Seq(("a", false)).toDF("a", "b")
+
+checkAnswer(
+  df.select(data_type($"a"), data_type($"b")),
--- End diff --

The idea would be to know the type based on the value itself, not by the 
schema (i.e. the value could be null):
```scala
val df = spark.sparkContext.parallelize(
  StringData(null) ::
  StringData("a") :: Nil).toDF()
df.select(data_type(col("s"))) //you get null and string in this case
df.schema.map(_.dataType.simpleString) // you just get string
```
On the other hand, it'd be nice to have this SQL function as we do in some 
databases.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18447: [SPARK-21232][SQL][SparkR][PYSPARK] New built-in ...

2017-06-28 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18447#discussion_r124473417
  
--- Diff: R/pkg/R/functions.R ---
@@ -598,6 +598,25 @@ setMethod("hash",
 column(jc)
   })
 
+#' data_type
+#'
+#' Returns the data type of a given column.
+#'
+#' @param x Column to get the data type.
+#'
+#' @rdname data_type
+#' @name data_type
+#' @family misc functions
+#' @aliases data_type,Column-method
+#' @export
+#' @examples \dontrun{data_type(df$c)}
--- End diff --

In R, I think we could do this as below:

```r
> df <- createDataFrame(iris)
> lapply(schema(df)$fields(), function(s) { s$dataType.simpleString() })
[[1]]
[1] "double"

[[2]]
[1] "double"

[[3]]
[1] "double"

[[4]]
[1] "double"

[[5]]
[1] "string"
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18447: [SPARK-21232][SQL][SparkR][PYSPARK] New built-in ...

2017-06-28 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18447#discussion_r124473373
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -1248,6 +1248,17 @@ def hash(*cols):
 return Column(jc)
 
 
+def data_type(col):
+"""Returns the data type of the given column.
+
+>>> spark.createDataFrame([('ABC',)], 
['a']).select(data_type('a').alias('data_type')).collect()
--- End diff --

In Python, I think we could do this as below:

```python
>>> df = spark.createDataFrame([('ABC',)], ['a'])
>>> [s.dataType.simpleString() for s in df.schema]
['string']
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18447: [SPARK-21232][SQL][SparkR][PYSPARK] New built-in ...

2017-06-28 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18447#discussion_r124470787
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala ---
@@ -209,6 +209,18 @@ class DataFrameFunctionsSuite extends QueryTest with 
SharedSQLContext {
   Row(2743272264L, 2180413220L))
   }
 
+  test("misc data_type function") {
+val df = Seq(("a", false)).toDF("a", "b")
+
+checkAnswer(
+  df.select(data_type($"a"), data_type($"b")),
--- End diff --

I think we can easily get the types via

```scala
val df = Seq(("a", false)).toDF("a", "b")
df.schema.map(_.dataType.simpleString)
```

I just wonder in which case we need these type per each row in the 
dataframe.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18447: [SPARK-21232][SQL][SparkR][PYSPARK] New built-in ...

2017-06-28 Thread mmolimar
GitHub user mmolimar opened a pull request:

https://github.com/apache/spark/pull/18447

[SPARK-21232][SQL][SparkR][PYSPARK] New built-in SQL function - Data_Type

## What changes were proposed in this pull request?

New built-in function to get the data type of columns in SQL.

## How was this patch tested?

Unit tests included.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mmolimar/spark SPARK-21232

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18447.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18447


commit ef2b2189994f4d790560dbf8bfddf0008a520ccf
Author: Mario Molina 
Date:   2017-06-28T06:22:29Z

New data_type SQL function

commit f4bea7061d09246c8fcaa97865043a70228913e3
Author: Mario Molina 
Date:   2017-06-28T06:22:57Z

Tests for data_type function

commit 6fad8e9f518567b503345a21fcea8a4ddf1e5d9b
Author: Mario Molina 
Date:   2017-06-28T06:25:18Z

Python support for data_type function

commit 959cccf4357abef2bd90957c5402c2a2d67c6262
Author: Mario Molina 
Date:   2017-06-28T06:25:31Z

R support for data_type function




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org