[GitHub] spark pull request #23252: [SPARK-26239] File-based secret key loading for S...

2018-12-08 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/23252#discussion_r240022921
  
--- Diff: core/src/test/scala/org/apache/spark/SecurityManagerSuite.scala 
---
@@ -440,12 +473,27 @@ class SecurityManagerSuite extends SparkFunSuite with 
ResetSystemProperties {
 intercept[IllegalArgumentException] {
   mgr.getSecretKey()
 }
+  case FILE =>
+val secretFile = createTempSecretFile()
+conf.set(AUTH_SECRET_FILE, secretFile.getAbsolutePath)
+mgr.initializeAuth()
+assert(encodeFileAsBase64(secretFile) === 
mgr.getSecretKey())
 }
   }
 }
   )
 }
   }
 
+  private def encodeFileAsBase64(secretFile: File) = {
+Base64.getEncoder.encodeToString(Files.readAllBytes(secretFile.toPath))
+  }
+
+  private def createTempSecretFile(contents: String = "test-secret"): File 
= {
+val secretDir = Utils.createTempDir("temp-secrets")
+val secretFile = new File(secretDir, "temp-secret.txt")
+Files.write(secretFile.toPath, 
contents.getBytes(StandardCharsets.UTF_8))
+secretFile
--- End diff --

ah it's fine


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23224: [SPARK-26277][SQL][TEST] WholeStageCodegen metrics shoul...

2018-12-08 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/23224
  
LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23256: [SPARK-24207][R] follow-up PR for SPARK-24207 to fix cod...

2018-12-08 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/23256
  
ideally, but really not for this PR


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23256: [SPARK-24207][R] follow-up PR for SPARK-24207 to ...

2018-12-07 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/23256#discussion_r239997109
  
--- Diff: R/pkg/tests/fulltests/test_mllib_fpm.R ---
@@ -84,19 +84,20 @@ test_that("spark.fpGrowth", {
 })
 
 test_that("spark.prefixSpan", {
-df <- createDataFrame(list(list(list(list(1L, 2L), list(3L))),
-  list(list(list(1L), list(3L, 2L), list(1L, 2L))),
-  list(list(list(1L, 2L), list(5L))),
-  list(list(list(6L, schema = c("sequence"))
-result1 <- spark.findFrequentSequentialPatterns(df, minSupport = 0.5, 
maxPatternLength = 5L,
-maxLocalProjDBSize = 
3200L)
-
-expected_result <- createDataFrame(list(list(list(list(1L)), 3L),
-list(list(list(3L)), 2L),
-list(list(list(2L)), 3L),
-list(list(list(1L, 2L)), 3L),
-list(list(list(1L), list(3L)), 
2L)),
-schema = c("sequence", "freq"))
-  })
+  df <- createDataFrame(list(list(list(list(1L, 2L), list(3L))),
+ list(list(list(1L), list(3L, 2L), list(1L, 
2L))),
+ list(list(list(1L, 2L), list(5L))),
+ list(list(list(6L,
+schema = c("sequence"))
+  result <- spark.findFrequentSequentialPatterns(df, minSupport = 0.5, 
maxPatternLength = 5L,
+ maxLocalProjDBSize = 
3200L)
+
+  expected_result <- createDataFrame(list(list(list(list(1L)), 3L), 
list(list(list(3L)), 2L),
+  list(list(list(2L)), 3L), 
list(list(list(1L, 2L)), 3L),
+  list(list(list(1L), list(3L)), 
2L)),
+ schema = c("sequence", "freq"))
+
+  expect_equivalent(expected_result, result)
--- End diff --

this is an important fix..


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23218: [SPARK-26266][BUILD] Update to Scala 2.12.8

2018-12-06 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/23218
  
do we need to relnote jvm compatibility?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23252: [SPARK-26239] File-based secret key loading for S...

2018-12-06 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/23252#discussion_r239705869
  
--- Diff: core/src/test/scala/org/apache/spark/SecurityManagerSuite.scala 
---
@@ -440,12 +473,27 @@ class SecurityManagerSuite extends SparkFunSuite with 
ResetSystemProperties {
 intercept[IllegalArgumentException] {
   mgr.getSecretKey()
 }
+  case FILE =>
+val secretFile = createTempSecretFile()
+conf.set(AUTH_SECRET_FILE, secretFile.getAbsolutePath)
+mgr.initializeAuth()
+assert(encodeFileAsBase64(secretFile) === 
mgr.getSecretKey())
 }
   }
 }
   )
 }
   }
 
+  private def encodeFileAsBase64(secretFile: File) = {
+Base64.getEncoder.encodeToString(Files.readAllBytes(secretFile.toPath))
+  }
+
+  private def createTempSecretFile(contents: String = "test-secret"): File 
= {
+val secretDir = Utils.createTempDir("temp-secrets")
+val secretFile = new File(secretDir, "temp-secret.txt")
+Files.write(secretFile.toPath, 
contents.getBytes(StandardCharsets.UTF_8))
+secretFile
--- End diff --

can this secret be recovered on disk or we trust tempDir ACL is sufficient?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22305: [SPARK-24561][SQL][Python] User-defined window aggregati...

2018-12-05 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/22305
  
I can help if this looks good to @ueshin 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23184: [SPARK-26227][R] from_[csv|json] should accept sc...

2018-12-02 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/23184#discussion_r238120855
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala 
---
@@ -225,4 +225,10 @@ private[sql] object SQLUtils extends Logging {
 }
 sparkSession.sessionState.catalog.listTables(db).map(_.table).toArray
   }
+
+  def createArrayType(elementType: DataType): ArrayType = 
DataTypes.createArrayType(elementType)
--- End diff --

yea, it's really minor, but really no one except you is working R APIs :)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23184: [SPARK-26227][R] from_[csv|json] should accept sc...

2018-12-02 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/23184#discussion_r238120812
  
--- Diff: R/pkg/R/functions.R ---
@@ -2254,40 +2255,48 @@ setMethod("date_format", signature(y = "Column", x 
= "character"),
 column(jc)
   })
 
+setClassUnion("characterOrstructTypeOrColumn", c("character", 
"structType", "Column"))
--- End diff --

yes


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC

2018-12-01 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/23072#discussion_r238087240
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/ml/FPGrowthExample.scala ---
@@ -64,4 +64,3 @@ object FPGrowthExample {
 spark.stop()
   }
 }
-// scalastyle:on println
--- End diff --

yes, println is not used


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23161: [SPARK-26189][R]Fix unionAll doc in SparkR

2018-11-30 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/23161
  
merged to master


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23161: [SPARK-26189][R]Fix unionAll doc in SparkR

2018-11-30 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/23161
  
LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23184: [SPARK-26227][R] from_[csv|json] should accept sc...

2018-11-30 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/23184#discussion_r238055143
  
--- Diff: R/pkg/R/functions.R ---
@@ -2254,40 +2255,48 @@ setMethod("date_format", signature(y = "Column", x 
= "character"),
 column(jc)
   })
 
+setClassUnion("characterOrstructTypeOrColumn", c("character", 
"structType", "Column"))
--- End diff --

we should probably try to pull all the setClassUnion in one place. (to 
avoid conflict or duplication)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23184: [SPARK-26227][R] from_[csv|json] should accept sc...

2018-11-30 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/23184#discussion_r238055087
  
--- Diff: R/pkg/R/functions.R ---
@@ -2254,40 +2255,48 @@ setMethod("date_format", signature(y = "Column", x 
= "character"),
 column(jc)
   })
 
+setClassUnion("characterOrstructTypeOrColumn", c("character", 
"structType", "Column"))
+
 #' @details
 #' \code{from_json}: Parses a column containing a JSON string into a 
Column of \code{structType}
 #' with the specified \code{schema} or array of \code{structType} if 
\code{as.json.array} is set
 #' to \code{TRUE}. If the string is unparseable, the Column will contain 
the value NA.
 #'
 #' @rdname column_collection_functions
 #' @param as.json.array indicating if input string is JSON array of 
objects or a single object.
-#' @aliases from_json from_json,Column,characterOrstructType-method
+#' @aliases from_json from_json,Column,characterOrstructTypeOrColumn-method
 #' @examples
 #'
 #' \dontrun{
 #' df2 <- sql("SELECT named_struct('date', cast('2000-01-01' as date)) as 
d")
 #' df2 <- mutate(df2, d2 = to_json(df2$d, dateFormat = 'dd/MM/'))
 #' schema <- structType(structField("date", "string"))
 #' head(select(df2, from_json(df2$d2, schema, dateFormat = 'dd/MM/')))
-
 #' df2 <- sql("SELECT named_struct('name', 'Bob') as people")
 #' df2 <- mutate(df2, people_json = to_json(df2$people))
 #' schema <- structType(structField("name", "string"))
 #' head(select(df2, from_json(df2$people_json, schema)))
-#' head(select(df2, from_json(df2$people_json, "name STRING")))}
+#' head(select(df2, from_json(df2$people_json, "name STRING")))
+#' head(select(df2, from_json(df2$people_json, 
schema_of_json(head(df2)$people_json}
 #' @note from_json since 2.2.0
-setMethod("from_json", signature(x = "Column", schema = 
"characterOrstructType"),
+setMethod("from_json", signature(x = "Column", schema = 
"characterOrstructTypeOrColumn"),
   function(x, schema, as.json.array = FALSE, ...) {
 if (is.character(schema)) {
-  schema <- structType(schema)
+  jschema <- structType(schema)$jobj
+} else if (class(schema) == "structType") {
+  jschema <- schema$jobj
+} else {
+  jschema <- schema@jc
 }
 
 if (as.json.array) {
-  jschema <- 
callJStatic("org.apache.spark.sql.types.DataTypes",
- "createArrayType",
- schema$jobj)
-} else {
-  jschema <- schema$jobj
+  # This case is R-specifically different. Unlike Scala and 
Python side,
--- End diff --

what if `as.json.array` is TRUE but schema is also set?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23184: [SPARK-26227][R] from_[csv|json] should accept sc...

2018-11-30 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/23184#discussion_r238055126
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala 
---
@@ -225,4 +225,10 @@ private[sql] object SQLUtils extends Logging {
 }
 sparkSession.sessionState.catalog.listTables(db).map(_.table).toArray
   }
+
+  def createArrayType(elementType: DataType): ArrayType = 
DataTypes.createArrayType(elementType)
--- End diff --

as mentioned before, I kinda have to disagree with you here... I'd prefer 
less stuff in r/SQLUtils, and instead, call the scala/jvm method directly from 
R...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23184: [SPARK-26227][R] from_[csv|json] should accept sc...

2018-11-30 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/23184#discussion_r238055173
  
--- Diff: R/pkg/R/functions.R ---
@@ -202,8 +202,9 @@ NULL
 #'  \itemize{
 #'  \item \code{from_json}: a structType object to use as the 
schema to use
 #'  when parsing the JSON string. Since Spark 2.3, the 
DDL-formatted string is
-#'  also supported for the schema.
-#'  \item \code{from_csv}: a DDL-formatted string
+#'  also supported for the schema. Since Spark 3.0, 
\code{schema_of_json} or
+#'  a string literal can also be accepted.
--- End diff --

that's true...? they are the same?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22939: [SPARK-25446][R] Add schema_of_json() and schema_of_csv(...

2018-11-29 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/22939
  
Error looks reasonable...




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23161: [SPARK-26189][R]Fix unionAll doc in SparkR

2018-11-29 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/23161#discussion_r237383462
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -2732,13 +2732,24 @@ setMethod("union",
 dataFrame(unioned)
   })
 
-#' Return a new SparkDataFrame containing the union of rows
-#'
-#' This is an alias for `union`.
+#' Return a new SparkDataFrame containing the union of rows.
+#' This is an alias for \code{union}.
--- End diff --

actually, we do need a newline - L2735 is the title as this is a new page. 
could you build the roxygen2 doc to double check?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23161: [SPARK-26189][R]Fix unionAll doc in SparkR

2018-11-28 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/23161#discussion_r236972877
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -2732,14 +2732,24 @@ setMethod("union",
 dataFrame(unioned)
   })
 
-#' Return a new SparkDataFrame containing the union of rows
-#'
-#' This is an alias for `union`.
+#' Return a new SparkDataFrame containing the union of rows.
+#' This is an alias for \code{union}.
 #'
-#' @rdname union
-#' @name unionAll
+#' @param x a SparkDataFrame.
+#' @param y a SparkDataFrame.
+#' @return A SparkDataFrame containing the result of the unionAll 
operation.
+#' @family SparkDataFrame functions
 #' @aliases unionAll,SparkDataFrame,SparkDataFrame-method
-#' @note unionAll since 1.4.0
+#' @rdname unionAll
+#' @name unionAll
--- End diff --

I'd add a `@seealso` to union


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23161: [SPARK-26189][R]Fix unionAll doc in SparkR

2018-11-28 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/23161#discussion_r236973169
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -2732,14 +2732,24 @@ setMethod("union",
 dataFrame(unioned)
   })
 
-#' Return a new SparkDataFrame containing the union of rows
-#'
-#' This is an alias for `union`.
+#' Return a new SparkDataFrame containing the union of rows.
+#' This is an alias for \code{union}.
 #'
-#' @rdname union
-#' @name unionAll
+#' @param x a SparkDataFrame.
+#' @param y a SparkDataFrame.
+#' @return A SparkDataFrame containing the result of the unionAll 
operation.
+#' @family SparkDataFrame functions
 #' @aliases unionAll,SparkDataFrame,SparkDataFrame-method
-#' @note unionAll since 1.4.0
+#' @rdname unionAll
+#' @name unionAll
+#' @examples
+#'\dontrun{
+#' sparkR.session()
+#' df1 <- read.json(path)
+#' df2 <- read.json(path2)
+#' unionAllDF <- unionAll(df1, df2)
+#' }
+#' @note unionAll since 3.0.0
--- End diff --

this should be `1.4.0` as above


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23025: [SPARK-26024][SQL]: Update documentation for repa...

2018-11-27 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/23025#discussion_r236970732
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -767,6 +767,14 @@ setMethod("repartition",
 #'  using \code{spark.sql.shuffle.partitions} as 
number of partitions.}
 #'}
 #'
+#' At least one partition-by expression must be specified.
--- End diff --

761 is significant also, but correct. 

essentially:
1. first line of the blob is the title (L760)
2. second text after "empty line" is the description (L762)
3. third after another "empty line" is the "detail note" which is stashed 
all the way to the bottom of the doc page

so generally you want "important" part of the description on top and not in 
the "detail" section because it is easily missed. 

this is the most common pattern in this code base. there's another, where 
multiple function is doc together as a group, eg. collection sql function (in 
functions.R). other finer control is possible as well but not used today in 
this code base.

similarly L829 is good, L831 is a bit fuzzy - I'd personally prefer without 
L831 to keep the whole text in the description section of the doc.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC

2018-11-27 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/23072#discussion_r236771417
  
--- Diff: docs/ml-clustering.md ---
@@ -265,3 +265,44 @@ Refer to the [R API 
docs](api/R/spark.gaussianMixture.html) for more details.
 
 
 
+
+## Power Iteration Clustering (PIC)
+
+Power Iteration Clustering (PIC) is  a scalable graph clustering algorithm
--- End diff --

could you open a separate PR with just this file (minus R) and 
FPGrowthExample.scala on branch-2.4?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22939: [SPARK-25446][R] Add schema_of_json() and schema_...

2018-11-27 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/22939#discussion_r236770223
  
--- Diff: R/pkg/R/functions.R ---
@@ -2230,6 +2237,32 @@ setMethod("from_json", signature(x = "Column", 
schema = "characterOrstructType")
 column(jc)
   })
 
+#' @details
+#' \code{schema_of_json}: Parses a JSON string and infers its schema in 
DDL format.
+#'
+#' @rdname column_collection_functions
+#' @aliases schema_of_json schema_of_json,characterOrColumn-method
+#' @examples
+#'
+#' \dontrun{
+#' json <- '{"name":"Bob"}'
+#' df <- sql("SELECT * FROM range(1)")
+#' head(select(df, schema_of_json(json)))}
+#' @note schema_of_json since 3.0.0
+setMethod("schema_of_json", signature(x = "characterOrColumn"),
+  function(x, ...) {
+if (class(x) == "character") {
+  col <- callJStatic("org.apache.spark.sql.functions", "lit", 
x)
+} else {
+  col <- x@jc
--- End diff --

maybe to think about the design of API in R and Scala and else where - what 
does it look like when the user passes in a column that is not a literal 
string? probably worthwhile to follow up separately.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23145: [MINOR][Docs][WIP] Fix Typos

2018-11-27 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/23145#discussion_r236765511
  
--- Diff: docs/index.md ---
@@ -67,7 +67,7 @@ Example applications are also provided in Python. For 
example,
 ./bin/spark-submit examples/src/main/python/pi.py 10
 
 Spark also provides an experimental [R API](sparkr.html) since 1.4 (only 
DataFrames APIs included).
--- End diff --

let's remove `experimental`, WDYT?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23098: [WIP][SPARK-26132][BUILD][CORE] Remove support fo...

2018-11-27 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/23098#discussion_r236764795
  
--- Diff: R/pkg/R/sparkR.R ---
@@ -269,7 +269,7 @@ sparkR.sparkContext <- function(
 #' sparkR.session("yarn-client", "SparkR", "/home/spark",
 #'list(spark.executor.memory="4g"),
 #'c("one.jar", "two.jar", "three.jar"),
-#'c("com.databricks:spark-avro_2.11:2.0.1"))
+#'c("com.databricks:spark-avro_2.12:2.0.1"))
--- End diff --

yes, dummy name is completely fine with me.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23131: [SPARK-25908][SQL][FOLLOW-UP] Add back unionAll

2018-11-27 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/23131#discussion_r236763355
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -2732,6 +2732,20 @@ setMethod("union",
 dataFrame(unioned)
   })
 
+#' Return a new SparkDataFrame containing the union of rows
+#'
+#' This is an alias for `union`.
--- End diff --

also backtick doesn't format with roxygen2. this should be 
```
This is an alias for \code{union}.
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23025: [SPARK-26024][SQL]: Update documentation for repa...

2018-11-27 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/23025#discussion_r236762465
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -767,6 +767,14 @@ setMethod("repartition",
 #'  using \code{spark.sql.shuffle.partitions} as 
number of partitions.}
 #'}
 #'
+#' At least one partition-by expression must be specified.
--- End diff --

this won't be formatted correctly in R doc due to the fact that "empty 
line" is significant. L769 should be removed to ensure it is in description


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23131: [SPARK-25908][SQL][FOLLOW-UP] Add back unionAll

2018-11-27 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/23131#discussion_r236760822
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -2732,6 +2732,20 @@ setMethod("union",
 dataFrame(unioned)
   })
 
+#' Return a new SparkDataFrame containing the union of rows
+#'
+#' This is an alias for `union`.
--- End diff --

If the goal is for this to be like other *All, this should go into a 
separate doc page, plus seealso, example etc.

The way this was written, as it was a deprecated function, this doc page 
merge with union - as it is, none of the text above will show up and also 
unionAll will not be listed in method index list.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23145: [MINOR][Docs] "a R interpreter" -> "an R interpre...

2018-11-26 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/23145#discussion_r236546043
  
--- Diff: docs/index.md ---
@@ -67,7 +67,7 @@ Example applications are also provided in Python. For 
example,
 ./bin/spark-submit examples/src/main/python/pi.py 10
 
 Spark also provides an experimental [R API](sparkr.html) since 1.4 (only 
DataFrames APIs included).
--- End diff --

tbh, I'm not sure this should be called "an experimental [R API]"


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23089: [SPARK-26120][TESTS][SS][SPARKR]Fix a streaming query le...

2018-11-21 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/23089
  
Thanks!





---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22939: [SPARK-25446][R] Add schema_of_json() and schema_of_csv(...

2018-11-21 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/22939
  
Sorry for the delay, will do another pass in 1 or 2 days





---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC

2018-11-17 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/23072#discussion_r234432181
  
--- Diff: R/pkg/R/mllib_clustering.R ---
@@ -610,3 +616,57 @@ setMethod("write.ml", signature(object = "LDAModel", 
path = "character"),
   function(object, path, overwrite = FALSE) {
 write_internal(object, path, overwrite)
   })
+
+#' PowerIterationClustering
+#'
+#' A scalable graph clustering algorithm. Users can call 
\code{spark.assignClusters} to
+#' return a cluster assignment for each input vertex.
+#'
+#  Run the PIC algorithm and returns a cluster assignment for each input 
vertex.
+#' @param data A SparkDataFrame.
+#' @param k The number of clusters to create.
+#' @param initMode Param for the initialization algorithm.
+#' @param maxIter Param for maximum number of iterations.
+#' @param srcCol Param for the name of the input column for source vertex 
IDs.
+#' @param dstCol Name of the input column for destination vertex IDs.
+#' @param weightCol Param for weight column name. If this is not set or 
\code{NULL},
+#'  we treat all instance weights as 1.0.
+#' @param ... additional argument(s) passed to the method.
+#' @return A dataset that contains columns of vertex id and the 
corresponding cluster for the id.
+#' The schema of it will be:
+#' \code{id: Long}
+#' \code{cluster: Int}
+#' @rdname spark.powerIterationClustering
+#' @aliases 
assignClusters,PowerIterationClustering-method,SparkDataFrame-method
+#' @examples
+#' \dontrun{
+#' df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0),
+#'   list(1L, 2L, 1.0), list(3L, 4L, 1.0),
+#'   list(4L, 0L, 0.1)), schema = c("src", "dst", 
"weight"))
+#' clusters <- spark.assignClusters(df, initMode="degree", 
weightCol="weight")
+#' showDF(clusters)
+#' }
+#' @note spark.assignClusters(SparkDataFrame) since 3.0.0
+setMethod("spark.assignClusters",
+  signature(data = "SparkDataFrame"),
+  function(data, k = 2L, initMode = "random", maxIter = 20L, 
srcCol = "src",
+dstCol = "dst", weightCol = NULL) {
--- End diff --

I  think we try to avoid srcCol dstCol in R (I think there are other R ml 
APIs like that)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC

2018-11-17 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/23072#discussion_r234432019
  
--- Diff: R/pkg/R/mllib_clustering.R ---
@@ -610,3 +616,57 @@ setMethod("write.ml", signature(object = "LDAModel", 
path = "character"),
   function(object, path, overwrite = FALSE) {
 write_internal(object, path, overwrite)
   })
+
+#' PowerIterationClustering
+#'
+#' A scalable graph clustering algorithm. Users can call 
\code{spark.assignClusters} to
+#' return a cluster assignment for each input vertex.
+#'
+#  Run the PIC algorithm and returns a cluster assignment for each input 
vertex.
+#' @param data A SparkDataFrame.
+#' @param k The number of clusters to create.
+#' @param initMode Param for the initialization algorithm.
+#' @param maxIter Param for maximum number of iterations.
+#' @param srcCol Param for the name of the input column for source vertex 
IDs.
+#' @param dstCol Name of the input column for destination vertex IDs.
+#' @param weightCol Param for weight column name. If this is not set or 
\code{NULL},
+#'  we treat all instance weights as 1.0.
+#' @param ... additional argument(s) passed to the method.
+#' @return A dataset that contains columns of vertex id and the 
corresponding cluster for the id.
+#' The schema of it will be:
+#' \code{id: Long}
+#' \code{cluster: Int}
+#' @rdname spark.powerIterationClustering
+#' @aliases 
assignClusters,PowerIterationClustering-method,SparkDataFrame-method
+#' @examples
+#' \dontrun{
+#' df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0),
+#'   list(1L, 2L, 1.0), list(3L, 4L, 1.0),
+#'   list(4L, 0L, 0.1)), schema = c("src", "dst", 
"weight"))
+#' clusters <- spark.assignClusters(df, initMode="degree", 
weightCol="weight")
+#' showDF(clusters)
+#' }
+#' @note spark.assignClusters(SparkDataFrame) since 3.0.0
+setMethod("spark.assignClusters",
+  signature(data = "SparkDataFrame"),
+  function(data, k = 2L, initMode = "random", maxIter = 20L, 
srcCol = "src",
--- End diff --

set valid values for initMode and check for it - eg. 
https://github.com/apache/spark/pull/23072/files#diff-d9f92e07db6424e2527a7f9d7caa9013R355

and `match.arg(initMode)`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC

2018-11-17 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/23072#discussion_r234432049
  
--- Diff: R/pkg/vignettes/sparkr-vignettes.Rmd ---
@@ -968,6 +970,17 @@ predicted <- predict(model, df)
 head(predicted)
 ```
 
+ Power Iteration Clustering
+
+Power Iteration Clustering (PIC) is a scalable graph clustering algorithm. 
`spark.assignClusters` method runs the PIC algorithm and returns a cluster 
assignment for each input vertex.
+
+```{r}
+df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0),
+  list(1L, 2L, 1.0), list(3L, 4L, 1.0),
+  list(4L, 0L, 0.1)), schema = c("src", "dst", 
"weight"))
+head(spark.assignClusters(df, initMode="degree", weightCol="weight"))
--- End diff --

spacing: `initMode = "degree", weightCol = "weight"`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23073: [SPARK-26104] expose pci info to task scheduler

2018-11-17 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/23073#discussion_r234431864
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/cluster/ExecutorData.scala ---
@@ -27,12 +27,14 @@ import org.apache.spark.rpc.{RpcAddress, RpcEndpointRef}
  * @param executorHost The hostname that this executor is running on
  * @param freeCores  The current number of cores available for work on the 
executor
  * @param totalCores The total number of cores available to the executor
+ * @param pcis The external devices avaliable to the executor
--- End diff --

available


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23073: [SPARK-26104] expose pci info to task scheduler

2018-11-17 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/23073
  
please put ^ comment into PR description (because comment is not included 
in commit message once the PR is merged)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23012: [SPARK-26014][R] Deprecate R prior to version 3.4 in Spa...

2018-11-14 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/23012
  
Yea there are some problem with some packages we depend on that are not 
installable from CRAN (eg too old) so it will be hard to a new version of R and 
new installation.

So to clarify, depreciation as is we still test on R 3.1?





---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23012: [SPARK-26014][R] Deprecate R prior to version 3.4 in Spa...

2018-11-13 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/23012
  
Hey shane I don’t think we are saying to test multiple R version at all. 
In fact quite the opposite, just the new(er) version at some point in the 
future.

(We don’t have a better solution for packages though. There’s another 
PR for R arrow package for example)



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23012: [SPARK-26014][R] Deprecate R prior to version 3.4 in Spa...

2018-11-13 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/23012
  
I think it's easier to say unsupported if we are not testing it in jenkins 
or appveyer. I don't know if we any coverage at release for older R version 
anyway, so it's better to unsupported then deprecate. 

but agree maybe the way to do this is deprecate without updating R in 
jenkins


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23017: [WIP][SPARK-26015][K8S] Set a default UID for Spark on K...

2018-11-13 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/23017
  
noted test issue. let's kick off test though


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23017: [WIP][SPARK-26015][K8S] Set a default UID for Spark on K...

2018-11-13 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/23017
  
ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23007: [SPARK-26010][R] fix vignette eval with Java 11

2018-11-12 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/23007
  
merged to master/2.4


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22866: WIP [SPARK-12172][SPARKR] Remove internal-only RDD metho...

2018-11-12 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/22866
  
thx, but DO NOT MERGE - there's some nasty bug I'm still investigating..


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23012: [SPARK-26014][R] Deprecate R prior to version 3.4...

2018-11-12 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/23012#discussion_r232881732
  
--- Diff: R/pkg/R/sparkR.R ---
@@ -283,6 +283,10 @@ sparkR.session <- function(
   enableHiveSupport = TRUE,
   ...) {
 
+  if (utils::compareVersion(paste0(R.version$major, ".", R.version$minor), 
"3.4.0") == -1) {
+warning("R prior to version 3.4 is deprecated as of Spark 3.0.")
+  }
--- End diff --

ditto
`Support for R prior to version 3.4 is deprecated since Spark 3.0.0`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23012: [SPARK-26014][R] Deprecate R prior to version 3.4...

2018-11-12 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/23012#discussion_r232882419
  
--- Diff: docs/index.md ---
@@ -31,7 +31,8 @@ Spark runs on both Windows and UNIX-like systems (e.g. 
Linux, Mac OS). It's easy
 locally on one machine --- all you need is to have `java` installed on 
your system `PATH`,
 or the `JAVA_HOME` environment variable pointing to a Java installation.
 
-Spark runs on Java 8+, Python 2.7+/3.4+ and R 3.1+. For the Scala API, 
Spark {{site.SPARK_VERSION}}
+Spark runs on Java 8+, Python 2.7+/3.4+ and R 3.1+. R prior to version 3.4 
is deprecated as of Spark 3.0.
--- End diff --

`R prior to version 3.4 support is deprecated as of Spark 3.0.0.`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23012: [SPARK-26014][R] Deprecate R prior to version 3.4...

2018-11-12 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/23012#discussion_r232882178
  
--- Diff: docs/index.md ---
@@ -31,7 +31,8 @@ Spark runs on both Windows and UNIX-like systems (e.g. 
Linux, Mac OS). It's easy
 locally on one machine --- all you need is to have `java` installed on 
your system `PATH`,
 or the `JAVA_HOME` environment variable pointing to a Java installation.
 
-Spark runs on Java 8+, Python 2.7+/3.4+ and R 3.1+. For the Scala API, 
Spark {{site.SPARK_VERSION}}
+Spark runs on Java 8+, Python 2.7+/3.4+ and R 3.1+. R prior to version 3.4 
is deprecated as of Spark 3.0.
--- End diff --

with all the other changes, we haven't listed all deprecation here, or have 
we?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23012: [SPARK-26014][R] Deprecate R prior to version 3.4...

2018-11-12 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/23012#discussion_r232881594
  
--- Diff: R/WINDOWS.md ---
@@ -3,7 +3,7 @@
 To build SparkR on Windows, the following steps are required
 
 1. Install R (>= 3.1) and 
[Rtools](http://cran.r-project.org/bin/windows/Rtools/). Make sure to
-include Rtools and R in `PATH`.
+include Rtools and R in `PATH`. Note that R prior to version 3.4 is 
deprecated as of Spark 3.0.
--- End diff --

I really would prefer "unsupported" but if we go with this it should say
`Note that support for R prior to version 3.4 is deprecated as of Spark 
3.0.0.`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23012: [SPARK-26014][R] Deprecate R prior to version 3.4 in Spa...

2018-11-12 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/23012
  
FYI
This is unused code I’m going to remove it

https://github.com/apache/spark/blob/master/R/pkg/src-native/string_hash_code.c





---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23012: [SPARK-26014][R] Deprecate R prior to version 3.4 in Spa...

2018-11-12 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/23012
  
Also I think the warning should be in .First in general.R




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23012: [SPARK-26014][R] Deprecate R prior to version 3.4 in Spa...

2018-11-12 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/23012
  
I think this should say unsupported (ie could still work) instead of 
deprecated

Also the compareVersion should check both major and minor ie 3.4.0



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22939: [SPARK-25446][R] Add schema_of_json() and schema_...

2018-11-11 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/22939#discussion_r232500194
  
--- Diff: R/pkg/R/functions.R ---
@@ -2230,6 +2237,32 @@ setMethod("from_json", signature(x = "Column", 
schema = "characterOrstructType")
 column(jc)
   })
 
+#' @details
+#' \code{schema_of_json}: Parses a JSON string and infers its schema in 
DDL format.
+#'
+#' @rdname column_collection_functions
+#' @aliases schema_of_json schema_of_json,characterOrColumn-method
+#' @examples
+#'
+#' \dontrun{
+#' json <- '{"name":"Bob"}'
+#' df <- sql("SELECT * FROM range(1)")
+#' head(select(df, schema_of_json(json)))}
+#' @note schema_of_json since 3.0.0
+setMethod("schema_of_json", signature(x = "characterOrColumn"),
+  function(x, ...) {
+if (class(x) == "character") {
+  col <- callJStatic("org.apache.spark.sql.functions", "lit", 
x)
+} else {
+  col <- x@jc
--- End diff --

just that I thought the shortcut syntax in scala is nicer looking then 
`lit("string")` in R


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

2018-11-11 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/22954#discussion_r232500065
  
--- Diff: R/pkg/R/SQLContext.R ---
@@ -172,36 +257,72 @@ getDefaultSqlSource <- function() {
 createDataFrame <- function(data, schema = NULL, samplingRatio = 1.0,
 numPartitions = NULL) {
   sparkSession <- getSparkSession()
-
+  arrowEnabled <- sparkR.conf("spark.sql.execution.arrow.enabled")[[1]] == 
"true"
+  shouldUseArrow <- FALSE
+  firstRow <- NULL
   if (is.data.frame(data)) {
-  # Convert data into a list of rows. Each row is a list.
-
-  # get the names of columns, they will be put into RDD
-  if (is.null(schema)) {
-schema <- names(data)
-  }
+# get the names of columns, they will be put into RDD
+if (is.null(schema)) {
+  schema <- names(data)
+}
 
-  # get rid of factor type
-  cleanCols <- function(x) {
-if (is.factor(x)) {
-  as.character(x)
-} else {
-  x
-}
+# get rid of factor type
+cleanCols <- function(x) {
+  if (is.factor(x)) {
+as.character(x)
+  } else {
+x
   }
+}
+data[] <- lapply(data, cleanCols)
+
+args <- list(FUN = list, SIMPLIFY = FALSE, USE.NAMES = FALSE)
+if (arrowEnabled) {
+  shouldUseArrow <- tryCatch({
+stopifnot(length(data) > 0)
+dataHead <- head(data, 1)
+checkTypeRequirementForArrow(data, schema)
+fileName <- writeToTempFileInArrow(data, numPartitions)
+tryCatch(
+  jrddInArrow <- callJStatic("org.apache.spark.sql.api.r.SQLUtils",
+ "readArrowStreamFromFile",
+ sparkSession,
+ fileName),
+finally = {
+  file.remove(fileName)
+})
+
+firstRow <- do.call(mapply, append(args, dataHead))[[1]]
+TRUE
+  },
+  error = function(e) {
+warning(paste0("createDataFrame attempted Arrow optimization 
because ",
+   "'spark.sql.execution.arrow.enabled' is set to 
true; however, ",
+   "failed, attempting non-optimization. Reason: ",
+   e))
+return(FALSE)
--- End diff --

nit: just `FALSE` is good


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

2018-11-11 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/22954#discussion_r232499902
  
--- Diff: R/pkg/R/SQLContext.R ---
@@ -147,6 +147,91 @@ getDefaultSqlSource <- function() {
   l[["spark.sql.sources.default"]]
 }
 
+writeToTempFileInArrow <- function(rdf, numPartitions) {
+  requireNamespace1 <- requireNamespace
+
+  # For some reasons, Arrow R API requires to load 'defer_parent' which is 
from 'withr' package.
+  # This is a workaround to avoid this error. Otherwise, we should 
directly load 'withr'
+  # package, which CRAN complains about.
+  defer_parent <- function(x, ...) {
+  if (requireNamespace1("withr", quietly = TRUE)) {
+  defer_parent <- get("defer_parent", envir = asNamespace("withr"), 
inherits = FALSE)
+  defer_parent(x, ...)
+} else {
+  stop("'withr' package should be installed.")
+}
+  }
+
+  # R API in Arrow is not yet released. CRAN requires to add the package 
in requireNamespace
+  # at DESCRIPTION. Later, CRAN checks if the package is available or not. 
Therefore, it works
+  # around by avoiding direct requireNamespace.
+  if (requireNamespace1("arrow", quietly = TRUE)) {
+record_batch <- get("record_batch", envir = asNamespace("arrow"), 
inherits = FALSE)
+record_batch_stream_writer <- get(
+  "record_batch_stream_writer", envir = asNamespace("arrow"), inherits 
= FALSE)
+file_output_stream <- get(
+  "file_output_stream", envir = asNamespace("arrow"), inherits = FALSE)
+write_record_batch <- get(
+  "write_record_batch", envir = asNamespace("arrow"), inherits = FALSE)
+
+numPartitions <- if (!is.null(numPartitions)) {
+  numToInt(numPartitions)
+} else {
+  1
--- End diff --

future: consolidate the default here and inside makeSplits


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

2018-11-11 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/22954#discussion_r232499848
  
--- Diff: R/pkg/tests/fulltests/test_sparkSQL.R ---
@@ -307,6 +307,64 @@ test_that("create DataFrame from RDD", {
   unsetHiveContext()
 })
 
+test_that("createDataFrame Arrow optimization", {
+  skip_if_not_installed("arrow")
+  skip_if_not_installed("withr")
--- End diff --

are we going to ask shane to install arrow/withr on the Jenkins machines?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

2018-11-11 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/22954#discussion_r232499794
  
--- Diff: R/pkg/tests/fulltests/test_sparkSQL.R ---
@@ -307,6 +307,64 @@ test_that("create DataFrame from RDD", {
   unsetHiveContext()
 })
 
+test_that("createDataFrame Arrow optimization", {
+  skip_if_not_installed("arrow")
+  skip_if_not_installed("withr")
+
+  conf <- callJMethod(sparkSession, "conf")
+  arrowEnabled <- sparkR.conf("spark.sql.execution.arrow.enabled")[[1]]
+
+  callJMethod(conf, "set", "spark.sql.execution.arrow.enabled", "false")
+  tryCatch({
--- End diff --

does this fail? or just a way to inject a finally?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22993: [SPARK-24421][BUILD][CORE] Accessing sun.misc.Cle...

2018-11-11 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/22993#discussion_r232499645
  
--- Diff: common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java 
---
@@ -67,6 +67,59 @@
 unaligned = _unaligned;
   }
 
+  // Access fields and constructors once and store them, for performance:
+
+  private static final Constructor DBB_CONSTRUCTOR;
+  private static final Field DBB_CLEANER_FIELD;
+  static {
+try {
+  Class cls = Class.forName("java.nio.DirectByteBuffer");
+  Constructor constructor = cls.getDeclaredConstructor(Long.TYPE, 
Integer.TYPE);
+  constructor.setAccessible(true);
+  Field cleanerField = cls.getDeclaredField("cleaner");
+  cleanerField.setAccessible(true);
+  DBB_CONSTRUCTOR = constructor;
+  DBB_CLEANER_FIELD = cleanerField;
+} catch (ClassNotFoundException | NoSuchMethodException | 
NoSuchFieldException e) {
+  throw new IllegalStateException(e);
+}
+  }
+
+  private static final Method CLEANER_CREATE_METHOD;
+  static {
+// The implementation of Cleaner changed from JDK 8 to 9
+int majorVersion = 
Integer.parseInt(System.getProperty("java.version").split("\\.")[0]);
--- End diff --

looks like it could be `java.version=1.8.0_192` or `java.version=11.0.1`

ie. first integer or second integer (1.8 => 8?)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23007: [SPARK-26010] fix vignette eval with Java 11

2018-11-11 Thread felixcheung
GitHub user felixcheung opened a pull request:

https://github.com/apache/spark/pull/23007

[SPARK-26010] fix vignette eval with Java 11

## What changes were proposed in this pull request?

changes in vignette only to disable eval

## How was this patch tested?

Jenkins

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/felixcheung/spark rjavavervig

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23007.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23007


commit dc4ba6cd813d91622032508e6696b6f56fb19bd9
Author: Felix Cheung 
Date:   2018-11-11T18:30:11Z

fix




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22993: [SPARK-24421][BUILD][CORE] Accessing sun.misc.Cle...

2018-11-11 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/22993#discussion_r232477875
  
--- Diff: common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java 
---
@@ -67,6 +67,59 @@
 unaligned = _unaligned;
   }
 
+  // Access fields and constructors once and store them, for performance:
+
+  private static final Constructor DBB_CONSTRUCTOR;
+  private static final Field DBB_CLEANER_FIELD;
+  static {
+try {
+  Class cls = Class.forName("java.nio.DirectByteBuffer");
+  Constructor constructor = cls.getDeclaredConstructor(Long.TYPE, 
Integer.TYPE);
+  constructor.setAccessible(true);
+  Field cleanerField = cls.getDeclaredField("cleaner");
+  cleanerField.setAccessible(true);
+  DBB_CONSTRUCTOR = constructor;
+  DBB_CLEANER_FIELD = cleanerField;
+} catch (ClassNotFoundException | NoSuchMethodException | 
NoSuchFieldException e) {
+  throw new IllegalStateException(e);
+}
+  }
+
+  private static final Method CLEANER_CREATE_METHOD;
+  static {
+// The implementation of Cleaner changed from JDK 8 to 9
+int majorVersion = 
Integer.parseInt(System.getProperty("java.version").split("\\.")[0]);
--- End diff --

is there a defined fixed format for this?
we are doing some java version check and found very different format from 
different JDK sources (Oracle vs OpenJDK vs IBM ...)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22993: [SPARK-24421][BUILD][CORE] Accessing sun.misc.Cleaner in...

2018-11-11 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/22993
  
what settings we need to allow `illegal reflective access`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22989: [SPARK-25986][Build] Banning throw new OutOfMemoryErrors

2018-11-11 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/22989
  
and catching Error or Throwable..


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22989: [SPARK-25986][Build] Banning throw new OutOfMemor...

2018-11-11 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/22989#discussion_r232477783
  
--- Diff: scalastyle-config.xml ---
@@ -240,6 +240,18 @@ This file is divided into 3 sections:
 ]]>
   
 
+  
+throw new 
OutOfMemoryError
+

[GitHub] spark issue #22977: [BUILD] Bump previousSparkVersion in MimaBuild.scala to ...

2018-11-10 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/22977
  
right, I mean both this and that should be part of the process 
"post-release"


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

2018-11-10 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/22954#discussion_r232477365
  
--- Diff: R/pkg/R/SQLContext.R ---
@@ -147,6 +147,55 @@ getDefaultSqlSource <- function() {
   l[["spark.sql.sources.default"]]
 }
 
+writeToTempFileInArrow <- function(rdf, numPartitions) {
+  # R API in Arrow is not yet released. CRAN requires to add the package 
in requireNamespace
+  # at DESCRIPTION. Later, CRAN checks if the package is available or not. 
Therefore, it works
+  # around by avoiding direct requireNamespace.
+  requireNamespace1 <- requireNamespace
+  if (requireNamespace1("arrow", quietly = TRUE)) {
+record_batch <- get("record_batch", envir = asNamespace("arrow"), 
inherits = FALSE)
+record_batch_stream_writer <- get(
+  "record_batch_stream_writer", envir = asNamespace("arrow"), inherits 
= FALSE)
+file_output_stream <- get(
+  "file_output_stream", envir = asNamespace("arrow"), inherits = FALSE)
+write_record_batch <- get(
+  "write_record_batch", envir = asNamespace("arrow"), inherits = FALSE)
+
+# Currently arrow requires withr; otherwise, write APIs don't work.
+# Direct 'require' is not recommended by CRAN. Here's a workaround.
+require1 <- require
+if (require1("withr", quietly = TRUE)) {
--- End diff --

that will be good. circumventing CRAN check for method name is... 
problematic..
(there are other more hacky way too, but ..)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

2018-11-10 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/22954#discussion_r232477325
  
--- Diff: R/pkg/R/SQLContext.R ---
@@ -189,19 +238,67 @@ createDataFrame <- function(data, schema = NULL, 
samplingRatio = 1.0,
   x
 }
   }
+  data[] <- lapply(data, cleanCols)
 
-  # drop factors and wrap lists
-  data <- setNames(lapply(data, cleanCols), NULL)
+  args <- list(FUN = list, SIMPLIFY = FALSE, USE.NAMES = FALSE)
+  if (arrowEnabled) {
+shouldUseArrow <- tryCatch({
+  stopifnot(length(data) > 0)
+  dataHead <- head(data, 1)
+  # Currenty Arrow optimization does not support POSIXct and raw 
for now.
+  # Also, it does not support explicit float type set by users. It 
leads to
+  # incorrect conversion. We will fall back to the path without 
Arrow optimization.
+  if (any(sapply(dataHead, function(x) is(x, "POSIXct" {
+stop("Arrow optimization with R DataFrame does not support 
POSIXct type yet.")
+  }
+  if (any(sapply(dataHead, is.raw))) {
+stop("Arrow optimization with R DataFrame does not support raw 
type yet.")
+  }
+  if (inherits(schema, "structType")) {
+if (any(sapply(schema$fields(), function(x) 
x$dataType.toString() == "FloatType"))) {
+  stop("Arrow optimization with R DataFrame does not support 
FloatType type yet.")
--- End diff --

maybe out of bit range? 53 bit 
https://stat.ethz.ch/R-manual/R-patched/library/base/html/double.html



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

2018-11-10 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/22954#discussion_r232477271
  
--- Diff: R/pkg/R/SQLContext.R ---
@@ -189,19 +238,67 @@ createDataFrame <- function(data, schema = NULL, 
samplingRatio = 1.0,
   x
 }
   }
+  data[] <- lapply(data, cleanCols)
 
-  # drop factors and wrap lists
-  data <- setNames(lapply(data, cleanCols), NULL)
+  args <- list(FUN = list, SIMPLIFY = FALSE, USE.NAMES = FALSE)
+  if (arrowEnabled) {
+shouldUseArrow <- tryCatch({
+  stopifnot(length(data) > 0)
+  dataHead <- head(data, 1)
+  # Currenty Arrow optimization does not support POSIXct and raw 
for now.
+  # Also, it does not support explicit float type set by users. It 
leads to
+  # incorrect conversion. We will fall back to the path without 
Arrow optimization.
+  if (any(sapply(dataHead, function(x) is(x, "POSIXct" {
+stop("Arrow optimization with R DataFrame does not support 
POSIXct type yet.")
+  }
+  if (any(sapply(dataHead, is.raw))) {
+stop("Arrow optimization with R DataFrame does not support raw 
type yet.")
+  }
+  if (inherits(schema, "structType")) {
+if (any(sapply(schema$fields(), function(x) 
x$dataType.toString() == "FloatType"))) {
+  stop("Arrow optimization with R DataFrame does not support 
FloatType type yet.")
+}
+  }
+  firstRow <- do.call(mapply, append(args, dataHead))[[1]]
+  fileName <- writeToTempFileInArrow(data, numPartitions)
+  tryCatch(
+jrddInArrow <- 
callJStatic("org.apache.spark.sql.api.r.SQLUtils",
+   "readArrowStreamFromFile",
+   sparkSession,
+   fileName),
+  finally = {
+file.remove(fileName)
--- End diff --

yes, just more consistent. I also don't know for sure why all other 
instances are calling unlink


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

2018-11-10 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/22954#discussion_r232477257
  
--- Diff: R/pkg/R/SQLContext.R ---
@@ -189,19 +238,67 @@ createDataFrame <- function(data, schema = NULL, 
samplingRatio = 1.0,
   x
 }
   }
+  data[] <- lapply(data, cleanCols)
 
-  # drop factors and wrap lists
-  data <- setNames(lapply(data, cleanCols), NULL)
+  args <- list(FUN = list, SIMPLIFY = FALSE, USE.NAMES = FALSE)
+  if (arrowEnabled) {
+shouldUseArrow <- tryCatch({
+  stopifnot(length(data) > 0)
+  dataHead <- head(data, 1)
+  # Currenty Arrow optimization does not support POSIXct and raw 
for now.
+  # Also, it does not support explicit float type set by users. It 
leads to
+  # incorrect conversion. We will fall back to the path without 
Arrow optimization.
+  if (any(sapply(dataHead, function(x) is(x, "POSIXct" {
--- End diff --

LG, I tested a few cases too


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

2018-11-10 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/22954#discussion_r232477171
  
--- Diff: R/pkg/R/SQLContext.R ---
@@ -172,10 +221,10 @@ getDefaultSqlSource <- function() {
 createDataFrame <- function(data, schema = NULL, samplingRatio = 1.0,
 numPartitions = NULL) {
   sparkSession <- getSparkSession()
-
+  arrowEnabled <- sparkR.conf("spark.sql.execution.arrow.enabled")[[1]] == 
"true"
--- End diff --

ok


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

2018-11-10 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/22954#discussion_r232477155
  
--- Diff: R/pkg/R/SQLContext.R ---
@@ -147,6 +147,55 @@ getDefaultSqlSource <- function() {
   l[["spark.sql.sources.default"]]
 }
 
+writeToTempFileInArrow <- function(rdf, numPartitions) {
+  # R API in Arrow is not yet released. CRAN requires to add the package 
in requireNamespace
+  # at DESCRIPTION. Later, CRAN checks if the package is available or not. 
Therefore, it works
+  # around by avoiding direct requireNamespace.
+  requireNamespace1 <- requireNamespace
+  if (requireNamespace1("arrow", quietly = TRUE)) {
+record_batch <- get("record_batch", envir = asNamespace("arrow"), 
inherits = FALSE)
+record_batch_stream_writer <- get(
+  "record_batch_stream_writer", envir = asNamespace("arrow"), inherits 
= FALSE)
+file_output_stream <- get(
+  "file_output_stream", envir = asNamespace("arrow"), inherits = FALSE)
+write_record_batch <- get(
+  "write_record_batch", envir = asNamespace("arrow"), inherits = FALSE)
+
+# Currently arrow requires withr; otherwise, write APIs don't work.
+# Direct 'require' is not recommended by CRAN. Here's a workaround.
+require1 <- require
+if (require1("withr", quietly = TRUE)) {
+  numPartitions <- if (!is.null(numPartitions)) {
+numToInt(numPartitions)
+  } else {
+1
+  }
+  fileName <- tempfile(pattern = "spark-arrow", fileext = ".tmp")
+  chunk <- as.integer(ceiling(nrow(rdf) / numPartitions))
+  rdf_slices <- split(rdf, rep(1:ceiling(nrow(rdf) / chunk), each = 
chunk)[1:nrow(rdf)])
--- End diff --

ah, any idea that was done that way in python? this seems to be different 
from sc.paralleize which is what 
https://github.com/apache/spark/blob/c3b4a94a91d66c172cf332321d3a78dba29ef8f0/R/pkg/R/context.R#L152
 is done


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

2018-11-10 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/22954#discussion_r232477131
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala 
---
@@ -225,4 +226,25 @@ private[sql] object SQLUtils extends Logging {
 }
 sparkSession.sessionState.catalog.listTables(db).map(_.table).toArray
   }
+
+  /**
+   * R callable function to read a file in Arrow stream format and create 
a `RDD`
+   * using each serialized ArrowRecordBatch as a partition.
+   */
+  def readArrowStreamFromFile(
+  sparkSession: SparkSession,
+  filename: String): JavaRDD[Array[Byte]] = {
+ArrowConverters.readArrowStreamFromFile(sparkSession.sqlContext, 
filename)
--- End diff --

hmm, I see your point... but there could be hundreds of these wrapper we 
need add if we set as a practice, I'm guessing. 

a few problems with these wrappers I see:
1. they are extra work to add or maintain
2. many are very simple, not much value add
3. many get abandoned over the years - they are not called and not removed

but I kinda see your way, let's keep this one and review any new one in the 
future.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22997: SPARK-25999: make-distribution.sh failure with --r and -...

2018-11-10 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/22997
  
btw, please see the page https://spark.apache.org/contributing.html and 
particularly "Pull Request" on the format.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22997: SPARK-25999: make-distribution.sh failure with --r and -...

2018-11-10 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/22997
  
thx, but I'm not sure about this approach. this step will now cause hadoop 
jar to be packaged into the release tarball of hadoop-provided, which is 
undoing the point of hadoop-provided.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22977: [BUILD] Bump previousSparkVersion in MimaBuild.scala to ...

2018-11-10 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/22977
  
I think also there is a hive metastore test that downloads spark release 
jar?





---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22967: [SPARK-25956] Make Scala 2.12 as default Scala ve...

2018-11-09 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/22967#discussion_r232178323
  
--- Diff: docs/sparkr.md ---
@@ -133,7 +133,7 @@ specifying `--packages` with `spark-submit` or `sparkR` 
commands, or if initiali
 
 
 {% highlight r %}
-sparkR.session(sparkPackages = "com.databricks:spark-avro_2.11:3.0.0")
+sparkR.session()
--- End diff --

yes!
(although, let's not use spark here - don't want to encourage naming 
packages with spark in the name)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

2018-11-09 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/22954#discussion_r232170936
  
--- Diff: R/pkg/R/SQLContext.R ---
@@ -147,6 +147,55 @@ getDefaultSqlSource <- function() {
   l[["spark.sql.sources.default"]]
 }
 
+writeToTempFileInArrow <- function(rdf, numPartitions) {
+  # R API in Arrow is not yet released. CRAN requires to add the package 
in requireNamespace
+  # at DESCRIPTION. Later, CRAN checks if the package is available or not. 
Therefore, it works
+  # around by avoiding direct requireNamespace.
+  requireNamespace1 <- requireNamespace
+  if (requireNamespace1("arrow", quietly = TRUE)) {
+record_batch <- get("record_batch", envir = asNamespace("arrow"), 
inherits = FALSE)
+record_batch_stream_writer <- get(
+  "record_batch_stream_writer", envir = asNamespace("arrow"), inherits 
= FALSE)
+file_output_stream <- get(
+  "file_output_stream", envir = asNamespace("arrow"), inherits = FALSE)
+write_record_batch <- get(
+  "write_record_batch", envir = asNamespace("arrow"), inherits = FALSE)
+
+# Currently arrow requires withr; otherwise, write APIs don't work.
+# Direct 'require' is not recommended by CRAN. Here's a workaround.
+require1 <- require
+if (require1("withr", quietly = TRUE)) {
+  numPartitions <- if (!is.null(numPartitions)) {
+numToInt(numPartitions)
+  } else {
+1
+  }
+  fileName <- tempfile(pattern = "spark-arrow", fileext = ".tmp")
+  chunk <- as.integer(ceiling(nrow(rdf) / numPartitions))
+  rdf_slices <- split(rdf, rep(1:ceiling(nrow(rdf) / chunk), each = 
chunk)[1:nrow(rdf)])
--- End diff --

`1 : ceiling`? `1 : nrow`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

2018-11-09 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/22954#discussion_r232176721
  
--- Diff: R/pkg/R/SQLContext.R ---
@@ -189,19 +238,67 @@ createDataFrame <- function(data, schema = NULL, 
samplingRatio = 1.0,
   x
 }
   }
+  data[] <- lapply(data, cleanCols)
 
-  # drop factors and wrap lists
-  data <- setNames(lapply(data, cleanCols), NULL)
+  args <- list(FUN = list, SIMPLIFY = FALSE, USE.NAMES = FALSE)
+  if (arrowEnabled) {
+shouldUseArrow <- tryCatch({
+  stopifnot(length(data) > 0)
+  dataHead <- head(data, 1)
+  # Currenty Arrow optimization does not support POSIXct and raw 
for now.
+  # Also, it does not support explicit float type set by users. It 
leads to
+  # incorrect conversion. We will fall back to the path without 
Arrow optimization.
+  if (any(sapply(dataHead, function(x) is(x, "POSIXct" {
+stop("Arrow optimization with R DataFrame does not support 
POSIXct type yet.")
+  }
+  if (any(sapply(dataHead, is.raw))) {
+stop("Arrow optimization with R DataFrame does not support raw 
type yet.")
+  }
+  if (inherits(schema, "structType")) {
+if (any(sapply(schema$fields(), function(x) 
x$dataType.toString() == "FloatType"))) {
+  stop("Arrow optimization with R DataFrame does not support 
FloatType type yet.")
+}
+  }
+  firstRow <- do.call(mapply, append(args, dataHead))[[1]]
+  fileName <- writeToTempFileInArrow(data, numPartitions)
+  tryCatch(
+jrddInArrow <- 
callJStatic("org.apache.spark.sql.api.r.SQLUtils",
+   "readArrowStreamFromFile",
+   sparkSession,
+   fileName),
+  finally = {
+file.remove(fileName)
+  })
+  TRUE
+},
+error = function(e) {
+  message(paste0("WARN: createDataFrame attempted Arrow 
optimization because ",
--- End diff --

? https://stat.ethz.ch/R-manual/R-devel/library/base/html/warning.html


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

2018-11-09 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/22954#discussion_r232172687
  
--- Diff: R/pkg/R/SQLContext.R ---
@@ -147,6 +147,55 @@ getDefaultSqlSource <- function() {
   l[["spark.sql.sources.default"]]
 }
 
+writeToTempFileInArrow <- function(rdf, numPartitions) {
+  # R API in Arrow is not yet released. CRAN requires to add the package 
in requireNamespace
+  # at DESCRIPTION. Later, CRAN checks if the package is available or not. 
Therefore, it works
+  # around by avoiding direct requireNamespace.
+  requireNamespace1 <- requireNamespace
+  if (requireNamespace1("arrow", quietly = TRUE)) {
+record_batch <- get("record_batch", envir = asNamespace("arrow"), 
inherits = FALSE)
+record_batch_stream_writer <- get(
+  "record_batch_stream_writer", envir = asNamespace("arrow"), inherits 
= FALSE)
+file_output_stream <- get(
+  "file_output_stream", envir = asNamespace("arrow"), inherits = FALSE)
+write_record_batch <- get(
+  "write_record_batch", envir = asNamespace("arrow"), inherits = FALSE)
+
+# Currently arrow requires withr; otherwise, write APIs don't work.
+# Direct 'require' is not recommended by CRAN. Here's a workaround.
+require1 <- require
+if (require1("withr", quietly = TRUE)) {
+  numPartitions <- if (!is.null(numPartitions)) {
+numToInt(numPartitions)
+  } else {
+1
+  }
+  fileName <- tempfile(pattern = "spark-arrow", fileext = ".tmp")
+  chunk <- as.integer(ceiling(nrow(rdf) / numPartitions))
+  rdf_slices <- split(rdf, rep(1:ceiling(nrow(rdf) / chunk), each = 
chunk)[1:nrow(rdf)])
+  stream_writer <- NULL
+  for (rdf_slice in rdf_slices) {
+batch <- record_batch(rdf_slice)
+if (is.null(stream_writer)) {
+  # We should avoid private calls like 'close_on_exit' (CRAN 
disallows) but looks
+  # there's no exposed API for it. Here's a workaround but ideally 
this should
+  # be removed.
+  close_on_exit <- get("close_on_exit", envir = 
asNamespace("arrow"), inherits = FALSE)
--- End diff --

actually, I think if you use withr here it will call close_on_exit for you? 
(but when?)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

2018-11-09 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/22954#discussion_r232169938
  
--- Diff: R/pkg/R/SQLContext.R ---
@@ -147,6 +147,55 @@ getDefaultSqlSource <- function() {
   l[["spark.sql.sources.default"]]
 }
 
+writeToTempFileInArrow <- function(rdf, numPartitions) {
+  # R API in Arrow is not yet released. CRAN requires to add the package 
in requireNamespace
+  # at DESCRIPTION. Later, CRAN checks if the package is available or not. 
Therefore, it works
+  # around by avoiding direct requireNamespace.
+  requireNamespace1 <- requireNamespace
+  if (requireNamespace1("arrow", quietly = TRUE)) {
+record_batch <- get("record_batch", envir = asNamespace("arrow"), 
inherits = FALSE)
+record_batch_stream_writer <- get(
+  "record_batch_stream_writer", envir = asNamespace("arrow"), inherits 
= FALSE)
+file_output_stream <- get(
+  "file_output_stream", envir = asNamespace("arrow"), inherits = FALSE)
+write_record_batch <- get(
+  "write_record_batch", envir = asNamespace("arrow"), inherits = FALSE)
+
+# Currently arrow requires withr; otherwise, write APIs don't work.
+# Direct 'require' is not recommended by CRAN. Here's a workaround.
+require1 <- require
+if (require1("withr", quietly = TRUE)) {
+  numPartitions <- if (!is.null(numPartitions)) {
+numToInt(numPartitions)
+  } else {
+1
--- End diff --

should this go by default with default parallelism ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

2018-11-09 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/22954#discussion_r232173367
  
--- Diff: R/pkg/R/SQLContext.R ---
@@ -189,19 +238,67 @@ createDataFrame <- function(data, schema = NULL, 
samplingRatio = 1.0,
   x
 }
   }
+  data[] <- lapply(data, cleanCols)
 
-  # drop factors and wrap lists
-  data <- setNames(lapply(data, cleanCols), NULL)
+  args <- list(FUN = list, SIMPLIFY = FALSE, USE.NAMES = FALSE)
+  if (arrowEnabled) {
+shouldUseArrow <- tryCatch({
+  stopifnot(length(data) > 0)
+  dataHead <- head(data, 1)
+  # Currenty Arrow optimization does not support POSIXct and raw 
for now.
+  # Also, it does not support explicit float type set by users. It 
leads to
+  # incorrect conversion. We will fall back to the path without 
Arrow optimization.
+  if (any(sapply(dataHead, function(x) is(x, "POSIXct" {
--- End diff --

can you check -  I think `is` `is.x` doesn't something do the right thing 
when

head(df, 1) and one of the field is `NA`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

2018-11-09 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/22954#discussion_r232173043
  
--- Diff: R/pkg/R/SQLContext.R ---
@@ -172,10 +221,10 @@ getDefaultSqlSource <- function() {
 createDataFrame <- function(data, schema = NULL, samplingRatio = 1.0,
 numPartitions = NULL) {
   sparkSession <- getSparkSession()
-
+  arrowEnabled <- sparkR.conf("spark.sql.execution.arrow.enabled")[[1]] == 
"true"
--- End diff --

is it always in the conf - I think you can also pass in a default value to 
sparkR.conf


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

2018-11-09 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/22954#discussion_r232167634
  
--- Diff: R/pkg/R/SQLContext.R ---
@@ -147,6 +147,55 @@ getDefaultSqlSource <- function() {
   l[["spark.sql.sources.default"]]
 }
 
+writeToTempFileInArrow <- function(rdf, numPartitions) {
+  # R API in Arrow is not yet released. CRAN requires to add the package 
in requireNamespace
+  # at DESCRIPTION. Later, CRAN checks if the package is available or not. 
Therefore, it works
+  # around by avoiding direct requireNamespace.
+  requireNamespace1 <- requireNamespace
+  if (requireNamespace1("arrow", quietly = TRUE)) {
+record_batch <- get("record_batch", envir = asNamespace("arrow"), 
inherits = FALSE)
+record_batch_stream_writer <- get(
+  "record_batch_stream_writer", envir = asNamespace("arrow"), inherits 
= FALSE)
+file_output_stream <- get(
+  "file_output_stream", envir = asNamespace("arrow"), inherits = FALSE)
+write_record_batch <- get(
+  "write_record_batch", envir = asNamespace("arrow"), inherits = FALSE)
+
+# Currently arrow requires withr; otherwise, write APIs don't work.
+# Direct 'require' is not recommended by CRAN. Here's a workaround.
+require1 <- require
+if (require1("withr", quietly = TRUE)) {
--- End diff --

require1 sounds a bit like a hack though...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

2018-11-09 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/22954#discussion_r232172546
  
--- Diff: R/pkg/R/SQLContext.R ---
@@ -147,6 +147,55 @@ getDefaultSqlSource <- function() {
   l[["spark.sql.sources.default"]]
 }
 
+writeToTempFileInArrow <- function(rdf, numPartitions) {
+  # R API in Arrow is not yet released. CRAN requires to add the package 
in requireNamespace
+  # at DESCRIPTION. Later, CRAN checks if the package is available or not. 
Therefore, it works
+  # around by avoiding direct requireNamespace.
+  requireNamespace1 <- requireNamespace
+  if (requireNamespace1("arrow", quietly = TRUE)) {
+record_batch <- get("record_batch", envir = asNamespace("arrow"), 
inherits = FALSE)
+record_batch_stream_writer <- get(
+  "record_batch_stream_writer", envir = asNamespace("arrow"), inherits 
= FALSE)
+file_output_stream <- get(
+  "file_output_stream", envir = asNamespace("arrow"), inherits = FALSE)
+write_record_batch <- get(
+  "write_record_batch", envir = asNamespace("arrow"), inherits = FALSE)
+
+# Currently arrow requires withr; otherwise, write APIs don't work.
+# Direct 'require' is not recommended by CRAN. Here's a workaround.
+require1 <- require
+if (require1("withr", quietly = TRUE)) {
+  numPartitions <- if (!is.null(numPartitions)) {
+numToInt(numPartitions)
+  } else {
+1
+  }
+  fileName <- tempfile(pattern = "spark-arrow", fileext = ".tmp")
+  chunk <- as.integer(ceiling(nrow(rdf) / numPartitions))
+  rdf_slices <- split(rdf, rep(1:ceiling(nrow(rdf) / chunk), each = 
chunk)[1:nrow(rdf)])
+  stream_writer <- NULL
+  for (rdf_slice in rdf_slices) {
+batch <- record_batch(rdf_slice)
+if (is.null(stream_writer)) {
+  # We should avoid private calls like 'close_on_exit' (CRAN 
disallows) but looks
+  # there's no exposed API for it. Here's a workaround but ideally 
this should
+  # be removed.
+  close_on_exit <- get("close_on_exit", envir = 
asNamespace("arrow"), inherits = FALSE)
--- End diff --

so is this an API missing in Arrow?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

2018-11-09 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/22954#discussion_r232177132
  
--- Diff: R/pkg/R/SQLContext.R ---
@@ -189,19 +238,67 @@ createDataFrame <- function(data, schema = NULL, 
samplingRatio = 1.0,
   x
 }
   }
+  data[] <- lapply(data, cleanCols)
 
-  # drop factors and wrap lists
-  data <- setNames(lapply(data, cleanCols), NULL)
+  args <- list(FUN = list, SIMPLIFY = FALSE, USE.NAMES = FALSE)
+  if (arrowEnabled) {
+shouldUseArrow <- tryCatch({
--- End diff --

refactor this into a method?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

2018-11-09 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/22954#discussion_r232167926
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala 
---
@@ -225,4 +226,25 @@ private[sql] object SQLUtils extends Logging {
 }
 sparkSession.sessionState.catalog.listTables(db).map(_.table).toArray
   }
+
+  /**
+   * R callable function to read a file in Arrow stream format and create 
a `RDD`
+   * using each serialized ArrowRecordBatch as a partition.
+   */
+  def readArrowStreamFromFile(
+  sparkSession: SparkSession,
+  filename: String): JavaRDD[Array[Byte]] = {
+ArrowConverters.readArrowStreamFromFile(sparkSession.sqlContext, 
filename)
--- End diff --

what's the advantage of adding this wrapper here - I've thinking to 
eliminate most of these if possible - and just use callJMethod on 
`ArrowConverters` say?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

2018-11-09 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/22954#discussion_r232176774
  
--- Diff: R/pkg/R/SQLContext.R ---
@@ -189,19 +238,67 @@ createDataFrame <- function(data, schema = NULL, 
samplingRatio = 1.0,
   x
 }
   }
+  data[] <- lapply(data, cleanCols)
 
-  # drop factors and wrap lists
-  data <- setNames(lapply(data, cleanCols), NULL)
+  args <- list(FUN = list, SIMPLIFY = FALSE, USE.NAMES = FALSE)
+  if (arrowEnabled) {
+shouldUseArrow <- tryCatch({
+  stopifnot(length(data) > 0)
+  dataHead <- head(data, 1)
+  # Currenty Arrow optimization does not support POSIXct and raw 
for now.
+  # Also, it does not support explicit float type set by users. It 
leads to
+  # incorrect conversion. We will fall back to the path without 
Arrow optimization.
+  if (any(sapply(dataHead, function(x) is(x, "POSIXct" {
+stop("Arrow optimization with R DataFrame does not support 
POSIXct type yet.")
+  }
+  if (any(sapply(dataHead, is.raw))) {
+stop("Arrow optimization with R DataFrame does not support raw 
type yet.")
+  }
+  if (inherits(schema, "structType")) {
+if (any(sapply(schema$fields(), function(x) 
x$dataType.toString() == "FloatType"))) {
+  stop("Arrow optimization with R DataFrame does not support 
FloatType type yet.")
+}
+  }
+  firstRow <- do.call(mapply, append(args, dataHead))[[1]]
+  fileName <- writeToTempFileInArrow(data, numPartitions)
+  tryCatch(
+jrddInArrow <- 
callJStatic("org.apache.spark.sql.api.r.SQLUtils",
+   "readArrowStreamFromFile",
+   sparkSession,
+   fileName),
+  finally = {
+file.remove(fileName)
--- End diff --

unlink?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

2018-11-09 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/22954#discussion_r232171176
  
--- Diff: R/pkg/R/SQLContext.R ---
@@ -147,6 +147,55 @@ getDefaultSqlSource <- function() {
   l[["spark.sql.sources.default"]]
 }
 
+writeToTempFileInArrow <- function(rdf, numPartitions) {
+  # R API in Arrow is not yet released. CRAN requires to add the package 
in requireNamespace
+  # at DESCRIPTION. Later, CRAN checks if the package is available or not. 
Therefore, it works
+  # around by avoiding direct requireNamespace.
+  requireNamespace1 <- requireNamespace
+  if (requireNamespace1("arrow", quietly = TRUE)) {
+record_batch <- get("record_batch", envir = asNamespace("arrow"), 
inherits = FALSE)
+record_batch_stream_writer <- get(
+  "record_batch_stream_writer", envir = asNamespace("arrow"), inherits 
= FALSE)
+file_output_stream <- get(
+  "file_output_stream", envir = asNamespace("arrow"), inherits = FALSE)
+write_record_batch <- get(
+  "write_record_batch", envir = asNamespace("arrow"), inherits = FALSE)
+
+# Currently arrow requires withr; otherwise, write APIs don't work.
+# Direct 'require' is not recommended by CRAN. Here's a workaround.
+require1 <- require
+if (require1("withr", quietly = TRUE)) {
+  numPartitions <- if (!is.null(numPartitions)) {
+numToInt(numPartitions)
+  } else {
+1
+  }
+  fileName <- tempfile(pattern = "spark-arrow", fileext = ".tmp")
+  chunk <- as.integer(ceiling(nrow(rdf) / numPartitions))
+  rdf_slices <- split(rdf, rep(1:ceiling(nrow(rdf) / chunk), each = 
chunk)[1:nrow(rdf)])
--- End diff --

how is this slices computed? is it similar to 
https://github.com/apache/spark/blob/c3b4a94a91d66c172cf332321d3a78dba29ef8f0/R/pkg/R/context.R#L152


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

2018-11-09 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/22954#discussion_r232167480
  
--- Diff: R/pkg/R/SQLContext.R ---
@@ -147,6 +147,55 @@ getDefaultSqlSource <- function() {
   l[["spark.sql.sources.default"]]
 }
 
+writeToTempFileInArrow <- function(rdf, numPartitions) {
+  # R API in Arrow is not yet released. CRAN requires to add the package 
in requireNamespace
+  # at DESCRIPTION. Later, CRAN checks if the package is available or not. 
Therefore, it works
+  # around by avoiding direct requireNamespace.
+  requireNamespace1 <- requireNamespace
+  if (requireNamespace1("arrow", quietly = TRUE)) {
+record_batch <- get("record_batch", envir = asNamespace("arrow"), 
inherits = FALSE)
+record_batch_stream_writer <- get(
+  "record_batch_stream_writer", envir = asNamespace("arrow"), inherits 
= FALSE)
+file_output_stream <- get(
+  "file_output_stream", envir = asNamespace("arrow"), inherits = FALSE)
+write_record_batch <- get(
+  "write_record_batch", envir = asNamespace("arrow"), inherits = FALSE)
--- End diff --

ok


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22954: [SPARK-25981][R] Enables Arrow optimization from ...

2018-11-09 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/22954#discussion_r232167110
  
--- Diff: R/pkg/R/SQLContext.R ---
@@ -215,14 +278,16 @@ createDataFrame <- function(data, schema = NULL, 
samplingRatio = 1.0,
   }
 
   if (is.null(schema) || (!inherits(schema, "structType") && 
is.null(names(schema {
-row <- firstRDD(rdd)
+if (is.null(firstRow)) {
+  firstRow <- firstRDD(rdd)
--- End diff --

I <3 4 digits!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22939: [SPARK-25446][R] Add schema_of_json() and schema_...

2018-11-08 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/22939#discussion_r232166370
  
--- Diff: R/pkg/R/functions.R ---
@@ -2230,6 +2237,32 @@ setMethod("from_json", signature(x = "Column", 
schema = "characterOrstructType")
 column(jc)
   })
 
+#' @details
+#' \code{schema_of_json}: Parses a JSON string and infers its schema in 
DDL format.
+#'
+#' @rdname column_collection_functions
+#' @aliases schema_of_json schema_of_json,characterOrColumn-method
+#' @examples
+#'
+#' \dontrun{
+#' json <- '{"name":"Bob"}'
+#' df <- sql("SELECT * FROM range(1)")
+#' head(select(df, schema_of_json(json)))}
+#' @note schema_of_json since 3.0.0
+setMethod("schema_of_json", signature(x = "characterOrColumn"),
+  function(x, ...) {
+if (class(x) == "character") {
+  col <- callJStatic("org.apache.spark.sql.functions", "lit", 
x)
+} else {
+  col <- x@jc
--- End diff --

hm.. why not just support string then? it's kinda very odd usage in R 
`schema_of_csv(lit("Amsterdam,2018")))`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22967: [SPARK-25956] Make Scala 2.12 as default Scala ve...

2018-11-08 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/22967#discussion_r231819016
  
--- Diff: docs/sparkr.md ---
@@ -133,7 +133,7 @@ specifying `--packages` with `spark-submit` or `sparkR` 
commands, or if initiali
 
 
 {% highlight r %}
-sparkR.session(sparkPackages = "com.databricks:spark-avro_2.11:3.0.0")
+sparkR.session()
--- End diff --

I'd not worry about this example too much - this could be 
`com.databricks:spark-avro_2.12:3.0.0`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22932: [SPARK-25102][SQL] Write Spark version to ORC/Parquet fi...

2018-11-07 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/22932
  
Does it have different values for

new native ORC writer, old Hive ORC writer






---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22948: [SPARK-25944][R][BUILD] AppVeyor change to latest...

2018-11-07 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/22948#discussion_r231598045
  
--- Diff: dev/appveyor-install-dependencies.ps1 ---
@@ -115,7 +115,7 @@ $env:Path += ";$env:HADOOP_HOME\bin"
 Pop-Location
 
 # == R
-$rVer = "3.4.1"
+$rVer = "3.5.1"
 $rToolsVer = "3.4.0"
--- End diff --

is there update to rtool?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22921: [SPARK-25908][CORE][SQL] Remove old deprecated it...

2018-11-07 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/22921#discussion_r231596680
  
--- Diff: R/pkg/R/functions.R ---
@@ -1663,9 +1692,24 @@ setMethod("toDegrees",
 #' @aliases toRadians toRadians,Column-method
 #' @note toRadians since 1.4.0
 setMethod("toRadians",
+signature(x = "Column"),
--- End diff --

fix indentation? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22921: [SPARK-25908][CORE][SQL] Remove old deprecated it...

2018-11-06 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/22921#discussion_r231403827
  
--- Diff: R/pkg/R/functions.R ---
@@ -319,6 +319,27 @@ setMethod("acos",
 column(jc)
   })
 
+#' @details
+#' \code{approx_count_distinct}: Returns the approximate number of 
distinct items in a group.
+#'
+#' @rdname column_aggregate_functions
+#' @aliases approx_count_distinct approx_count_distinct,Column-method
+#' @examples
+#'
+#' \dontrun{
+#' head(select(df, approx_count_distinct(df$gear)))
+#' head(select(df, approx_count_distinct(df$gear, 0.02)))
+#' head(select(df, countDistinct(df$gear, df$cyl)))
+#' head(select(df, n_distinct(df$gear)))
+#' head(distinct(select(df, "gear")))}
--- End diff --

we only need one set - they both are `@rdname column_aggregate_functions` 
so will duplicate all other examples


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22939: [SPARK-25446][R] Add schema_of_json() and schema_...

2018-11-06 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/22939#discussion_r231403096
  
--- Diff: R/pkg/R/functions.R ---
@@ -2230,6 +2237,32 @@ setMethod("from_json", signature(x = "Column", 
schema = "characterOrstructType")
 column(jc)
   })
 
+#' @details
+#' \code{schema_of_json}: Parses a JSON string and infers its schema in 
DDL format.
+#'
+#' @rdname column_collection_functions
+#' @aliases schema_of_json schema_of_json,characterOrColumn-method
+#' @examples
+#'
+#' \dontrun{
+#' json <- '{"name":"Bob"}'
+#' df <- sql("SELECT * FROM range(1)")
+#' head(select(df, schema_of_json(json)))}
+#' @note schema_of_json since 3.0.0
+setMethod("schema_of_json", signature(x = "characterOrColumn"),
+  function(x, ...) {
+if (class(x) == "character") {
+  col <- callJStatic("org.apache.spark.sql.functions", "lit", 
x)
+} else {
+  col <- x@jc
--- End diff --

you are saying this `select(df, schema_of_csv(df$schemaCol))` is not 
allowed?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization fr...

2018-11-06 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/22954#discussion_r231402726
  
--- Diff: R/pkg/R/SQLContext.R ---
@@ -147,6 +147,30 @@ getDefaultSqlSource <- function() {
   l[["spark.sql.sources.default"]]
 }
 
+writeToTempFileInArrow <- function(rdf, numPartitions) {
+  stopifnot(require("arrow", quietly = TRUE))
+  stopifnot(require("withr", quietly = TRUE))
+  numPartitions <- if (!is.null(numPartitions)) {
+numToInt(numPartitions)
+  } else {
+1
+  }
+  fileName <- tempfile()
--- End diff --

might need to give it a dir prefix to use - the tempfile default is not 
CRAN compliant and possibly some ACL issue


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization fr...

2018-11-06 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/22954#discussion_r231402235
  
--- Diff: R/pkg/R/SQLContext.R ---
@@ -147,6 +147,30 @@ getDefaultSqlSource <- function() {
   l[["spark.sql.sources.default"]]
 }
 
+writeToTempFileInArrow <- function(rdf, numPartitions) {
+  stopifnot(require("arrow", quietly = TRUE))
--- End diff --

btw, is it worthwhile to check the arrow package version?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization fr...

2018-11-06 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/22954#discussion_r231402297
  
--- Diff: R/pkg/R/SQLContext.R ---
@@ -172,15 +196,17 @@ getDefaultSqlSource <- function() {
 createDataFrame <- function(data, schema = NULL, samplingRatio = 1.0,
 numPartitions = NULL) {
   sparkSession <- getSparkSession()
-
+  conf <- callJMethod(sparkSession, "conf")
+  arrowEnabled <- tolower(callJMethod(conf, "get", 
"spark.sql.execution.arrow.enabled")) == "true"
--- End diff --

I think you can use sparkR.conf


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization fr...

2018-11-06 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/22954#discussion_r231402063
  
--- Diff: R/pkg/R/SQLContext.R ---
@@ -147,6 +147,30 @@ getDefaultSqlSource <- function() {
   l[["spark.sql.sources.default"]]
 }
 
+writeToTempFileInArrow <- function(rdf, numPartitions) {
+  stopifnot(require("arrow", quietly = TRUE))
+  stopifnot(require("withr", quietly = TRUE))
--- End diff --

is it possible to not depend on this withr? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization fr...

2018-11-06 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/22954#discussion_r231401994
  
--- Diff: R/pkg/R/SQLContext.R ---
@@ -147,6 +147,30 @@ getDefaultSqlSource <- function() {
   l[["spark.sql.sources.default"]]
 }
 
+writeToTempFileInArrow <- function(rdf, numPartitions) {
+  stopifnot(require("arrow", quietly = TRUE))
--- End diff --

perhaps best to add a clearer error message?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22939: [SPARK-25446][R] Add schema_of_json() and schema_...

2018-11-05 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/22939#discussion_r231025592
  
--- Diff: R/pkg/R/functions.R ---
@@ -205,11 +205,18 @@ NULL
 #'  also supported for the schema.
 #'  \item \code{from_csv}: a DDL-formatted string
 #'  }
-#' @param ... additional argument(s). In \code{to_json}, \code{to_csv} and 
\code{from_json},
-#'this contains additional named properties to control how it 
is converted, accepts
-#'the same options as the JSON/CSV data source. Additionally 
\code{to_json} supports
-#'the "pretty" option which enables pretty JSON generation. In 
\code{arrays_zip},
-#'this contains additional Columns of arrays to be merged.
+#' @param ... additional argument(s).
+#'  \itemize{
+#'  \item \code{to_json}, \code{from_json} and 
\code{schema_of_json}: this contains
+#'  additional named properties to control how it is converted 
and accepts the
+#'  same options as the JSON data source.
+#'  \item \code{to_json}: it supports the "pretty" option which 
enables pretty
--- End diff --

I know it's there before but I'd like to suggest to give an example - doc 
or code example below.
it's a bit different from python/scala I think


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   6   7   8   9   10   >