[GitHub] spark pull request #22919: [SPARK-25906][SHELL] Documents '-I' option (from ...

2018-11-05 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22919#discussion_r230701471
  
--- Diff: bin/spark-shell ---
@@ -32,7 +32,10 @@ if [ -z "${SPARK_HOME}" ]; then
   source "$(dirname "$0")"/find-spark-home
 fi
 
-export _SPARK_CMD_USAGE="Usage: ./bin/spark-shell [options]"
+export _SPARK_CMD_USAGE="Usage: ./bin/spark-shell [options]
+
+Scala REPL options:
+  -Ipreload , enforcing line-by-line 
interpretation"
--- End diff --

Yea .. but `-i` doesn't handle implicits like toDF or symbols which are 
pretty basic ones. I think we should better avoid to document it for now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22913: [SPARK-25902][SQL] Add support for dates with millisecon...

2018-11-05 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22913
  
We map `POSIXct` to `TimestampType` in R. I think it makes more sense to 
not support this. Historically Parquet also choses to better not support. For 
instance, `UINT_*`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22913: [SPARK-25902][SQL] Add support for dates with mil...

2018-11-05 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22913#discussion_r230698731
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowUtils.scala 
---
@@ -71,6 +71,7 @@ object ArrowUtils {
 case d: ArrowType.Decimal => DecimalType(d.getPrecision, d.getScale)
 case date: ArrowType.Date if date.getUnit == DateUnit.DAY => DateType
 case ts: ArrowType.Timestamp if ts.getUnit == TimeUnit.MICROSECOND => 
TimestampType
+case date: ArrowType.Date if date.getUnit == DateUnit.MILLISECOND => 
TimestampType
--- End diff --

Yea, but date is a date. Otherwise, users should use timestamp types if 
they don't want to truncate. I don't think we should switch the type to cover 
its capability.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22919: [SPARK-25906][SHELL] Documents '-I' option (from ...

2018-11-05 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22919#discussion_r230689462
  
--- Diff: bin/spark-shell ---
@@ -32,7 +32,10 @@ if [ -z "${SPARK_HOME}" ]; then
   source "$(dirname "$0")"/find-spark-home
 fi
 
-export _SPARK_CMD_USAGE="Usage: ./bin/spark-shell [options]"
+export _SPARK_CMD_USAGE="Usage: ./bin/spark-shell [options]
+
+Scala REPL options:
+  -Ipreload , enforcing line-by-line 
interpretation"
--- End diff --

Oh haha. Sorry. That's in `SparkSubmitArguments.printUsageAndExit`.thats 
why I left https://github.com/apache/spark/pull/22919#discussion_r230554455


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22939: [SPARK-25446][R] Add schema_of_json() and schema_of_csv(...

2018-11-05 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22939
  
Let me clean up and deal with other comments.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22939: [SPARK-25446][R] Add schema_of_json() and schema_...

2018-11-05 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22939#discussion_r230688803
  
--- Diff: R/pkg/R/functions.R ---
@@ -2230,6 +2237,32 @@ setMethod("from_json", signature(x = "Column", 
schema = "characterOrstructType")
 column(jc)
   })
 
+#' @details
+#' \code{schema_of_json}: Parses a JSON string and infers its schema in 
DDL format.
+#'
+#' @rdname column_collection_functions
+#' @aliases schema_of_json schema_of_json,characterOrColumn-method
+#' @examples
+#'
+#' \dontrun{
+#' json <- '{"name":"Bob"}'
+#' df <- sql("SELECT * FROM range(1)")
+#' head(select(df, schema_of_json(json)))}
+#' @note schema_of_json since 3.0.0
+setMethod("schema_of_json", signature(x = "characterOrColumn"),
+  function(x, ...) {
+if (class(x) == "character") {
+  col <- callJStatic("org.apache.spark.sql.functions", "lit", 
x)
+} else {
+  col <- x@jc
--- End diff --

That's actually related with Scala API. There are too many overridden 
versions of functions in `function.scala` so we're trying to reduce it. Column 
is preferred over other specific types because Column can cover other 
expression cases.. in Python or R, they can be easily supported so other types 
and column are all supported. To cut it short, for consistency with Scala API.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22939: [SPARK-25446][R] Add schema_of_json() and schema_...

2018-11-05 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22939#discussion_r230686967
  
--- Diff: R/pkg/R/functions.R ---
@@ -205,11 +205,18 @@ NULL
 #'  also supported for the schema.
 #'  \item \code{from_csv}: a DDL-formatted string
 #'  }
-#' @param ... additional argument(s). In \code{to_json}, \code{to_csv} and 
\code{from_json},
-#'this contains additional named properties to control how it 
is converted, accepts
-#'the same options as the JSON/CSV data source. Additionally 
\code{to_json} supports
-#'the "pretty" option which enables pretty JSON generation. In 
\code{arrays_zip},
-#'this contains additional Columns of arrays to be merged.
+#' @param ... additional argument(s).
+#'  \itemize{
+#'  \item \code{to_json}, \code{from_json} and 
\code{schema_of_json}: this contains
+#'  additional named properties to control how it is converted 
and accepts the
+#'  same options as the JSON data source.
+#'  \item \code{to_json}: it supports the "pretty" option which 
enables pretty
--- End diff --

Yup.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22919: [SPARK-25906][SHELL] Documents '-I' option (from ...

2018-11-05 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22919#discussion_r230675018
  
--- Diff: bin/spark-shell ---
@@ -32,7 +32,10 @@ if [ -z "${SPARK_HOME}" ]; then
   source "$(dirname "$0")"/find-spark-home
 fi
 
-export _SPARK_CMD_USAGE="Usage: ./bin/spark-shell [options]"
+export _SPARK_CMD_USAGE="Usage: ./bin/spark-shell [options]
+
+Scala REPL options:
+  -Ipreload , enforcing line-by-line 
interpretation"
--- End diff --

I tested other options and this one looks only the valid one. I described 
in PR description.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22939: [SPARK-25446][R] Add schema_of_json() and schema_of_csv(...

2018-11-04 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22939
  
Will make another PR after this gets merged to allow the cases below:


```r
df <- sql("SELECT named_struct('name', 'Bob') as people")
df <- mutate(df, people_json = to_json(df$people))
head(select(df, from_json(df$people_json, 
schema_of_json(head(df)$people_json
```

```
  from_json(people_json)
1Bob
```

```r
df <- sql("SELECT named_struct('name', 'Bob') as people")
df <- mutate(df, people_json = to_csv(df$people))
head(select(df, from_csv(df$people_json, 
schema_of_csv(head(df)$people_json
```

```
  from_csv(people_json)
1   Bob
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22905: [SPARK-25894][SQL] Add a ColumnarFileFormat type ...

2018-11-04 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22905#discussion_r230616072
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala 
---
@@ -306,7 +306,15 @@ case class FileSourceScanExec(
   withOptPartitionCount
 }
 
-withSelectedBucketsCount
+val withOptColumnCount = relation.fileFormat match {
+  case columnar: ColumnarFileFormat =>
+val sqlConf = relation.sparkSession.sessionState.conf
+val columnCount = columnar.columnCountForSchema(sqlConf, 
requiredSchema)
+withSelectedBucketsCount + ("ColumnCount" -> columnCount.toString)
--- End diff --

I was wondering how important to know if the columns are pruned or not. In 
that way, other logs should be put in metadata. For instance, we're not even 
showing the actual filters (not cayalyst but I mean the actual pushed filters 
that are going to apply at each source implementation level such as filters 
from `ParquetFilters.createFilter`) in Spark side.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18581: [SPARK-21289][SQL][ML] Supports custom line separator fo...

2018-11-04 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/18581
  
What you see is what you get. It's not yet finished. See also 
https://github.com/apache/spark/pull/20877#issuecomment-429182740


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22914: [SPARK-25900][WEBUI]When the page number is more than th...

2018-11-04 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22914
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22939: [SPARK-25446][R] Add schema_of_json() and schema_of_csv(...

2018-11-04 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22939
  
Makes sense. Let me separate it tomorrow.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22939: [SPARK-25446][R] Add schema_of_json() and schema_of_csv(...

2018-11-04 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22939
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22939: [SPARK-25446][R] Add schema_of_json() and schema_...

2018-11-04 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22939#discussion_r230586828
  
--- Diff: R/pkg/R/functions.R ---
@@ -202,14 +202,18 @@ NULL
 #'  \itemize{
 #'  \item \code{from_json}: a structType object to use as the 
schema to use
 #'  when parsing the JSON string. Since Spark 2.3, the 
DDL-formatted string is
-#'  also supported for the schema.
-#'  \item \code{from_csv}: a DDL-formatted string
+#'  also supported for the schema. Since Spark 3.0, 
\code{schema_of_json} or
+#'  a DDL-formatted string literal can also be accepted.
+#'  \item \code{from_csv}: a structType object, DDL-formatted 
string, \code{schema_of_csv}
+#'  or DDL-formatted string literal
 #'  }
-#' @param ... additional argument(s). In \code{to_json}, \code{to_csv} and 
\code{from_json},
-#'this contains additional named properties to control how it 
is converted, accepts
-#'the same options as the JSON/CSV data source. Additionally 
\code{to_json} supports
-#'the "pretty" option which enables pretty JSON generation. In 
\code{arrays_zip},
-#'this contains additional Columns of arrays to be merged.
+#' @param ... additional argument(s). In \code{to_json}, \code{from_json} 
and
--- End diff --

Yea I think so. Let me try.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22938: [SPARK-25935][SQL] Prevent null rows from JSON pa...

2018-11-04 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22938#discussion_r230581932
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 ---
@@ -552,13 +552,19 @@ case class JsonToStructs(
 
   // This converts parsed rows to the desired output by the given schema.
   @transient
-  lazy val converter = nullableSchema match {
-case _: StructType =>
-  (rows: Iterator[InternalRow]) => if (rows.hasNext) rows.next() else 
null
-case _: ArrayType =>
-  (rows: Iterator[InternalRow]) => if (rows.hasNext) 
rows.next().getArray(0) else null
-case _: MapType =>
-  (rows: Iterator[InternalRow]) => if (rows.hasNext) 
rows.next().getMap(0) else null
+  lazy val converter = (rows: Iterator[InternalRow]) => {
+if (rows.hasNext) {
+  val result = rows.next()
+  // JSON's parser produces one record only.
+  assert(!rows.hasNext)
+  nullableSchema match {
+case _: StructType => result
--- End diff --

@MaxGekk, this looks going to type-dispatch here per record.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22938: [SPARK-25935][SQL] Prevent null rows from JSON parser

2018-11-04 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22938
  
ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22940: [MINOR][R] Rename SQLUtils name to RSQLUtils in R

2018-11-04 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22940
  
@felixcheung, I don't feel super strongly about it but it bugged me. WDYT?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22940: [MINOR][R] Rename SQLUtils name to RSQLUtils in R

2018-11-04 Thread HyukjinKwon
GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/22940

[MINOR][R] Rename SQLUtils name to RSQLUtils in R

## What changes were proposed in this pull request?

This PR proposes to rename `SQLUtils` name to `RSQLUtils` in R. This bugged 
me before but I didn't propose to change because of backporting worry. Since 
we're going to 3.0.0, the backporting concern is likely small. The name 
`SQLUtils` sounds quite confusing to me.

This basically targets to match with Python side utils:

`org.apache.spark.sql.api.python.PythonSQLUtils`
`org.apache.spark.api.python.PythonUtils`

**Before:**

`org.apache.spark.sql.api.r.SQLUtils`
`org.apache.spark.api.r.RUtils`

**After:**

`org.apache.spark.sql.api.r.RSQLUtils`
`org.apache.spark.api.r.RUtils`


## How was this patch tested?

Existing tests should cover.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark match-names

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22940.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22940


commit fa4b6f91f0f0b69d54142132be1d7f07cc3dd9b9
Author: hyukjinkwon 
Date:   2018-11-04T09:37:43Z

Match names to SQLUtils in R




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22938: [SPARK-25935][SQL] Prevent null rows from JSON parser

2018-11-04 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22938
  
add to whitelist


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22938: [SPARK-25935][SQL] Prevent null rows from JSON parser

2018-11-04 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22938
  
ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22939: [SPARK-25446][R] Add schema_of_json() and schema_of_csv(...

2018-11-04 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22939
  
cc @felixcheung and @MaxGekk 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22939: [SPARK-25446][R] Add schema_of_json() and schema_...

2018-11-04 Thread HyukjinKwon
GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/22939

[SPARK-25446][R] Add schema_of_json() and schema_of_csv() to R

## What changes were proposed in this pull request?

This PR proposes to expose `schema_of_json` and `schema_of_csv` at R side.

**`schema_of_json`**:

```r
> json <- '{"name":"Bob"}'
> df <- sql("SELECT * FROM range(1)")
> head(select(df, schema_of_json(json)))
  schema_of_json({"name":"Bob"})
1struct
```

**`schema_of_csv`**:

```r
> csv <- "Amsterdam,2018"
> df <- sql("SELECT * FROM range(1)")
> head(select(df, schema_of_csv(csv)))
  schema_of_csv(Amsterdam,2018)
1struct<_c0:string,_c1:int>
```

This is useful when it's used with [to|from]_[csv|json]:

```r
> df <- sql("SELECT named_struct('name', 'Bob') as people")
> df <- mutate(df, people_json = to_json(df$people))
> head(select(df, from_json(df$people_json, 
schema_of_json(head(df)$people_json
  from_json(people_json)
1Bob
```

```r
> df <- sql("SELECT named_struct('name', 'Bob') as people")
> df <- mutate(df, people_json = to_csv(df$people))
> head(select(df, from_csv(df$people_json, 
schema_of_csv(head(df)$people_json
  from_csv(people_json)
1   Bob
```

## How was this patch tested?

Manually tested, unit tests added, documentation manually built and 
verified.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark SPARK-25446

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22939.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22939


commit c4a78fc0b14876a857bdd9b2f8f094744dd76c04
Author: hyukjinkwon 
Date:   2018-11-04T08:46:20Z

Add schema_of_json() and schema_of_csv() to R




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22934: [INFRA] Close stale PRs

2018-11-04 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22934
  
Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22626: [SPARK-25638][SQL] Adding new function - to_csv()

2018-11-04 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22626
  
Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22626: [SPARK-25638][SQL] Adding new function - to_csv()

2018-11-04 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22626#discussion_r230577431
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala
 ---
@@ -174,3 +176,68 @@ case class SchemaOfCsv(
 
   override def prettyName: String = "schema_of_csv"
 }
+
+/**
+ * Converts a [[StructType]] to a CSV output string.
+ */
+// scalastyle:off line.size.limit
+@ExpressionDescription(
+  usage = "_FUNC_(expr[, options]) - Returns a CSV string with a given 
struct value",
+  examples = """
+Examples:
+  > SELECT _FUNC_(named_struct('a', 1, 'b', 2));
+   1,2
+  > SELECT _FUNC_(named_struct('time', to_timestamp('2015-08-26', 
'-MM-dd')), map('timestampFormat', 'dd/MM/'));
+   "26/08/2015"
+  """,
+  since = "3.0.0")
+// scalastyle:on line.size.limit
+case class StructsToCsv(
+ options: Map[String, String],
+ child: Expression,
+ timeZoneId: Option[String] = None)
+  extends UnaryExpression with TimeZoneAwareExpression with 
CodegenFallback with ExpectsInputTypes {
+  override def nullable: Boolean = true
+
+  def this(options: Map[String, String], child: Expression) = 
this(options, child, None)
+
+  // Used in `FunctionRegistry`
+  def this(child: Expression) = this(Map.empty, child, None)
+
+  def this(child: Expression, options: Expression) =
+this(
+  options = ExprUtils.convertToMapData(options),
+  child = child,
+  timeZoneId = None)
+
+  @transient
+  lazy val writer = new CharArrayWriter()
+
+  @transient
+  lazy val inputSchema: StructType = child.dataType match {
+case st: StructType => st
+case other =>
+  throw new IllegalArgumentException(s"Unsupported input type 
${other.catalogString}")
+  }
+
+  @transient
+  lazy val gen = new UnivocityGenerator(
+inputSchema, writer, new CSVOptions(options, columnPruning = true, 
timeZoneId.get))
--- End diff --

nit: We wouldn't need `lazy val writer` then but just `new 
CharArrayWriter()` here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22919: [SPARK-25906][SHELL] Documents '-I' option (from Scala R...

2018-11-04 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22919
  
Let me get this in to master and branch-2.4 in few days if there are no 
more comments.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22914: [SPARK-25900][WEBUI]When the page number is more than th...

2018-11-03 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22914
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22919: [SPARK-25906][SHELL] Documents '-I' option (from ...

2018-11-03 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22919#discussion_r230554455
  
--- Diff: bin/spark-shell2.cmd ---
@@ -20,7 +20,13 @@ rem
 rem Figure out where the Spark framework is installed
 call "%~dp0find-spark-home.cmd"
 
-set _SPARK_CMD_USAGE=Usage: .\bin\spark-shell.cmd [options]
+set LF=^
+
+
+rem two empty lines are required
+set _SPARK_CMD_USAGE=Usage: .\bin\spark-shell.cmd 
[options]^%LF%%LF%^%LF%%LF%^
+Scala REPL options:^%LF%%LF%^
--- End diff --

Script specific information looks included in `_SPARK_CMD_USAGE` - looks 
it's appropriate place then somewhere in 
`SparkSubmitArguments.printUsageAndExit`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22914: [SPARK-25900][WEBUI]When the page number is more than th...

2018-11-03 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22914
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22919: [SPARK-25906][SHELL] Documents '-I' option (from Scala R...

2018-11-03 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22919
  
Should be ready for a review.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22919: [SPARK-25906][SHELL] Documents '-I' option (from ...

2018-11-03 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22919#discussion_r230554256
  
--- Diff: bin/spark-shell2.cmd ---
@@ -20,7 +20,13 @@ rem
 rem Figure out where the Spark framework is installed
 call "%~dp0find-spark-home.cmd"
 
-set _SPARK_CMD_USAGE=Usage: .\bin\spark-shell.cmd [options]
+set LF=^
+
+
+rem two empty lines are required
+set _SPARK_CMD_USAGE=Usage: .\bin\spark-shell.cmd 
[options]^%LF%%LF%^%LF%%LF%^
+Scala REPL options:^%LF%%LF%^
--- End diff --

There seems no claver way then this to set newlines in variables in batch 
files.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22919: [SPARK-25906][SHELL] Documents '-I' option (from ...

2018-11-03 Thread HyukjinKwon
GitHub user HyukjinKwon reopened a pull request:

https://github.com/apache/spark/pull/22919

[SPARK-25906][SHELL] Documents '-I' option (from Scala REPL) in spark-shell

## What changes were proposed in this pull request?

Looks we mistakenly changed `-i` option behaviour at `spark-shell`. This PR 
targets to restore the previous option and behaviour.

The root cause seems to be 
https://github.com/scala/scala/commit/99dad60d984d3f72338f3bad4c4fe905090edd51. 
They change what `-i` means in that commit. `-i` option is replaced to `-I`.

The _newly replaced_ option -i at Scala 2.11.12 works like `:paste` 
(previously it worked like `:load`). `:paste` looks not working with implicits 
- at least I verified Spark 2.4.0, 2.3.2, 2.0.0, and the current master:

```bash
scala> :paste test.scala
Pasting file test.scala...
:19: error: value toDF is not a member of 
org.apache.spark.rdd.RDD[Record]
Error occurred in an application involving default arguments.
   spark.sparkContext.parallelize((1 to 2).map(i => Record(i, 
s"val_$i"))).toDF.show

   ^
```

Note that `./bin/spark-shell --help` does not describe this option so I 
guess it's not explicitly documented (I guess?); however, it's best not to 
break. The changes are only two lines.

In particular, we should backport this to branch-2.4.

## How was this patch tested?

Manually tested.


With the input below:

```bash
$ cat test.scala
spark.version
case class Record(key: Int, value: String)
spark.sparkContext.parallelize((1 to 2).map(i => Record(i, 
s"val_$i"))).toDF.show
```

**Spark 2.3.2:**

```scala
$ bin/spark-shell -i test.scala
...
+---+-+
|key|value|
+---+-+
|  1|val_1|
|  2|val_2|
+---+-+
```

**Before:**

```scala
$ bin/spark-shell -i test.scala
...
test.scala:17: error: value toDF is not a member of 
org.apache.spark.rdd.RDD[Record]
Error occurred in an application involving default arguments.
   spark.sparkContext.parallelize((1 to 2).map(i => Record(i, 
s"val_$i"))).toDF.show

   ^
```

**After:**

```scala
$ ./bin/spark-shell -i test.scala
...
+---+-+
|key|value|
+---+-+
|  1|val_1|
|  2|val_2|
+---+-+
```


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/HyukjinKwon/spark SPARK-25906

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22919.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22919


commit 5f3cb87c8798e72cc6852e71c02ffc2077c748d7
Author: hyukjinkwon 
Date:   2018-11-03T12:48:25Z

Documents '-I' option (from Scala REPL) in spark-shell




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22919: [SPARK-25906][SHELL] Documents '-I' option (from ...

2018-11-03 Thread HyukjinKwon
Github user HyukjinKwon closed the pull request at:

https://github.com/apache/spark/pull/22919


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22921: [SPARK-25908][CORE][SQL] Remove old deprecated items in ...

2018-11-02 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22921
  
Looks okay to me too but I'd also leave this open for few more days.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22908: [MINOR][SQL] Replace all TreeNode's node name in ...

2018-11-02 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22908#discussion_r230544923
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
 ---
@@ -56,7 +56,7 @@ case class DataSourceV2Relation(
 
   override def pushedFilters: Seq[Expression] = Seq.empty
 
-  override def simpleString: String = "RelationV2 " + metadataString
+  override def simpleString: String = s"$nodeName " + metadataString
--- End diff --

I'd follow this comment, actually.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22913: [SPARK-25902][SQL] Add support for dates with mil...

2018-11-02 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22913#discussion_r230544838
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowUtils.scala 
---
@@ -71,6 +71,7 @@ object ArrowUtils {
 case d: ArrowType.Decimal => DecimalType(d.getPrecision, d.getScale)
 case date: ArrowType.Date if date.getUnit == DateUnit.DAY => DateType
 case ts: ArrowType.Timestamp if ts.getUnit == TimeUnit.MICROSECOND => 
TimestampType
+case date: ArrowType.Date if date.getUnit == DateUnit.MILLISECOND => 
TimestampType
--- End diff --

Wait .. is it correct to map it to `TimestampType`? Looks this is why 
`Date` with `MILLISECOND` is not added.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22913: [SPARK-25902][SQL] Add support for dates with millisecon...

2018-11-02 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22913
  
Can we add a test in `ArrowUtilsSuite.scala`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22626: [SPARK-25638][SQL] Adding new function - to_csv()

2018-11-02 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22626#discussion_r230544775
  
--- Diff: sql/core/src/test/resources/sql-tests/inputs/csv-functions.sql ---
@@ -15,3 +15,10 @@ CREATE TEMPORARY VIEW csvTable(csvField, a) AS SELECT * 
FROM VALUES ('1,abc', 'a
 SELECT schema_of_csv(csvField) FROM csvTable;
 -- Clean up
 DROP VIEW IF EXISTS csvTable;
+-- to_csv
+select to_csv(named_struct('a', 1, 'b', 2));
+select to_csv(named_struct('time', to_timestamp('2015-08-26', 
'-MM-dd')), map('timestampFormat', 'dd/MM/'));
+-- Check if errors handled
+select to_csv(named_struct('a', 1, 'b', 2), named_struct('mode', 
'PERMISSIVE'));
+select to_csv(named_struct('a', 1, 'b', 2), map('mode', 1));
--- End diff --

This one too since the exception is from `convertToMapData`. We just only 
need one test - this one or the one right above. One of them can be removed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22626: [SPARK-25638][SQL] Adding new function - to_csv()

2018-11-02 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22626#discussion_r230544760
  
--- Diff: sql/core/src/test/resources/sql-tests/inputs/csv-functions.sql ---
@@ -15,3 +15,10 @@ CREATE TEMPORARY VIEW csvTable(csvField, a) AS SELECT * 
FROM VALUES ('1,abc', 'a
 SELECT schema_of_csv(csvField) FROM csvTable;
 -- Clean up
 DROP VIEW IF EXISTS csvTable;
+-- to_csv
+select to_csv(named_struct('a', 1, 'b', 2));
+select to_csv(named_struct('time', to_timestamp('2015-08-26', 
'-MM-dd')), map('timestampFormat', 'dd/MM/'));
+-- Check if errors handled
+select to_csv(named_struct('a', 1, 'b', 2), named_struct('mode', 
'PERMISSIVE'));
+select to_csv(named_struct('a', 1, 'b', 2), map('mode', 1));
+select to_csv();
--- End diff --

I think we don't have to test this since it's not specific to this 
expression.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22626: [SPARK-25638][SQL] Adding new function - to_csv()

2018-11-02 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22626#discussion_r230544717
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala
 ---
@@ -174,3 +176,66 @@ case class SchemaOfCsv(
 
   override def prettyName: String = "schema_of_csv"
 }
+
+/**
+ * Converts a [[StructType]] to a CSV output string.
+ */
+// scalastyle:off line.size.limit
+@ExpressionDescription(
+  usage = "_FUNC_(expr[, options]) - Returns a CSV string with a given 
struct value",
+  examples = """
+Examples:
+  > SELECT _FUNC_(named_struct('a', 1, 'b', 2));
+   1,2
+  > SELECT _FUNC_(named_struct('time', to_timestamp('2015-08-26', 
'-MM-dd')), map('timestampFormat', 'dd/MM/'));
+   "26/08/2015"
+  """,
+  since = "3.0.0")
+// scalastyle:on line.size.limit
+case class StructsToCsv(
+ options: Map[String, String],
+ child: Expression,
+ timeZoneId: Option[String] = None)
+  extends UnaryExpression with TimeZoneAwareExpression with 
CodegenFallback with ExpectsInputTypes {
+  override def nullable: Boolean = true
+
+  def this(options: Map[String, String], child: Expression) = 
this(options, child, None)
+
+  // Used in `FunctionRegistry`
+  def this(child: Expression) = this(Map.empty, child, None)
+
+  def this(child: Expression, options: Expression) =
+this(
+  options = ExprUtils.convertToMapData(options),
+  child = child,
+  timeZoneId = None)
+
+  @transient
+  lazy val writer = new CharArrayWriter()
+
+  @transient
+  lazy val inputSchema: StructType = child.dataType match {
+case st: StructType => st
+case other =>
+  throw new IllegalArgumentException(s"Unsupported input type 
${other.catalogString}")
+  }
+
+  @transient
+  lazy val gen = new UnivocityGenerator(
+inputSchema, writer, new CSVOptions(options, columnPruning = true, 
timeZoneId.get))
+
+  // This converts rows to the CSV output according to the given schema.
+  @transient
+  lazy val converter: Any => UTF8String = {
+(row: Any) => 
UTF8String.fromString(gen.writeToString(row.asInstanceOf[InternalRow]))
+  }
+
+  override def dataType: DataType = StringType
+
+  override def withTimeZone(timeZoneId: String): TimeZoneAwareExpression =
+copy(timeZoneId = Option(timeZoneId))
+
+  override def nullSafeEval(value: Any): Any = converter(value)
+
+  override def inputTypes: Seq[AbstractDataType] = 
TypeCollection(StructType) :: Nil
--- End diff --

I think we can `StructType :: Nil`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22626: [SPARK-25638][SQL] Adding new function - to_csv()

2018-11-02 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22626#discussion_r230544667
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala
 ---
@@ -174,3 +176,66 @@ case class SchemaOfCsv(
 
   override def prettyName: String = "schema_of_csv"
 }
+
+/**
+ * Converts a [[StructType]] to a CSV output string.
+ */
+// scalastyle:off line.size.limit
+@ExpressionDescription(
+  usage = "_FUNC_(expr[, options]) - Returns a CSV string with a given 
struct value",
+  examples = """
+Examples:
+  > SELECT _FUNC_(named_struct('a', 1, 'b', 2));
+   1,2
+  > SELECT _FUNC_(named_struct('time', to_timestamp('2015-08-26', 
'-MM-dd')), map('timestampFormat', 'dd/MM/'));
+   "26/08/2015"
+  """,
+  since = "3.0.0")
+// scalastyle:on line.size.limit
+case class StructsToCsv(
+ options: Map[String, String],
+ child: Expression,
+ timeZoneId: Option[String] = None)
+  extends UnaryExpression with TimeZoneAwareExpression with 
CodegenFallback with ExpectsInputTypes {
+  override def nullable: Boolean = true
+
+  def this(options: Map[String, String], child: Expression) = 
this(options, child, None)
+
+  // Used in `FunctionRegistry`
+  def this(child: Expression) = this(Map.empty, child, None)
+
+  def this(child: Expression, options: Expression) =
+this(
+  options = ExprUtils.convertToMapData(options),
+  child = child,
+  timeZoneId = None)
+
+  @transient
+  lazy val writer = new CharArrayWriter()
+
+  @transient
+  lazy val inputSchema: StructType = child.dataType match {
+case st: StructType => st
+case other =>
+  throw new IllegalArgumentException(s"Unsupported input type 
${other.catalogString}")
+  }
+
+  @transient
+  lazy val gen = new UnivocityGenerator(
+inputSchema, writer, new CSVOptions(options, columnPruning = true, 
timeZoneId.get))
+
+  // This converts rows to the CSV output according to the given schema.
+  @transient
+  lazy val converter: Any => UTF8String = {
+(row: Any) => 
UTF8String.fromString(gen.writeToString(row.asInstanceOf[InternalRow]))
--- End diff --

@MaxGekk, can we use the data from `writer` like `writer.toString` and 
`writer.reset()` like `to_json`? Looks we are going to avoid header (which is 
fine). If we explicitly set `header` to `false` in this expression, looks we 
don't need to add `writeToString` in `UnivocityGenerator`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22913: [SPARK-25902][SQL] Add support for dates with millisecon...

2018-11-02 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22913
  
ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22914: [SPARK-25900][WEBUI]When the page number is more than th...

2018-11-02 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22914
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22913: [SPARK-25902][SQL] Add support for dates with millisecon...

2018-11-02 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22913
  
cc @BryanCutler 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22626: [SPARK-25638][SQL] Adding new function - to_csv()

2018-11-02 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22626#discussion_r230544556
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/CsvFunctionsSuite.scala ---
@@ -45,7 +45,6 @@ class CsvFunctionsSuite extends QueryTest with 
SharedSQLContext {
   Row(Row(java.sql.Timestamp.valueOf("2015-08-26 18:00:00.0"
   }
 
-
--- End diff --

Ah, I prefer to don't include unrelated changes but it's okay


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22626: [SPARK-25638][SQL] Adding new function - to_csv()

2018-11-02 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22626#discussion_r230544492
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityGenerator.scala
 ---
@@ -15,18 +15,17 @@
  * limitations under the License.
  */
 
-package org.apache.spark.sql.execution.datasources.csv
+package org.apache.spark.sql.catalyst.csv
 
 import java.io.Writer
 
 import com.univocity.parsers.csv.CsvWriter
 
 import org.apache.spark.sql.catalyst.InternalRow
-import org.apache.spark.sql.catalyst.csv.CSVOptions
 import org.apache.spark.sql.catalyst.util.DateTimeUtils
 import org.apache.spark.sql.types._
 
-private[csv] class UnivocityGenerator(
+private[sql] class UnivocityGenerator(
--- End diff --

Let's remove `private[sql]`. We are already in an internal package 
`catalyst`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22923: [SPARK-25910][CORE] accumulator updates from previous st...

2018-11-02 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22923
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22919: [SPARK-25906][SHELL] Restores '-i' option's behaviour in...

2018-11-01 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22919
  
Let me go ahead with documenting one then tomorrow.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22905: [SPARK-25894][SQL] Add a ColumnarFileFormat type ...

2018-11-01 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22905#discussion_r230250684
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ColumnarFileFormat.scala
 ---
@@ -0,0 +1,32 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.StructType
+
+/**
+ * An optional mix-in for columnar [[FileFormat]]s. This trait provides 
some helpful metadata when
+ * debugging a physical query plan.
+ */
+private[sql] trait ColumnarFileFormat {
--- End diff --

Thanks for explanation. Mind changing it to `private[datasources]`? I am 
still not clear:

1. If this information is worth enough for metadata and/or why it should be 
there
2. If this information can be generalised
3. The purpose of this info - is this purpose to check if the columns are 
actually being pruned or not?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22905: [SPARK-25894][SQL] Add a ColumnarFileFormat type ...

2018-11-01 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22905#discussion_r230249905
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala 
---
@@ -306,7 +306,15 @@ case class FileSourceScanExec(
   withOptPartitionCount
 }
 
-withSelectedBucketsCount
+val withOptColumnCount = relation.fileFormat match {
+  case columnar: ColumnarFileFormat =>
+val sqlConf = relation.sparkSession.sessionState.conf
+val columnCount = columnar.columnCountForSchema(sqlConf, 
requiredSchema)
+withSelectedBucketsCount + ("ColumnCount" -> columnCount.toString)
--- End diff --

The purpose of this info is to check the number of columns actually 
selected, and that information can be shown via logging, no? Why should it be 
exposed in metadata then?

Maybe debug logging that shows the number of columns that actually being 
selected via the underlying source.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22919: [SPARK-25906][SHELL] Restores '-i' option's behaviour in...

2018-11-01 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22919
  
True. but it's a bit difficult to say it's a downside because `-i` has a 
quite major issue as described above - user's code suddenly does not work with 
implicit method (like symbol `'id` or `.toDF` which are pretty common) in minor 
release bumpup. Also, looks the only reason why changed `:load` to `:paste` in 
`-i` option is a bug (SI-7898).

For documentation, I can mention we can use Scala shell flags. Looks 
@cloud-fan is going to leave this JIRA as a known issue so I guess we are good. 
Maybe I can leave an additional note in migration guide as well if we're not 
going ahead to fix it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22919: [SPARK-25906][SHELL] Restores '-i' option's behaviour in...

2018-11-01 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22919
  
Thabks, @dbtsai. Also adding @vanzin and @srowen wdyt about -i options at 
2.4.x and 3.0.0?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22919: [SPARK-25906][SHELL] Restores '-i' option's behaviour in...

2018-11-01 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22919
  
Also, the difference introduced between `-i` and `-I` sounds rather a bug 
fix ([SI-7898](https://issues.scala-lang.org/browse/SI-7898)), which already 
exists in previous shell. I ended up with thinking that it might be more 
important that end users upgrade Spark and face such issues than syncing it to 
Scala 2.11.12 shell at Spark 2.4.x.

I don't superduper strongly feel about it but it was my conclusion. I will 
defer to guys here about going ahead in this way or not.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22919: [SPARK-25906][SHELL] Restores '-i' option's behaviour in...

2018-11-01 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22919
  
Note that `python` shell also takes file but `pyspark` shell doesn't:

```
$ cat tmp.py
print "a"
```

```
$ python2 tmp.py
a
```

```
$ ./bin/pyspark tmp.py
Running python applications through 'pyspark' is not supported as of Spark 
2.0.
Use ./bin/spark-submit 

```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22919: [SPARK-25906][SHELL] Restores '-i' option's behaviour in...

2018-11-01 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22919
  
I was actually thinking that hard about that. My conclusion was that 
basically we should but `spark-shell` is a Spark entry point where we make a 
release - so I thought it's better to keep it in 2.4.0.

In 3.0.0, actually I think we should disallow both `-i` and `-I` like 
`pyspark` or `sparkr` shell (we allowed file before but not anymore).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22919: [SPARK-25906][SHELL] Restores '-i' option's behaviour in...

2018-11-01 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22919
  
Yup. That's what I thought. and treat them as same


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22919: [SPARK-25906][SHELL] Revives '-i' option in spark-shell

2018-11-01 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22919
  
That's exactly what I initially thought. At least we could in 3.0.0 but I 
think we should keep it at 2.4.x.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22626: [SPARK-25638][SQL] Adding new function - to_csv()

2018-11-01 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22626#discussion_r229981376
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
@@ -3905,6 +3905,47 @@ object functions {
 withExpr(SchemaOfCsv(csv.expr, options.asScala.toMap))
   }
 
+  /**
+   * (Scala-specific) Converts a column containing a `StructType` into a 
CSV string
+   * with the specified schema. Throws an exception, in the case of an 
unsupported type.
+   *
+   * @param e a column containing a struct.
+   * @param options options to control how the struct column is converted 
into a CSV string.
+   *It accepts the same options and the CSV data source.
+   *
+   * @group collection_funcs
+   * @since 3.0.0
+   */
+  def to_csv(e: Column, options: Map[String, String]): Column = withExpr {
--- End diff --

Let's get rid of this Scala version for now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22626: [SPARK-25638][SQL] Adding new function - to_csv()

2018-11-01 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22626#discussion_r229981416
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/CsvFunctionsSuite.scala ---
@@ -45,7 +45,6 @@ class CsvFunctionsSuite extends QueryTest with 
SharedSQLContext {
   Row(Row(java.sql.Timestamp.valueOf("2015-08-26 18:00:00.0"
   }
 
-
--- End diff --

seems mistake.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22626: [SPARK-25638][SQL] Adding new function - to_csv()

2018-11-01 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22626#discussion_r229981271
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala
 ---
@@ -174,3 +176,66 @@ case class SchemaOfCsv(
 
   override def prettyName: String = "schema_of_csv"
 }
+
+/**
+ * Converts a [[StructType]] to a CSV output string.
+ */
+// scalastyle:off line.size.limit
+@ExpressionDescription(
+  usage = "_FUNC_(expr[, options]) - Returns a CSV string with a given 
struct value",
+  examples = """
+Examples:
+  > SELECT _FUNC_(named_struct('a', 1, 'b', 2));
+   1,2
+  > SELECT _FUNC_(named_struct('time', to_timestamp('2015-08-26', 
'-MM-dd')), map('timestampFormat', 'dd/MM/'));
+   "26/08/2015"
+  """,
+  since = "3.0.0")
+// scalastyle:on line.size.limit
+case class StructsToCsv(
+ options: Map[String, String],
+ child: Expression,
+ timeZoneId: Option[String] = None)
--- End diff --

seems indentation mistake


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22626: [SPARK-25638][SQL] Adding new function - to_csv()

2018-11-01 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22626
  
@MaxGekk, BTW, I added `schema_of_csv` at R side. You can add 
`schema_of_json` at R side in a similar way.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22919: [SPARK-25906][SHELL] Revives '-i' option in spark-shell

2018-11-01 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22919
  
cc @felixcheung as well. This is similar code path with 
https://github.com/apache/zeppelin/pull/3206


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22919: [SPARK-25906][SHELL] Revives '-i' option in spark-shell

2018-11-01 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22919
  
Note that some commands like `:replay` wouldn't work as well if `-i` option 
is given in the shell since `:paste` itself looks not working. This PR also 
fixes it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22919: [SPARK-25906][SHELL] Revive -i option in spark-shell

2018-11-01 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22919
  
cc @dbtsai, @dongjoon-hyun, @gatorsmile and @cloud-fan 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22919: [SPARK-25906][SHELL] Revive -i option in spark-sh...

2018-11-01 Thread HyukjinKwon
GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/22919

[SPARK-25906][SHELL] Revive -i option in spark-shell

## What changes were proposed in this pull request?

Looks we mistakenly removed `-i` option at `spark-shell`. This PR targets 
to restore the previous option and behaviour.

The root cause seems to be 
https://github.com/scala/scala/commit/99dad60d984d3f72338f3bad4c4fe905090edd51. 
They change what `-i` means in that commit. `-i` option is replaced to `-I`.

The _newly replaced_ option -i at Scala 2.11.12 works like `:paste` 
(previously it worked like `:load`). `:paste` looks not working so far - at 
least I verified Spark 2.4.0, 2.3.2, 2.0.0, and the current master:

```
scala> :paste test.scala
Pasting file test.scala...
:19: error: value toDF is not a member of 
org.apache.spark.rdd.RDD[Record]
Error occurred in an application involving default arguments.
   spark.sparkContext.parallelize((1 to 2).map(i => Record(i, 
s"val_$i"))).toDF.show

   ^
```

Note that `./bin/spark-shell --help` does not describe this option so I 
guess it's not explicitly documented (I guess?); however, it's best not to 
break. The changes are only two lines.

## How was this patch tested?

Manually tested.


With the input below:

```
$ cat test.scala
spark.version
case class Record(key: Int, value: String)
spark.sparkContext.parallelize((1 to 2).map(i => Record(i, 
s"val_$i"))).toDF.show
```

**Spark 2.3.2:**

```scala
$ bin/spark-shell -i test.scala
...
+---+-+
|key|value|
+---+-+
|  1|val_1|
|  2|val_2|
+---+-+
```

**Before:**

```scala
$ bin/spark-shell -i test.scala
...
test.scala:17: error: value toDF is not a member of 
org.apache.spark.rdd.RDD[Record]
Error occurred in an application involving default arguments.
   spark.sparkContext.parallelize((1 to 2).map(i => Record(i, 
s"val_$i"))).toDF.show
```

**After:**

```scala
./bin/spark-shell -i test.scala
...
+---+-+
|key|value|
+---+-+
|  1|val_1|
|  2|val_2|
+---+-+
```


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/HyukjinKwon/spark SPARK-25906

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22919.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22919


commit 17b3c6c5ad4a39b1ecfd33111c392de47c6df967
Author: hyukjinkwon 
Date:   2018-11-01T08:53:39Z

Revive -i option in spark-shell




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22912: [SPARK-25901][CORE] Use only one thread in BarrierTaskCo...

2018-10-31 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22912
  
cc @jiangxb1987 and @mengxr 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22905: [SPARK-25894][SQL] Add a ColumnarFileFormat type ...

2018-10-31 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22905#discussion_r229933689
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ColumnarFileFormat.scala
 ---
@@ -0,0 +1,32 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.StructType
+
+/**
+ * An optional mix-in for columnar [[FileFormat]]s. This trait provides 
some helpful metadata when
+ * debugging a physical query plan.
+ */
+private[sql] trait ColumnarFileFormat {
--- End diff --

If it's supposed to be exposed as an interface to external datasources, 
then I wouldn't even add this one. It looks a rough guess that it can be 
generalised.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22905: [SPARK-25894][SQL] Add a ColumnarFileFormat type ...

2018-10-31 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22905#discussion_r229933544
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala 
---
@@ -306,7 +306,15 @@ case class FileSourceScanExec(
   withOptPartitionCount
 }
 
-withSelectedBucketsCount
+val withOptColumnCount = relation.fileFormat match {
+  case columnar: ColumnarFileFormat =>
+val sqlConf = relation.sparkSession.sessionState.conf
+val columnCount = columnar.columnCountForSchema(sqlConf, 
requiredSchema)
+withSelectedBucketsCount + ("ColumnCount" -> columnCount.toString)
--- End diff --

Is this something we really should include in the metadata? If the purpose 
of this is to check if the column pruning works or not, logging should be good 
enough. Adding a trait for it sounds an overkill for the current status. Let's 
not add an abstraction just for rough guess that it can be generalised.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22905: [SPARK-25894][SQL] Add a ColumnarFileFormat type ...

2018-10-31 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22905#discussion_r229932338
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ColumnarFileFormat.scala
 ---
@@ -0,0 +1,32 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.StructType
+
+/**
+ * An optional mix-in for columnar [[FileFormat]]s. This trait provides 
some helpful metadata when
+ * debugging a physical query plan.
+ */
+private[sql] trait ColumnarFileFormat {
--- End diff --

and I would actually make it `pricate[datasources]` since that's only 
currently used in ParquetFileFormat.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22905: [SPARK-25894][SQL] Add a ColumnarFileFormat type ...

2018-10-31 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22905#discussion_r229932234
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ColumnarFileFormat.scala
 ---
@@ -0,0 +1,32 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.StructType
+
+/**
+ * An optional mix-in for columnar [[FileFormat]]s. This trait provides 
some helpful metadata when
+ * debugging a physical query plan.
+ */
+private[sql] trait ColumnarFileFormat {
--- End diff --

It's already in a private package. `private[sql]` can be removed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22666: [SPARK-25672][SQL] schema_of_csv() - schema inference fr...

2018-10-31 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22666
  
Argh, sorry, it was my mistake.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22666: [SPARK-25672][SQL] schema_of_csv() - schema inference fr...

2018-10-31 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22666
  
Ah no I am sorry @MaxGekk. I made the primary author as me mistakenly. 
I showed my email first.

```
=== Pull Request #22666 ===
title   [SPARK-25672][SQL] schema_of_csv() - schema inference from an 
example
source  MaxGekk/schema_of_csv-function
target  master
url https://api.github.com/repos/apache/spark/pulls/22666

Proceed with merging pull request #22666? (y/n): y
git fetch apache-github pull/22666/head:PR_TOOL_MERGE_PR_22666
From https://github.com/apache/spark
 * [new ref] refs/pull/22666/head -> PR_TOOL_MERGE_PR_22666
git fetch apache master:PR_TOOL_MERGE_PR_22666_MASTER
remote: Counting objects: 303, done.
remote: Compressing objects: 100% (153/153), done.
remote: Total 209 (delta 91), reused 0 (delta 0)
Receiving objects: 100% (209/209), 91.89 KiB | 445.00 KiB/s, done.
Resolving deltas: 100% (91/91), completed with 65 local objects.
From https://git-wip-us.apache.org/repos/asf/spark
 * [new branch]  master -> PR_TOOL_MERGE_PR_22666_MASTER
   57eddc7182e..c5ef477d2f6  master -> apache/master
git checkout PR_TOOL_MERGE_PR_22666_MASTER
Switched to branch 'PR_TOOL_MERGE_PR_22666_MASTER'
['git', 'merge', 'PR_TOOL_MERGE_PR_22666', '--squash']
Automatic merge went well; stopped before committing as requested
['git', 'log', 'HEAD..PR_TOOL_MERGE_PR_22666', '--pretty=format:%an <%ae>']
Enter primary author in the format of "name " [hyukjinkwon 
]: hyukjinkwon 
['git', 'log', 'HEAD..PR_TOOL_MERGE_PR_22666', '--pretty=format:%h [%an] 
%s']
```

Looks the commit order affects the name appearing for `Enter primary author 
in the format of "name "`.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22666: [SPARK-25672][SQL] schema_of_csv() - schema inference fr...

2018-10-31 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22666
  
Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22901: [SPARK-25891][PYTHON] Upgrade to Py4J 0.10.8.1

2018-10-31 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22901
  
Late LGTM!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22895: [SPARK-25886][SQL][Minor] Improve error message of `Fail...

2018-10-31 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22895
  
Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22902: [SPARK-25893][SQL] Show a directional error message for ...

2018-10-31 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22902
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22666: [SPARK-25672][SQL] schema_of_csv() - schema inference fr...

2018-10-31 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22666
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] zeppelin issue #3206: [WIP][ZEPPELIN-3810] Support Spark 2.4

2018-10-31 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/zeppelin/pull/3206
  
Thanks for confirmation.


---


[GitHub] spark issue #22666: [SPARK-25672][SQL] schema_of_csv() - schema inference fr...

2018-10-31 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22666
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22895: [SPARK-25886][SQL][Minor] Improve error message of `Fail...

2018-10-31 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22895
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] zeppelin issue #3206: [WIP][ZEPPELIN-3810] Support Spark 2.4

2018-10-30 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/zeppelin/pull/3206
  
I also tested this patch against Spark RC5.


---


[GitHub] zeppelin issue #3206: [WIP][ZEPPELIN-3810] Support Spark 2.4

2018-10-30 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/zeppelin/pull/3206
  
BTW, I tested this via my Travis CI - 
https://travis-ci.org/HyukjinKwon/zeppelin/builds/448215776. Tests seems got 
passed.


---


[GitHub] zeppelin pull request #3206: [WIP][ZEPPELIN-3810] Support Spark 2.4

2018-10-30 Thread HyukjinKwon
GitHub user HyukjinKwon reopened a pull request:

https://github.com/apache/zeppelin/pull/3206

[WIP][ZEPPELIN-3810] Support Spark 2.4

### What is this PR for?

Spark 2.4 changed it's Scala version from 2.11.8 to 2.11.12 (see 
SPARK-24418).

There are two problems for this upgrade at Zeppelin side:

1.. Some methods that are used in private by reflection, for instance, 
`loopPostInit` became inaccessible.

See:
 - 
https://github.com/scala/scala/blob/v2.11.8/src/repl/scala/tools/nsc/interpreter/ILoop.scala
 - 
https://github.com/scala/scala/blob/v2.11.12/src/repl/scala/tools/nsc/interpreter/ILoop.scala

To work around this, I manually ported `loopPostInit` at 2.11.8 to retain 
the behaviour. Some functions that are commonly existing at both Scala 2.11.8 
and Scala 2.11.12 are used inside of the new `loopPostInit` by reflection.


2.. Upgrade from 2.11.8 to 2.11.12 requires `jline.version` upgrade. 
Otherwise, we will hit:
```
Caused by: java.lang.NoSuchMethodError: 

jline.console.completer.CandidateListCompletionHandler.setPrintSpaceAfterFullCompletion(Z)V
  at 
scala.tools.nsc.interpreter.jline.JLineConsoleReader.initCompletion(JLineReader.scala:139)
```

To work around this, I tweaked this by upgrading jline from `2.12.1` to 
`2.14.3`.


### What type of PR is it?
[Improvement]

### Todos
* [ ] - Wait until Spark 2.4.0 is officially released.

### What is the Jira issue?
* https://issues.apache.org/jira/browse/ZEPPELIN-3810

### How should this be tested?

Verified manually against Spark 2.4.0 RC3

### Questions:
* Does the licenses files need update? Yes
* Is there breaking changes for older versions? No
* Does this needs documentation? No


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/zeppelin ZEPPELIN-3810

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/zeppelin/pull/3206.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3206


commit 9ac1797a7e175a73350d61a0e62b3e4f89b29b8f
Author: hyukjinkwon 
Date:   2018-10-17T14:41:29Z

Support Spark 2.4




---


[GitHub] zeppelin pull request #3206: [WIP][ZEPPELIN-3810] Support Spark 2.4

2018-10-30 Thread HyukjinKwon
Github user HyukjinKwon closed the pull request at:

https://github.com/apache/zeppelin/pull/3206


---


[GitHub] spark pull request #22891: [SPARK-25881][pyspark] df.toPandas() convert deci...

2018-10-30 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22891#discussion_r229564044
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -2064,6 +2064,7 @@ def toDF(self, *cols):
 @since(1.3)
 def toPandas(self):
 """
+:param coerce_float: default False, if Ture, will handle decimal 
type to np.float64 instand of type object.
--- End diff --

There's no such param. Also, we don't convert it to string. Also, Arrow 
optimization code path should also be fixed BTW.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22895: [SPARK-25886][SQL][Minor] Improve error message o...

2018-10-30 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22895#discussion_r229563551
  
--- Diff: 
external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala 
---
@@ -100,9 +100,14 @@ case class AvroDataToCatalyst(
   case NonFatal(e) => parseMode match {
 case PermissiveMode => nullResultRow
 case FailFastMode =>
-  throw new SparkException("Malformed records are detected in 
record parsing. " +
+  val msg = if (e.getMessage != null) {
+e.getMessage + "\n"
+  } else {
+""
+  }
+  throw new SparkException(msg + "Malformed records are detected 
in record parsing. " +
 s"Current parse Mode: ${FailFastMode.name}. To process 
malformed records as null " +
-"result, try setting the option 'mode' as 'PERMISSIVE'.", 
e.getCause)
+"result, try setting the option 'mode' as 'PERMISSIVE'.", e)
--- End diff --

How about we just remove `msg` change? If I am not mistaken, usually it's 
just added at the cause alone.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22891: [SPARK-25881][pyspark] df.toPandas() convert decimal to ...

2018-10-30 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22891
  
I don't think we should map decimals to float. It will loses precisions and 
it's a breaking change.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22895: [SPARK-25886][SQL][Minor] Improve error message o...

2018-10-30 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22895#discussion_r229562284
  
--- Diff: 
external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala 
---
@@ -100,9 +100,14 @@ case class AvroDataToCatalyst(
   case NonFatal(e) => parseMode match {
 case PermissiveMode => nullResultRow
 case FailFastMode =>
-  throw new SparkException("Malformed records are detected in 
record parsing. " +
+  val msg = if (e.getMessage != null) {
+e.getMessage + "\n"
+  } else {
+""
+  }
+  throw new SparkException(msg + "Malformed records are detected 
in record parsing. " +
 s"Current parse Mode: ${FailFastMode.name}. To process 
malformed records as null " +
-"result, try setting the option 'mode' as 'PERMISSIVE'.", 
e.getCause)
+"result, try setting the option 'mode' as 'PERMISSIVE'.", e)
--- End diff --

Adding causes looks fine. I was actually wondering adding `msg` is 
necessary.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22896: [SPARKR]found some extra whitespace in the R tests

2018-10-30 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22896
  
@shaneknapp, I already merged this into master branch 
(https://github.com/apache/spark/commit/243ce319a06f20365d5b08d479642d75748645d9)
 :-). AppVeyor tests were passed 
(https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/builds/19932574).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22896: [SPARKR]found some extra whitespace in the R tests

2018-10-30 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22896
  
@shaneknapp, I just got this in since that's orthogonal to this PR :-) .


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22896: [SPARKR]found some extra whitespace in the R tests

2018-10-30 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22896
  
Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22844: [SPARK-25847][SQL][TEST] Refactor JSONBenchmarks to use ...

2018-10-30 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22844
  
Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22895: [SPARK-25886][SQL][Minor] Improve error message o...

2018-10-30 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22895#discussion_r229539289
  
--- Diff: 
external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala 
---
@@ -100,9 +100,14 @@ case class AvroDataToCatalyst(
   case NonFatal(e) => parseMode match {
 case PermissiveMode => nullResultRow
 case FailFastMode =>
-  throw new SparkException("Malformed records are detected in 
record parsing. " +
+  val msg = if (e.getMessage != null) {
+e.getMessage + "\n"
+  } else {
+""
+  }
--- End diff --

I think it's okay to make this if-else inlined


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22895: [SPARK-25886][SQL][Minor] Improve error message o...

2018-10-30 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22895#discussion_r229539055
  
--- Diff: 
external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala 
---
@@ -100,9 +100,14 @@ case class AvroDataToCatalyst(
   case NonFatal(e) => parseMode match {
 case PermissiveMode => nullResultRow
 case FailFastMode =>
-  throw new SparkException("Malformed records are detected in 
record parsing. " +
+  val msg = if (e.getMessage != null) {
+e.getMessage + "\n"
+  } else {
+""
+  }
+  throw new SparkException(msg + "Malformed records are detected 
in record parsing. " +
 s"Current parse Mode: ${FailFastMode.name}. To process 
malformed records as null " +
-"result, try setting the option 'mode' as 'PERMISSIVE'.", 
e.getCause)
+"result, try setting the option 'mode' as 'PERMISSIVE'.", e)
--- End diff --

@gengliangwang, looks okay. How does the error message look like 
before/after btw? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22844: [SPARK-25847][SQL][TEST] Refactor JSONBenchmarks to use ...

2018-10-30 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22844
  
LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22844: [SPARK-25847][SQL][TEST] Refactor JSONBenchmarks to use ...

2018-10-30 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22844
  
Lgtm


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16429: [SPARK-19019][PYTHON] Fix hijacked `collections.namedtup...

2018-10-30 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/16429
  
Install Spark I mentioned above.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22893: One part of Spark MLlib Kmean Logic Performance problem

2018-10-30 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22893
  
How much performance does it gain in end-to-end test, and how does it 
provide better performance?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22893: One part of Spark MLlib Kmean Logic Performance problem

2018-10-30 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22893
  
Also please fill the PR description. How much performance does it gain? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



<    2   3   4   5   6   7   8   9   10   11   >