date:20181207

[GitHub] spark issue #23262: [SPARK-26312][SQL]Converting converters in RDDConversion...

2018-12-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23262
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5879/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23262: [SPARK-26312][SQL]Converting converters in RDDConversion...

2018-12-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23262
  
**[Test build #99861 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99861/testReport)**
 for PR 23262 at commit 
[`ddb2528`](https://github.com/apache/spark/commit/ddb252892a439281b16bc14fdfdb7faf756f1067).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23256: [SPARK-24207][R] follow-up PR for SPARK-24207 to ...

2018-12-07 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/23256#discussion_r239997109
  
--- Diff: R/pkg/tests/fulltests/test_mllib_fpm.R ---
@@ -84,19 +84,20 @@ test_that("spark.fpGrowth", {
 })
 
 test_that("spark.prefixSpan", {
-df <- createDataFrame(list(list(list(list(1L, 2L), list(3L))),
-  list(list(list(1L), list(3L, 2L), list(1L, 2L))),
-  list(list(list(1L, 2L), list(5L))),
-  list(list(list(6L, schema = c("sequence"))
-result1 <- spark.findFrequentSequentialPatterns(df, minSupport = 0.5, 
maxPatternLength = 5L,
-maxLocalProjDBSize = 
3200L)
-
-expected_result <- createDataFrame(list(list(list(list(1L)), 3L),
-list(list(list(3L)), 2L),
-list(list(list(2L)), 3L),
-list(list(list(1L, 2L)), 3L),
-list(list(list(1L), list(3L)), 
2L)),
-schema = c("sequence", "freq"))
-  })
+  df <- createDataFrame(list(list(list(list(1L, 2L), list(3L))),
+ list(list(list(1L), list(3L, 2L), list(1L, 
2L))),
+ list(list(list(1L, 2L), list(5L))),
+ list(list(list(6L,
+schema = c("sequence"))
+  result <- spark.findFrequentSequentialPatterns(df, minSupport = 0.5, 
maxPatternLength = 5L,
+ maxLocalProjDBSize = 
3200L)
+
+  expected_result <- createDataFrame(list(list(list(list(1L)), 3L), 
list(list(list(3L)), 2L),
+  list(list(list(2L)), 3L), 
list(list(list(1L, 2L)), 3L),
+  list(list(list(1L), list(3L)), 
2L)),
+ schema = c("sequence", "freq"))
+
+  expect_equivalent(expected_result, result)
--- End diff --

this is an important fix..


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23262: [SPARK-26312][SQL]Converting converters in RDDConversion...

2018-12-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23262
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23261: [WIP][SPARK-23674][ML] Adds Spark ML Events

2018-12-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23261
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23261: [WIP][SPARK-23674][ML] Adds Spark ML Events

2018-12-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23261
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99859/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23261: [WIP][SPARK-23674][ML] Adds Spark ML Events

2018-12-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23261
  
**[Test build #99859 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99859/testReport)**
 for PR 23261 at commit 
[`56805d6`](https://github.com/apache/spark/commit/56805d60004142106b400b94eb94f8cf87486494).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class TransformStart(transformer: Transformer, input: Dataset[_]) 
extends MLEvent`
  * `case class TransformEnd(transformer: Transformer, output: Dataset[_]) 
extends MLEvent`
  * `case class FitStart[M <: Model[M]](estimator: Estimator[M], dataset: 
Dataset[_]) extends MLEvent`
  * `case class FitEnd[M <: Model[M]](estimator: Estimator[M], model: M) 
extends MLEvent`
  * `case class LoadInstanceStart[T](reader: MLReader[T], path: String) 
extends MLEvent`
  * `case class LoadInstanceEnd[T](reader: MLReader[T], instance: T) 
extends MLEvent`
  * `case class SaveInstanceStart(writer: MLWriter, path: String) extends 
MLEvent`
  * `case class SaveInstanceEnd(writer: MLWriter, path: String) extends 
MLEvent`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23262: [SPARK-26312][SQL]Converting converters in RDDCon...

2018-12-07 Thread eatoncys

GitHub user eatoncys opened a pull request:

https://github.com/apache/spark/pull/23262

[SPARK-26312][SQL]Converting converters in RDDConversions into arrays to 
improve their access performance


## What changes were proposed in this pull request?

`RDDConversions` would get disproportionately slower as the number of 
columns in the query increased.
This PR converts the `converters` in `RDDConversions` into arrays to 
improve their access performance, the type of `converters` before is 
`scala.collection.immutable.::` which is a subtype of list.

The test of `PrunedScanSuite` for 2000 columns and 20k rows takes 409 
seconds before this PR, and 361 seconds after.

## How was this patch tested?

Test case of `PrunedScanSuite` 

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/eatoncys/spark toarray

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23262.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23262


commit ddb252892a439281b16bc14fdfdb7faf756f1067
Author: 10129659 
Date:   2018-12-08T07:15:10Z

Converting converters in RDDConversions into arrays to improve their access 
performance




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23252: [SPARK-26239] File-based secret key loading for SASL.

2018-12-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23252
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99850/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23252: [SPARK-26239] File-based secret key loading for SASL.

2018-12-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23252
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22707: [SPARK-25717][SQL] Insert overwrite a recreated e...

2018-12-07 Thread maropu

Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/22707#discussion_r239996822
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertSuite.scala ---
@@ -774,4 +774,23 @@ class InsertSuite extends QueryTest with 
TestHiveSingleton with BeforeAndAfter
   }
 }
   }
+
+  test("SPARK-25717: Insert overwrite a recreated external and partitioned 
table "
++ "should remove the historical partition first") {
+withTempDir { tmpDir =>
+  withTable("test_table") {
+(0 until 3).foreach { _ =>
+  sql("DROP TABLE IF EXISTS test_table")
+  sql(
+s"""
+   |CREATE EXTERNAL TABLE test_table (key int)
+   |PARTITIONED BY (p int)
+   |LOCATION 
'${tmpDir.toURI.toString.stripSuffix("/")}/test_table'
--- End diff --

nit: `|LOCATION '${tmpDir.toURI}'`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23252: [SPARK-26239] File-based secret key loading for SASL.

2018-12-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23252
  
**[Test build #99850 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99850/testReport)**
 for PR 23252 at commit 
[`662e38e`](https://github.com/apache/spark/commit/662e38e6f1a01297d75132b088afb6a6a761f70a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22707: [SPARK-25717][SQL] Insert overwrite a recreated external...

2018-12-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22707
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5878/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22707: [SPARK-25717][SQL] Insert overwrite a recreated external...

2018-12-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22707
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22707: [SPARK-25717][SQL] Insert overwrite a recreated external...

2018-12-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22707
  
**[Test build #99860 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99860/testReport)**
 for PR 22707 at commit 
[`89f7223`](https://github.com/apache/spark/commit/89f7223593b171133bb716b0004aa13a592dee6a).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22707: [SPARK-25717][SQL] Insert overwrite a recreated e...

2018-12-07 Thread maropu

Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/22707#discussion_r239996528
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertSuite.scala ---
@@ -774,4 +774,23 @@ class InsertSuite extends QueryTest with 
TestHiveSingleton with BeforeAndAfter
   }
 }
   }
+
+  test("SPARK-25717: Insert overwrite a recreated external and partitioned 
table "
--- End diff --

Can you make the title shorter in a single line?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22707: [SPARK-25717][SQL] Insert overwrite a recreated external...

2018-12-07 Thread maropu

Github user maropu commented on the issue:

https://github.com/apache/spark/pull/22707
  
ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23260: [SPARK-26311][YARN] New feature: custom log URL for stdo...

2018-12-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23260
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23260: [SPARK-26311][YARN] New feature: custom log URL for stdo...

2018-12-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23260
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99857/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23260: [SPARK-26311][YARN] New feature: custom log URL for stdo...

2018-12-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23260
  
**[Test build #99857 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99857/testReport)**
 for PR 23260 at commit 
[`65cc6a3`](https://github.com/apache/spark/commit/65cc6a32729cccba340f66c766c7255be4d7f356).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23261: [SPARK-23674][ML] Adds Spark ML Events

2018-12-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23261
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23224: [SPARK-26277][SQL][TEST] WholeStageCodegen metrics shoul...

2018-12-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23224
  
**[Test build #99858 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99858/testReport)**
 for PR 23224 at commit 
[`71e569b`](https://github.com/apache/spark/commit/71e569b2a89ff0fe83b3f94fee6f57596c093590).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23261: [SPARK-23674][ML] Adds Spark ML Events

2018-12-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23261
  
**[Test build #99859 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99859/testReport)**
 for PR 23261 at commit 
[`56805d6`](https://github.com/apache/spark/commit/56805d60004142106b400b94eb94f8cf87486494).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23261: [SPARK-23674][ML] Adds Spark ML Events

2018-12-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23261
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5877/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23261: [SPARK-23674][ML] Adds Spark ML Events

2018-12-07 Thread HyukjinKwon

GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/23261

[SPARK-23674][ML] Adds Spark ML Events

## What changes were proposed in this pull request?

This PR proposes to add ML events so that other developers can track add 
some actions for them.

## Introduction

This PR proposes to send some ML events like SQL. This is quite useful when 
people want to track and make some actions for corresponding ML operations. For 
instance, I have been working on integrating Spark with Atlas, and with some 
custom changes with this PR, I can visualise ML pipeline as below:


![spark_ml_streaming_lineage](https://user-images.githubusercontent.com/6477701/49682779-394bca80-faf5-11e8-85b8-5fae28b784b3.png)

I think not to mention how useful it is to track the SQL operations. 
Likewise, I would like to propose ML events as well (as lowest stability 
`@Unstable` APIs for now).

## Implementation Details

### Sends event (but not expose ML specific listener)

In `events.scala`, it adds:

```scala
@Unstable
case class ...Events

object MLEvents {
  // Wrappers to send events:
  // def with...Event(body) = {
  //   body()
  //   SparkContext.getOrCreate().listenerBus.post(event)
  // }
}
```

This way mimics both:

**1.. Catalog events (see 
`org.apache.spark.sql.catalyst.catalog.events.scala`)**

- This allows a Catalog specific listener to be added 
`ExternalCatalogEventListener` 
- It's implemented in a way of wrapping whole `ExternalCatalog` named 
`ExternalCatalogWithListener`
which delegates the operations to `ExternalCatalog`

This is not quite possible in this case because most of instances (like 
`Pipeline`) will be directly created in most of cases. We might be able to do 
that via extending `ListenerBus` for all possbiel instances but IMHO it's 
invasive. Also, exposing another ML specific listener sounds a bit too much for 
current status. Therefore, I simply borrowed file name and structures here

**2.. SQL execution events (see 
`org.apache.spark.sql.execution.SQLExecution.scala`)**

- Add an object that wraps a body to send events

Current apporach is rather close to this. It has a `with...` wrapper to 
send events. I borrowed this approach to be consistent.


### Add `...Impl` methods to wrap each to send events

**1. `mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala`**

```diff
- def save(...) = { saveImpl(...) }
+ def save(...) = MLEvents.withSaveInstanceEvent { saveImpl(...) }
  def saveImpl(...): Unit = ...
```

  Note that `saveImpl` was already implemented unlike other instances below.


```diff
- def load(...): T
+ def load(...): T = MLEvents.withLoadInstanceEvent { loadImple(...) }
+ def loadImpl(...): T
```

**2. `mllib/src/main/scala/org/apache/spark/ml/Estimator.scala`**

```diff
- def fit(...): Model
+ def fit(...): Model = MLEvents.withFitEvent { fitImpl() }
+ def fitImpl(...): Model
```

**3. `mllib/src/main/scala/org/apache/spark/ml/Transformer.scala`**

```diff
- def transform(...): DataFrame
+ def transform(...): DataFrame = MLEvents.withTransformEvent { 
transformImpl() }
+ def transformImpl(...): DataFrame
```

This approach follows the existing way as below in ML:

**1.. `transform` and `transformImpl`**


https://github.com/apache/spark/blob/9b1f6c8bab5401258c653d4e2efb50e97c6d282f/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala#L202-L213


https://github.com/apache/spark/blob/9b1f6c8bab5401258c653d4e2efb50e97c6d282f/mllib/src/main/scala/org/apache/spark/ml/regression/DecisionTreeRegressor.scala#L190-L196


https://github.com/apache/spark/blob/9b1f6c8bab5401258c653d4e2efb50e97c6d282f/mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala#L1037-L1042

**2.. `save` and `saveImpl`**


https://github.com/apache/spark/blob/9b1f6c8bab5401258c653d4e2efb50e97c6d282f/mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala#L166-L176

Inherited ones are intentionally omitted here for simplicity. They are 
inherited and implemented at multiple places.

## Backword Compatibility

This keeps both source and binary backward compatibility. I was thinking 
enforcing `...Impl` by leaving it abstract methods but just decided to leave a 
body that throws `UnsupportedOperationException` so that we can keep full 
source and binary compatibilities.

- For user-faced API perspective, there's no difference. `...Impl` methods 
are protected and not visible to end users.

- For developer API perspective, if some developers want to `...` methods 
instead of `...Impl`, that's still fine. It only does n

[GitHub] spark issue #23260: [SPARK-26311][YARN] New feature: custom log URL for stdo...

2018-12-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23260
  
**[Test build #99857 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99857/testReport)**
 for PR 23260 at commit 
[`65cc6a3`](https://github.com/apache/spark/commit/65cc6a32729cccba340f66c766c7255be4d7f356).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23260: [SPARK-26311][YARN] New feature: custom log URL for stdo...

2018-12-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23260
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23260: [SPARK-26311][YARN] New feature: custom log URL for stdo...

2018-12-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23260
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23260: [SPARK-26311][YARN] New feature: custom log URL f...

2018-12-07 Thread HeartSaVioR

GitHub user HeartSaVioR opened a pull request:

https://github.com/apache/spark/pull/23260

[SPARK-26311][YARN] New feature: custom log URL for stdout/stderr

## What changes were proposed in this pull request?

This patch proposes adding a new configuration on YARN mode: custom log 
URL. This will enable end users to point application logs to external log 
service which enables to serve logs when NodeManager becomes unavailable.

Some pre-defined patterns are available for custom log URL to specify them 
like path variables.

## How was this patch tested?

Manual test. 

Below run changes executor log URLs in UI pages.

```
./bin/spark-submit --conf 
spark.yarn.custom.log.url="{{HttpScheme}}{{NodeHttpAddress}}/test/cluster/{{ClusterId}}/container/{{ContainerId}}/user/{{User}}/filename/{{FileName}}"
 --class org.apache.spark.examples.SparkPi 
examples/jars/spark-examples_2.11-.jar
```

Example of stdout log url is below:


`http://node-address:node-port`/test/cluster/`workload1`/container/`container_e08_1542798098040_0012_01_02`/user/`spark`/filename/`stdout`


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HeartSaVioR/spark SPARK-26311

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23260.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23260


commit 65cc6a32729cccba340f66c766c7255be4d7f356
Author: Jungtaek Lim (HeartSaVioR) 
Date:   2018-12-08T06:32:46Z

[SPARK-26311][YARN] New feature: custom log URL for stdout/stderr




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23258: [SPARK-23375][SQL][FOLLOWUP][TEST] Test Sort metr...

2018-12-07 Thread seancxmao

Github user seancxmao commented on a diff in the pull request:

https://github.com/apache/spark/pull/23258#discussion_r239995952
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala
 ---
@@ -182,10 +182,13 @@ class SQLMetricsSuite extends SparkFunSuite with 
SQLMetricsTestUtils with Shared
   }
 
   test("Sort metrics") {
-// Assume the execution plan is
-// WholeStageCodegen(nodeId = 0, Range(nodeId = 2) -> Sort(nodeId = 1))
-val ds = spark.range(10).sort('id)
-testSparkPlanMetrics(ds.toDF(), 2, Map.empty)
+// Assume the execution plan with node id is
+// Sort(nodeId = 0)
+//   Exchange(nodeId = 1)
+// Range(nodeId = 2)
+val df = spark.range(9, -1, -1).sort('id).toDF()
--- End diff --

Using `Seq` instead of `Range`, we makes things simpler and more readable. 
I'll change to use `Seq`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23258: [SPARK-23375][SQL][FOLLOWUP][TEST] Test Sort metrics whi...

2018-12-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23258
  
**[Test build #99856 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99856/testReport)**
 for PR 23258 at commit 
[`6e36336`](https://github.com/apache/spark/commit/6e3633620af4e624c8e562ff290e4264cf33eb8b).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23258: [SPARK-23375][SQL][FOLLOWUP][TEST] Test Sort metr...

2018-12-07 Thread seancxmao

Github user seancxmao commented on a diff in the pull request:

https://github.com/apache/spark/pull/23258#discussion_r239995901
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala
 ---
@@ -182,10 +182,13 @@ class SQLMetricsSuite extends SparkFunSuite with 
SQLMetricsTestUtils with Shared
   }
 
   test("Sort metrics") {
-// Assume the execution plan is
-// WholeStageCodegen(nodeId = 0, Range(nodeId = 2) -> Sort(nodeId = 1))
-val ds = spark.range(10).sort('id)
-testSparkPlanMetrics(ds.toDF(), 2, Map.empty)
+// Assume the execution plan with node id is
+// Sort(nodeId = 0)
+//   Exchange(nodeId = 1)
+// Range(nodeId = 2)
+val df = spark.range(9, -1, -1).sort('id).toDF()
+testSparkPlanMetrics(df, 2, Map.empty)
+
df.queryExecution.executedPlan.find(_.isInstanceOf[SortExec]).getOrElse(assert(false))
--- End diff --

OK, I think this is more readable. fixed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23258: [SPARK-23375][SQL][FOLLOWUP][TEST] Test Sort metr...

2018-12-07 Thread seancxmao

Github user seancxmao commented on a diff in the pull request:

https://github.com/apache/spark/pull/23258#discussion_r239995882
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala
 ---
@@ -26,7 +26,7 @@ import org.apache.spark.SparkFunSuite
 import org.apache.spark.sql._
 import org.apache.spark.sql.catalyst.expressions.aggregate.{Final, Partial}
 import org.apache.spark.sql.catalyst.plans.logical.LocalRelation
-import org.apache.spark.sql.execution.{FilterExec, RangeExec, SparkPlan, 
WholeStageCodegenExec}
+import org.apache.spark.sql.execution._
--- End diff --

Ok, unfolded.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23207: [SPARK-26193][SQL] Implement shuffle write metric...

2018-12-07 Thread xuanyuanking

Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/23207#discussion_r239995006
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala ---
@@ -38,13 +38,21 @@ case class CollectLimitExec(limit: Int, child: 
SparkPlan) extends UnaryExecNode
   override def outputPartitioning: Partitioning = SinglePartition
   override def executeCollect(): Array[InternalRow] = 
child.executeTake(limit)
   private val serializer: Serializer = new 
UnsafeRowSerializer(child.output.size)
-  override lazy val metrics = 
SQLShuffleMetricsReporter.createShuffleReadMetrics(sparkContext)
+  private lazy val writeMetrics =
+SQLShuffleWriteMetricsReporter.createShuffleWriteMetrics(sparkContext)
+  private lazy val readMetrics =
+SQLShuffleMetricsReporter.createShuffleReadMetrics(sparkContext)
--- End diff --

Yea that was done and revert in 
https://github.com/apache/spark/pull/23207/commits/7d104ebe854effb3d8ceb63cd408b9749cee1a8a,
 will separate to another pr after this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22952: [SPARK-20568][SS] Provide option to clean up completed f...

2018-12-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22952
  
**[Test build #99855 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99855/testReport)**
 for PR 22952 at commit 
[`998e769`](https://github.com/apache/spark/commit/998e769c3b552c39736af7814f60be895dbd90d4).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22952: [SPARK-20568][SS] Provide option to clean up completed f...

2018-12-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22952
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99851/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22952: [SPARK-20568][SS] Provide option to clean up completed f...

2018-12-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22952
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22952: [SPARK-20568][SS] Provide option to clean up completed f...

2018-12-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22952
  
**[Test build #99851 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99851/testReport)**
 for PR 22952 at commit 
[`22dce0c`](https://github.com/apache/spark/commit/22dce0c1d17de360c0d887a0352c1e8f57761db7).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `  class FileStreamSourceCleaner(fileSystem: FileSystem, sourcePath: 
Path,`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23242: [SPARK-26285][CORE] accumulator metrics sources f...

2018-12-07 Thread redsanket

Github user redsanket commented on a diff in the pull request:

https://github.com/apache/spark/pull/23242#discussion_r239994508
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/AccumulatorMetricsTest.scala 
---
@@ -0,0 +1,68 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+// scalastyle:off println
+package org.apache.spark.examples
+
+import org.apache.spark.metrics.source.{DoubleAccumulatorSource, 
LongAccumulatorSource}
+import org.apache.spark.sql.SparkSession
+
+/**
+ * Usage: AccumulatorMetricsTest [partitions] [numElem] [blockSize]
+ */
+object AccumulatorMetricsTest {
--- End diff --

Example named as Test is a bit confusing i think... thoughts?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23259: [SPARK-26215][SQL][WIP] Define reserved/non-reser...

2018-12-07 Thread maropu

Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/23259#discussion_r239994423
  
--- Diff: 
sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 ---
@@ -769,7 +774,7 @@ nonReserved
 | REVOKE | GRANT | LOCK | UNLOCK | MSCK | REPAIR | RECOVER | EXPORT | 
IMPORT | LOAD | VALUES | COMMENT | ROLE
 | ROLES | COMPACTIONS | PRINCIPALS | TRANSACTIONS | INDEX | INDEXES | 
LOCKS | OPTION | LOCAL | INPATH
 | ASC | DESC | LIMIT | RENAME | SETS
-| AT | NULLS | OVERWRITE | ALL | ANY | ALTER | AS | BETWEEN | BY | 
CREATE | DELETE
+| AT | NULLS | OVERWRITE | ANY | ALTER | AS | BETWEEN | BY | CREATE | 
DELETE
--- End diff --

yea, thanks. you're right.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23259: [SPARK-26215][SQL][WIP] Define reserved/non-reser...

2018-12-07 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/23259#discussion_r239994385
  
--- Diff: 
sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 ---
@@ -769,7 +774,7 @@ nonReserved
 | REVOKE | GRANT | LOCK | UNLOCK | MSCK | REPAIR | RECOVER | EXPORT | 
IMPORT | LOAD | VALUES | COMMENT | ROLE
 | ROLES | COMPACTIONS | PRINCIPALS | TRANSACTIONS | INDEX | INDEXES | 
LOCKS | OPTION | LOCAL | INPATH
 | ASC | DESC | LIMIT | RENAME | SETS
-| AT | NULLS | OVERWRITE | ALL | ANY | ALTER | AS | BETWEEN | BY | 
CREATE | DELETE
+| AT | NULLS | OVERWRITE | ANY | ALTER | AS | BETWEEN | BY | CREATE | 
DELETE
--- End diff --

Doesn't `ANY` move to `reserved`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23253: [SPARK-26303][SQL] Return partial results for bad...

2018-12-07 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/23253#discussion_r239994326
  
--- Diff: docs/sql-migration-guide-upgrade.md ---
@@ -37,6 +37,8 @@ displayTitle: Spark SQL Upgrading Guide
 
   - In Spark version 2.4 and earlier, CSV datasource converts a malformed 
CSV string to a row with all `null`s in the PERMISSIVE mode. Since Spark 3.0, 
returned row can contain non-`null` fields if some of CSV column values were 
parsed and converted to desired types successfully.
 
+  - In Spark version 2.4 and earlier, JSON datasource and JSON functions 
like `from_json` convert a bad JSON record to a row with all `null`s in the 
PERMISSIVE mode when specified schema is `StructType`. Since Spark 3.0, 
returned row can contain non-`null` fields if some of JSON column values were 
parsed and converted to desired types successfully.
+
--- End diff --

If there is no corrupt column?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23259: [SPARK-26215][SQL][WIP] Define reserved/non-reserved key...

2018-12-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23259
  
**[Test build #99854 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99854/testReport)**
 for PR 23259 at commit 
[`01bc383`](https://github.com/apache/spark/commit/01bc38347496a6194b46ace0feb7d2cd1adb614e).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23259: [SPARK-26215][SQL][WIP] Define reserved/non-reserved key...

2018-12-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23259
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5876/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23259: [SPARK-26215][SQL][WIP] Define reserved/non-reserved key...

2018-12-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23259
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23259: [SPARK-26215][SQL][WIP] Define reserved/non-reserved key...

2018-12-07 Thread maropu

Github user maropu commented on the issue:

https://github.com/apache/spark/pull/23259
  
To discuss this topic smoothly, I made this pr.
Any comment/suggestion is welcome.

cc: @gatorsmile @cloud-fan @viirya 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23259: [SPARK-26215][SQL][WIP] Define reserved/non-reser...

2018-12-07 Thread maropu

GitHub user maropu opened a pull request:

https://github.com/apache/spark/pull/23259

[SPARK-26215][SQL][WIP] Define reserved/non-reserved keywords based on the 
ANSI SQL standard

## What changes were proposed in this pull request?
This pr targeted to define reserved/non-reserved keywords based on the ANSI 
SQL standard.

TODO:
 - Where should we hanlde reserved key words?
 - Which SQL standard does Spark SQL follow (e.g., 2011 or 2016)?
 - Where should we docment the list of reserved/non-reserved key words?
 - Others?

## How was this patch tested?
Added tests in `TableIdentifierParserSuite`.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/maropu/spark SPARK-26215-WIP

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23259.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23259


commit 01bc38347496a6194b46ace0feb7d2cd1adb614e
Author: Takeshi Yamamuro 
Date:   2018-12-06T08:04:49Z

WIP: SQL Reserved/Non-Reserved Key Words




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20146: [SPARK-11215][ML] Add multiple columns support to String...

2018-12-07 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/20146
  
Thanks @holdenk for reviewing! I've resolved some comments and replied 
others. Please take a look. Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20146: [SPARK-11215][ML] Add multiple columns support to...

2018-12-07 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20146#discussion_r239994064
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala ---
@@ -130,21 +159,60 @@ class StringIndexer @Since("1.4.0") (
   @Since("1.4.0")
   def setOutputCol(value: String): this.type = set(outputCol, value)
 
+  /** @group setParam */
+  @Since("2.4.0")
+  def setInputCols(value: Array[String]): this.type = set(inputCols, value)
+
+  /** @group setParam */
+  @Since("2.4.0")
+  def setOutputCols(value: Array[String]): this.type = set(outputCols, 
value)
+
+  private def countByValue(
+  dataset: Dataset[_],
+  inputCols: Array[String]): Array[OpenHashMap[String, Long]] = {
+
+val aggregator = new StringIndexerAggregator(inputCols.length)
+implicit val encoder = Encoders.kryo[Array[OpenHashMap[String, Long]]]
+
+dataset.select(inputCols.map(col(_).cast(StringType)): _*)
+  .toDF
+  .groupBy().agg(aggregator.toColumn)
+  .as[Array[OpenHashMap[String, Long]]]
+  .collect()(0)
+  }
+
   @Since("2.0.0")
   override def fit(dataset: Dataset[_]): StringIndexerModel = {
 transformSchema(dataset.schema, logging = true)
-val values = dataset.na.drop(Array($(inputCol)))
-  .select(col($(inputCol)).cast(StringType))
-  .rdd.map(_.getString(0))
-val labels = $(stringOrderType) match {
-  case StringIndexer.frequencyDesc => 
values.countByValue().toSeq.sortBy(-_._2)
-.map(_._1).toArray
-  case StringIndexer.frequencyAsc => 
values.countByValue().toSeq.sortBy(_._2)
-.map(_._1).toArray
-  case StringIndexer.alphabetDesc => 
values.distinct.collect.sortWith(_ > _)
-  case StringIndexer.alphabetAsc => values.distinct.collect.sortWith(_ 
< _)
-}
-copyValues(new StringIndexerModel(uid, labels).setParent(this))
+
+val (inputCols, _) = getInOutCols()
+
+val filteredDF = dataset.na.drop(inputCols)
+
+// In case of equal frequency when frequencyDesc/Asc, we further sort 
the strings by alphabet.
+val labelsArray = $(stringOrderType) match {
+  case StringIndexer.frequencyDesc =>
+countByValue(filteredDF, inputCols).map { counts =>
+  counts.toSeq.sortBy(_._1).sortBy(-_._2).map(_._1).toArray
--- End diff --

As one sorts by string and another one sorts by count, can we replace them 
with a compound expression with a single sortBy?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23258: [SPARK-23375][SQL][FOLLOWUP][TEST] Test Sort metr...

2018-12-07 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/23258#discussion_r239993927
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala
 ---
@@ -182,10 +182,13 @@ class SQLMetricsSuite extends SparkFunSuite with 
SQLMetricsTestUtils with Shared
   }
 
   test("Sort metrics") {
-// Assume the execution plan is
-// WholeStageCodegen(nodeId = 0, Range(nodeId = 2) -> Sort(nodeId = 1))
-val ds = spark.range(10).sort('id)
-testSparkPlanMetrics(ds.toDF(), 2, Map.empty)
+// Assume the execution plan with node id is
+// Sort(nodeId = 0)
+//   Exchange(nodeId = 1)
+// Range(nodeId = 2)
+val df = spark.range(9, -1, -1).sort('id).toDF()
--- End diff --

Either is fine to me as we now add assert to make sure Sort node exist.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23258: [SPARK-23375][SQL][FOLLOWUP][TEST] Test Sort metr...

2018-12-07 Thread maropu

Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/23258#discussion_r239993889
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala
 ---
@@ -182,10 +182,13 @@ class SQLMetricsSuite extends SparkFunSuite with 
SQLMetricsTestUtils with Shared
   }
 
   test("Sort metrics") {
-// Assume the execution plan is
-// WholeStageCodegen(nodeId = 0, Range(nodeId = 2) -> Sort(nodeId = 1))
-val ds = spark.range(10).sort('id)
-testSparkPlanMetrics(ds.toDF(), 2, Map.empty)
+// Assume the execution plan with node id is
+// Sort(nodeId = 0)
+//   Exchange(nodeId = 1)
+// Range(nodeId = 2)
+val df = spark.range(9, -1, -1).sort('id).toDF()
--- End diff --

We need to use `range` here? How about just writing `Seq(1, 3, 2, 
...).toDF("id")`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20146: [SPARK-11215][ML] Add multiple columns support to String...

2018-12-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20146
  
**[Test build #99853 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99853/testReport)**
 for PR 20146 at commit 
[`301fa4c`](https://github.com/apache/spark/commit/301fa4cbb4d62d9a180dcbafbe0d2b68dac5a3c8).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20146: [SPARK-11215][ML] Add multiple columns support to String...

2018-12-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20146
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20146: [SPARK-11215][ML] Add multiple columns support to String...

2018-12-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20146
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5875/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23258: [SPARK-23375][SQL][FOLLOWUP][TEST] Test Sort metr...

2018-12-07 Thread maropu

Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/23258#discussion_r239993718
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala
 ---
@@ -182,10 +182,13 @@ class SQLMetricsSuite extends SparkFunSuite with 
SQLMetricsTestUtils with Shared
   }
 
   test("Sort metrics") {
-// Assume the execution plan is
-// WholeStageCodegen(nodeId = 0, Range(nodeId = 2) -> Sort(nodeId = 1))
-val ds = spark.range(10).sort('id)
-testSparkPlanMetrics(ds.toDF(), 2, Map.empty)
+// Assume the execution plan with node id is
+// Sort(nodeId = 0)
+//   Exchange(nodeId = 1)
+// Range(nodeId = 2)
+val df = spark.range(9, -1, -1).sort('id).toDF()
+testSparkPlanMetrics(df, 2, Map.empty)
+
df.queryExecution.executedPlan.find(_.isInstanceOf[SortExec]).getOrElse(assert(false))
--- End diff --


`assert(df.queryExecution.executedPlan.find(_.isInstanceOf[SortExec]).isDefined)`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23258: [SPARK-23375][SQL][FOLLOWUP][TEST] Test Sort metr...

2018-12-07 Thread maropu

Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/23258#discussion_r239993561
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala
 ---
@@ -26,7 +26,7 @@ import org.apache.spark.SparkFunSuite
 import org.apache.spark.sql._
 import org.apache.spark.sql.catalyst.expressions.aggregate.{Final, Partial}
 import org.apache.spark.sql.catalyst.plans.logical.LocalRelation
-import org.apache.spark.sql.execution.{FilterExec, RangeExec, SparkPlan, 
WholeStageCodegenExec}
+import org.apache.spark.sql.execution._
--- End diff --

nit: unfold?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23258: [SPARK-23375][SQL][FOLLOWUP][TEST] Test Sort metrics whi...

2018-12-07 Thread maropu

Github user maropu commented on the issue:

https://github.com/apache/spark/pull/23258
  
cc: @mgaido91


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20146: [SPARK-11215][ML] Add multiple columns support to...

2018-12-07 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20146#discussion_r239993480
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala ---
@@ -130,21 +159,60 @@ class StringIndexer @Since("1.4.0") (
   @Since("1.4.0")
   def setOutputCol(value: String): this.type = set(outputCol, value)
 
+  /** @group setParam */
+  @Since("2.4.0")
+  def setInputCols(value: Array[String]): this.type = set(inputCols, value)
+
+  /** @group setParam */
+  @Since("2.4.0")
+  def setOutputCols(value: Array[String]): this.type = set(outputCols, 
value)
+
+  private def countByValue(
+  dataset: Dataset[_],
+  inputCols: Array[String]): Array[OpenHashMap[String, Long]] = {
+
+val aggregator = new StringIndexerAggregator(inputCols.length)
+implicit val encoder = Encoders.kryo[Array[OpenHashMap[String, Long]]]
+
+dataset.select(inputCols.map(col(_).cast(StringType)): _*)
+  .toDF
+  .groupBy().agg(aggregator.toColumn)
+  .as[Array[OpenHashMap[String, Long]]]
+  .collect()(0)
+  }
+
   @Since("2.0.0")
   override def fit(dataset: Dataset[_]): StringIndexerModel = {
 transformSchema(dataset.schema, logging = true)
-val values = dataset.na.drop(Array($(inputCol)))
-  .select(col($(inputCol)).cast(StringType))
-  .rdd.map(_.getString(0))
-val labels = $(stringOrderType) match {
-  case StringIndexer.frequencyDesc => 
values.countByValue().toSeq.sortBy(-_._2)
-.map(_._1).toArray
-  case StringIndexer.frequencyAsc => 
values.countByValue().toSeq.sortBy(_._2)
-.map(_._1).toArray
-  case StringIndexer.alphabetDesc => 
values.distinct.collect.sortWith(_ > _)
-  case StringIndexer.alphabetAsc => values.distinct.collect.sortWith(_ 
< _)
-}
-copyValues(new StringIndexerModel(uid, labels).setParent(this))
+
+val (inputCols, _) = getInOutCols()
+
+val filteredDF = dataset.na.drop(inputCols)
+
+// In case of equal frequency when frequencyDesc/Asc, we further sort 
the strings by alphabet.
--- End diff --

I'll also add it to ml migration document.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23218: [SPARK-26266][BUILD] Update to Scala 2.12.8

2018-12-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23218
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99849/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23218: [SPARK-26266][BUILD] Update to Scala 2.12.8

2018-12-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23218
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23218: [SPARK-26266][BUILD] Update to Scala 2.12.8

2018-12-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23218
  
**[Test build #99849 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99849/testReport)**
 for PR 23218 at commit 
[`b667d37`](https://github.com/apache/spark/commit/b667d37e9ee2d8cdce459806925cdc0fe725b7bf).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20146: [SPARK-11215][ML] Add multiple columns support to...

2018-12-07 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20146#discussion_r239992942
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala ---
@@ -310,11 +439,23 @@ object StringIndexerModel extends 
MLReadable[StringIndexerModel] {
 override def load(path: String): StringIndexerModel = {
   val metadata = DefaultParamsReader.loadMetadata(path, sc, className)
   val dataPath = new Path(path, "data").toString
-  val data = sparkSession.read.parquet(dataPath)
-.select("labels")
-.head()
-  val labels = data.getAs[Seq[String]](0).toArray
-  val model = new StringIndexerModel(metadata.uid, labels)
+
+  val (majorVersion, minorVersion) = 
majorMinorVersion(metadata.sparkVersion)
+  val labelsArray = if (majorVersion < 2 || (majorVersion == 2 && 
minorVersion <= 3)) {
--- End diff --

But this needs to change for Spark 3.0 now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23224: [SPARK-26277][SQL][TEST] WholeStageCodegen metric...

2018-12-07 Thread seancxmao

Github user seancxmao commented on a diff in the pull request:

https://github.com/apache/spark/pull/23224#discussion_r239992863
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala
 ---
@@ -80,8 +80,10 @@ class SQLMetricsSuite extends SparkFunSuite with 
SQLMetricsTestUtils with Shared
 // Assume the execution plan is
 // WholeStageCodegen(nodeId = 0, Range(nodeId = 2) -> Filter(nodeId = 
1))
 // TODO: update metrics in generated operators
-val ds = spark.range(10).filter('id < 5)
-testSparkPlanMetrics(ds.toDF(), 1, Map.empty)
+val df = spark.range(10).filter('id < 5).toDF()
+testSparkPlanMetrics(df, 1, Map.empty, true)
+
df.queryExecution.executedPlan.find(_.isInstanceOf[WholeStageCodegenExec])
+  .getOrElse(assert(false))
--- End diff --

Thank you @viirya. Very good suggestions.

After investigation, besides whole-stage codegen related issue, I found 
another issue. 
#20560/[SPARK-23375](https://issues.apache.org/jira/browse/SPARK-23375) 
introduced an optimizer rule to eliminate redundant Sort. For a test case named 
"Sort metrics" in `SQLMetricsSuite`, because range is already sorted, sort is 
removed by the `RemoveRedundantSorts`, which makes the test case meaningless. 
This seems to be a pretty different issue, so I opened a new PR. See #23258 for 
details.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20146: [SPARK-11215][ML] Add multiple columns support to...

2018-12-07 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20146#discussion_r239992845
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala ---
@@ -310,11 +439,23 @@ object StringIndexerModel extends 
MLReadable[StringIndexerModel] {
 override def load(path: String): StringIndexerModel = {
   val metadata = DefaultParamsReader.loadMetadata(path, sc, className)
   val dataPath = new Path(path, "data").toString
-  val data = sparkSession.read.parquet(dataPath)
-.select("labels")
-.head()
-  val labels = data.getAs[Seq[String]](0).toArray
-  val model = new StringIndexerModel(metadata.uid, labels)
+
+  val (majorVersion, minorVersion) = 
majorMinorVersion(metadata.sparkVersion)
+  val labelsArray = if (majorVersion < 2 || (majorVersion == 2 && 
minorVersion <= 3)) {
--- End diff --

This is for loading old StringIndexerModel saved by previous Spark. 
Previous model has `labels`, but new model has `labelsArray`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23258: [SPARK-23375][SQL][FOLLOWUP][TEST] Test Sort metrics whi...

2018-12-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23258
  
**[Test build #99852 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99852/testReport)**
 for PR 23258 at commit 
[`408ccf8`](https://github.com/apache/spark/commit/408ccf88325c23c6f2f6119cd6ba99c5056f95a2).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23258: [SPARK-23375][SQL][FOLLOWUP][TEST] Test Sort metrics whi...

2018-12-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23258
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23258: [SPARK-23375][SQL][FOLLOWUP][TEST] Test Sort metrics whi...

2018-12-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23258
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23258: [SPARK-23375][SQL][FOLLOWUP][TEST] Test Sort metr...

2018-12-07 Thread seancxmao

GitHub user seancxmao opened a pull request:

https://github.com/apache/spark/pull/23258

[SPARK-23375][SQL][FOLLOWUP][TEST] Test Sort metrics while Sort is missing

## What changes were proposed in this pull request?
#20560/[SPARK-23375](https://issues.apache.org/jira/browse/SPARK-23375) 
introduced an optimizer rule to eliminate redundant Sort. For a test case named 
"Sort metrics" in `SQLMetricsSuite`, because range is already sorted, sort is 
removed by the `RemoveRedundantSorts`, which makes this test case meaningless.

This PR modifies the query for testing Sort metrics and checks Sort exists 
in the plan.

## How was this patch tested?
Modify the existing test case.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/seancxmao/spark sort-metrics

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23258.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23258


commit 408ccf88325c23c6f2f6119cd6ba99c5056f95a2
Author: seancxmao 
Date:   2018-12-08T03:58:21Z

[SPARK-23375][SQL][FOLLOWUP][TEST] Test Sort metrics while Sort is missing




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20146: [SPARK-11215][ML] Add multiple columns support to...

2018-12-07 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20146#discussion_r239992579
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala ---
@@ -130,21 +159,60 @@ class StringIndexer @Since("1.4.0") (
   @Since("1.4.0")
   def setOutputCol(value: String): this.type = set(outputCol, value)
 
+  /** @group setParam */
+  @Since("2.4.0")
+  def setInputCols(value: Array[String]): this.type = set(inputCols, value)
+
+  /** @group setParam */
+  @Since("2.4.0")
+  def setOutputCols(value: Array[String]): this.type = set(outputCols, 
value)
+
+  private def countByValue(
+  dataset: Dataset[_],
+  inputCols: Array[String]): Array[OpenHashMap[String, Long]] = {
+
+val aggregator = new StringIndexerAggregator(inputCols.length)
+implicit val encoder = Encoders.kryo[Array[OpenHashMap[String, Long]]]
+
+dataset.select(inputCols.map(col(_).cast(StringType)): _*)
+  .toDF
+  .groupBy().agg(aggregator.toColumn)
+  .as[Array[OpenHashMap[String, Long]]]
+  .collect()(0)
+  }
+
   @Since("2.0.0")
   override def fit(dataset: Dataset[_]): StringIndexerModel = {
 transformSchema(dataset.schema, logging = true)
-val values = dataset.na.drop(Array($(inputCol)))
-  .select(col($(inputCol)).cast(StringType))
-  .rdd.map(_.getString(0))
-val labels = $(stringOrderType) match {
-  case StringIndexer.frequencyDesc => 
values.countByValue().toSeq.sortBy(-_._2)
-.map(_._1).toArray
-  case StringIndexer.frequencyAsc => 
values.countByValue().toSeq.sortBy(_._2)
-.map(_._1).toArray
-  case StringIndexer.alphabetDesc => 
values.distinct.collect.sortWith(_ > _)
-  case StringIndexer.alphabetAsc => values.distinct.collect.sortWith(_ 
< _)
-}
-copyValues(new StringIndexerModel(uid, labels).setParent(this))
+
+val (inputCols, _) = getInOutCols()
+
+val filteredDF = dataset.na.drop(inputCols)
+
+// In case of equal frequency when frequencyDesc/Asc, we further sort 
the strings by alphabet.
+val labelsArray = $(stringOrderType) match {
+  case StringIndexer.frequencyDesc =>
+countByValue(filteredDF, inputCols).map { counts =>
+  counts.toSeq.sortBy(_._1).sortBy(-_._2).map(_._1).toArray
+}
+  case StringIndexer.frequencyAsc =>
+countByValue(filteredDF, inputCols).map { counts =>
+  counts.toSeq.sortBy(_._1).sortBy(_._2).map(_._1).toArray
+}
+  case StringIndexer.alphabetDesc =>
+import dataset.sparkSession.implicits._
+inputCols.map { inputCol =>
+  
filteredDF.select(inputCol).distinct().sort(dataset(s"$inputCol").desc)
--- End diff --

That's good. I miss this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23196: [SPARK-26243][SQL] Use java.time API for parsing timesta...

2018-12-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23196
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23196: [SPARK-26243][SQL] Use java.time API for parsing timesta...

2018-12-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23196
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99848/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23196: [SPARK-26243][SQL] Use java.time API for parsing timesta...

2018-12-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23196
  
**[Test build #99848 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99848/testReport)**
 for PR 23196 at commit 
[`244654b`](https://github.com/apache/spark/commit/244654b95ae789de83e853f74feade2a66adf432).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20146: [SPARK-11215][ML] Add multiple columns support to...

2018-12-07 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20146#discussion_r239992360
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala ---
@@ -79,26 +81,53 @@ private[feature] trait StringIndexerBase extends Params 
with HasHandleInvalid wi
   @Since("2.3.0")
   def getStringOrderType: String = $(stringOrderType)
 
-  /** Validates and transforms the input schema. */
-  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
-val inputColName = $(inputCol)
+  /** Returns the input and output column names corresponding in pair. */
+  private[feature] def getInOutCols(): (Array[String], Array[String]) = {
+ParamValidators.checkSingleVsMultiColumnParams(this, Seq(outputCol), 
Seq(outputCols))
+
+if (isSet(inputCol)) {
+  (Array($(inputCol)), Array($(outputCol)))
+} else {
+  require($(inputCols).length == $(outputCols).length,
+"The number of input columns does not match output columns")
+  ($(inputCols), $(outputCols))
+}
+  }
+
+  private def validateAndTransformField(
+  schema: StructType,
+  inputColName: String,
+  outputColName: String): StructField = {
 val inputDataType = schema(inputColName).dataType
 require(inputDataType == StringType || 
inputDataType.isInstanceOf[NumericType],
   s"The input column $inputColName must be either string type or 
numeric type, " +
 s"but got $inputDataType.")
-val inputFields = schema.fields
-val outputColName = $(outputCol)
-require(inputFields.forall(_.name != outputColName),
+require(schema.fields.forall(_.name != outputColName),
--- End diff --

Sounds good to me. Added.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20146: [SPARK-11215][ML] Add multiple columns support to...

2018-12-07 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20146#discussion_r239992378
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala ---
@@ -130,21 +159,60 @@ class StringIndexer @Since("1.4.0") (
   @Since("1.4.0")
   def setOutputCol(value: String): this.type = set(outputCol, value)
 
+  /** @group setParam */
+  @Since("2.4.0")
+  def setInputCols(value: Array[String]): this.type = set(inputCols, value)
+
+  /** @group setParam */
+  @Since("2.4.0")
+  def setOutputCols(value: Array[String]): this.type = set(outputCols, 
value)
+
+  private def countByValue(
+  dataset: Dataset[_],
+  inputCols: Array[String]): Array[OpenHashMap[String, Long]] = {
+
+val aggregator = new StringIndexerAggregator(inputCols.length)
+implicit val encoder = Encoders.kryo[Array[OpenHashMap[String, Long]]]
+
+dataset.select(inputCols.map(col(_).cast(StringType)): _*)
+  .toDF
+  .groupBy().agg(aggregator.toColumn)
+  .as[Array[OpenHashMap[String, Long]]]
+  .collect()(0)
+  }
+
   @Since("2.0.0")
   override def fit(dataset: Dataset[_]): StringIndexerModel = {
 transformSchema(dataset.schema, logging = true)
-val values = dataset.na.drop(Array($(inputCol)))
-  .select(col($(inputCol)).cast(StringType))
-  .rdd.map(_.getString(0))
-val labels = $(stringOrderType) match {
-  case StringIndexer.frequencyDesc => 
values.countByValue().toSeq.sortBy(-_._2)
-.map(_._1).toArray
-  case StringIndexer.frequencyAsc => 
values.countByValue().toSeq.sortBy(_._2)
-.map(_._1).toArray
-  case StringIndexer.alphabetDesc => 
values.distinct.collect.sortWith(_ > _)
-  case StringIndexer.alphabetAsc => values.distinct.collect.sortWith(_ 
< _)
-}
-copyValues(new StringIndexerModel(uid, labels).setParent(this))
+
+val (inputCols, _) = getInOutCols()
+
+val filteredDF = dataset.na.drop(inputCols)
+
+// In case of equal frequency when frequencyDesc/Asc, we further sort 
the strings by alphabet.
--- End diff --

Moved to `stringOrderType`'s doc.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23196: [SPARK-26243][SQL] Use java.time API for parsing timesta...

2018-12-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23196
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23196: [SPARK-26243][SQL] Use java.time API for parsing timesta...

2018-12-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23196
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99847/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23196: [SPARK-26243][SQL] Use java.time API for parsing timesta...

2018-12-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23196
  
**[Test build #99847 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99847/testReport)**
 for PR 23196 at commit 
[`6b6ea8a`](https://github.com/apache/spark/commit/6b6ea8a89128a8f459a846dddc48aa6af09a2c51).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20146: [SPARK-11215][ML] Add multiple columns support to...

2018-12-07 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20146#discussion_r239991238
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala ---
@@ -130,21 +159,60 @@ class StringIndexer @Since("1.4.0") (
   @Since("1.4.0")
   def setOutputCol(value: String): this.type = set(outputCol, value)
 
+  /** @group setParam */
+  @Since("2.4.0")
--- End diff --

Yes. Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23226: [SPARK-26286][TEST] Add MAXIMUM_PAGE_SIZE_BYTES exceptio...

2018-12-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23226
  
**[Test build #4462 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4462/testReport)**
 for PR 23226 at commit 
[`486f2b5`](https://github.com/apache/spark/commit/486f2b5c8b51f0f12782d9c9d1907c96a9298c89).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23207: [SPARK-26193][SQL] Implement shuffle write metric...

2018-12-07 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/23207#discussion_r239990986
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala ---
@@ -38,13 +38,21 @@ case class CollectLimitExec(limit: Int, child: 
SparkPlan) extends UnaryExecNode
   override def outputPartitioning: Partitioning = SinglePartition
   override def executeCollect(): Array[InternalRow] = 
child.executeTake(limit)
   private val serializer: Serializer = new 
UnsafeRowSerializer(child.output.size)
-  override lazy val metrics = 
SQLShuffleMetricsReporter.createShuffleReadMetrics(sparkContext)
+  private lazy val writeMetrics =
+SQLShuffleWriteMetricsReporter.createShuffleWriteMetrics(sparkContext)
+  private lazy val readMetrics =
+SQLShuffleMetricsReporter.createShuffleReadMetrics(sparkContext)
--- End diff --

I feel it is better to rename SQLShuffleMetricsReporter to 
SQLShuffleReadMetricsReporter to make it match with 
SQLShuffleWriteMetricsReporter. It can be in a followup.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22952: [SPARK-20568][SS] Provide option to clean up completed f...

2018-12-07 Thread HeartSaVioR

Github user HeartSaVioR commented on the issue:

https://github.com/apache/spark/pull/22952
  
@zsxwing Please also take a look: I guess I addressed glob overlap issue as 
well.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22952: [SPARK-20568][SS] Provide option to clean up completed f...

2018-12-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22952
  
**[Test build #99851 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99851/testReport)**
 for PR 22952 at commit 
[`22dce0c`](https://github.com/apache/spark/commit/22dce0c1d17de360c0d887a0352c1e8f57761db7).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23252: [SPARK-26239] File-based secret key loading for SASL.

2018-12-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23252
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5874/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23252: [SPARK-26239] File-based secret key loading for SASL.

2018-12-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23252
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23252: [SPARK-26239] File-based secret key loading for SASL.

2018-12-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23252
  
**[Test build #99850 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99850/testReport)**
 for PR 23252 at commit 
[`662e38e`](https://github.com/apache/spark/commit/662e38e6f1a01297d75132b088afb6a6a761f70a).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23252: [SPARK-26239] File-based secret key loading for S...

2018-12-07 Thread mccheah

Github user mccheah commented on a diff in the pull request:

https://github.com/apache/spark/pull/23252#discussion_r239990434
  
--- Diff: 
resource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStepSuite.scala
 ---
@@ -16,10 +16,13 @@
  */
 package org.apache.spark.deploy.k8s.features
 
-import scala.collection.JavaConverters._
+import java.io.File
+import java.nio.charset.StandardCharsets
+import java.nio.file.Files
 
 import io.fabric8.kubernetes.api.model._
 import org.scalatest.BeforeAndAfter
+import scala.collection.JavaConverters._
--- End diff --

Think I fixed this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23225: [SPARK-26287][CORE]Don't need to create an empty ...

2018-12-07 Thread srowen

Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/23225#discussion_r239990036
  
--- Diff: 
core/src/main/java/org/apache/spark/shuffle/sort/ShuffleExternalSorter.java ---
@@ -161,6 +161,10 @@ private void writeSortedFile(boolean isLastFile) {
 final ShuffleInMemorySorter.ShuffleSorterIterator sortedRecords =
   inMemSorter.getSortedIterator();
 
+// If there are no sorted records, so we don't need to create an empty 
spill file.
+if (!sortedRecords.hasNext()) {
+  return;
+}
--- End diff --

If you're going to short-circuit, why not do it at the start of the 
function and save the rest of the work done above?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23225: [SPARK-26287][CORE]Don't need to create an empty ...

2018-12-07 Thread wangjiaochun

Github user wangjiaochun commented on a diff in the pull request:

https://github.com/apache/spark/pull/23225#discussion_r239989805
  
--- Diff: 
core/src/main/java/org/apache/spark/shuffle/sort/ShuffleExternalSorter.java ---
@@ -161,6 +161,10 @@ private void writeSortedFile(boolean isLastFile) {
 final ShuffleInMemorySorter.ShuffleSorterIterator sortedRecords =
   inMemSorter.getSortedIterator();
 
+// If there are no sorted records, so we don't need to create an empty 
spill file.
+if (!sortedRecords.hasNext()) {
+  return;
+}
--- End diff --

I think it's better not to do that.Because change the original code style 
and it don't makes an appreciable difference in readability.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23252: [SPARK-26239] File-based secret key loading for SASL.

2018-12-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23252
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23252: [SPARK-26239] File-based secret key loading for SASL.

2018-12-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23252
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99845/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23218: [SPARK-26266][BUILD] Update to Scala 2.12.8

2018-12-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23218
  
**[Test build #4461 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4461/testReport)**
 for PR 23218 at commit 
[`b667d37`](https://github.com/apache/spark/commit/b667d37e9ee2d8cdce459806925cdc0fe725b7bf).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23252: [SPARK-26239] File-based secret key loading for SASL.

2018-12-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23252
  
**[Test build #99845 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99845/testReport)**
 for PR 23252 at commit 
[`a6a44c6`](https://github.com/apache/spark/commit/a6a44c6099154bad8e1776df13aa1b00b258215a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23252: [SPARK-26239] File-based secret key loading for SASL.

2018-12-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23252
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99846/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23252: [SPARK-26239] File-based secret key loading for SASL.

2018-12-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23252
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23252: [SPARK-26239] File-based secret key loading for SASL.

2018-12-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23252
  
**[Test build #99846 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99846/testReport)**
 for PR 23252 at commit 
[`9443646`](https://github.com/apache/spark/commit/9443646e72b51a0cc4b94e98a495159942929c73).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23257: [SPARK-26310][SQL] Verify applicability of JSON options

2018-12-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23257
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23257: [SPARK-26310][SQL] Verify applicability of JSON options

2018-12-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23257
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99844/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23257: [SPARK-26310][SQL] Verify applicability of JSON options

2018-12-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23257
  
**[Test build #99844 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99844/testReport)**
 for PR 23257 at commit 
[`af15070`](https://github.com/apache/spark/commit/af1507093f42a206c2c22f6c40cf4c43290244b8).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23160: [SPARK-26196][SPARK-26281][WebUI] Total tasks title in t...

2018-12-07 Thread shahidki31

Github user shahidki31 commented on the issue:

https://github.com/apache/spark/pull/23160
  
Thanks @srowen. This issue happens only in master branch.

Thank you all for the review and comments 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23256: [SPARK-24207][R] follow-up PR for SPARK-24207 to fix cod...

2018-12-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23256
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 5 >

1 - 100 of 428 matches

Mail list logo