[GitHub] spark issue #23111: [SPARK-26148][PYTHON][TESTS] Increases default paralleli...

2018-11-23 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23111
  
Hey all, I will merge this in few days if there's no more comments. It's 
going to speed up the tests roughly 12 ~ 15 mins.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23119: [SPARK-25954][SS][FOLLOWUP][test-maven] Add Zookeeper 3....

2018-11-23 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23119
  
LGTM too


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23080: [SPARK-26108][SQL] Support custom lineSep in CSV ...

2018-11-22 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23080#discussion_r235830894
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala 
---
@@ -377,6 +377,8 @@ final class DataStreamReader private[sql](sparkSession: 
SparkSession) extends Lo
* `multiLine` (default `false`): parse one record, which may span 
multiple lines.
* `locale` (default is `en-US`): sets a locale as language tag in 
IETF BCP 47 format.
* For instance, this is used while parsing dates and timestamps.
+   * `lineSep` (default covers all `\r`, `\r\n` and `\n`): defines the 
line separator
+   * that should be used for parsing. Maximum length is 2.
--- End diff --

I'm sorry. can you fix `Maximum length is 2` as well? should be good to go.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23111: [SPARK-26148][PYTHON][TESTS] Increases default paralleli...

2018-11-22 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23111
  
Yea, the improvement looks persistent:

`Tests passed in 1027 seconds`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23117: [WIP][SPARK-7721][INFRA] Run and generate test coverage ...

2018-11-22 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23117
  
It's not urgent :) so it's okay. Actually i'm on a vacation for a week as 
well. Thanks for taking a look @shaneknapp !!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23098: [WIP][SPARK-26132][BUILD][CORE] Remove support fo...

2018-11-22 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23098#discussion_r235830558
  
--- Diff: bin/load-spark-env.cmd ---
@@ -21,37 +21,42 @@ rem This script loads spark-env.cmd if it exists, and 
ensures it is only loaded
 rem spark-env.cmd is loaded from SPARK_CONF_DIR if set, or within the 
current directory's
 rem conf\ subdirectory.
 
+set SPARK_ENV_CMD="spark-env.cmd"
 if [%SPARK_ENV_LOADED%] == [] (
   set SPARK_ENV_LOADED=1
 
   if [%SPARK_CONF_DIR%] == [] (
 set SPARK_CONF_DIR=%~dp0..\conf
   )
 
-  call :LoadSparkEnv
+  set SPARK_ENV_CMD="%SPARK_CONF_DIR%/%SPARK_ENV_CMD%"
+  if exist "%SPARK_ENV_CMD%" (
+call "%SPARK_ENV_CMD%"
+  )
 )
 
 rem Setting SPARK_SCALA_VERSION if not already set.
 
-set ASSEMBLY_DIR2="%SPARK_HOME%\assembly\target\scala-2.11"
-set ASSEMBLY_DIR1="%SPARK_HOME%\assembly\target\scala-2.12"
-
-if [%SPARK_SCALA_VERSION%] == [] (
-
-  if exist %ASSEMBLY_DIR2% if exist %ASSEMBLY_DIR1% (
-echo "Presence of build for multiple Scala versions detected."
-echo "Either clean one of them or, set SPARK_SCALA_VERSION in 
spark-env.cmd."
-exit 1
-  )
-  if exist %ASSEMBLY_DIR2% (
-set SPARK_SCALA_VERSION=2.11
-  ) else (
-set SPARK_SCALA_VERSION=2.12
-  )
-)
+rem TODO: revisit for Scala 2.13 support
+set SPARK_SCALA_VERSION=2.12
--- End diff --

Here, I ran some of simple commands:

```cmd
C:\>set A=aa

C:\>ECHO %A%
aa

C:\>set A="aa"

C:\>ECHO %A%
"aa"

C:\>call "python.exe"
Python 3.6.4 (v3.6.4:d48eceb, Dec 19 2017, 06:54:40) [MSC v.1900 64 bit 
(AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> exit(0)

C:\>call python.exe
Python 3.6.4 (v3.6.4:d48eceb, Dec 19 2017, 06:54:40) [MSC v.1900 64 bit 
(AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> exit(0)
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23109: [SPARK-26069][TESTS][FOLLOWUP]Add another possible error...

2018-11-22 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23109
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23111: [SPARK-26148][PYTHON][TESTS] Increases default paralleli...

2018-11-22 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23111
  
cc @rxin, @BryanCutler, @squito. This decreases elapsed time (even faster 
then before splitting the tests).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23111: [DO-NOT-MERGE] Increases default parallelism in PySpark ...

2018-11-22 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23111
  
Oh? it drastically decreases from, for instance,

```
Tests passed in 1770 seconds
```

to

```
Tests passed in 1171 seconds
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23111: [DO-NOT-MERGE] Increases default parallelism in PySpark ...

2018-11-22 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23111
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23117: [WIP][SPARK-7721][INFRA] Run and generate test coverage ...

2018-11-22 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23117
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23117: [WIP][SPARK-7721][INFRA] Run and generate test co...

2018-11-22 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23117#discussion_r235660674
  
--- Diff: dev/run-tests.py ---
@@ -594,7 +651,18 @@ def main():
 
 modules_with_python_tests = [m for m in test_modules if 
m.python_test_goals]
 if modules_with_python_tests:
-run_python_tests(modules_with_python_tests, opts.parallelism)
+# We only run PySpark tests with coverage report in one specific 
job with
+# Spark master with SBT in Jenkins.
+is_sbt_master_job = (
+os.environ.get("AMPLAB_JENKINS_BUILD_PROFILE", "") == 
"hadoop2.7"
+and os.environ.get("SPARK_BRANCH", "") == "master"
+and os.environ.get("AMPLAB_JENKINS", "") == "true"
+and os.environ.get("AMPLAB_JENKINS_BUILD_TOOL", "") == "sbt")
+is_sbt_master_job = True  # Will remove this right before getting 
merged.
--- End diff --

I should remove this before getting this in.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23118: [SPARK-26144][BUILD] `build/mvn` should detect `scala.ve...

2018-11-22 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23118
  
Haha today's not holiday here :D.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23109: [SPARK-26069][TESTS][FOLLOWUP]Add another possible error...

2018-11-22 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23109
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23117: [WIP][SPARK-7721][INFRA] Run and generate test co...

2018-11-22 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23117#discussion_r235644712
  
--- Diff: dev/run-tests.py ---
@@ -594,7 +651,18 @@ def main():
 
 modules_with_python_tests = [m for m in test_modules if 
m.python_test_goals]
 if modules_with_python_tests:
-run_python_tests(modules_with_python_tests, opts.parallelism)
+# We only run PySpark tests with coverage report in one specific 
job with
+# Spark master with SBT in Jenkins.
+is_sbt_master_job = (
+os.environ.get("AMPLAB_JENKINS_BUILD_PROFILE", "") == 
"hadoop2.7"
--- End diff --

This environment variables were checked before at 
https://github.com/apache/spark/pull/17669


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23111: [DO-NOT-MERGE] Increases default parallelism in PySpark ...

2018-11-22 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23111
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23117: [WIP][SPARK-7721][INFRA] Run and generate test coverage ...

2018-11-22 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23117
  
cc @rxin, @JoshRosen, @shaneknapp, @gatorsmile, @BryanCutler, @holdenk, 
@felixcheung, @viirya, @ueshin, @icexelloss 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23117: [WIP][SPARK-7721][INFRA] Run and generate test co...

2018-11-22 Thread HyukjinKwon
GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/23117

[WIP][SPARK-7721][INFRA] Run and generate test coverage report from Python 
via Jenkins

## What changes were proposed in this pull request?


### Background

For the current status, the test script that generates coverage information 
was merged
into Spark, https://github.com/apache/spark/pull/20204

So, we can generate the coverage report and site by, for example:

```
run-tests-with-coverage --python-executables=python3 --modules=pyspark-sql
```

like `run-tests` script in `./python`.


### Proposed change

The next step is to host this coverage report via `github.io` automatically
by Jenkins (see https://spark-test.github.io/pyspark-coverage-site/).

This uses my testing account for Spark, @spark-test, which is shared to 
Felix and Shivaram a long time ago for testing purpose including AppVeyor.

To cut this short, this PR targets to run the coverage in 

[spark-master-test-sbt-hadoop-2.7](https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/)

In the specific job, it will clone the page, and rebase the up-to-date 
PySpark test coverage from the latest commit. For instance as below:

```bash
# Clone PySpark coverage site.
git clone https://github.com/spark-test/pyspark-coverage-site.git

# Copy generated coverage HTML.
cp -r .../python/test_coverage/htmlcov/* pyspark-coverage-site/

# Check out to a temporary branch.
git checkout --orphan latest_branch

# Add all the files.
git add -A

# Commit current test coverage results.
git commit -am "Coverage report at latest commit in Apache Spark"

# Delete the old branch.
git branch -D gh-pages

# Rename the temporary branch to master.
git branch -m gh-pages

# Finally, force update to our repository.
git push -f origin gh-pages
```

So, it is a one single up-to-date coverage can be shown in the `github-io` 
page. The commands above were manually tested.

### TODO:

- [ ] Write a draft
- [ ] Set hidden `SPARK_TEST_KEY` for @spark-test's password in Jenkins via 
Jenkins's feature.
  This should be set both at `SparkPullRequestBuilder` so that we (or I) 
can test and `spark-master-test-sbt-hadoop-2.7`
- [ ] Make PR builder's test passed
- [ ] Enable it this build only at spark-master-test-sbt-hadoop-2.7 right 
before getting this in.

## How was this patch tested?

It will be tested via Jenkins.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark SPARK-7721

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23117.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23117


commit d88d5aa73db636f8c73ace9f83f339781ea50531
Author: hyukjinkwon 
Date:   2018-11-22T08:08:20Z

Run and generate test coverage report from Python via Jenkins




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23111: [DO-NOT-MERGE] Increases default parallelism in PySpark ...

2018-11-21 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23111
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23109: [SPARK-26069][TESTS][FOLLOWUP]Add another possible error...

2018-11-21 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23109
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23112: [GraphX] Remove unused variables left over by previous r...

2018-11-21 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23112
  
ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23112: [GraphX] Remove unused variables left over by previous r...

2018-11-21 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23112
  
Looks okay to go.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23109: [SPARK-26069][TESTS][FOLLOWUP]Add another possible error...

2018-11-21 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23109
  
retest this please

(the test failures look not persistent)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23070: [SPARK-26099][SQL] Verification of the corrupt column in...

2018-11-21 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23070
  
Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23080: [SPARK-26108][SQL] Support custom lineSep in CSV datasou...

2018-11-21 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23080
  
LGTM except https://github.com/apache/spark/pull/23080/files#r235589426


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23080: [SPARK-26108][SQL] Support custom lineSep in CSV ...

2018-11-21 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23080#discussion_r235589426
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala 
---
@@ -216,8 +232,13 @@ class CSVOptions(
 format.setDelimiter(delimiter)
 format.setQuote(quote)
 format.setQuoteEscape(escape)
+lineSeparator.foreach {sep =>
+  format.setLineSeparator(sep)
+  format.setNormalizedNewline(0x00.toChar)
--- End diff --

I know we have some problems here for setting newlines more then 1 
character because `setNormalizedNewline` only supports one character. 

This is related with 
https://github.com/apache/spark/pull/18581#issuecomment-314037750 and 
https://github.com/uniVocity/univocity-parsers/issues/170

That's why I thought we can only support this for single character for now.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23080: [SPARK-26108][SQL] Support custom lineSep in CSV ...

2018-11-21 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23080#discussion_r235589448
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala 
---
@@ -227,7 +248,10 @@ class CSVOptions(
 settings.setEmptyValue(emptyValueInRead)
 settings.setMaxCharsPerColumn(maxCharsPerColumn)
 
settings.setUnescapedQuoteHandling(UnescapedQuoteHandling.STOP_AT_DELIMITER)
-settings.setLineSeparatorDetectionEnabled(multiLine == true)
+settings.setLineSeparatorDetectionEnabled(lineSeparatorInRead.isEmpty 
&& multiLine)
+lineSeparatorInRead.foreach { _ =>
--- End diff --

nice!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22979: [SPARK-25977][SQL] Parsing decimals from CSV usin...

2018-11-21 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22979#discussion_r235588064
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVExprUtils.scala
 ---
@@ -79,4 +83,22 @@ object CSVExprUtils {
 throw new IllegalArgumentException(s"Delimiter cannot be more than 
one character: $str")
 }
   }
+
+  def getDecimalParser(useLegacyParser: Boolean, locale: Locale): String 
=> java.math.BigDecimal = {
+if (useLegacyParser) {
+  (s: String) => new BigDecimal(s.replaceAll(",", ""))
+} else {
+  val df = new DecimalFormat("", new DecimalFormatSymbols(locale))
--- End diff --

let's name `df` `decimalFormat` ..


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23027: [SPARK-26049][SQL][TEST] FilterPushdownBenchmark add InM...

2018-11-21 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23027
  
Looks fine. @dongjoon-hyun WDYT?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

2018-11-21 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23052
  
There are two more things to deal with:

https://github.com/apache/spark/pull/23052#issuecomment-440687200 comment 
will still be valid - at least it should be double checked because dataframes 
originated from emptyRDD does not write anything all times.

One thing is CSV for text-based datasources because it can write out 
headers. Thing is, the header is currently written when the first row is 
written - this is what https://github.com/apache/spark/pull/13252 PR targeted 
before. I closed this because there's no interests but we should fix.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

2018-11-21 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23052
  
cc @cloud-fan as well


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23111: [DO-NOT-MERGE] Increases default parallelism in PySpark ...

2018-11-21 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23111
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22938: [SPARK-25935][SQL] Prevent null rows from JSON pa...

2018-11-21 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22938#discussion_r235583851
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala 
---
@@ -240,16 +240,6 @@ class SQLQuerySuite extends QueryTest with 
SharedSQLContext {
   Seq(Row("1"), Row("2")))
   }
 
-  test("SPARK-11226 Skip empty line in json file") {
--- End diff --

Where is it moved to then? Does that mean we don't have a regression test 
for SPARK-11226 anymore?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22938: [SPARK-25935][SQL] Prevent null rows from JSON parser

2018-11-21 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22938
  
Sorry for the late response. The change looks good to me in general but I 
had one question.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22938: [SPARK-25935][SQL] Prevent null rows from JSON pa...

2018-11-21 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22938#discussion_r235583559
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
 ---
@@ -1892,7 +1898,7 @@ class JsonSuite extends QueryTest with 
SharedSQLContext with TestJsonData {
 .text(path)
 
   val jsonDF = spark.read.option("multiLine", true).option("mode", 
"PERMISSIVE").json(path)
-  assert(jsonDF.count() === corruptRecordCount)
+  assert(jsonDF.count() === corruptRecordCount + 1) // null row for 
empty file
--- End diff --

If that's true, we should not do this. Empty files can be generated in many 
cases for now and the behaviour is not currently well defined. If we rely on 
this behaviour, it will cause some weird behaviours or bugs hard to fix.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22938: [SPARK-25935][SQL] Prevent null rows from JSON pa...

2018-11-21 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22938#discussion_r235583349
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
 ---
@@ -1892,7 +1898,7 @@ class JsonSuite extends QueryTest with 
SharedSQLContext with TestJsonData {
 .text(path)
 
   val jsonDF = spark.read.option("multiLine", true).option("mode", 
"PERMISSIVE").json(path)
-  assert(jsonDF.count() === corruptRecordCount)
+  assert(jsonDF.count() === corruptRecordCount + 1) // null row for 
empty file
--- End diff --

Wait, does this mean that it reads an empty record from empty file after 
this change?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22938: [SPARK-25935][SQL] Prevent null rows from JSON pa...

2018-11-21 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22938#discussion_r235583315
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
 ---
@@ -1905,7 +1911,7 @@ class JsonSuite extends QueryTest with 
SharedSQLContext with TestJsonData {
   F.count($"dummy").as("valid"),
   F.count($"_corrupt_record").as("corrupt"),
   F.count("*").as("count"))
-  checkAnswer(counts, Row(1, 4, 6))
+  checkAnswer(counts, Row(1, 5, 7)) // null row for empty file
--- End diff --

Wait, does this mean that it reads an empty record from empty file after 
this change?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23111: [DO-NOT-MERGE] Increases default parallelism in PySpark ...

2018-11-21 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23111
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23113: [SPARK-26019][PYTHON] Fix race condition in accumulators...

2018-11-21 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23113
  
> race condition explained in 
https://issues.apache.org/jira/browse/SPARK-26019

How race condition happens? Can you clarify it in PR description. I think 
@viirya's analysis is matched to what I investigated 
(https://issues.apache.org/jira/browse/SPARK-26019?focusedCommentId=16692722=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16692722).
 



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23098: [WIP][SPARK-26132][BUILD][CORE] Remove support fo...

2018-11-21 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23098#discussion_r235572895
  
--- Diff: bin/load-spark-env.cmd ---
@@ -21,37 +21,42 @@ rem This script loads spark-env.cmd if it exists, and 
ensures it is only loaded
 rem spark-env.cmd is loaded from SPARK_CONF_DIR if set, or within the 
current directory's
 rem conf\ subdirectory.
 
+set SPARK_ENV_CMD="spark-env.cmd"
 if [%SPARK_ENV_LOADED%] == [] (
   set SPARK_ENV_LOADED=1
 
   if [%SPARK_CONF_DIR%] == [] (
 set SPARK_CONF_DIR=%~dp0..\conf
   )
 
-  call :LoadSparkEnv
+  set SPARK_ENV_CMD="%SPARK_CONF_DIR%/%SPARK_ENV_CMD%"
+  if exist "%SPARK_ENV_CMD%" (
+call "%SPARK_ENV_CMD%"
+  )
 )
 
 rem Setting SPARK_SCALA_VERSION if not already set.
 
-set ASSEMBLY_DIR2="%SPARK_HOME%\assembly\target\scala-2.11"
-set ASSEMBLY_DIR1="%SPARK_HOME%\assembly\target\scala-2.12"
-
-if [%SPARK_SCALA_VERSION%] == [] (
-
-  if exist %ASSEMBLY_DIR2% if exist %ASSEMBLY_DIR1% (
-echo "Presence of build for multiple Scala versions detected."
-echo "Either clean one of them or, set SPARK_SCALA_VERSION in 
spark-env.cmd."
-exit 1
-  )
-  if exist %ASSEMBLY_DIR2% (
-set SPARK_SCALA_VERSION=2.11
-  ) else (
-set SPARK_SCALA_VERSION=2.12
-  )
-)
+rem TODO: revisit for Scala 2.13 support
+set SPARK_SCALA_VERSION=2.12
--- End diff --

For `call`, it's a bit different IIRC (https://ss64.com/nt/cmd.html)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23098: [WIP][SPARK-26132][BUILD][CORE] Remove support fo...

2018-11-21 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23098#discussion_r235572737
  
--- Diff: bin/load-spark-env.cmd ---
@@ -21,37 +21,42 @@ rem This script loads spark-env.cmd if it exists, and 
ensures it is only loaded
 rem spark-env.cmd is loaded from SPARK_CONF_DIR if set, or within the 
current directory's
 rem conf\ subdirectory.
 
+set SPARK_ENV_CMD="spark-env.cmd"
 if [%SPARK_ENV_LOADED%] == [] (
   set SPARK_ENV_LOADED=1
 
   if [%SPARK_CONF_DIR%] == [] (
 set SPARK_CONF_DIR=%~dp0..\conf
   )
 
-  call :LoadSparkEnv
+  set SPARK_ENV_CMD="%SPARK_CONF_DIR%/%SPARK_ENV_CMD%"
+  if exist "%SPARK_ENV_CMD%" (
+call "%SPARK_ENV_CMD%"
+  )
 )
 
 rem Setting SPARK_SCALA_VERSION if not already set.
 
-set ASSEMBLY_DIR2="%SPARK_HOME%\assembly\target\scala-2.11"
-set ASSEMBLY_DIR1="%SPARK_HOME%\assembly\target\scala-2.12"
-
-if [%SPARK_SCALA_VERSION%] == [] (
-
-  if exist %ASSEMBLY_DIR2% if exist %ASSEMBLY_DIR1% (
-echo "Presence of build for multiple Scala versions detected."
-echo "Either clean one of them or, set SPARK_SCALA_VERSION in 
spark-env.cmd."
-exit 1
-  )
-  if exist %ASSEMBLY_DIR2% (
-set SPARK_SCALA_VERSION=2.11
-  ) else (
-set SPARK_SCALA_VERSION=2.12
-  )
-)
+rem TODO: revisit for Scala 2.13 support
+set SPARK_SCALA_VERSION=2.12
--- End diff --

but at least that's what I remember when I test 
https://github.com/apache/spark/blob/master/bin/spark-sql2.cmd#L23 line.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23098: [WIP][SPARK-26132][BUILD][CORE] Remove support fo...

2018-11-21 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23098#discussion_r235572442
  
--- Diff: bin/load-spark-env.cmd ---
@@ -21,37 +21,42 @@ rem This script loads spark-env.cmd if it exists, and 
ensures it is only loaded
 rem spark-env.cmd is loaded from SPARK_CONF_DIR if set, or within the 
current directory's
 rem conf\ subdirectory.
 
+set SPARK_ENV_CMD="spark-env.cmd"
 if [%SPARK_ENV_LOADED%] == [] (
   set SPARK_ENV_LOADED=1
 
   if [%SPARK_CONF_DIR%] == [] (
 set SPARK_CONF_DIR=%~dp0..\conf
   )
 
-  call :LoadSparkEnv
+  set SPARK_ENV_CMD="%SPARK_CONF_DIR%/%SPARK_ENV_CMD%"
+  if exist "%SPARK_ENV_CMD%" (
+call "%SPARK_ENV_CMD%"
+  )
 )
 
 rem Setting SPARK_SCALA_VERSION if not already set.
 
-set ASSEMBLY_DIR2="%SPARK_HOME%\assembly\target\scala-2.11"
-set ASSEMBLY_DIR1="%SPARK_HOME%\assembly\target\scala-2.12"
-
-if [%SPARK_SCALA_VERSION%] == [] (
-
-  if exist %ASSEMBLY_DIR2% if exist %ASSEMBLY_DIR1% (
-echo "Presence of build for multiple Scala versions detected."
-echo "Either clean one of them or, set SPARK_SCALA_VERSION in 
spark-env.cmd."
-exit 1
-  )
-  if exist %ASSEMBLY_DIR2% (
-set SPARK_SCALA_VERSION=2.11
-  ) else (
-set SPARK_SCALA_VERSION=2.12
-  )
-)
+rem TODO: revisit for Scala 2.13 support
+set SPARK_SCALA_VERSION=2.12
--- End diff --

Yea, I think we shouldn't quote if I remember this correctly. Let me test 
and get back to you today or tomorrow. I'll have to fly to Korea for my one 
week vacation from tomorrow :D.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23111: [DO-NOT-MERGE] Increases default parallelism in P...

2018-11-21 Thread HyukjinKwon
GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/23111

[DO-NOT-MERGE] Increases default parallelism in PySpark tests

## What changes were proposed in this pull request?

I'm trying to see if increasing parallelism decreases elapsed time in 
PySpark tests

## How was this patch tested?

Jenkins tests



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark parallelism

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23111.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23111


commit ec0f7301f693811f4f782a62aae05abac33acd34
Author: hyukjinkwon 
Date:   2018-11-22T00:10:10Z

Increases default parallelism in PySpark tests




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23078: [SPARK-26106][PYTHON] Prioritizes ML unittests over the ...

2018-11-21 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23078
  
Thanks. Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23102: [SPARK-26137][CORE] Use Java system property "fil...

2018-11-21 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23102#discussion_r235570156
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/DependencyUtils.scala 
---
@@ -65,7 +65,7 @@ private[deploy] object DependencyUtils extends Logging {
   .map {
 resolveGlobPaths(_, hadoopConf)
   .split(",")
-  .filterNot(_.contains(userJar.split("/").last))
+  
.filterNot(_.contains(userJar.split(System.getProperty("file.separator")).last))
--- End diff --

+1


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23102: [SPARK-26137][CORE] Use Java system property "file.separ...

2018-11-21 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23102
  
@markpavey, can you write a test? I can run the test on Windows via 
AppVeyor.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23055: [SPARK-26080][PYTHON] Disable 'spark.executor.pyspark.me...

2018-11-21 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23055
  
Sorry, may I ask to take another look please? I thought it's quite a 
straightforward fix by a consistent way of fixing.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23078: [SPARK-26106][PYTHON] Prioritizes ML unittests over the ...

2018-11-21 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23078
  
@JoshRosen, can you take a look please when you're available? it's quite 
obvious to fix.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23109: [SPARK-26069][TESTS][FOLLOWUP]Add another possible error...

2018-11-21 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23109
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

2018-11-21 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23052
  
Also, it's not always for Parquet to write empty files. That does not write 
empty files when data frames are created from emptyRDD (the one pointed out in 
the PR link I gave). We should match this behaviour as well.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23101: [SPARK-26134][CORE] Upgrading Hadoop to 2.7.4 to fix jav...

2018-11-21 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23101
  
LGTM2


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

2018-11-21 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23052
  
@MaxGekk I didn't  mean to block this PR. Since we're going ahead for 3.0, 
it should be good to match and fix the behaviours across data sources. For 
instance, CSV should still be able to read the header. Shall we clarify each 
data sources behaviour?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] zeppelin issue #3206: [ZEPPELIN-3810] Support Spark 2.4

2018-11-21 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/zeppelin/pull/3206
  
This fix is not released yet. This PR exactly fixes the problem you faced. 
This fix will be available in the next release of Zeppelin.


---


[GitHub] spark pull request #23098: [WIP][SPARK-26132][BUILD][CORE] Remove support fo...

2018-11-21 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23098#discussion_r235322452
  
--- Diff: bin/load-spark-env.cmd ---
@@ -21,37 +21,42 @@ rem This script loads spark-env.cmd if it exists, and 
ensures it is only loaded
 rem spark-env.cmd is loaded from SPARK_CONF_DIR if set, or within the 
current directory's
 rem conf\ subdirectory.
 
+set SPARK_ENV_CMD="spark-env.cmd"
 if [%SPARK_ENV_LOADED%] == [] (
   set SPARK_ENV_LOADED=1
 
   if [%SPARK_CONF_DIR%] == [] (
 set SPARK_CONF_DIR=%~dp0..\conf
   )
 
-  call :LoadSparkEnv
+  set SPARK_ENV_CMD="%SPARK_CONF_DIR%/%SPARK_ENV_CMD%"
+  if exist "%SPARK_ENV_CMD%" (
+call "%SPARK_ENV_CMD%"
+  )
 )
 
 rem Setting SPARK_SCALA_VERSION if not already set.
 
-set ASSEMBLY_DIR2="%SPARK_HOME%\assembly\target\scala-2.11"
-set ASSEMBLY_DIR1="%SPARK_HOME%\assembly\target\scala-2.12"
-
-if [%SPARK_SCALA_VERSION%] == [] (
-
-  if exist %ASSEMBLY_DIR2% if exist %ASSEMBLY_DIR1% (
-echo "Presence of build for multiple Scala versions detected."
-echo "Either clean one of them or, set SPARK_SCALA_VERSION in 
spark-env.cmd."
-exit 1
-  )
-  if exist %ASSEMBLY_DIR2% (
-set SPARK_SCALA_VERSION=2.11
-  ) else (
-set SPARK_SCALA_VERSION=2.12
-  )
-)
+rem TODO: revisit for Scala 2.13 support
+set SPARK_SCALA_VERSION=2.12
--- End diff --

Nope, it takes as is including quotes ... haha odd (to me).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23071: [SPARK-26102][SQL][TEST] Extracting common CSV/JSON func...

2018-11-21 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23071
  
Thank you @MaxGekk.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22939: [SPARK-25446][R] Add schema_of_json() and schema_of_csv(...

2018-11-21 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22939
  
Sure!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23027: [SPARK-26049][SQL][TEST] FilterPushdownBenchmark add InM...

2018-11-20 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23027
  
@wangyum, why did you close this?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22939: [SPARK-25446][R] Add schema_of_json() and schema_of_csv(...

2018-11-20 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22939
  
gentle ping, @felixcheung.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23070: [SPARK-26099][SQL] Verification of the corrupt column in...

2018-11-20 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23070
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23071: [SPARK-26102][SQL][TEST] Extracting common CSV/JSON func...

2018-11-20 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23071
  
I think we don't need this for now. Let's do this when more `from/to_...` 
functions are added later. The amount of codes increases actually.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23080: [SPARK-26108][SQL] Support custom lineSep in CSV datasou...

2018-11-20 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23080
  
@MaxGekk, let's rebase this one accordingly with encoding support.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23080: [SPARK-26108][SQL] Support custom lineSep in CSV ...

2018-11-20 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23080#discussion_r235244407
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala 
---
@@ -192,6 +192,20 @@ class CSVOptions(
*/
   val emptyValueInWrite = emptyValue.getOrElse("\"\"")
 
+  /**
+   * A string between two consecutive JSON records.
+   */
+  val lineSeparator: Option[String] = parameters.get("lineSep").map { sep 
=>
+require(sep.nonEmpty, "'lineSep' cannot be an empty string.")
+require(sep.length <= 2, "'lineSep' can contain 1 or 2 characters.")
--- End diff --

Hm, I see.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23085: [Docs] Added csv, orc, and text output format options to...

2018-11-20 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23085
  
@mrandrewandrade, let's close this for now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23078: [SPARK-26106][PYTHON] Prioritizes ML unittests over the ...

2018-11-20 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23078
  
@zsxwing can you take a look please when you're available


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23099: [WIP][SPARK-25954][SS] Upgrade to Kafka 2.1.0

2018-11-20 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23099
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23089: [SPARK-26120][TESTS][SS][SPARKR]Fix a streaming query le...

2018-11-20 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23089
  
Merged to master and branch-2.4.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23091: [SPARK-26122][SQL] Support encoding for multiLine in CSV...

2018-11-20 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23091
  
Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23055: [SPARK-26080][PYTHON] Disable 'spark.executor.pys...

2018-11-20 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23055#discussion_r235226292
  
--- Diff: 
core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala ---
@@ -74,8 +74,13 @@ private[spark] abstract class BasePythonRunner[IN, OUT](
   private val reuseWorker = conf.getBoolean("spark.python.worker.reuse", 
true)
   // each python worker gets an equal part of the allocation. the worker 
pool will grow to the
   // number of concurrent tasks, which is determined by the number of 
cores in this executor.
-  private val memoryMb = conf.get(PYSPARK_EXECUTOR_MEMORY)
+  private val memoryMb = if (Utils.isWindows) {
--- End diff --

I am doing this because:

1. The code consistency - usually the configuration is dealt with in JVM 
side if possible - otherwise we should send the configuration to worker side 
and the process it in Python worker side ideally. What we need to do is disable 
the configuration on a certain condition explicitly.

2. If we do it only in Python side, it's going to introduce the third 
status in JVM (`memoryMb`). When the configuration value exists in JVM (which 
means it's enabled) but the functionaility is disabled in Python side. Dealing 
with it will reduce the complexity.

Why are you so against about disabling it in JVM side? I don't see big 
downside of doing this. I added a flag as you said as well. For clarification, 
it's a rather minor point.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23085: [Docs] Added csv, orc, and text output format options to...

2018-11-20 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23085
  
Because it's already documented. Also it brings maintnense overhead.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23087: [SPARK-26124][BUILD] Update plugins to latest versions

2018-11-20 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23087
  
reest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23004: [SPARK-26004][SQL] InMemoryTable support StartsWith pred...

2018-11-19 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23004
  
Looks fine to me


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23085: [Docs] Added csv, orc, and text output format options to...

2018-11-19 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23085
  
It's already documented in official site anyway.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23085: [Docs] Added csv, orc, and text output format options to...

2018-11-19 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23085
  
I think it doesn't need to change. It's likely to be changed and we 
wouldn't want to update this doc everytime we add new datasource.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23055: [SPARK-26080][PYTHON] Disable 'spark.executor.pys...

2018-11-19 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23055#discussion_r234838984
  
--- Diff: 
core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala ---
@@ -74,8 +74,13 @@ private[spark] abstract class BasePythonRunner[IN, OUT](
   private val reuseWorker = conf.getBoolean("spark.python.worker.reuse", 
true)
   // each python worker gets an equal part of the allocation. the worker 
pool will grow to the
   // number of concurrent tasks, which is determined by the number of 
cores in this executor.
-  private val memoryMb = conf.get(PYSPARK_EXECUTOR_MEMORY)
+  private val memoryMb = if (Utils.isWindows) {
--- End diff --

For the current status, there's no diff. I'm just trying to match to what 
we have been usually doing - configuration control is usually made in JVM side 
and I propose to disallow this configuration on Windows. I think it's pretty 
minor point now because I added the flag as well ... 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23091: [SPARK-26122][SQL] Support encoding for multiLine in CSV...

2018-11-19 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23091
  
FYI hey @priancho IIRC, you proposed a similar change before in the mailing 
list. I wasn't positive about that because I was thinking we should deprecate 
`encoding` option at that time. It has a long long discussion and we're going 
to support this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23087: [MINOR][BUILD] Update plugins to latest versions

2018-11-19 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23087
  
Looks fine to me.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23087: [MINOR][BUILD] Update plugins to latest versions

2018-11-19 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23087#discussion_r234831263
  
--- Diff: pom.xml ---
@@ -2522,7 +2530,7 @@
   
 com.puppycrawl.tools
 checkstyle
-8.2
+8.14
--- End diff --

@srowen, I think 
https://github.com/apache/spark/blob/master/project/plugins.sbt#L4 should be 
updated too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22635: [SPARK-25591][PySpark][SQL] Avoid overwriting deserializ...

2018-11-19 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22635
  
This is fixed in 2.4.0 and your issue is when 2.3.1 -> 2.3.2. It's not 
related.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22635: [SPARK-25591][PySpark][SQL] Avoid overwriting deserializ...

2018-11-19 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22635
  
How does it related with the JIRA? looks not quite related from a cursory 
look. Please leave some analysis next time or at least testing it before/after 
the specific commit. Let me take a look anyway.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23080: [SPARK-26108][SQL] Support custom lineSep in CSV datasou...

2018-11-18 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23080
  
Ah, also, `CsvParser.beginParsing` takes an additional argument `Charset`. 
It should rather be easily able to support encoding in `multiLine`. @MaxGekk, 
would you be able to find some time to work on it? If that change can make the 
current PR easier. we can merge that one first.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23080: [SPARK-26108][SQL] Support custom lineSep in CSV ...

2018-11-18 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23080#discussion_r234476318
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala 
---
@@ -192,6 +192,20 @@ class CSVOptions(
*/
   val emptyValueInWrite = emptyValue.getOrElse("\"\"")
 
+  /**
+   * A string between two consecutive JSON records.
+   */
+  val lineSeparator: Option[String] = parameters.get("lineSep").map { sep 
=>
+require(sep.nonEmpty, "'lineSep' cannot be an empty string.")
+require(sep.length <= 2, "'lineSep' can contain 1 or 2 characters.")
+sep
+  }
+
+  val lineSeparatorInRead: Option[Array[Byte]] = lineSeparator.map { 
lineSep =>
+lineSep.getBytes("UTF-8")
--- End diff --

@MaxGekk, CSV's multiline does not support encoding but I think normal mode 
supports `encoding`. It should be okay to get bytes from it. We can just throw 
an exception when multiline is enabled.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23080: [SPARK-26108][SQL] Support custom lineSep in CSV ...

2018-11-18 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23080#discussion_r234475595
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala 
---
@@ -192,6 +192,20 @@ class CSVOptions(
*/
   val emptyValueInWrite = emptyValue.getOrElse("\"\"")
 
+  /**
+   * A string between two consecutive JSON records.
+   */
+  val lineSeparator: Option[String] = parameters.get("lineSep").map { sep 
=>
+require(sep.nonEmpty, "'lineSep' cannot be an empty string.")
+require(sep.length <= 2, "'lineSep' can contain 1 or 2 characters.")
--- End diff --

We could say the line separator should be 1 or 2 bytes (UTF-8) in read path 
specifically.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23080: [SPARK-26108][SQL] Support custom lineSep in CSV ...

2018-11-18 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23080#discussion_r234475228
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala 
---
@@ -192,6 +192,20 @@ class CSVOptions(
*/
   val emptyValueInWrite = emptyValue.getOrElse("\"\"")
 
+  /**
+   * A string between two consecutive JSON records.
+   */
+  val lineSeparator: Option[String] = parameters.get("lineSep").map { sep 
=>
+require(sep.nonEmpty, "'lineSep' cannot be an empty string.")
+require(sep.length <= 2, "'lineSep' can contain 1 or 2 characters.")
--- End diff --

@MaxGekk, might not be a super big deal but I believe this should be 
counted after converting it into `UTF-8`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23077: [SPARK-26105][PYTHON] Clean unittest2 imports up that we...

2018-11-18 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23077
  
Merged to master.

Thanks for reviewing this, @BryanCutler and @srowen.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23055: [SPARK-26080][PYTHON] Disable 'spark.executor.pyspark.me...

2018-11-18 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23055
  
adding @BryanCutler and @ueshin as well FYI.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23078: [SPARK-26106][PYTHON] Prioritizes ML unittests over the ...

2018-11-18 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23078
  
cc @BryanCutler.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23078: [SPARK-26106][PYTHON] Prioritizes ML unittests ov...

2018-11-18 Thread HyukjinKwon
GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/23078

[SPARK-26106][PYTHON] Prioritizes ML unittests over the doctests in PySpark

## What changes were proposed in this pull request?

Arguably, unittests usually takes longer then doctests. We better 
prioritize unittests over doctests.

Other modules are already being prioritized over doctests. Looks ML module 
was missed at the very first place.

## How was this patch tested?

Jenkins tests.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark SPARK-26106

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23078.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23078


commit 8eb5444f97e6441e934ba264bb1fd7c8063d48ad
Author: hyukjinkwon 
Date:   2018-11-18T08:41:49Z

Prioritizes ML unittests over the doctests in PySpark




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23077: [SPARK-25344][PYTHON] Clean unittest2 imports up that we...

2018-11-18 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23077
  
cc @BryanCutler.

BTW, Bryan, do you have some time to work on the `has_numpy` stuff that we 
talked about before? I was thinking we should specify the package version and 
produce similar skip messages like we do in Padnas and Arrow:


https://github.com/apache/spark/blob/a7a331df6e6fbcb181caf2363bffc3e05bdfc009/python/pyspark/testing/sqlutils.py#L30-L55

I'll try to work on this towards the end of the next week if you're busy.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23077: [SPARK-25344][PYTHON] Clean unittest2 imports up ...

2018-11-18 Thread HyukjinKwon
GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/23077

[SPARK-25344][PYTHON] Clean unittest2 imports up that were added for Python 
2.6 before

## What changes were proposed in this pull request?

Currently, some of PySpark tests sill assume the tests could be ran in 
Python 2.6 by importing `unittest2`. For instance:

```python
if sys.version_info[:2] <= (2, 6):
try:
import unittest2 as unittest
except ImportError:
sys.stderr.write('Please install unittest2 to test with Python 2.6 
or earlier')
sys.exit(1)
else:
import unittest
```

While I am here, I removed some of unused imports and reordered imports per 
PEP 8.

We officially dropped Python 2.6 support a while ago and started to discuss 
about Python 2 drop. It's better to remove them out. 

## How was this patch tested?

Manually tests, and existing tests via Jenkins.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark SPARK-26105

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23077.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23077


commit ed834f5062b1d8dff1fe92a6ad678d5d74df2c1c
Author: hyukjinkwon 
Date:   2018-11-18T08:17:48Z

Clean unittest2 imports up added Python 2.6 before




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23063: [SPARK-26033][PYTHON][TESTS] Break large ml/tests.py fil...

2018-11-18 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23063
  
Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23070: [SPARK-26099][SQL] Verification of the corrupt column in...

2018-11-17 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23070
  
L9oks good to me. I or someone else should take a closer look tho.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23070: [SPARK-26099][SQL] Verification of the corrupt column in...

2018-11-17 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23070
  
ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23070: [SPARK-26099][SQL] Verification of the corrupt column in...

2018-11-17 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23070
  
add to whitelist


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23064: [MINOR][SQL] Fix typo in CTAS plan database string

2018-11-17 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23064
  
Merged to master, branch-2.4 and branch-2.3.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23055: [SPARK-26080][PYTHON] Disable 'spark.executor.pyspark.me...

2018-11-17 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23055
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23063: [SPARK-26033][PYTHON][TESTS] Break large ml/tests.py fil...

2018-11-16 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23063
  
Will merge this one tomorrow if this is not merged till then.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23050: [SPARK-26079][sql] Ensure listener event delivery in Str...

2018-11-16 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23050
  
Merged to master and branch-2.4.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23048: transform DenseVector x DenseVector sqdist from i...

2018-11-16 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23048#discussion_r234399421
  
--- Diff: 
mllib-local/src/main/scala/org/apache/spark/ml/linalg/Vectors.scala ---
@@ -370,14 +370,19 @@ object Vectors {
   case (v1: DenseVector, v2: SparseVector) =>
 squaredDistance = sqdist(v2, v1)
 
-  case (DenseVector(vv1), DenseVector(vv2)) =>
-var kv = 0
+  case (DenseVector(vv1), DenseVector(vv2)) => {
 val sz = vv1.length
-while (kv < sz) {
-  val score = vv1(kv) - vv2(kv)
-  squaredDistance += score * score
-  kv += 1
+@annotation.tailrec
--- End diff --

FWIW, I did an experimental by myself before because someone proposed 
similar change (turning `while` to tail recursive one). There was no virtual 
difference. If this change is only tail recursion vs `while`, I doubt how much 
it can improve. I believe the benchmark steps should be clarified.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23055: [SPARK-26080][PYTHON] Disable 'spark.executor.pys...

2018-11-16 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23055#discussion_r234398745
  
--- Diff: 
core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala ---
@@ -74,8 +74,13 @@ private[spark] abstract class BasePythonRunner[IN, OUT](
   private val reuseWorker = conf.getBoolean("spark.python.worker.reuse", 
true)
   // each python worker gets an equal part of the allocation. the worker 
pool will grow to the
   // number of concurrent tasks, which is determined by the number of 
cores in this executor.
-  private val memoryMb = conf.get(PYSPARK_EXECUTOR_MEMORY)
+  private val memoryMb = if (Utils.isWindows) {
--- End diff --

> My point is that if resource can't be loaded for any reason, the code 
shouldn't fail. As it is, if resource can't be loaded then that is handled, but 
if the memory limit is set then the worker will still try to use it. That's 
what I think is brittle. There should be a flag for whether to attempt to use 
the resource API, based on whether it was loaded.

Ah, so the point is that the condition for the existence `resource` might 
not be clear - so we should have the flag to make it not failed in case. Yup, 
makes sense. Let me add a flag.

> If the worker operates as I described, then why make any changes on the 
JVM side?

I am making some changes on the JVM side so that we can explicitly disable 
on a certain condition For instance, if we don't make a change in JVM side, and 
only make the changes in Python `worker`. Later, we can have some other changes 
in JVM side referring this configuration - which can be problematic.

If we keep the configuration somehow in JVM side, it basically means we 
could have another status for this configuration, rather then just being 
disabled or enabled which we should take care of.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23059: [SPARK-26091][SQL] Upgrade to 2.3.4 for Hive Metastore C...

2018-11-16 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23059
  
Looks good. Adding @gatorsmile and @wangyum 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



<    1   2   3   4   5   6   7   8   9   10   >