[GitHub] spark issue #23111: [SPARK-26148][PYTHON][TESTS] Increases default paralleli...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23111 Hey all, I will merge this in few days if there's no more comments. It's going to speed up the tests roughly 12 ~ 15 mins. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23119: [SPARK-25954][SS][FOLLOWUP][test-maven] Add Zookeeper 3....
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23119 LGTM too --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23080: [SPARK-26108][SQL] Support custom lineSep in CSV ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23080#discussion_r235830894 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala --- @@ -377,6 +377,8 @@ final class DataStreamReader private[sql](sparkSession: SparkSession) extends Lo * `multiLine` (default `false`): parse one record, which may span multiple lines. * `locale` (default is `en-US`): sets a locale as language tag in IETF BCP 47 format. * For instance, this is used while parsing dates and timestamps. + * `lineSep` (default covers all `\r`, `\r\n` and `\n`): defines the line separator + * that should be used for parsing. Maximum length is 2. --- End diff -- I'm sorry. can you fix `Maximum length is 2` as well? should be good to go. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23111: [SPARK-26148][PYTHON][TESTS] Increases default paralleli...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23111 Yea, the improvement looks persistent: `Tests passed in 1027 seconds` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23117: [WIP][SPARK-7721][INFRA] Run and generate test coverage ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23117 It's not urgent :) so it's okay. Actually i'm on a vacation for a week as well. Thanks for taking a look @shaneknapp !! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23098: [WIP][SPARK-26132][BUILD][CORE] Remove support fo...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23098#discussion_r235830558 --- Diff: bin/load-spark-env.cmd --- @@ -21,37 +21,42 @@ rem This script loads spark-env.cmd if it exists, and ensures it is only loaded rem spark-env.cmd is loaded from SPARK_CONF_DIR if set, or within the current directory's rem conf\ subdirectory. +set SPARK_ENV_CMD="spark-env.cmd" if [%SPARK_ENV_LOADED%] == [] ( set SPARK_ENV_LOADED=1 if [%SPARK_CONF_DIR%] == [] ( set SPARK_CONF_DIR=%~dp0..\conf ) - call :LoadSparkEnv + set SPARK_ENV_CMD="%SPARK_CONF_DIR%/%SPARK_ENV_CMD%" + if exist "%SPARK_ENV_CMD%" ( +call "%SPARK_ENV_CMD%" + ) ) rem Setting SPARK_SCALA_VERSION if not already set. -set ASSEMBLY_DIR2="%SPARK_HOME%\assembly\target\scala-2.11" -set ASSEMBLY_DIR1="%SPARK_HOME%\assembly\target\scala-2.12" - -if [%SPARK_SCALA_VERSION%] == [] ( - - if exist %ASSEMBLY_DIR2% if exist %ASSEMBLY_DIR1% ( -echo "Presence of build for multiple Scala versions detected." -echo "Either clean one of them or, set SPARK_SCALA_VERSION in spark-env.cmd." -exit 1 - ) - if exist %ASSEMBLY_DIR2% ( -set SPARK_SCALA_VERSION=2.11 - ) else ( -set SPARK_SCALA_VERSION=2.12 - ) -) +rem TODO: revisit for Scala 2.13 support +set SPARK_SCALA_VERSION=2.12 --- End diff -- Here, I ran some of simple commands: ```cmd C:\>set A=aa C:\>ECHO %A% aa C:\>set A="aa" C:\>ECHO %A% "aa" C:\>call "python.exe" Python 3.6.4 (v3.6.4:d48eceb, Dec 19 2017, 06:54:40) [MSC v.1900 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> exit(0) C:\>call python.exe Python 3.6.4 (v3.6.4:d48eceb, Dec 19 2017, 06:54:40) [MSC v.1900 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> exit(0) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23109: [SPARK-26069][TESTS][FOLLOWUP]Add another possible error...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23109 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23111: [SPARK-26148][PYTHON][TESTS] Increases default paralleli...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23111 cc @rxin, @BryanCutler, @squito. This decreases elapsed time (even faster then before splitting the tests). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23111: [DO-NOT-MERGE] Increases default parallelism in PySpark ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23111 Oh? it drastically decreases from, for instance, ``` Tests passed in 1770 seconds ``` to ``` Tests passed in 1171 seconds ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23111: [DO-NOT-MERGE] Increases default parallelism in PySpark ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23111 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23117: [WIP][SPARK-7721][INFRA] Run and generate test coverage ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23117 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23117: [WIP][SPARK-7721][INFRA] Run and generate test co...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23117#discussion_r235660674 --- Diff: dev/run-tests.py --- @@ -594,7 +651,18 @@ def main(): modules_with_python_tests = [m for m in test_modules if m.python_test_goals] if modules_with_python_tests: -run_python_tests(modules_with_python_tests, opts.parallelism) +# We only run PySpark tests with coverage report in one specific job with +# Spark master with SBT in Jenkins. +is_sbt_master_job = ( +os.environ.get("AMPLAB_JENKINS_BUILD_PROFILE", "") == "hadoop2.7" +and os.environ.get("SPARK_BRANCH", "") == "master" +and os.environ.get("AMPLAB_JENKINS", "") == "true" +and os.environ.get("AMPLAB_JENKINS_BUILD_TOOL", "") == "sbt") +is_sbt_master_job = True # Will remove this right before getting merged. --- End diff -- I should remove this before getting this in. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23118: [SPARK-26144][BUILD] `build/mvn` should detect `scala.ve...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23118 Haha today's not holiday here :D. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23109: [SPARK-26069][TESTS][FOLLOWUP]Add another possible error...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23109 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23117: [WIP][SPARK-7721][INFRA] Run and generate test co...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23117#discussion_r235644712 --- Diff: dev/run-tests.py --- @@ -594,7 +651,18 @@ def main(): modules_with_python_tests = [m for m in test_modules if m.python_test_goals] if modules_with_python_tests: -run_python_tests(modules_with_python_tests, opts.parallelism) +# We only run PySpark tests with coverage report in one specific job with +# Spark master with SBT in Jenkins. +is_sbt_master_job = ( +os.environ.get("AMPLAB_JENKINS_BUILD_PROFILE", "") == "hadoop2.7" --- End diff -- This environment variables were checked before at https://github.com/apache/spark/pull/17669 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23111: [DO-NOT-MERGE] Increases default parallelism in PySpark ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23111 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23117: [WIP][SPARK-7721][INFRA] Run and generate test coverage ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23117 cc @rxin, @JoshRosen, @shaneknapp, @gatorsmile, @BryanCutler, @holdenk, @felixcheung, @viirya, @ueshin, @icexelloss --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23117: [WIP][SPARK-7721][INFRA] Run and generate test co...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/23117 [WIP][SPARK-7721][INFRA] Run and generate test coverage report from Python via Jenkins ## What changes were proposed in this pull request? ### Background For the current status, the test script that generates coverage information was merged into Spark, https://github.com/apache/spark/pull/20204 So, we can generate the coverage report and site by, for example: ``` run-tests-with-coverage --python-executables=python3 --modules=pyspark-sql ``` like `run-tests` script in `./python`. ### Proposed change The next step is to host this coverage report via `github.io` automatically by Jenkins (see https://spark-test.github.io/pyspark-coverage-site/). This uses my testing account for Spark, @spark-test, which is shared to Felix and Shivaram a long time ago for testing purpose including AppVeyor. To cut this short, this PR targets to run the coverage in [spark-master-test-sbt-hadoop-2.7](https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/) In the specific job, it will clone the page, and rebase the up-to-date PySpark test coverage from the latest commit. For instance as below: ```bash # Clone PySpark coverage site. git clone https://github.com/spark-test/pyspark-coverage-site.git # Copy generated coverage HTML. cp -r .../python/test_coverage/htmlcov/* pyspark-coverage-site/ # Check out to a temporary branch. git checkout --orphan latest_branch # Add all the files. git add -A # Commit current test coverage results. git commit -am "Coverage report at latest commit in Apache Spark" # Delete the old branch. git branch -D gh-pages # Rename the temporary branch to master. git branch -m gh-pages # Finally, force update to our repository. git push -f origin gh-pages ``` So, it is a one single up-to-date coverage can be shown in the `github-io` page. The commands above were manually tested. ### TODO: - [ ] Write a draft - [ ] Set hidden `SPARK_TEST_KEY` for @spark-test's password in Jenkins via Jenkins's feature. This should be set both at `SparkPullRequestBuilder` so that we (or I) can test and `spark-master-test-sbt-hadoop-2.7` - [ ] Make PR builder's test passed - [ ] Enable it this build only at spark-master-test-sbt-hadoop-2.7 right before getting this in. ## How was this patch tested? It will be tested via Jenkins. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-7721 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23117.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23117 commit d88d5aa73db636f8c73ace9f83f339781ea50531 Author: hyukjinkwon Date: 2018-11-22T08:08:20Z Run and generate test coverage report from Python via Jenkins --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23111: [DO-NOT-MERGE] Increases default parallelism in PySpark ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23111 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23109: [SPARK-26069][TESTS][FOLLOWUP]Add another possible error...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23109 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23112: [GraphX] Remove unused variables left over by previous r...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23112 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23112: [GraphX] Remove unused variables left over by previous r...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23112 Looks okay to go. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23109: [SPARK-26069][TESTS][FOLLOWUP]Add another possible error...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23109 retest this please (the test failures look not persistent) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23070: [SPARK-26099][SQL] Verification of the corrupt column in...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23070 Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23080: [SPARK-26108][SQL] Support custom lineSep in CSV datasou...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23080 LGTM except https://github.com/apache/spark/pull/23080/files#r235589426 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23080: [SPARK-26108][SQL] Support custom lineSep in CSV ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23080#discussion_r235589426 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala --- @@ -216,8 +232,13 @@ class CSVOptions( format.setDelimiter(delimiter) format.setQuote(quote) format.setQuoteEscape(escape) +lineSeparator.foreach {sep => + format.setLineSeparator(sep) + format.setNormalizedNewline(0x00.toChar) --- End diff -- I know we have some problems here for setting newlines more then 1 character because `setNormalizedNewline` only supports one character. This is related with https://github.com/apache/spark/pull/18581#issuecomment-314037750 and https://github.com/uniVocity/univocity-parsers/issues/170 That's why I thought we can only support this for single character for now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23080: [SPARK-26108][SQL] Support custom lineSep in CSV ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23080#discussion_r235589448 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala --- @@ -227,7 +248,10 @@ class CSVOptions( settings.setEmptyValue(emptyValueInRead) settings.setMaxCharsPerColumn(maxCharsPerColumn) settings.setUnescapedQuoteHandling(UnescapedQuoteHandling.STOP_AT_DELIMITER) -settings.setLineSeparatorDetectionEnabled(multiLine == true) +settings.setLineSeparatorDetectionEnabled(lineSeparatorInRead.isEmpty && multiLine) +lineSeparatorInRead.foreach { _ => --- End diff -- nice! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22979: [SPARK-25977][SQL] Parsing decimals from CSV usin...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22979#discussion_r235588064 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVExprUtils.scala --- @@ -79,4 +83,22 @@ object CSVExprUtils { throw new IllegalArgumentException(s"Delimiter cannot be more than one character: $str") } } + + def getDecimalParser(useLegacyParser: Boolean, locale: Locale): String => java.math.BigDecimal = { +if (useLegacyParser) { + (s: String) => new BigDecimal(s.replaceAll(",", "")) +} else { + val df = new DecimalFormat("", new DecimalFormatSymbols(locale)) --- End diff -- let's name `df` `decimalFormat` .. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23027: [SPARK-26049][SQL][TEST] FilterPushdownBenchmark add InM...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23027 Looks fine. @dongjoon-hyun WDYT? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23052 There are two more things to deal with: https://github.com/apache/spark/pull/23052#issuecomment-440687200 comment will still be valid - at least it should be double checked because dataframes originated from emptyRDD does not write anything all times. One thing is CSV for text-based datasources because it can write out headers. Thing is, the header is currently written when the first row is written - this is what https://github.com/apache/spark/pull/13252 PR targeted before. I closed this because there's no interests but we should fix. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23052 cc @cloud-fan as well --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23111: [DO-NOT-MERGE] Increases default parallelism in PySpark ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23111 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22938: [SPARK-25935][SQL] Prevent null rows from JSON pa...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22938#discussion_r235583851 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala --- @@ -240,16 +240,6 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext { Seq(Row("1"), Row("2"))) } - test("SPARK-11226 Skip empty line in json file") { --- End diff -- Where is it moved to then? Does that mean we don't have a regression test for SPARK-11226 anymore? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22938: [SPARK-25935][SQL] Prevent null rows from JSON parser
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22938 Sorry for the late response. The change looks good to me in general but I had one question. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22938: [SPARK-25935][SQL] Prevent null rows from JSON pa...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22938#discussion_r235583559 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -1892,7 +1898,7 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { .text(path) val jsonDF = spark.read.option("multiLine", true).option("mode", "PERMISSIVE").json(path) - assert(jsonDF.count() === corruptRecordCount) + assert(jsonDF.count() === corruptRecordCount + 1) // null row for empty file --- End diff -- If that's true, we should not do this. Empty files can be generated in many cases for now and the behaviour is not currently well defined. If we rely on this behaviour, it will cause some weird behaviours or bugs hard to fix. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22938: [SPARK-25935][SQL] Prevent null rows from JSON pa...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22938#discussion_r235583349 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -1892,7 +1898,7 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { .text(path) val jsonDF = spark.read.option("multiLine", true).option("mode", "PERMISSIVE").json(path) - assert(jsonDF.count() === corruptRecordCount) + assert(jsonDF.count() === corruptRecordCount + 1) // null row for empty file --- End diff -- Wait, does this mean that it reads an empty record from empty file after this change? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22938: [SPARK-25935][SQL] Prevent null rows from JSON pa...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22938#discussion_r235583315 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -1905,7 +1911,7 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { F.count($"dummy").as("valid"), F.count($"_corrupt_record").as("corrupt"), F.count("*").as("count")) - checkAnswer(counts, Row(1, 4, 6)) + checkAnswer(counts, Row(1, 5, 7)) // null row for empty file --- End diff -- Wait, does this mean that it reads an empty record from empty file after this change? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23111: [DO-NOT-MERGE] Increases default parallelism in PySpark ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23111 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23113: [SPARK-26019][PYTHON] Fix race condition in accumulators...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23113 > race condition explained in https://issues.apache.org/jira/browse/SPARK-26019 How race condition happens? Can you clarify it in PR description. I think @viirya's analysis is matched to what I investigated (https://issues.apache.org/jira/browse/SPARK-26019?focusedCommentId=16692722=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16692722). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23098: [WIP][SPARK-26132][BUILD][CORE] Remove support fo...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23098#discussion_r235572895 --- Diff: bin/load-spark-env.cmd --- @@ -21,37 +21,42 @@ rem This script loads spark-env.cmd if it exists, and ensures it is only loaded rem spark-env.cmd is loaded from SPARK_CONF_DIR if set, or within the current directory's rem conf\ subdirectory. +set SPARK_ENV_CMD="spark-env.cmd" if [%SPARK_ENV_LOADED%] == [] ( set SPARK_ENV_LOADED=1 if [%SPARK_CONF_DIR%] == [] ( set SPARK_CONF_DIR=%~dp0..\conf ) - call :LoadSparkEnv + set SPARK_ENV_CMD="%SPARK_CONF_DIR%/%SPARK_ENV_CMD%" + if exist "%SPARK_ENV_CMD%" ( +call "%SPARK_ENV_CMD%" + ) ) rem Setting SPARK_SCALA_VERSION if not already set. -set ASSEMBLY_DIR2="%SPARK_HOME%\assembly\target\scala-2.11" -set ASSEMBLY_DIR1="%SPARK_HOME%\assembly\target\scala-2.12" - -if [%SPARK_SCALA_VERSION%] == [] ( - - if exist %ASSEMBLY_DIR2% if exist %ASSEMBLY_DIR1% ( -echo "Presence of build for multiple Scala versions detected." -echo "Either clean one of them or, set SPARK_SCALA_VERSION in spark-env.cmd." -exit 1 - ) - if exist %ASSEMBLY_DIR2% ( -set SPARK_SCALA_VERSION=2.11 - ) else ( -set SPARK_SCALA_VERSION=2.12 - ) -) +rem TODO: revisit for Scala 2.13 support +set SPARK_SCALA_VERSION=2.12 --- End diff -- For `call`, it's a bit different IIRC (https://ss64.com/nt/cmd.html) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23098: [WIP][SPARK-26132][BUILD][CORE] Remove support fo...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23098#discussion_r235572737 --- Diff: bin/load-spark-env.cmd --- @@ -21,37 +21,42 @@ rem This script loads spark-env.cmd if it exists, and ensures it is only loaded rem spark-env.cmd is loaded from SPARK_CONF_DIR if set, or within the current directory's rem conf\ subdirectory. +set SPARK_ENV_CMD="spark-env.cmd" if [%SPARK_ENV_LOADED%] == [] ( set SPARK_ENV_LOADED=1 if [%SPARK_CONF_DIR%] == [] ( set SPARK_CONF_DIR=%~dp0..\conf ) - call :LoadSparkEnv + set SPARK_ENV_CMD="%SPARK_CONF_DIR%/%SPARK_ENV_CMD%" + if exist "%SPARK_ENV_CMD%" ( +call "%SPARK_ENV_CMD%" + ) ) rem Setting SPARK_SCALA_VERSION if not already set. -set ASSEMBLY_DIR2="%SPARK_HOME%\assembly\target\scala-2.11" -set ASSEMBLY_DIR1="%SPARK_HOME%\assembly\target\scala-2.12" - -if [%SPARK_SCALA_VERSION%] == [] ( - - if exist %ASSEMBLY_DIR2% if exist %ASSEMBLY_DIR1% ( -echo "Presence of build for multiple Scala versions detected." -echo "Either clean one of them or, set SPARK_SCALA_VERSION in spark-env.cmd." -exit 1 - ) - if exist %ASSEMBLY_DIR2% ( -set SPARK_SCALA_VERSION=2.11 - ) else ( -set SPARK_SCALA_VERSION=2.12 - ) -) +rem TODO: revisit for Scala 2.13 support +set SPARK_SCALA_VERSION=2.12 --- End diff -- but at least that's what I remember when I test https://github.com/apache/spark/blob/master/bin/spark-sql2.cmd#L23 line. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23098: [WIP][SPARK-26132][BUILD][CORE] Remove support fo...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23098#discussion_r235572442 --- Diff: bin/load-spark-env.cmd --- @@ -21,37 +21,42 @@ rem This script loads spark-env.cmd if it exists, and ensures it is only loaded rem spark-env.cmd is loaded from SPARK_CONF_DIR if set, or within the current directory's rem conf\ subdirectory. +set SPARK_ENV_CMD="spark-env.cmd" if [%SPARK_ENV_LOADED%] == [] ( set SPARK_ENV_LOADED=1 if [%SPARK_CONF_DIR%] == [] ( set SPARK_CONF_DIR=%~dp0..\conf ) - call :LoadSparkEnv + set SPARK_ENV_CMD="%SPARK_CONF_DIR%/%SPARK_ENV_CMD%" + if exist "%SPARK_ENV_CMD%" ( +call "%SPARK_ENV_CMD%" + ) ) rem Setting SPARK_SCALA_VERSION if not already set. -set ASSEMBLY_DIR2="%SPARK_HOME%\assembly\target\scala-2.11" -set ASSEMBLY_DIR1="%SPARK_HOME%\assembly\target\scala-2.12" - -if [%SPARK_SCALA_VERSION%] == [] ( - - if exist %ASSEMBLY_DIR2% if exist %ASSEMBLY_DIR1% ( -echo "Presence of build for multiple Scala versions detected." -echo "Either clean one of them or, set SPARK_SCALA_VERSION in spark-env.cmd." -exit 1 - ) - if exist %ASSEMBLY_DIR2% ( -set SPARK_SCALA_VERSION=2.11 - ) else ( -set SPARK_SCALA_VERSION=2.12 - ) -) +rem TODO: revisit for Scala 2.13 support +set SPARK_SCALA_VERSION=2.12 --- End diff -- Yea, I think we shouldn't quote if I remember this correctly. Let me test and get back to you today or tomorrow. I'll have to fly to Korea for my one week vacation from tomorrow :D. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23111: [DO-NOT-MERGE] Increases default parallelism in P...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/23111 [DO-NOT-MERGE] Increases default parallelism in PySpark tests ## What changes were proposed in this pull request? I'm trying to see if increasing parallelism decreases elapsed time in PySpark tests ## How was this patch tested? Jenkins tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark parallelism Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23111.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23111 commit ec0f7301f693811f4f782a62aae05abac33acd34 Author: hyukjinkwon Date: 2018-11-22T00:10:10Z Increases default parallelism in PySpark tests --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23078: [SPARK-26106][PYTHON] Prioritizes ML unittests over the ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23078 Thanks. Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23102: [SPARK-26137][CORE] Use Java system property "fil...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23102#discussion_r235570156 --- Diff: core/src/main/scala/org/apache/spark/deploy/DependencyUtils.scala --- @@ -65,7 +65,7 @@ private[deploy] object DependencyUtils extends Logging { .map { resolveGlobPaths(_, hadoopConf) .split(",") - .filterNot(_.contains(userJar.split("/").last)) + .filterNot(_.contains(userJar.split(System.getProperty("file.separator")).last)) --- End diff -- +1 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23102: [SPARK-26137][CORE] Use Java system property "file.separ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23102 @markpavey, can you write a test? I can run the test on Windows via AppVeyor. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23055: [SPARK-26080][PYTHON] Disable 'spark.executor.pyspark.me...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23055 Sorry, may I ask to take another look please? I thought it's quite a straightforward fix by a consistent way of fixing. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23078: [SPARK-26106][PYTHON] Prioritizes ML unittests over the ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23078 @JoshRosen, can you take a look please when you're available? it's quite obvious to fix. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23109: [SPARK-26069][TESTS][FOLLOWUP]Add another possible error...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23109 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23052 Also, it's not always for Parquet to write empty files. That does not write empty files when data frames are created from emptyRDD (the one pointed out in the PR link I gave). We should match this behaviour as well. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23101: [SPARK-26134][CORE] Upgrading Hadoop to 2.7.4 to fix jav...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23101 LGTM2 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23052 @MaxGekk I didn't mean to block this PR. Since we're going ahead for 3.0, it should be good to match and fix the behaviours across data sources. For instance, CSV should still be able to read the header. Shall we clarify each data sources behaviour? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] zeppelin issue #3206: [ZEPPELIN-3810] Support Spark 2.4
Github user HyukjinKwon commented on the issue: https://github.com/apache/zeppelin/pull/3206 This fix is not released yet. This PR exactly fixes the problem you faced. This fix will be available in the next release of Zeppelin. ---
[GitHub] spark pull request #23098: [WIP][SPARK-26132][BUILD][CORE] Remove support fo...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23098#discussion_r235322452 --- Diff: bin/load-spark-env.cmd --- @@ -21,37 +21,42 @@ rem This script loads spark-env.cmd if it exists, and ensures it is only loaded rem spark-env.cmd is loaded from SPARK_CONF_DIR if set, or within the current directory's rem conf\ subdirectory. +set SPARK_ENV_CMD="spark-env.cmd" if [%SPARK_ENV_LOADED%] == [] ( set SPARK_ENV_LOADED=1 if [%SPARK_CONF_DIR%] == [] ( set SPARK_CONF_DIR=%~dp0..\conf ) - call :LoadSparkEnv + set SPARK_ENV_CMD="%SPARK_CONF_DIR%/%SPARK_ENV_CMD%" + if exist "%SPARK_ENV_CMD%" ( +call "%SPARK_ENV_CMD%" + ) ) rem Setting SPARK_SCALA_VERSION if not already set. -set ASSEMBLY_DIR2="%SPARK_HOME%\assembly\target\scala-2.11" -set ASSEMBLY_DIR1="%SPARK_HOME%\assembly\target\scala-2.12" - -if [%SPARK_SCALA_VERSION%] == [] ( - - if exist %ASSEMBLY_DIR2% if exist %ASSEMBLY_DIR1% ( -echo "Presence of build for multiple Scala versions detected." -echo "Either clean one of them or, set SPARK_SCALA_VERSION in spark-env.cmd." -exit 1 - ) - if exist %ASSEMBLY_DIR2% ( -set SPARK_SCALA_VERSION=2.11 - ) else ( -set SPARK_SCALA_VERSION=2.12 - ) -) +rem TODO: revisit for Scala 2.13 support +set SPARK_SCALA_VERSION=2.12 --- End diff -- Nope, it takes as is including quotes ... haha odd (to me). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23071: [SPARK-26102][SQL][TEST] Extracting common CSV/JSON func...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23071 Thank you @MaxGekk. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22939: [SPARK-25446][R] Add schema_of_json() and schema_of_csv(...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22939 Sure! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23027: [SPARK-26049][SQL][TEST] FilterPushdownBenchmark add InM...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23027 @wangyum, why did you close this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22939: [SPARK-25446][R] Add schema_of_json() and schema_of_csv(...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22939 gentle ping, @felixcheung. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23070: [SPARK-26099][SQL] Verification of the corrupt column in...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23070 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23071: [SPARK-26102][SQL][TEST] Extracting common CSV/JSON func...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23071 I think we don't need this for now. Let's do this when more `from/to_...` functions are added later. The amount of codes increases actually. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23080: [SPARK-26108][SQL] Support custom lineSep in CSV datasou...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23080 @MaxGekk, let's rebase this one accordingly with encoding support. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23080: [SPARK-26108][SQL] Support custom lineSep in CSV ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23080#discussion_r235244407 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala --- @@ -192,6 +192,20 @@ class CSVOptions( */ val emptyValueInWrite = emptyValue.getOrElse("\"\"") + /** + * A string between two consecutive JSON records. + */ + val lineSeparator: Option[String] = parameters.get("lineSep").map { sep => +require(sep.nonEmpty, "'lineSep' cannot be an empty string.") +require(sep.length <= 2, "'lineSep' can contain 1 or 2 characters.") --- End diff -- Hm, I see. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23085: [Docs] Added csv, orc, and text output format options to...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23085 @mrandrewandrade, let's close this for now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23078: [SPARK-26106][PYTHON] Prioritizes ML unittests over the ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23078 @zsxwing can you take a look please when you're available --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23099: [WIP][SPARK-25954][SS] Upgrade to Kafka 2.1.0
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23099 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23089: [SPARK-26120][TESTS][SS][SPARKR]Fix a streaming query le...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23089 Merged to master and branch-2.4. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23091: [SPARK-26122][SQL] Support encoding for multiLine in CSV...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23091 Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23055: [SPARK-26080][PYTHON] Disable 'spark.executor.pys...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23055#discussion_r235226292 --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala --- @@ -74,8 +74,13 @@ private[spark] abstract class BasePythonRunner[IN, OUT]( private val reuseWorker = conf.getBoolean("spark.python.worker.reuse", true) // each python worker gets an equal part of the allocation. the worker pool will grow to the // number of concurrent tasks, which is determined by the number of cores in this executor. - private val memoryMb = conf.get(PYSPARK_EXECUTOR_MEMORY) + private val memoryMb = if (Utils.isWindows) { --- End diff -- I am doing this because: 1. The code consistency - usually the configuration is dealt with in JVM side if possible - otherwise we should send the configuration to worker side and the process it in Python worker side ideally. What we need to do is disable the configuration on a certain condition explicitly. 2. If we do it only in Python side, it's going to introduce the third status in JVM (`memoryMb`). When the configuration value exists in JVM (which means it's enabled) but the functionaility is disabled in Python side. Dealing with it will reduce the complexity. Why are you so against about disabling it in JVM side? I don't see big downside of doing this. I added a flag as you said as well. For clarification, it's a rather minor point. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23085: [Docs] Added csv, orc, and text output format options to...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23085 Because it's already documented. Also it brings maintnense overhead. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23087: [SPARK-26124][BUILD] Update plugins to latest versions
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23087 reest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23004: [SPARK-26004][SQL] InMemoryTable support StartsWith pred...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23004 Looks fine to me --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23085: [Docs] Added csv, orc, and text output format options to...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23085 It's already documented in official site anyway. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23085: [Docs] Added csv, orc, and text output format options to...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23085 I think it doesn't need to change. It's likely to be changed and we wouldn't want to update this doc everytime we add new datasource. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23055: [SPARK-26080][PYTHON] Disable 'spark.executor.pys...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23055#discussion_r234838984 --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala --- @@ -74,8 +74,13 @@ private[spark] abstract class BasePythonRunner[IN, OUT]( private val reuseWorker = conf.getBoolean("spark.python.worker.reuse", true) // each python worker gets an equal part of the allocation. the worker pool will grow to the // number of concurrent tasks, which is determined by the number of cores in this executor. - private val memoryMb = conf.get(PYSPARK_EXECUTOR_MEMORY) + private val memoryMb = if (Utils.isWindows) { --- End diff -- For the current status, there's no diff. I'm just trying to match to what we have been usually doing - configuration control is usually made in JVM side and I propose to disallow this configuration on Windows. I think it's pretty minor point now because I added the flag as well ... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23091: [SPARK-26122][SQL] Support encoding for multiLine in CSV...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23091 FYI hey @priancho IIRC, you proposed a similar change before in the mailing list. I wasn't positive about that because I was thinking we should deprecate `encoding` option at that time. It has a long long discussion and we're going to support this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23087: [MINOR][BUILD] Update plugins to latest versions
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23087 Looks fine to me. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23087: [MINOR][BUILD] Update plugins to latest versions
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23087#discussion_r234831263 --- Diff: pom.xml --- @@ -2522,7 +2530,7 @@ com.puppycrawl.tools checkstyle -8.2 +8.14 --- End diff -- @srowen, I think https://github.com/apache/spark/blob/master/project/plugins.sbt#L4 should be updated too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22635: [SPARK-25591][PySpark][SQL] Avoid overwriting deserializ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22635 This is fixed in 2.4.0 and your issue is when 2.3.1 -> 2.3.2. It's not related. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22635: [SPARK-25591][PySpark][SQL] Avoid overwriting deserializ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22635 How does it related with the JIRA? looks not quite related from a cursory look. Please leave some analysis next time or at least testing it before/after the specific commit. Let me take a look anyway. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23080: [SPARK-26108][SQL] Support custom lineSep in CSV datasou...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23080 Ah, also, `CsvParser.beginParsing` takes an additional argument `Charset`. It should rather be easily able to support encoding in `multiLine`. @MaxGekk, would you be able to find some time to work on it? If that change can make the current PR easier. we can merge that one first. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23080: [SPARK-26108][SQL] Support custom lineSep in CSV ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23080#discussion_r234476318 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala --- @@ -192,6 +192,20 @@ class CSVOptions( */ val emptyValueInWrite = emptyValue.getOrElse("\"\"") + /** + * A string between two consecutive JSON records. + */ + val lineSeparator: Option[String] = parameters.get("lineSep").map { sep => +require(sep.nonEmpty, "'lineSep' cannot be an empty string.") +require(sep.length <= 2, "'lineSep' can contain 1 or 2 characters.") +sep + } + + val lineSeparatorInRead: Option[Array[Byte]] = lineSeparator.map { lineSep => +lineSep.getBytes("UTF-8") --- End diff -- @MaxGekk, CSV's multiline does not support encoding but I think normal mode supports `encoding`. It should be okay to get bytes from it. We can just throw an exception when multiline is enabled. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23080: [SPARK-26108][SQL] Support custom lineSep in CSV ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23080#discussion_r234475595 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala --- @@ -192,6 +192,20 @@ class CSVOptions( */ val emptyValueInWrite = emptyValue.getOrElse("\"\"") + /** + * A string between two consecutive JSON records. + */ + val lineSeparator: Option[String] = parameters.get("lineSep").map { sep => +require(sep.nonEmpty, "'lineSep' cannot be an empty string.") +require(sep.length <= 2, "'lineSep' can contain 1 or 2 characters.") --- End diff -- We could say the line separator should be 1 or 2 bytes (UTF-8) in read path specifically. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23080: [SPARK-26108][SQL] Support custom lineSep in CSV ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23080#discussion_r234475228 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala --- @@ -192,6 +192,20 @@ class CSVOptions( */ val emptyValueInWrite = emptyValue.getOrElse("\"\"") + /** + * A string between two consecutive JSON records. + */ + val lineSeparator: Option[String] = parameters.get("lineSep").map { sep => +require(sep.nonEmpty, "'lineSep' cannot be an empty string.") +require(sep.length <= 2, "'lineSep' can contain 1 or 2 characters.") --- End diff -- @MaxGekk, might not be a super big deal but I believe this should be counted after converting it into `UTF-8`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23077: [SPARK-26105][PYTHON] Clean unittest2 imports up that we...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23077 Merged to master. Thanks for reviewing this, @BryanCutler and @srowen. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23055: [SPARK-26080][PYTHON] Disable 'spark.executor.pyspark.me...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23055 adding @BryanCutler and @ueshin as well FYI. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23078: [SPARK-26106][PYTHON] Prioritizes ML unittests over the ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23078 cc @BryanCutler. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23078: [SPARK-26106][PYTHON] Prioritizes ML unittests ov...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/23078 [SPARK-26106][PYTHON] Prioritizes ML unittests over the doctests in PySpark ## What changes were proposed in this pull request? Arguably, unittests usually takes longer then doctests. We better prioritize unittests over doctests. Other modules are already being prioritized over doctests. Looks ML module was missed at the very first place. ## How was this patch tested? Jenkins tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-26106 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23078.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23078 commit 8eb5444f97e6441e934ba264bb1fd7c8063d48ad Author: hyukjinkwon Date: 2018-11-18T08:41:49Z Prioritizes ML unittests over the doctests in PySpark --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23077: [SPARK-25344][PYTHON] Clean unittest2 imports up that we...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23077 cc @BryanCutler. BTW, Bryan, do you have some time to work on the `has_numpy` stuff that we talked about before? I was thinking we should specify the package version and produce similar skip messages like we do in Padnas and Arrow: https://github.com/apache/spark/blob/a7a331df6e6fbcb181caf2363bffc3e05bdfc009/python/pyspark/testing/sqlutils.py#L30-L55 I'll try to work on this towards the end of the next week if you're busy. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23077: [SPARK-25344][PYTHON] Clean unittest2 imports up ...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/23077 [SPARK-25344][PYTHON] Clean unittest2 imports up that were added for Python 2.6 before ## What changes were proposed in this pull request? Currently, some of PySpark tests sill assume the tests could be ran in Python 2.6 by importing `unittest2`. For instance: ```python if sys.version_info[:2] <= (2, 6): try: import unittest2 as unittest except ImportError: sys.stderr.write('Please install unittest2 to test with Python 2.6 or earlier') sys.exit(1) else: import unittest ``` While I am here, I removed some of unused imports and reordered imports per PEP 8. We officially dropped Python 2.6 support a while ago and started to discuss about Python 2 drop. It's better to remove them out. ## How was this patch tested? Manually tests, and existing tests via Jenkins. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-26105 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23077.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23077 commit ed834f5062b1d8dff1fe92a6ad678d5d74df2c1c Author: hyukjinkwon Date: 2018-11-18T08:17:48Z Clean unittest2 imports up added Python 2.6 before --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23063: [SPARK-26033][PYTHON][TESTS] Break large ml/tests.py fil...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23063 Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23070: [SPARK-26099][SQL] Verification of the corrupt column in...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23070 L9oks good to me. I or someone else should take a closer look tho. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23070: [SPARK-26099][SQL] Verification of the corrupt column in...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23070 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23070: [SPARK-26099][SQL] Verification of the corrupt column in...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23070 add to whitelist --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23064: [MINOR][SQL] Fix typo in CTAS plan database string
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23064 Merged to master, branch-2.4 and branch-2.3. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23055: [SPARK-26080][PYTHON] Disable 'spark.executor.pyspark.me...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23055 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23063: [SPARK-26033][PYTHON][TESTS] Break large ml/tests.py fil...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23063 Will merge this one tomorrow if this is not merged till then. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23050: [SPARK-26079][sql] Ensure listener event delivery in Str...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23050 Merged to master and branch-2.4. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23048: transform DenseVector x DenseVector sqdist from i...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23048#discussion_r234399421 --- Diff: mllib-local/src/main/scala/org/apache/spark/ml/linalg/Vectors.scala --- @@ -370,14 +370,19 @@ object Vectors { case (v1: DenseVector, v2: SparseVector) => squaredDistance = sqdist(v2, v1) - case (DenseVector(vv1), DenseVector(vv2)) => -var kv = 0 + case (DenseVector(vv1), DenseVector(vv2)) => { val sz = vv1.length -while (kv < sz) { - val score = vv1(kv) - vv2(kv) - squaredDistance += score * score - kv += 1 +@annotation.tailrec --- End diff -- FWIW, I did an experimental by myself before because someone proposed similar change (turning `while` to tail recursive one). There was no virtual difference. If this change is only tail recursion vs `while`, I doubt how much it can improve. I believe the benchmark steps should be clarified. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23055: [SPARK-26080][PYTHON] Disable 'spark.executor.pys...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23055#discussion_r234398745 --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala --- @@ -74,8 +74,13 @@ private[spark] abstract class BasePythonRunner[IN, OUT]( private val reuseWorker = conf.getBoolean("spark.python.worker.reuse", true) // each python worker gets an equal part of the allocation. the worker pool will grow to the // number of concurrent tasks, which is determined by the number of cores in this executor. - private val memoryMb = conf.get(PYSPARK_EXECUTOR_MEMORY) + private val memoryMb = if (Utils.isWindows) { --- End diff -- > My point is that if resource can't be loaded for any reason, the code shouldn't fail. As it is, if resource can't be loaded then that is handled, but if the memory limit is set then the worker will still try to use it. That's what I think is brittle. There should be a flag for whether to attempt to use the resource API, based on whether it was loaded. Ah, so the point is that the condition for the existence `resource` might not be clear - so we should have the flag to make it not failed in case. Yup, makes sense. Let me add a flag. > If the worker operates as I described, then why make any changes on the JVM side? I am making some changes on the JVM side so that we can explicitly disable on a certain condition For instance, if we don't make a change in JVM side, and only make the changes in Python `worker`. Later, we can have some other changes in JVM side referring this configuration - which can be problematic. If we keep the configuration somehow in JVM side, it basically means we could have another status for this configuration, rather then just being disabled or enabled which we should take care of. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23059: [SPARK-26091][SQL] Upgrade to 2.3.4 for Hive Metastore C...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23059 Looks good. Adding @gatorsmile and @wangyum --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org